Architectural overview of the proposed model. AMR labels of spectrum-drug pairs can be represented in an incomplete matrix. A microbial sample that is susceptible to a drug is denoted by a negative label (orange), whereas positive labels (blue) signify an intermediate or resistant combination. Instance (spectrum) and target (drug) embeddings xi and t j are obtained from their respective input representations passed through their respective neural network branch. The two resulting embeddings are aggregated to a single score by their (scaled) dot product. The cross-entropy loss optimizes this score to be maximal or minimal for positive or negative combinations of microbial spectra and drugs, respectively.

All tested model sizes for the (instance) spectrum branch. Hidden sizes represent the evolution of the hidden state dimensionality as it goes through the model, with every hyphen defining one fully connected layer. The listed number of parameters only include those of the instance (spectrum) branch.

Barplots showing test performance results for all trained models. ROC-AUC evaluates overall ranking of predictions. Prec@1(-) evaluates how often the top suggested treatment would be effective. Both metrics are calculated per spectrum/patient and then averaged. Errorbars represent the standard deviation over five random model seeds. The x-axis and colors show the different drug and spectrum embedders, respectively.

Test performance of selected general and species-specific dual branch recommender models. The listed averages and standard deviations are calculated over five independent runs of the same model. Performance is computed on the subset of labels spanning the 25 most-common species in DRIAMS-A.

Test performance of selected recommender models, compared to the performance of a collection of models — each trained on only one species-drug combination — coined “species-drug classifiers”.

“Speciesdrug classifiers” refer to a collection of binary classifiers, each trained to predict AMR status for a subset of data comprising a single species-drug combination. “Simulated expert’s best guess” refers to counting AMR label frequencies in single species-drug combinations, and taking those as predictions. The listed averages and standard deviations are calculated over five independent runs of the same model. Given the non-stochastic nature of the logistic regression and XGBoost implementations, only one set of models is trained and, hence, no standard deviations are reported. Performance is computed on the subset of labels spanning the 200 mostcommon species-drug combinations.

Transfer learning of DRIAMS-A models to other hospitals. Errorbands show the standard deviation over five runs. Results in terms of other evaluation metrics are shown in Appendix C Figure 11.

UMAP scatterplots of test set MALDI-TOF spectra embeddings xi. Top: Embeddings from a “general” (trained on all species) recommender. Only embeddings belonging to the 25 most-occurring species in the test set are shown. The panels on the right show the same embeddings as on the left, but colored according to its AMR status to a certain drug. The four displayed drugs are selected based on a ranking of the product of the number of positive and negative labels .In this way, the drugs that have a lot of observed labels, both positives and negatives, are displayed. Bottom: Highlighted embeddings from a S. epidermidis specific recommender model.

Full list of modifications made to drug names in DRIAMS. Modifications consist of (1) removal of drugs, (2) merging of drugs, and (3) renaming drugs.

Structure used for the residual blocks, used in the 1D CNN, 2D CNN, and transformer. In the case of convolutions, the output is zero padded so as to produce the same output dimensions as in the input.

Overview of all different drug embedders tested in this work. One-hot embeddings are the only technique not incorporating prior knowledge of the structure of the compound. Hence, they are the only technique incapable of directly transferring to new compounds. Morgan fingerprints produce a bit-vector containing information on the presence of certain substructures. DeepSMILES strings are encoded and processed with a 1D CNN, GRU, or transformer. Drawings of molecules are processed with a 2D CNN. A string kernel on SMILES strings produces a numerical vector for every drug (taken as the row in the resulting Gram matrix).

All hyperparameter tuning experiments. All evaluations are listed in terms of validation ROC-AUCs. All numbers are averages of five model runs, with errorbars showing standard deviations. In every experiment, the highest average is chosen to use in the final models.

Spectrum-macro ROC curve for bestperforming model (Morgan Fingerprints drug embedder, Medium-sized spectrum embedder). The y-axis shows the average sensitivity (across patients), while the x-axis shows one minus the average specificity. Note that this ROC curve is not a traditional ROC curve constructed from one single label set and one corresponding prediction set. Rather, it is constructed from spectrum-macro metrics as follows: for any possible threshold value, binarize all predictions. Then, for every spectrum/patient independently, compute the sensitivity and specificity for the subset of labels corresponding to that spectrum/patient. Finally, those sensititivies and specificities are averaged across patients to obtain one point on above ROC curve. In blue, the optimal sensitivity and specificity (according to the Youden index) is indicated (Youden, 1950).

Barplots showing test performance results for all trained models. Colors represent the different spectrum embedder model sizes. Performance is shown in terms of Macro ROC-AUC (computed per drug and averaged). Errorbars represent the standard deviation over five random seeds.

Full table of test results. The listed averages and standard deviations are calculated over five independent runs of the same model. The best models for every metric per drug embedder are underlined. The overall best model for every metric is in bold face.

Performance of models compared against a linear spectrum embedder baseline. The comparison is only shown for the best-performing drug embedder (Morgan Fingerprints). Errorbars represent the standard deviation over five random seeds.

Transfer learning of DRIAMS-A models to other hospitals. Errorbands show the standard deviation over five runs.

Test ROC-AUC performance per species. Reported figures are averages across the five different Medium-sized Morgan Fingerprint-based recommenders.

UMAP scatterplots of test set MALDI-TOF spectra embeddings xi. Embeddings from a “general” (trained on all spectra across species) recommender are shown. Only embeddings belonging to the 25 mostoccurring species in the test set are shown. Spectra are colored according to its AMR status to a certain drug. The twenty displayed drugs are selected based on a ranking of the product of the number of positive and negative labels .In this way, the drugs that have a lot of observed labels, both positives and negatives, are displayed. The drugs here are ranked 5-24 (the first four are shown in Figure 4). In order to map the clusters back to species, readers are referred back to Figure 4.

UMAP scatterplots of test set MALDI-TOF spectra embeddings xi. Embeddings from two “speciesspecific” recommenders are shown. Spectra are colored according to its AMR status to a certain drug.