Test performance of selected recommender models, compared to the performance of a collection of models — each trained on only one species-drug combination — coined “species-drug classifiers”. “Speciesdrug classifiers” refer to a collection of binary classifiers, each trained to predict AMR status for a subset of data comprising a single species-drug combination. “Simulated expert’s best guess” refers to counting AMR label frequencies in single species-drug combinations, and taking those as predictions. The listed averages and standard deviations are calculated over five independent runs of the same model. Given the non-stochastic nature of the logistic regression and XGBoost implementations, only one set of models is trained and, hence, no standard deviations are reported. Performance is computed on the subset of labels spanning the 200 mostcommon species-drug combinations.