Participant demographic and clinical characteristics.

For demographics and clinical information, we report percentage for categorical variables, medians and [Q1, Q3] for real-value variables. ART is antiretroviral therapy. CD4 counts reported in cells/mm3. Years of HIV has 1 missing value, Years of ART has 7, and CD4 Nadir has 3; consequently, these missing values are not included in median and quantiles computations. Years before ART means years of HIV infection before ART initiation.

PWH features correlate with HIV reservoir characteristics.

The abundance of 144 immune cell populations were determined by flow cytometry and the HIV reservoir was quantified by intact proviral DNA assay for a cohort of 115 people with HIV (PWH). Each abundance and clinical and demographic variable was correlated with total HIV reservoir size, intact reservoir size and the percentage of intact proviruses. Spearman correlation coefficients (bold) are shown for 36 variables that had significant p-values (< 0.05) after Benjamini–Hochberg correction for multiple comparisons. Each feature/subset is ranked by the absolute value of the correlation coefficient for the total reservoir size. For years of ART, we compute correlation based on 108 participants, excluding participants with missing years of ART values.

Duration of treatment and the HIV reservoir.

Scatterplots for years of ART vs HIV reservoir size total reservoir size (A), intact reservoir size (B) and percent intact (C) are shown. Each dot represents an individual study participant. Correlation coefficients and corresponding p-values are shown for each plot. Patients that have missing values of years of ART were not included in the plot. For percent intact, piece-wise linear function with two breaks is fitted. For total reservoir size the linear function is fitted.

Leave-one-covariate-out (LOCO) analysis for clinical-demographic features and reservoir characteristics while predicting immunophenotypes.

A. Explanation of leave-one-covariate-out (LOCO) analysis based on example of %Tn CD8 T for clinical-demographic features and reservoir characteristics while predicting immunophenotypes. Analysis was performed for all 133 immunophenotypes considered in the study. The top 10 biggest drops in adjusted R2 scores are reported for models that use total reservoir size (B.), intact reservoir size (C.), or percent intact (D.) as features in addition to clinical and demographic information. Participants with missing years of ART values are excluded from this analysis. The missing value of the CD4 nadir for one participant is imputed. In Figure S6, the LOCO analysis results are visualized for all 133 immunophenotypes.

Receiver operating characteristic (ROC) curves identify PWH parameters that can classify reservoir characteristics.

For total reservoir (A), intact reservoir (B), and percent intact (C), ROC curves are plotted for all 144 immune markers, demographics, and clinical variables (shown in gray). Axes represent the true positive rate (TPR) and the false positive rate (FPR) for each variable for classifying study participants into low (below median) versus high (above median) reservoir size. ROC curves for ten variables with the highest AUC values are shown in color for every HIV reservoir characteristic. Striped black lines represent the ROC curves of a random model. For years of ART ROC curves, we exclude participants with missing years of ART values.27

Dimension reduction reveals two major clusters of PWH with distinct immune systems and reservoirs.

A. PaCMAP was applied to the data using the ten immune cell features with the highest AUC values for classifying participants based on total reservoir size, and two clusters (cluster 1 and cluster 2) are identified. B. Same as A but data points are color-coded by total reservoir size (high=purple, low=pink). Total reservoir size (C.), intact reservoir size (D.), and percent intact (E.) is shown for participants within each cluster. F. Key immune cell features that distinguish cluster 1 from cluster 2 are identified by visualizing the features with the highest AUC values with respect to classifying cohort participants based on cluster membership. Axes represent the true positive rate (TPR) and the false positive rate (FPR) for each variable. Immune markers and clinical-demographic features are shown for each cluster in Figure S8.

Decision tree visualization of the association of immune cell subsets with reservoir characteristics.

(A, C, E) Host variables (immune cell frequencies, demographic, and clinical information) were used to visualize the PWH dataset using the optimal sparse decision trees algorithm GOSDT. The overall set of PWH was classified as likely having high (above median, orange “leaves”) or low (below median, blue “leaves”) total reservoir size (A), intact reservoir size (C), and percent intact (E). In each leaf, “med” denotes the median HIV characteristic of PWH, N is the number of PWH in the leaf, and MN is the number of misslabeled PWH. (B, D, F) PWH in model leaves associated with high (orange) or low (blue) reservoir size characteristics were aggregated and a Mann-Whitney U test was performed to determine statistical significance between the actual total reservoir size of the “high” and “low” groups for total reservoir size (B), intact reservoir size (D), and percent intact (F). For the percent intact tree we exclude participants with missing values of years of ART. For total and intact reservoir size, missing values of years of ART were imputed, however, since the trees do not use this variable, imputations do not influence results. Visualization trees are explained with sets of rules in Supplemental Materials: Figure S9 for total reservoir size, Figure S10 for intact reservoir size, and Figure S11 for percent intact.

Predicting HIV reservoir characteristics with machine learning.

Average training and test accuracies over ten training and test data splits for Random Forest (RF), Gradient Boosted Trees (GBT), Support Vector Machines with RBF kernel (SVM), Logistic Regression (LR), and CART models for total reservoir size (A), intact reservoir size (C), and percent intact (E) are shown. For one split of training and test sets, logistic regression models are visualized for total reservoir (B), intact reservoir (D), and percent intact (F). On the y-axis we show variables used by the model, while the x-axis displays coefficient values for individual variables used by models. For percent intact models we exclude participants with missing values of years of ART. For total and intact reservoir size, missing values of years of ART were imputed. The missing value of the CD4 nadir for one participant was imputed as well using the MICE algorithm.

Representative flow cytometry gating is shown for one sample from the 115-person cohort.

Intact, Total reservoir size and % intact for demographic subgroups.

We report medians and [Q1, Q3] for total reservoir size, intact reservoir size, and percent intact. Mann-Whitney U test was used to compute p-values for gender and Kruskal-Wallis H-test for race.

Immune subsets characteristics.

We report medians, 25% and 75% percentiles for every immune variable used in the study.

Host features correlate with HIV reservoir characteristics.

The abundance of 119 immune cell populations was determined by flow cytometry and the HIV reservoir was quantified by intact proviral DNA assay for a cohort of 115 people with HIV (PWH). Each abundance and clinical and demographic variable was correlated with total HIV reservoir size, intact reservoir size, and the percentage of intact proviruses. Spearman correlation coefficients are shown for 69 variables that were correlated (p<0.05) with one or more reservoir characteristics. Each feature/subset is ranked by the absolute value of the correlation coefficient for the total reservoir size. Variables that had significant p-values (<0.05) after Benjamini–Hochberg correction for multiple comparisons (144 comparisons in our case) with one of the characteristics are shown in bold. For years of ART, correlations are computed based on 108 participants, excluding participants with missing years of ART values. For CD4 Nadir, correlations are computed based on 114 patients, excluding participants with the missing CD4 Nadir value.

Abundance of immune cell subsets correlates with HIV reservoir. Continuation in Figures S3, S4. Scatterplots for selected examples of immune cell subsets versus HIV reservoir size or percent intact are shown. Additionally, scatterplots of immune cell subsets versus time on therapy are displayed. Each dot represents an individual study participant. Correlation coefficients and corresponding p-values are shown for each plot. For time plots, participants that have missing values of years of ART were not included in the plot. The piece-wise linear function with one break is fitted if a break is between 1 and 20 years, otherwise, a linear function is fitted.

Abundance of immune cell subsets correlates with HIV reservoir. Continuation in Figures S2, S4. Scatterplots for selected examples of immune cell subsets versus HIV reservoir size or percent intact are shown. Additionally, scatterplots of immune cell subsets versus time on therapy are displayed. Each dot represents an individual study participant. Correlation coefficients and corresponding p-values are shown for each plot. For time plots, participants that have missing values of years of ART were not included in the plot. The piece-wise linear function with one break is fitted if a break is between 1 and 20 years, otherwise, a linear function is fitted.

Abundance of immune cell subsets correlates with HIV reservoir. Continuation in Figures S2, S3. Scatterplots for selected examples of immune cell subsets versus HIV reservoir size or percent intact are shown. Additionally, scatterplots of immune cell subsets versus time on therapy are displayed. Each dot represents an individual study participant. Correlation coefficients and corresponding p-values are shown for each plot. For time plots, participants that have missing values of years of ART were not included in the plot. The piece-wise linear function with one break is fitted if a break is between 1 and 20 years, otherwise, a linear function is fitted.

CD4/CD8 and (%CD127+ CD4T)/CD8 ratios correlate with total and intact reservoir size. Normalized (values transformed to be between 0 and 1) CD4/CD8 (A) and (%CD127+ CD4T)/CD8 (B) ratios are shown on the x-axis and the normalized natural logarithm (loge) of the total reservoir, intact reservoir and percent intact on the y-axis. Spearman correlation was computed between ratios and HIV reservoir characteristics. Outliers (red data points) were removed with the DBSCAN clustering algorithm and a linear regression model was fitted (black line) to the remaining data points. Spearman correlations, R2 scores and mean squared error after outlier removal are displayed.

PWH immune features correlate with years of ART.

The abundance of 119 immune cell populations was determined by flow cytometry and the HIV reservoir was quantified by intact proviral DNA assay for a cohort of 115 people with HIV (PWH). Each abundance and clinical and demographic variable was correlated with years of therapy for all participants (middle column)and for participants who had been on therapy for less than 10 years (right column). Spearman correlation coefficients are shown for all variables that were correlated (p<0.05) after Benjamini-Hochberg correction with years of treatment for all participants or for participants that were on ART for less than 10 years. Variables that had significant p-value (<0.05) after correction for multiple comparisons are shown in bold. Correlation is computed based on 108 PWH for all participants and on 60 PWH for PWH who were on therapy for less than 10 years). Computations exclude PWH with missing years of ART values. For CD4 Nadir, correlations are computed excluding the participant with missing CD4 Nadir value.

Multicolinearity analysis for variables used in models to predict immunophenotypes.

Variance Inflation Factor (VIF) for every variable is computed for models that fit immunophenotypes as a target and are based on clinical, demographic information and HIV characteristics (total, intact reservoir size or percent intact) as features. Note that here “Years before ART” is a continuous variable that we used for VIF computations only. In our other analysis we categorize this variable and use “Years before ART=NA”, “Years before ART< 1”, and “Years before ART 1” instead.

Leave-one-covariate-out (LOCO) analysis for clinical-demographic features and reservoir characteristics while predicting immunophenotypes. Analysis that is described in Figure 2 was performed for all 133 immunophenotypes considered in the study. Drops in adjusted R2 scores are reported for models that use total (A.), intact reservoir size (B.), or percent intact (C.) as features in addition to clinical and demographic information such as age, biological sex, years of treatment, CD4 nadir, recent CD4 count, and years of HIV before treatment (=NA, < 1, 1). On the x-axis, we show features that were dropped from the model. On the y-axis, we display immunophenotypes, which are targets (outcomes) for the linear regression models. In Tables S6-S8 we show the actual values of drops in adjusted R2 score.

Coefficient visualization for linear regression models that predict immunophenotypes based on clinical, demographic information and HIV characteristics.

(A., B., C.) LOCO analysis from Figure 3 for total-based (A.), intact-based (B.) and percent intact-based (C.) models. The drops in adjusted R2 scores are shown after removing a feature and training a new model without it. (D., E., F.) Coefficient visualization for models that include clinical and demographic information such as age, biological sex, years of treatment, CD4 nadir, recent CD4 count, and years of HIV before treatment = NA, years of HIV before treatment < 1, years of HIV before treatment 1 and total reservoirs size (D.) or intact reservoir size (E.), or percent intact (F.). No features are dropped from these models, they are “Include all” models from tables Tables S6-S8. On the x-axis, the feature is shown, and on the y-axis the target (immunophenotypes from (A., B., C.)). The heatmap displays the coefficient in front of that variable in the model (if the model is %CD4 T= β1Total + β2Age + β3Sex + , then β1, β2, β3, … are visualized), where positive coefficients are shown in red and negative in blue.

Adjusted R2 scores and differences in adjusted R2 for LOCO analysis for the model that contains total reservoir size.

The first column contains all immunophenotypes and is a target for the regression model. The second column contains adjusted R2 of the linear regression model with features such as total reservoir size, age, biological sex, years of treatment, CD4 nadir, recent CD4 count, and years of HIV before treatment (=NA, < 1, 1). The next seven columns show differences in adjusted R2 score (ΔR2) after removing a specific feature. Participants with missing years of ART values are excluded from this analysis. The missing value of the CD4 nadir for one participant is imputed.

Dimension reduction supplemental figures.

A. Participant age is shown within each cluster. B. Participant years of ART are shown within each cluster. C. Participant CD4 nadir is shown within each cluster. D. Participant CD4 count is shown within each cluster. E-O: Participant immune features of interest are shown, where plots of immune features with similar names are placed nearby. P.: Clusters with data points color-coded by intact reservoir size (high=purple, low=pink). Q.: Clusters with data points color-coded by percent intact reservoir size (high=purple, low=pink). R. Relative proportions of cannabis (CB) users and non-users (non-CB) are shown for each cluster. S. Total reservoir sizes for non-users and CB users are shown. T. The ages of study participants for non-users and CB users are shown.

Adjusted R2 scores and differences in adjusted R2 for LOCO analysis for the model that contains intact reservoir size.

The first column contains all immunophenotypes and is a target for the regression model. The second column contains adjusted R2 of the linear regression model with features such as intact reservoir size, age, biological sex, years of treatment, CD4 nadir, recent CD4 count, and years of HIV before treatment (=NA, < 1, 1). The next seven columns show differences in adjusted R2 score (ΔR2) after removing a specific feature. Participants with missing years of ART values are excluded from this analysis. The missing value of the CD4 nadir for one participant is imputed.

Adjusted R2 scores and differences in adjusted R2 for LOCO analysis for the model that contains percent intact.

The first column contains all immunophenotypes and is a target for the regression model. The second column contains adjusted R2 of the linear regression model with features such as percent intact, age, biological sex, years of treatment, CD4 nadir, recent CD4 count, and years of HIV before treatment (=NA, < 1, 1). The next seven columns show differences in adjusted R2 score (ΔR2) after removing a specific feature. Participants with missing years of ART values are excluded from this analysis. The missing value of the CD4 nadir for one participant is imputed.

Host features classify PWH with respect to HIV reservoir characteristics.

The abundance of each of the 144 variables was analyzed for performance in classifying each PWH within the cohort as having reservoir size above or below the median using a receiver operating characteristics (ROC) curve. The area under the curve (AUC) is reported. The table includes variables that had an AUC value larger than 0.6 for one of the reservoir size characteristics. We highlight variables that had an AUC value higher than 0.65 in bold. Variables are ranked by AUC for the total reservoir size. For years of ART, AUC values are computed based on 108 PWH, excluding participants with missing years of ART values. For CD4 Nadir, AUC values are computed based on 114 PWH, excluding one participant] with the missing CD4 Nadir value.

The total reservoir size visualization tree from Figure 6 is explained with a set of rules. For every leaf, the path that leads to this leaf is described. The histogram of the variable and split value for every node used in the tree is shown.

The intact reservoir size visualization tree from Figure 6 is explained with a set of rules. For every leaf, the path that leads to this leaf is described. The histogram of the variable and split value for every node used in the tree is shown.

The percent intact visualization tree from Figure 6 is explained with a set of rules. For every leaf, the path that leads to this leaf is described. The histogram of the variable and split value for every node used in the tree is shown.

Using machine learning to predict reservoir size. Average training and test R2 scores over different training and test data splits for Linear Regression (LR), Ridge Regression (RR), Kernel Regression with RBF kernel (KR), Decision Tree Regressor (DT), Random Forest (RF), and Gradient Boosted Trees (GBT) models are shown for predicting total reservoir size (A), intact reservoir size (B), and percentage intact (C). For one split of training and test sets, ridge regression performance is shown for total reservoir size (D), intact reservoir (E), and linear regression for percentage intact (F). On the x-axis, the actual values of HIV reservoir characteristics are shown, while on the y-axis the outputs of models are shown for training data (blue dots) and test data (red dots). For the same training and test split, ridge regression model feature coefficients are visualized for total reservoir (G), intact reservoir (H), and linear regression model feature coefficients are visualized for percent intact (I). On the y-axis we show variables used by the model, while on the x-axis coefficient values for linear models based on these variables. For percent intact models we exclude participants with missing values of years of ART. For total and intact reservoir size, missing values of years of ART were imputed. The missing value of the CD4 nadir for one participant was imputed as well using the MICE algorithm.

Training procedure for classification (regression)

Ranges of hyperparameters values that we used to perform grid search for classification and regression.