Machine learning approaches identify immunologic signatures of total and intact HIV DNA during long-term antiretroviral therapy

  1. Lesia Semenova
  2. Yingfan Wang
  3. Shane Falcinelli
  4. Nancie Archin
  5. Alicia D Cooper-Volkheimer
  6. David M Margolis
  7. Nilu Goonetilleke
  8. David M Murdoch
  9. Cynthia D Rudin  Is a corresponding author
  10. Edward P Browne  Is a corresponding author
  1. Microsoft Research, Duke University, United States
  2. Department of Computer Science, Duke University, United States
  3. UNC HIV Cure Center UNC Chapel Hill, United States
  4. Department of Microbiology and Immunology, UNC Chapel Hill, United States
  5. Department of Medicine, UNC Chapel Hill, United States
  6. Department of Medicine, Duke University, United States
6 figures, 2 tables and 2 additional files

Figures

Figure 1 with 5 supplements
Duration of treatment and the HIV reservoir.

Scatterplots for years of antiretroviral therapy (ART) versus total HIV reservoir frequency (A), intact reservoir frequency (B), and percent intact (C) are shown. Each dot represents an individual study participant. Correlation coefficients and corresponding p-values are shown for each plot. Participants that have missing values of years of ART were not included in the plot. For percent intact, piece-wise linear function with two breaks is fitted. For total HIV reservoir frequency a linear function is fitted.

Figure 1—figure supplement 1
Representative flow cytometry gating is shown for one sample from the 115-person cohort.
Figure 1—figure supplement 2
Abundance of immune cell subsets correlates with HIV reservoir (part I).

Abundance of immune cell subsets correlates with HIV reservoir. Continuation in Figure 1—figure supplement 3, Figure 1—figure supplement 4. Scatterplots for selected examples of immune cell subsets: (A) %CD8 T, (B) %CD38+/HLADR- CD4 T, (C) KLRG1-/PD-1- CD4 T, (D) %Tn CD4 T, (E) %NKG2A+ CD4 T, (F) %PD-1-/CCR7+ CD4 T, (G) %CD4T, (H) %Tcm CD8 T, (I) %CD38+ CD4 T, (J) %PD-1-/CCR7- CD4 T, (K) %CD38+/HLA-DR- CD8 T, (L) %PD-1+ CD4 T, versus HIV reservoir frequency or percent intact are shown. Additionally, scatterplots of immune cell subsets versus time on therapy are displayed. Each dot represents an individual study participant. Correlation coefficients and corresponding p-values are shown for each plot. For time plots, participants that have missing values of years of antiretroviral therapy (ART) were not included in the plot. The piece-wise linear function with one break is fitted if a break is between 1 and 20 years, otherwise, a linear function is fitted. Red trendline is shown for significant correlations.

Figure 1—figure supplement 3
Abundance of immune cell subsets correlates with HIV reservoir (part II).

Abundance of immune cell subsets correlates with HIV reservoir. Continuation in Figure 1—figure supplement 2, Figure 1—figure supplement 4. Scatterplots for selected examples of immune cell subsets or participant characteristics (A) Age, (B) CD4 nadir, (C) %PD-1+ Tn CD4 T, (D) %Tn CD8 T, (E) %PD-1+/CCR7+ CD8 T, (F) %CD38-/HLA-DR+ CD4 T, (G) %KLRG1-/PD-1- CD8 T, (H) %PD-1+ Tn CD8 T, (I) %CD38+/HLA-DR- Tn CD8 T, (J) %PD-1+/CCR7+ CD4 T, (K) %KLRG1+/CD27+ CD8 T, (L) %CD27+ CD4 T, versus HIV reservoir frequency or percent intact are shown. Additionally, scatterplots of immune cell subsets versus time on therapy are displayed. Each dot represents an individual study participant. Correlation coefficients and corresponding p-values are shown for each plot. For time plots, participants that have missing values of years of antiretroviral therapy (ART) were not included in the plot. The piece-wise linear function with one break is fitted if a break is between 1 and 20 years, otherwise, a linear function is fitted. Red trendline is shown for significant correlations.

Figure 1—figure supplement 4
Abundance of immune cell subsets correlates with HIV reservoir (part III).

Abundance of immune cell subsets correlates with HIV reservoir. Continuation in Figure 1—figure supplement 2, Figure 1—figure supplement 3. Scatterplots for selected examples of immune cell subsets, (A) %KLRG1-/CD27+ CD4 T, (B) HLA-DR+ CD4 T, (C) % PD-1+ CD8 T, (D) %CD38+ CD8 T, (E) %CD38-/HLA-DR+ Tn CD8 T, (F) %Tem CD4 T, (G) %CD8 T, (H) %CD4 T, (I) %CD127+ CD4 T, (J) %CD38+ CD8 T, (K) %CD107-/IFNγ-/IL-2+/TNFα+ CD4 T, (L) %CD107-/IFNγ+/IL-2+/TNFα- CD4 T, versus HIV reservoir frequency or percent intact are shown. Additionally, scatterplots of immune cell subsets versus time on therapy are displayed. Each dot represents an individual study participant. Correlation coefficients and corresponding p-values are shown for each plot. For time plots, participants that have missing values of years of antiretroviral therapy (ART) were not included in the plot. The piece-wise linear function with one break is fitted if a break is between 1 and 20 years, otherwise, a linear function is fitted. Red trendline is shown for significant correlations.

Figure 1—figure supplement 5
CD4/CD8 and (%CD127+ CD4T)/CD8 ratios correlate with total and intact reservoir frequency.

CD4/CD8 and (%CD127+ CD4T)/CD8 ratios correlate with total and intact reservoir frequency. Normalized (values transformed to be between 0 and 1) CD4/CD8 (A) and (%CD127+ CD4T)/CD8 (B) ratios are shown on the x-axis and the normalized natural logarithm (loge) of the total reservoir, intact reservoir and percent intact on the y-axis. Spearman correlation was computed between ratios and HIV reservoir characteristics. Outliers (red data points) were removed with the DBSCAN clustering algorithm and a linear regression model was fitted (black line) to the remaining data points. Spearman correlations, R2 scores, and mean squared error after outlier removal are displayed.

Figure 2 with 2 supplements
Leave-one-covariate-out (LOCO) analysis for clinical–demographic features and reservoir characteristics while predicting immunophenotypes.

(A) Explanation of LOCO analysis based on example of %CD4 T for clinical–demographic features and reservoir characteristics while predicting immunophenotypes. Analysis was performed for all 133 immunophenotypes considered in the study. The top 10 biggest drops in adjusted R2 scores are reported for models that use total reservoir frequency (B), intact reservoir frequency (C), or percent intact (D) as features in addition to clinical and demographic information. Participants with missing years of antiretroviral therapy (ART) values are excluded from this analysis. The missing value of the CD4 nadir for one participant is imputed.

Figure 2—figure supplement 1
Leave-one-covariate-out (LOCO) analysis visualization for all 133 immunophenotypes.

LOCO analysis for clinical–demographic features and reservoir characteristics while predicting immunophenotypes. Analysis that is described in Figure 2 was performed for all 133 immunophenotypes considered in the study. Drops in adjusted R2 scores are reported for models that use total reservoir frequency (A), intact reservoir frequency (B), or percent intact (C) as features in addition to clinical and demographic information such as age, biological sex, race, years of treatment, CD4 nadir, recent CD4 count, and years of HIV before treatment (=NA, <1 1). On the x-axis, we show features that were dropped from the model. On the y-axis, we display immunophenotypes, which are targets (outcomes) for the linear regression models. In Supplementary file 1f–h, we show the actual values of drops in adjusted R2 score.

Figure 2—figure supplement 2
Coefficient visualization for linear regression models that predict immunophenotypes in Figure 2B–D.

Coefficient visualization for linear regression models that predict immunophenotypes based on clinical, demographic information, and HIV characteristics. Leave-one-covariate-out (LOCO) analysis from Figure 3 for total reservoir-based (A), intact reservoir-based (B), and percent intact-based (C) models. The drops in adjusted R2 scores are shown after removing a feature and training a new model without it. Coefficient visualization for models that include clinical and demographic information such as age, biological sex, years of treatment, CD4 nadir, recent CD4 count, and years of HIV before treatment = NA, years of HIV before treatment <1, years of HIV before treatment ≥ 1 and total reservoirs frequency (D) or intact reservoir frequency (E), or percent intact (F). No features are dropped from these models, they are ‘Include all’ models from Supplementary file 1f–h. On the x-axis, the feature is shown, and on the y-axis the target (immunophenotypes from A–C). The heatmap displays the coefficient in front of that variable in the model (if the model is %CD4 T =β1Total+β2Age+β3Sex+..., then β1,β2,β3,... are visualized), where positive coefficients are shown in red and negative in blue.

Receiver operating characteristic (ROC) curves identify people with HIV (PWH) parameters that can classify reservoir characteristics.

For total reservoir frequency (A), intact reservoir frequency (B), and percent intact (C), ROC curves are plotted for all 144 immune markers, demographics, and clinical variables (shown in gray). Axes represent the true positive rate (TPR) and the false positive rate (FPR) for each variable for classifying study participants into low (below median) versus high (above median) reservoir frequency. ROC curves for 10 variables with the highest area under the curve (AUC) values are shown in color for each HIV reservoir characteristic. Striped black lines represent the ROC curves of a random model. For years of antiretroviral therapy (ART) ROC curves, we exclude participants with missing years of ART values.

Figure 4 with 2 supplements
Dimension reduction reveals two major clusters of people with HIV (PWH) with distinct immune systems and reservoirs.

(A) PaCMAP was applied to the data using the ten immune cell features with the highest area under the curve (AUC) values for classifying participants based on total reservoir frequency, and two clusters (clusters 1 and 2) are identified. (B) Same as A but data points are color-coded by total reservoir frequency (high = pink, low = gray). Total reservoir frequency (C), intact reservoir frequency (D), and percent intact (E) are shown for participants within each cluster. (F) Key immune cell features that distinguish cluster 1 from cluster 2 are identified by visualizing the features with the highest AUC values with respect to classifying cohort participants based on cluster membership. Axes represent the true positive rate (TPR) and the false positive rate (FPR) for each variable. Immune markers and clinical–demographic features are shown for each cluster in Figure 4—figure supplement 1.

Figure 4—figure supplement 1
Additional dimension reduction results.

Dimension reduction supplemental figures. (A) Participant age is shown within each cluster. (B) Participant years of antiretroviral therapy (ART) are shown within each cluster. (C) Participant CD4 nadir is shown within each cluster. (D) Participant CD4 count is shown within each cluster. (E–O) Participant immune features of interest are shown, where plots of immune features with similar names are placed nearby. (P) Clusters with data points color-coded by intact reservoir frequency (high = pink, low = gray). (Q) Clusters with data points color-coded by percent intact reservoir frequency (high = pink, low = gray). (R) Relative proportions of cannabis (CB) users and non-users (non-CB) are shown for each cluster. (S) Total reservoir frequencies (per million CD4 T cells) for non-users and CB users are shown. (T) The ages of study participants for non-users and CB users are shown.

Figure 4—figure supplement 2
Principal component analysis (PCA) visualization.

Left: PCA plot with data points color-coded by the membership of clusters identified in Figure 4. Right: PCA plot with data points color-coded by total reservoir frequency (high = pink, low = gray).

Figure 5 with 3 supplements
Decision tree visualization of the association of immune cell subsets with reservoir characteristics.

(A, C, E) Host variables (immune cell frequencies, demographic, and clinical information) were used to visualize the people with HIV (PWH) dataset using the optimal sparse decision trees algorithm Generalized and Scalable Optimal Sparse Decision Trees (GOSDT). The overall set of PWH was classified as likely having high (above median, orange ‘leaves’) or low (below median, blue ‘leaves’) total reservoir frequency (A), intact reservoir frequency (C), and percent intact (E). In each leaf, ‘med’ denotes the median HIV characteristic of PWH, N is the number of PWH in the leaf, and MN is the number of mislabeled PWH. (B, D, F) PWH in model leaves associated with high (orange) or low (blue) reservoir frequency characteristics were aggregated and a Mann–Whitney U test was performed to determine statistical significance between the actual total reservoir frequency of the ‘high’ and ‘low’ groups for total reservoir frequency (B), intact reservoir frequency (D), and percent intact (F). For the percent intact tree we exclude participants with missing values of years of antiretroviral therapy (ART). For total and intact reservoir frequency, missing values of years of ART were imputed, however, since the trees do not use this variable, imputations do not influence results. Visualization trees are explained with sets of rules in figure supplements.

Figure 5—figure supplement 1
The total reservoir frequency visualization tree is explained with a set of rules.

The total reservoir frequency visualization tree is explained with a set of rules. For every leaf, the path that leads to this leaf is described. The histogram of the variable and split value for every node used in the tree is shown.

Figure 5—figure supplement 2
The intact reservoir frequency visualization tree is explained with a set of rules.

The intact reservoir frequency visualization tree is explained with a set of rules. For every leaf, the path that leads to this leaf is described. The histogram of the variable and split value for every node used in the tree is shown.

Figure 5—figure supplement 3
The percent intact visualization tree is explained with a set of rules.

The percent intact visualization tree is explained with a set of rules. For every leaf, the path that leads to this leaf is described. The histogram of the variable and split value for every node used in the tree is shown.

Figure 6 with 1 supplement
Predicting HIV reservoir characteristics with machine learning.

Average training and test accuracies over 10 training and test data splits for Random Forest (RF), Gradient Boosted Trees (GBT), Support Vector Machines with RBF kernel (SVM), Logistic Regression (LR), and CART models for total reservoir frequency (A), intact reservoir frequency (C), and percent intact (E) are shown. For one split of training and test sets, LR models are visualized for total reservoir (B), intact reservoir (D), and percent intact (F). On the y-axis, we show variables used by the model, while the x-axis displays coefficient values for individual variables used by models. For percent intact models, we exclude participants with missing values of years of antiretroviral therapy (ART). For total and intact reservoir frequency, missing values of years of ART were imputed. The missing value of the CD4 nadir for one participant was imputed using the Multivariate Imputation by Chained Equations (MICE) algorithm.

Figure 6—figure supplement 1
Using machine learning to predict reservoir frequency.

Using machine learning to predict reservoir frequency. Average training and test R2 scores over different training and test data splits for Linear Regression (LR), Ridge Regression (RR), Kernel Regression with RBF kernel (KR), Decision Tree Regressor (DT), Random Forest (RF), and Gradient Boosted Trees (GBT) models are shown for predicting total reservoir frequency (A), intact reservoir frequency (B), and percentage intact (C). For one split of training and test sets, ridge regression performance is shown for total reservoir frequency (D), intact reservoir frequency (E), and linear regression for percentage intact (F). On the x-axis, the actual values of HIV reservoir characteristics are shown, while on the y-axis the outputs of models are shown for training data (blue dots) and test data (red dots). For the same training and test split, ridge regression model feature coefficients are visualized for total reservoir (G), intact reservoir (H), and linear regression model feature coefficients are visualized for percent intact (I). On the y-axis, we show variables used by the model, while on the x-axis coefficient values for linear models based on these variables. For percent intact models, we exclude participants with missing values of years of antiretroviral therapy (ART). For total and intact reservoir frequency, missing values of years of ART were imputed. The missing value of the CD4 nadir for one participant was imputed as well using the MICE algorithm.

Tables

Table 1
Participant demographic and clinical characteristics.

For demographics and clinical information, we report percentage for categorical variables, medians, and [Q1, Q3] for real-value variables. ART is antiretroviral therapy. CD4 counts reported in cells/mm3. Years of HIV has 1 missing value, years of ART has 7, and CD4 nadir has 3; consequently, these missing values are not included in median and quantiles computations. Years before ART means years of HIV infection before ART initiation.

Percentage (count)Median[Q1, Q3][Min, Max]
Age45[37, 53][23, 65]
Sex (% male)76.52% (88)
Race
Black60% (69)
White37.39% (43)
Other2.61% (3)
Years of HIV11[7, 19.85][1, 33.6]
Years before ART<155.65% (64)
Years before ART≥138.26% (44)
Years before ART = NA6.09% (7)
Years of ART9[5.23, 16.63][0.9,33.5]
Recent CD4 count799[624.5, 962][319, 1970]
CD4 nadir313.5[163.25, 463.25][2, 1080]
Table 2
PWH features correlate with HIV reservoir characteristics.

The abundance of 144 immune cell populations was determined by flow cytometry and the HIV reservoir was quantified by intact proviral DNA assay for a cohort of 115 people with HIV (PWH). Each abundance and clinical and demographic variable was correlated with total HIV reservoir frequency, intact reservoir frequency, and the percentage of intact proviruses. Spearman correlation coefficients (bold) are shown for 36 variables that had significant p-values (<0.05) after Benjamini–Hochberg correction for multiple comparisons. Each feature/subset is ranked by the absolute value of the correlation coefficient for the total reservoir frequency. For years of ART, we compute correlation based on 108 participants, excluding participants with missing years of ART values.

VariableTotalIntactPercent intact
%CD8 T0.40520.35620.0068
%CD38+/HLA-DR− CD4 T0.3891−0.10980.2664
%KLRG1−/PD-1− CD4 T0.3808−0.12890.2334
%Tn CD4 T0.3802−0.20310.1714
%NKG2A+ CD4 T0.36180.29040.0179
%PD-1−/CCR7+ CD4 T−0.3590−0.10820.2283
%CD4 T−0.3564−0.3195−0.0079
%Tcm CD8 T0.34660.1752−0.1814
%CD38+ CD4 T−0.3366−0.06110.2829
%PD-1−/CCR7− CD4 T0.33000.1824−0.0938
%CD38+/HLA-DR− CD8 T−0.3267−0.08370.2636
%PD-1+ CD4 T0.32220.0664−0.2470
Age0.31720.1669−0.1471
CD4 nadir−0.3164−0.15120.1927
%PD-1+ Tn CD4 T0.31190.1246−0.0962
%Tn CD8 T−0.3028−0.26970.0154
Years of ART0.3062−0.0688−0.4523
%PD-1+/CCR7+ CD8 T0.29260.1361−0.1254
%CD38−/HLA-DR+ CD4 T0.28490.0920−0.1500
%KLRG1−/PD-1− CD8 T−0.2757−0.21820.0162
%PD-1+ Tn CD8 T0.27380.1411−0.0705
%CD38+/HLA-DR− Tn CD8 T−0.2676−0.06100.2119
%PD-1+/CCR7+ CD4 T0.26650.0859−0.1673
%KLRG1+/CD27+ CD8 T0.26060.22120.0319
%CD27+ CD4 T−0.2602−0.05310.1767
%KLRG1−/CD27+ CD4 T−0.2575−0.04640.2171
%HLA-DR+ CD4 T0.25650.1110−0.0970
%PD-1+ CD8 T0.25410.19690.0035
%CD38+ CD8 T−0.24180.01090.3114
%CD38−/HLA-DR+ Tn CD8 T0.24040.0738−0.1346
%Tem CD4 T0.24020.0888−0.1669
%CD127+ CD4 T−0.1298−0.3160−0.2539
%CD107a−IFNγ
−IL-2+TNFα+ CD4 T
0.0568−0.1775−0.3223
%CD107a−IFNγ
+IL-2+TNFα− CD4 T
0.0285−0.2455−0.3265

Additional files

MDAR checklist
https://cdn.elifesciences.org/articles/94899/elife-94899-mdarchecklist1-v1.docx
Supplementary file 1

Additional tables with raw data values for main figures.

(a) Intact, total reservoir frequency, and %intact for demographic subgroups. (b) Immune subsets characteristics. (c) Host features correlate with HIV reservoir characteristics. (d) People with HIV (PWH) immune features correlate with years of antiretroviral therapy (ART). (e) Multicolinearity analysis for variables used in models to predict immunophenotypes. (f) Adjusted R2 scores and differences in adjusted R2 for leave-one-covariate-out (LOCO) analysis for the model that contains total reservoir frequency. (g) Adjusted R2 scores and differences in adjusted R2 for LOCO analysis for the model that contains intact reservoir frequency. (h) Adjusted R2 scores and differences in adjusted R2 for LOCO analysis for the model that contains percent intact. (i) Host features classify PWH with respect to HIV reservoir characteristics. (j) Training procedure for classification (regression). (k) Ranges of hyperparameters values that we used to perform grid search for classification and regression.

https://cdn.elifesciences.org/articles/94899/elife-94899-supp1-v1.pdf

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Lesia Semenova
  2. Yingfan Wang
  3. Shane Falcinelli
  4. Nancie Archin
  5. Alicia D Cooper-Volkheimer
  6. David M Margolis
  7. Nilu Goonetilleke
  8. David M Murdoch
  9. Cynthia D Rudin
  10. Edward P Browne
(2024)
Machine learning approaches identify immunologic signatures of total and intact HIV DNA during long-term antiretroviral therapy
eLife 13:RP94899.
https://doi.org/10.7554/eLife.94899.3