Deep-sequence phylogenetics to quantify patterns of HIV transmission in the context of a universal testing and treatment trial – BCPP/Ya Tsie trial

  1. Lerato E Magosi  Is a corresponding author
  2. Yinfeng Zhang
  3. Tanya Golubchik
  4. Victor DeGruttola
  5. Eric Tchetgen Tchetgen
  6. Vladimir Novitsky
  7. Janet Moore
  8. Pam Bachanas
  9. Tebogo Segolodi
  10. Refeletswe Lebelonyane
  11. Molly Pretorius Holme
  12. Sikhulile Moyo
  13. Joseph Makhema
  14. Shahin Lockman
  15. Christophe Fraser
  16. Myron Max Essex
  17. Marc Lipsitch  Is a corresponding author
  18. On behalf of The Botswana Combination Prevention Project and PANGEA consortium
  1. Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T.H. Chan School of Public Health, Harvard University, United States
  2. Division of Molecular & Genomic Pathology, University of Pittsburgh Medical Center Presbyterian Shadyside, United States
  3. Oxford Big Data Institute, Li Ka Shing Center for Health Information and Discovery, Nuffield Department of Medicine, Old Road Campus, University of Oxford, United Kingdom
  4. Department of Biostatistics, Harvard T.H. Chan School of Public Health, Harvard University, United States
  5. Department of Statistics, The Wharton School, University of Pennsylvania, United States
  6. Harvard T.H. Chan School of Public Health AIDS Initiative, Department of Immunology and Infectious Disease, Harvard T.H. Chan School of Public Health, Harvard University, United States
  7. Botswana Harvard AIDS Institute Partnership, Botswana
  8. Division of Global HIV/AIDS and TB, Centers for Disease Control and Prevention, United States
  9. HIV Prevention Research Unit, Centers for Disease Control and Prevention, Botswana
  10. Ministry of Health, Republic of Botswana, Botswana
  11. Brigham and Women's Hospital, Division of Infectious Diseases, United States

Abstract

Background:

Mathematical models predict that community-wide access to HIV testing-and-treatment can rapidly and substantially reduce new HIV infections. Yet several large universal test-and-treat HIV prevention trials in high-prevalence epidemics demonstrated variable reduction in population-level incidence.

Methods:

To elucidate patterns of HIV spread in universal test-and-treat trials, we quantified the contribution of geographic-location, gender, age, and randomized-HIV-intervention to HIV transmissions in the 30-community Ya Tsie trial in Botswana. We sequenced HIV viral whole genomes from 5114 trial participants among the 30 trial communities.

Results:

Deep-sequence phylogenetic analysis revealed that most inferred HIV transmissions within the trial occurred within the same or between neighboring communities, and between similarly aged partners. Transmissions into intervention communities from control communities were more common than the reverse post-baseline (30% [12.2 – 56.7] vs. 3% [0.1 – 27.3]) than at baseline (7% [1.5 – 25.3] vs. 5% [0.9 – 22.9]) compatible with a benefit from treatment-as-prevention.

Conclusions:

Our findings suggest that population mobility patterns are fundamental to HIV transmission dynamics and to the impact of HIV control strategies.

Funding:

This study was supported by the National Institute of General Medical Sciences (U54GM088558), the Fogarty International Center (FIC) of the U.S. National Institutes of Health (D43 TW009610), and the President’s Emergency Plan for AIDS Relief through the Centers for Disease Control and Prevention (CDC) (Cooperative agreements U01 GH000447 and U2G GH001911).

Editor's evaluation

The study by Magosi et al., evaluates the impact of targeted public health interventions on the HIV-1 transmission rate in Botswana. Using data from a large trial in Botswana, the authors found that HIV-1 transmission was more common to occur from control population groups into targeted population groups than vice-versa. The study is of public health interest, showing how some public health interventions are powerful in reducing HIV-1 transmission but only among the population targeted. This is a very comprehensive research study showing the advantages of using deep sequencing data in combination with phylogenetic tools to assess the positive impact of public health interventions in reducing HIV-1 transmission.

https://doi.org/10.7554/eLife.72657.sa0

Introduction

The global number of new infections with HIV-1, the virus that causes AIDS, has gradually declined since the peak in 1997, yet population-level HIV incidence remains high in East and Southern Africa, especially among young women (UNAIDS, 2019; UNAIDS, 2021). Despite efforts spanning four decades, there is still no successful HIV vaccine or widely administrable cure. This emphasizes the importance of curbing new infections in order to bring the HIV epidemic under control.

Four large cluster-randomized trials were recently conducted in East (Kenya and Uganda) and Southern Africa (Botswana, South Africa, and Zambia) to evaluate the effect of universal HIV testing and treatment in reducing population-level HIV incidence (Hayes et al., 2019; Makhema et al., 2019; Havlir et al., 2019; Iwuji et al., 2018). The trials showed variable outcomes, ranging from modest to no reduction in the occurrence of new HIV infections (Abdool Karim, 2019). The trials were motivated by evidence that early initiation of antiretroviral therapy substantially reduced onward transmission in people with HIV owing to suppressed virus levels (Cohen et al., 2016) and by mathematical models predicting that widely expanded access to HIV testing and treatment would rapidly and substantially reduce the occurrence of new infections (Granich et al., 2009).

The joint United Nations Programme on HIV/AIDS (UNAIDS) had set a global target of fewer than 500,000 new HIV infections by 2020 that was unmet. To stem the spread of new HIV infections we need to better understand patterns of HIV transmission in high-prevalence, generalized epidemics in sub-Saharan Africa. To obtain such understanding requires a better grasp of where and from whom those most at risk acquire infection and to whom they are likely to spread infection. Previous work in a high HIV burden setting in Kwa-Zulu Natal, South Africa identified age-disparate sexual partnerships between younger women under 25 years and older men as a primary contributor to HIV transmission and incidence in younger women (de Oliveira et al., 2017).

We hypothesized that geographic proximity, gender, age, and ready access to HIV care (randomized-HIV-intervention) contribute to shaping patterns of HIV transmission. To test these hypotheses, we performed a deep-sequence phylogenetic study of adults with HIV aged 16–64 years who consented to participate in the Botswana Ya Tsie HIV prevention trial to quantify the contribution of the above factors to HIV transmission patterns in Botswana adjusting for sampling heterogeneity.

Botswana is a sparsely populated, landlocked country in southern Africa, roughly the size of France, 581,730 km2 (224,610 square miles), with a population of about 2.3 million people. The Kalahari Desert occupies approximately 70% of the country, and the population is largely distributed along an eastern corridor bordered by South Africa, Zambia, and Zimbabwe (GISGeography, 2022). It is a high middle-income country with a Gini index of 53.3 (The World Bank, 2015). One in five adults in Botswana aged 15–49 years were reported to be living with HIV in 2019, representing the third-highest HIV prevalence in the world after Lesotho and Eswatini (UNAIDS, 2020).

Our results produced three key findings: first, most HIV transmissions inferred in the trial population occurred between partners residing in the same community or neighboring communities, and between partners of similar ages. Second, transmission events identified among trial participants sampled one or more years post-baseline were consistent with a greater flow of HIV transmissions into intervention communities from control communities than vice versa during the trial, although dates of transmission were not identified. Third, men and women sampled in the Botswana/Ya Tsie trial whose sexual partners could be inferred phylogenetically contributed similarly to the spread of HIV infection. Overall, our findings suggest that population mobility patterns are central to understanding HIV transmission dynamics and should be considered when designing and evaluating HIV control strategies.

Materials and methods

Overview of trial population

Request a detailed protocol

The Ya Tsie trial or Botswana Combination Prevention Project (BCPP) was a pair-matched community randomized trial to evaluate whether an optimized combination of effective HIV treatment and prevention interventions would substantially reduce population-level HIV incidence over 29 months. The trial was conducted from 2013 to 2018 in 30 rural and peri-urban communities (average population size: 5855) in the Central, South-East, and North/North-East regions of Botswana that represented an estimated total population of 175,664 (7.6% of the national population) (Figure 1). Among eligible community residents, 13,131 adults aged 16–64 years gave informed consent to participate in the trial. See Makhema et al., 2019; Gaolathe et al., 2016 for detailed eligibility criteria. The estimated HIV prevalence among participating communities at the start of the trial was 25.7% (27,446 / 106,712) (Supplementary file 1-Table 1A) (see Materials and methods section on computing age-gender estimates of the number of people with HIV in each trial community). Communities were matched into 15 pairs according to population size and age structure, access to health services, and geographic location relative to major urban centers; communities in each pair were then randomized into intervention and control arms. Intervention communities received expanded HIV testing, early initiation of antiretroviral treatment, and strengthened linkage-to-care, for example, clinic referrals with appointment dates, text alert reminders of appointments, and following-up with those who missed appointments. Intervention communities also received wider access to prevention of mother-to-child transmission services and male circumcision services, compared with control communities that received the standard of care.

Figure 1 with 2 supplements see all
Location of 30 rural and peri-urban communities that participated in the Botswana/Ya Tsie trial.

(A) A map of Botswana showing the spatial distribution of intervention and control communities in the Botswana/Ya Tsie trial. Intervention communities are denoted by filled blue circles and control communities are represented by filled red triangles. Trial communities are shown in the context of level-2 administrative subdivisions (solid black lines) and 2011 census enumeration areas (solid gray lines). Major urban areas are symbolized as filled orange squares. (B) Spatial distribution of the estimated population density of persons per square kilometer in Botswana in 2011. Filled blue circles represent the location of intervention communities and filled green triangles denote control communities in the Botswana/Ya Tsie trial.

Data set of people with HIV in the Botswana/Ya Tsie trial

Request a detailed protocol

To better understand HIV transmission patterns among participating communities, residents with HIV-1 were invited to provide a sample for viral phylogenetic analyses. This included (1) all people with HIV-1 identified during a baseline survey of 20% of households randomly sampled from each trial community to establish an incidence follow-up cohort, (2) all people with HIV-1 identified through annual household surveys conducted in this 20% household sample in all 30 communities during the trial, (3) all people with HIV-1 who were enrolled in a community-wide survey in 100% of 6 communities, 3 in the intervention and 3 in the control arms (the ‘end of study survey’), (4) all newly identified people with HIV-1 discovered during an initial community-wide HIV testing and counseling campaign followed by more targeted HIV testing in the 15 intervention communities, (5) all people with HIV-1 in intervention communities who later presented at healthcare facilities after being identified during community-wide testing campaigns, and (6) all people with HIV-1 who were already receiving HIV care at health facilities in the 15 intervention communities. Altogether, 5114 participants consented to a blood draw for HIV-1 viral genotyping and their viral genomes were successfully deep-sequenced, representing the largest phylogenetics study conducted in Botswana (Supplementary file 1-Table 1A).

Computing age-gender estimates of the number of people with HIV in each trial community

Age-gender HIV prevalence estimates in each trial community

Request a detailed protocol

To obtain HIV prevalence estimates we first fit a random-intercept Poisson regression model of HIV status, with age and gender as fixed-effects, using nonparametric maximum likelihood estimation in the gllamm program (Sophia, 1999; Rabe-Hesketh and Skrondal, 2012) in Stata 13.1. The model was fit to data indicating the HIV status of 12,570 participants that had consented to an HIV test and were part of a baseline survey of 20% of households randomly sampled from each of the 30 trial communities to establish an incidence follow-up cohort; the fraction of positive tests for HIV-1 infection was, 28.6% (3596 / 12,570). Ages of participants were grouped in two ways, as 5-year age-categories and four age-categories: 16–24, 25–34, 35–49, and 50–64 years. Furthermore, the counts of people with HIV for a specific age-gender grouping, i in community, j were assumed to have a Poisson distribution with mean, uij so that

(1) ln(uij)=ln(eij)+ β1+β2x2ij+ β3x3ij+ζj  .

Accordingly, β1 denotes the intercept, ζj the predicted random-intercept representing unobserved heterogeneity across trial communities, β2x2ij and β3x3ij fixed-effect covariates for age group and gender, respectively, and lneij an offset where eij represents the number of people tested for HIV in a specific age-gender grouping, i in community, j (i.e., exposed) so that,

(2) ln(uij)ln(eij)= β1+β2x2ij+ β3x3ij+ ζj

Computing empirical Bayes estimates of HIV prevalence

Request a detailed protocol

Next, to account for heterogeneity in HIV prevalence across trial communities we computed empirical Bayes estimates (or posterior means) for the age-gender specific HIV prevalence in each community. The empirical Bayes estimate of HIV prevalence for the jth community-age-gender combination is

(3) ζj^EB=Rj^ζj^ML

where Rj^=ψj^ψj^+θ^nj^ .

Here, R^j denotes a shrinkage factor that describes the proportion of the total variation, ψj^+θ^nJ^ attributed to differences in the age-gender specific HIV prevalence among trial communities, ψ^j . Furthermore, ζj^ML specifies the maximum likelihood estimate of ζj . Empirical Bayes estimates were obtained with the gllapred program (Sophia, 1999; Rabe-Hesketh and Skrondal, 2012) in Stata 13.1.

Estimates of size of population with HIV in trial communities

Request a detailed protocol

We thereafter estimated the size of the population with HIV for each age-gender category in each trial community. Empirical Bayes estimates of the age-gender specific HIV prevalence for each trial community were standardized to the proportion of participants in the 2011 Botswana population and housing census that were in the same age and gender categories of these communities. Then for a specific community-age-gender combination, for example, females aged 16–24 years in community j, the estimated number of people with HIV, N^HIV+[Female, 1624] is computed as

(4) N^HIV+[Female, 1624]=N^totalE{Prcensus[Female,1624 years]Pr[HIV+Female, 1624 years]}

where.

jth: Estimated population size for the jth trial community based on plot and household enumeration data (Makhema et al., 2019).

E: Proportion of enumerated household members that were eligible to participate in the trial.

Prcensus[Female, 16  24 years]: Joint probability denoting the proportion of census participants who are female and aged 16–24 years within community j.

Pr[HIV+Female, 1624 years]: Empirical Bayes estimate of HIV prevalence for females aged 16–24 years that participated in a baseline survey of 20% of households randomly sampled from community j.

The age-gender estimates of the number of people with HIV in each trial community were used to adjust for sampling heterogeneity when estimating the proportions of HIV transmissions in the trial population (see Materials and methods section on adjustment for variable sampling rates across different demographic groups or randomized-HIV-interventions). Population-size estimates of people with HIV were computed in R version 3.5.2 (R Deveploment Core Team, 2020).

Deep-sequencing of HIV viral genomes for phylogenetic analyses

Request a detailed protocol

Because most participants were virally suppressed at the time of sampling, deep sequencing of HIV-1 viral genomes was generally done on proviral DNA (81%, n = 4142) (Novitsky et al., 2015) (see Materials and methods section on paired-end deep-sequencing of HIV viral genomes for phylogenetic analyses). We deep-sequenced near-full length viral genomes to optimize resolution and power to detect viral genetic clusters of similar HIV-1 infections (Dennis et al., 2014; Figure 1—figure supplements 1 and 2). For each participant, we thus obtained mapped short reads generated with deep-sequencing that capture the HIV-1 viral diversity in an individual and a corresponding consensus sequence that represents a summary of the HIV-1 viral population in that individual. Consensus sequences were generated from alignments of the mapped short reads by identifying the majority nucleotide call at each base position along the HIV-1 genome (Wymant et al., 2018a).

Paired-end deep-sequencing of HIV viral genomes for phylogenetic analyses

Request a detailed protocol

Paired-end deep-sequencing was performed with Illumina MiSeq and HiSeq instruments at two locations, the BioPolymers sequencing core facility at Harvard Medical School, Boston, United States (n = 4246 participants) (Figure 1—figure supplement 1), and at the Wellcome Trust Sanger Institute, Hinxton, United Kingdom (n = 868 participants) (Figure 1—figure supplement 2) through the PANGEA consortium (Pillay et al., 2015). The sequencing success rate was relatively high, 75%–80% for first round amplicons and 95%–98% for second round amplicons. Moreover, the quality of sequencing was assessed with standard metrics for deep (or next-generation) sequencing data; however, we cannot exclude the potential for sequencing errors arising from hypermutations. The shiver sequence assembly software (Wymant et al., 2018a) was used to assemble and map each participant’s deep-sequencing short reads to a de-novo reference sequence tailored to the participant’s viral population. A listing of command-line parameters used to assemble HIV viral whole genomes with Shiver is provided in Supplementary file 1-Table 1B.

Criteria for inclusion in phylogenetic analyses

Request a detailed protocol

Our analysis was restricted to samples from 3832 participants (approximately 14% of the estimated 27,446 individuals living with HIV in trial communities) who had available demographic information and met minimum criteria on genome missingness (≥6300 nucleotides available) for inclusion in phylogenetic analyses. More precisely, individuals whose viral consensus genomes had fewer than 30% of the bases missing beyond the first 1000 nucleotides were retained for analysis. Sampling (or genotyping) density is defined as the estimated proportion of individuals living with HIV in a trial community that were included in our sample. Intervention communities were sampled more densely, 16% (2281 / 14,263) than control communities, 12% (1551 / 13,183); and sampling densities among all 30 trial communities ranged from 2.7% to 36.2% (Figure 2, Supplementary file 1, Table 1C) (see Materials and methods section on computing age-gender estimates of the number of people with HIV in each trial community). Based on the level of sexual mixing predicted for the trial (Novitsky et al., 2013)—that 21% (standard error: 2.6%) of the relationships would be out-of-community partnerships (assumed for this calculation to be with communities that were not in the trial)—and a genotyping density of 14% we expected our data set to contain the viral transmitter to a genotyped individual in about one out of every nine cases of HIV-1 transmission, that is, 0.14 * (1 − 0.21) = 0.11. This estimate reflects simplifying assumptions that: (i) for a pair of individuals whose viral sequences are genetically linked, the probability of sampling one member of the pair is independent from that of sampling the other member of the pair such that the probability of observing a pair that is genetically linked is the product of the individual sampling probabilities and (ii) the sampling probabilities were equal for men and women with HIV. The genotyping densities for women and men were 15% (2730 / 17,815) and 11% (1102 / 9631), respectively. Therefore, we expected the proportion of female viral transmitters in our data set to be 0.153 * (1 − 0.21) = 0.12 and male viral transmitters to be 0.114 * (1 − 0.21) = 0.09.

A plot comparing sampling densities of HIV-1 viral genomes in intervention and control communities in the Botswana/Ya Tsie trial.

Intervention communities (blue curve) were sampled more densely than control communities (green curve). Raw data points for the density plots are displayed as a rug-plot (red) on the x-axis. The overall genotyping density across all 30 trial communities was 14% (3832 / 27,446).

Definition of seroconverters

Request a detailed protocol

The HIV-incidence prospective cohort of the Botswana/Ya Tsie trial comprised a random sample of 20% of households in each community to evaluate the incidence of HIV infection and adoption of the intervention during the trial. The 3832 participants included in phylogenetic analyses included 85% (124 / 146) of the seroconverters in the HIV-incidence cohort; seroconverters in the HIV-incidence cohort were defined as individuals that acquired HIV infection during the trial (Supplementary file 1-Table 1C). Seven of the 2465 baseline samples included in phylogenetic analyses were seroconverters in the HIV-incidence cohort (i.e., individuals with a negative-HIV test and a subsequent positive-HIV test sample over the course of less than a year, and thus defined as baseline) compared with 117 of the 1367 post-baseline samples.

Comparing genetic distances between HIV-1 viral consensus sequences of trial participants included in phylogenetic analyses

Request a detailed protocol

We compared the distances between all possible pairs (n=7,340,196) of the 3832 HIV-1 viral whole-genome consensus sequences available for phylogenetic analyses (Figure 3A). Sequences were codon-aligned to HXB2, a standard HIV reference, with MAFFT v7.407 (Katoh et al., 2002; Katoh and Standley, 2013) and HIVAlign (https://www.hiv.lanl.gov/content/sequence/VIRALIGN/viralign.html); genetic distances were then computed under the Tamura-Nei-1993 nucleotide substitution model (Tamura and Nei, 1993) with TN93 v1.0.6 (https://github.com/veg/tn93 iGEM/UCSD evolutionary biology and bioinformatics group, 2021). A listing of parameters used to align sequences with MAFFT and HIVAlign, and compute genetic distances with TN93 is provided in Supplementary file 1-Table 1B.

Histograms showing the distribution of genetic distances between HIV-1 viral whole-genome consensus sequences.

(A) A histogram showing the distribution of genetic distances between HIV-1 viral whole-genome consensus sequences of trial participants included in phylogenetic analyses. Out of the 5114 trial participants that consented to viral genotyping 3832 met criteria for inclusion in phylogenetic analyses; these were individuals whose viral genomes had fewer than 30% of the bases missing beyond the first 1000 nucleotides and a minimum genome length of 6300 nucleotides. Tamura-Nei 93 genetic distances were computed between HIV-1 viral whole-genome consensus sequences of all possible pairs (n = 7,340,196) of the 3832 trial participants available for phylogenetic analyses. To improve visibility pairwise genetic distances below 0.05 (5%) nucleotide substitutions per site are also shown in a zoomed-in plot. The mean (± standard deviation) pairwise distance separating sequences was 12.5 ± 3.0%, and the maximum pairwise distance was 40.0% substitutions per site. (B) A histogram showing the distribution of genetic distances between viral whole-genome consensus sequences of epidemiologically linked HIV-1 subtype-C couples in the HIV Prevention Trials Network 052 study. Tamura-Nei 93 genetic distances were computed between viral whole-genome consensus sequences of all possible pairs (n = 6670) of the 116 epidemiologically linked couples with HIV-1 subtype-C infections in the HIV Prevention Trials Network 052 study. Pairwise comparisons between epidemiologically linked sequences (yellow) and ones between epidemiologically unlinked sequences (blue) are highlighted in color. The mean (± standard deviation) pairwise distance separating the sequences was 12.0 ± 4.1%, and the maximum pairwise distance was 19.6% substitutions per site. Genetic distances separating viral sequences of most epidemiology-linked couples were below 5.0% substitutions per site.

Consensus sequence phylogenetics to identify clusters of participants with genetically similar HIV-1 infections

Request a detailed protocol

To save time and computational resources from evaluating sequences that were too distantly related, we first identified clusters of participants with genetically similar HIV-1 infections as a filtering step, before performing ancestral host state reconstruction with Phyloscanner to detect probable directed transmission pairs. Two clustering algorithms, HIV Transmission Cluster Engine (HIV-TRACE) v0.4.4 (Wertheim et al., 2014; Kosakovsky Pond et al., 2018) and Cluster Picker v1.2.3 (Ragonnet-Cronin et al., 2013), were used to identify clusters of individuals whose HIV-1 viral whole-genome consensus sequences were genetically similar—suggesting they were probably members of a transmission chain (Poon, 2016; Rose et al., 2017). HIV-TRACE defines clusters based on pairwise genetic distances only; comparatively, Cluster Picker identifies clusters using pairwise genetic distances with the guidance of a phylogenetic tree. A multiple sequence alignment (as described in the Materials and methods section on comparing genetic distances between HIV-1 viral consensus sequences of trial participants included in phylogenetic analyses) was provided as input to HIV-TRACE and Cluster Picker. Additionally, for cluster picker, a corresponding phylogenetic tree inferred with FastTree2 v2.1.10 and boot-strap support values approximated with the Shimodaira-Hasegawa test (Shimodaira and Hasegawa, 1999) were provided as inputs. We defined genetic similarity clusters as groups of two or more participants whose viral whole-genome consensus sequences were separated by a genetic distance at or smaller than 4.5% nucleotide substitutions per site—and, for Cluster Picker, a bootstrap support value of at least 80%. The genetic distance threshold of 4.5% nucleotide substitutions per site was motivated by the distribution of genetic distances separating HIV-1 subtype C viral whole genomes of epidemiologically linked couples in the HIV Prevention Trials Network (HPTN) 052 trial (Figure 3B; Cohen et al., 2016; Eshleman et al., 2011). A listing of parameters used for consensus-sequence phylogenetics with HIV-TRACE and Cluster Picker is provided in Supplementary file 1-Table 1B.

Deep-sequence phylogenetics to infer the probable order of transmission events within identified clusters of genetically similar HIV-1 infections

Request a detailed protocol

We performed parsimony-based ancestral host state reconstruction of the mapped deep-sequencing short reads of participants in clusters of genetically similar HIV-1 infections (n = 525) with the Phyloscanner software v1.8.0 (Grabowski et al., 2018; Ratmann et al., 2019; Wymant et al., 2018b) to identify probable transmission pairs within clusters and the probable direction of transmission between them. Phyloscanner identifies potential transmission pairs by inferring whether the viral population in an individual is ancestral to or descendent from that of another individual, and crucially addresses contamination by excluding duplicates and phylogenetic outliers. To infer phylogenetic linkage and direction of transmission for a set of individuals Phyloscanner aligns submitted mapped reads in sliding windows along the genome that are matched across individuals; and at each window, infers a phylogeny for ancestral state reconstruction to identify probable ancestral relationships between all possible pairs of individuals in the set. Pairs of individuals are then classified as phylogenetically linked or unlinked based on the distance, adjacency, and topology of their subgraphs. A subgraph in Phyloscanner refers to all tips and internal nodes of a phylogeny assigned to an individual through parsimony-based ancestral state reconstruction. A pair of individuals, i and j are considered to be phylogenetically linked, more specifically, phylogenetic linkage is not excluded when the minimum patristic distance between their subgraphs, Δij is below a set threshold, and i and j are adjacent meaning that the shortest path connecting their subgraphs in a phylogeny does not pass through a third individual. Tree topology is used to infer the probable direction of transmission between a pair of individuals and refers to the number of subgraphs belonging to each individual in the pair that are ancestral-to or descendant-from those of the other individual. We classified pairs of individuals separated by a subgraph distance below 0.035 substitutions per site as phylogenetically linked based on previous work by Ratmann et al., 2019. The subgraph distance distribution of all pairs identified with Phyloscanner within genetic similarity clusters is shown in Figure 4. A listing of command-line parameters used for deep-sequence phylogenetics with Phyloscanner is provided in Supplementary file 1-Table 1B. Identified genetic similarity clusters were analyzed with Phyloscanner in parallel for computational efficiency using shell scripts that are available upon request.

Barplots showing subgraph distance distributions between deep-sequenced HIV-1 viral whole genomes of trial participants in clusters of genetically similar HIV-1 infections.

Within each genetic cluster parsimony-based ancestral host state reconstruction was done with Phyloscanner (Grabowski et al., 2018; Ratmann et al., 2019; Wymant et al., 2018b) to identify probable transmission pairs and resolve the probable order of transmission events. Thereafter, within each identified genetic cluster, the median subgraph distance and most frequent subgraph topology between each pair of individuals were determined across all deep-sequence phylogenies along the genome where the pair had sufficient mapped reads for phylogenetic inference. A subgraph refers to all tips and internal nodes of a phylogeny assigned to an individual through parsimony-based ancestral state reconstruction. Subgraph distances were standardized to the mean rate of evolution for HIV-1 group-specific antigen (GAG) and polymerase (POL) genes to account for variation in mutation rates along the genome. The three panels show median subgraph distance distributions of pairs in the union (A) of genetic clusters detected with HIV-TRACE (B) and Cluster Picker (C) algorithms. Within each panel, phylogenetically linked pairs with an inferred direction of transmission are summarized into same-sex (male-male and female-female) and opposite-sex pairs (male-female). The subgraphs of deep-sequenced HIV-1 viral genomes of most phylogenetically linked directed pairs were separated by a patristic distance below 0.035 (3.5%). Same-sex pairs were presumed to be members of transmission chains with missing intermediates.

Identifying probable source-recipient pairs with strong phylogenetic evidence for linkage and direction of transmission

Request a detailed protocol

Probable transmission pairs identified with Phyloscanner were further classified on the strength of evidence for phylogenetic linkage, λ^ij to separate pairs with strong support for linkage and direction of transmission from other potentially linked pairs; accounting for the extent of overlap in the read alignments from which the deep-sequence phylogenies were inferred. The strength of phylogenetic evidence or linkage score is described as, λ^ij= kLn where n is the number of deep-sequence phylogenies inferred in windows along the genome for which individuals i and j had sufficient deep-sequence reads for phylogenetic inference; and kL is the number of deep-sequence phylogenies along the genome that support a specific phylogenetic relationship type between individuals i and j, for example, that the subgraphs of individual i are ancestral-to, descendent-from, sibling-to, or intermingled-with those of individual j; sibling and intermingled topologies are ones where there is insufficient evidence to make inferences about the probable direction of transmission (Ratmann et al., 2019). We used a linkage and direction of transmission score threshold of 57% (λ^ij>c; c=0.57) or on average at least 24 out of 42 windows supporting linkage and direction of transmission for a highly supported transmission pair. Our Phyloscanner analyses were based on windows or mapped read alignments that were 200-bp long, affording n = 42 (8400 / 200) non-overlapping alignments and deep-sequence trees along the genome; we excluded the first 1000 nucleotides of the HIV-1 genome from analysis as they are typically poorly sequenced corresponding to a genome length of 8400 bp. The linkage and direction of transmission score threshold, c(0,1) was selected such that the posterior probability for λ^ij>50%, at least half of the windows along the genome supporting an ancestral relationship type between a pair of individuals, i and j exceed 80%, p(λ^ij>0.5 | kL, n)> α, where α=80% and kL  Binomial(n, λ^ij). The linkage and direction of transmission score threshold were computed with the Bayesian binomial test function in the BayesianFirstAid R package v0.1.

Estimating error rates in phylogenetic inference of direct transmission between sampled males and females

HIV-1 infections in southern Africa are more commonly transmitted through heterosexual contact between males and females, compared with sexual contact between same-sex couples. Thus, we can use the probability of inferring a phylogenetically linked male-male pair to calibrate an upper bound on the false discovery rate (FDR) of inferring a phylogenetically linked male-female pair, if we assume the probability of erroneously inferring direct transmission between a male-female pair is the same as the probability of inferring a phylogenetically linked male-male pair (Ratmann et al., 2019).

Calibrating a false discovery rate for inferring direct transmission

Request a detailed protocol

Consider a deep-sequence phylogenetic analysis where Lmm male-male pairs and Lmf male-female pairs were phylogenetically inferred in the sequenced sample. For Sm successfully sequenced males and Sf successfully sequenced females, let the probability of inferring a phylogenetically linked male-male pair in the sample be

(5) LmmSm (Sm1)/ 2 ,

where Lmm denotes the number of phylogenetically linked male-male pairs that were identified and Sm  (Sm1)/ 2 represents the number of distinct possible male-male pairs that could have been identified in the sample. The estimated number of phylogenetically linked male-female pairs between whom direct transmission did not occur would then be

(6) F^mfc=LmmSm (Sm1)/ 2SmSf .

Thereby affording a FDR estimate of,

(7) ρ^mfc=F^mfcLmf

This represents an upper bound on the FDR of inferring direct transmission between males and females in phylogenetically linked male-female pairs.

Adjustment for variable sampling rates across different demographic groups or randomized-HIV-interventions

To estimate HIV transmission flows in the trial population within and between different age-gender groups and locations (trial communities, trial arms, and geographic regions), accounting for variable rates of sampling, we used the method described by Carnegie et al., 2014. The method by Carnegie et al. uses a frequentist approach, alternatively a Bayesian approach could be employed as described by Ratmann et al., 2020.

Data set and sampling probability

Request a detailed protocol

Consider a population of individuals with HIV, Ω of finite size, N partitioned into u=1,, G disjoint groups so that, Nu represents the number of individuals with HIV in group, u. Groups might represent communities (or geographic locations), age-categories, genders, or trial arms. Let us suppose that we would like to estimate the proportions of HIV transmissions that occurred within and between intervention and control communities in the Botswana/Ya Tsie trial so that group, u are individuals residing in intervention communities and group, v are individuals residing in control communities. We denote nu and nv as the number of individuals randomly sampled from groups, u and v, respectively, whose HIV-1 viral genomes were successfully sequenced; yielding a sampling probability, su= nu / Nu for group, u and similarly, sv= nv / Nv , for group v. The vector

(8) z=(zuu, zuv, zvu, zvv)

describes the counts of directed (source-recipient) transmission pairs identified in the sequenced sample within and between intervention and control communities, where zuu denotes transmission pairs within intervention communities; zvv denotes transmission pairs within control communities; zuv denotes transmission pairs for whom a member of an intervention community was the inferred source of transmission to a control community member and vice-versa, zvu . We assume that HIV transmission pairs that are absent from the sequenced sample are missing at random conditional on group membership; meaning that the identified probable source-recipient transmission pairs are a random sample of the source-recipient transmission pairs present in the trial population. Let

(9) nuv={           nu  nv,  uvnu(nu1) / 2,  u=v

be the number of distinct possible transmission pairs among individuals sampled from u and v; similarly, let Nuv=Nu * Nv(or Nu(Nu1) / 2 if u=v) be the number of distinct transmission pairs available between trial population groups, u and v.

Probability of viral linkage between a pair of sampled individuals

Request a detailed protocol

Estimates of probabilities of linkage between two individuals randomly sampled from within demographic groups are observed using the method of Carnegie et al., 2014. The method assumes that samples are missing at random from within demographic groups and that relative increase in weight of individuals from a given group between the observed and true sample is the same as relative increase in weight between sub-samples of the observed data—obtained from the observed sampling rate—and the observed data.

In accordance with Carnegie et al., 2014 and accounting for the probable direction of transmission, we estimate the probability of viral linkage between two individuals randomly sampled from their respective groups for each group-pairing as

(10) p^=(p^uu,p^uv,p^vu,p^vv)

where, pij^=zijnij, i=u,v and j=u,v.

Proportion of HIV transmissions in the trial population adjusted for sampling heterogeneity

Request a detailed protocol

We next estimate the relative probability that an HIV transmission event in the trial population occurred from sexual contact between partners residing within intervention communities or within control communities; compared with partners where one resides in an intervention community and the other a control community or vice versa. Accordingly, we estimate the conditional probability, θ^ij that a pair of individuals are from a specific group-pairing given that their HIV-1 viral genomes are linked as,

(11) θ^ij=P(pair is from groups, i and j | pair is linked)=Nijpij^mnmNmnp^mn,i=u,v and j=u,v.

Here, Nijp^ij denotes the estimated number of HIV transmissions in the trial population attributed to groups, i and j . Similarly, mn mNmnp^mn represents the number of HIV transmissions in the trial population for all groups-pairings. Thus, the vector

(12) θ^=(θ^uu, θ^uv, θ^vu, θ^vv)

describes estimated proportions of HIV transmissions in the trial population within and between intervention and control communities adjusted for differential sampling by trial arm.

Computation of confidence intervals for estimated HIV transmission flows in the trial population

Request a detailed protocol

We use the Goodman method (Goodman, 1965) to compute simultaneous confidence intervals (CIs) at the 5% significance level for the estimated proportions of HIV transmissions within and between population groups (or strata). A continuity correction factor is implemented for the Goodman method’s CIs to account for small sample sizes (Cherry, 1996). The Goodman method estimates simultaneous CIs for the parameters of a multinomial distribution and thus assumes independence of observations.

To estimate Goodman’s CIs the estimated number of HIV transmission pairs in the trial population within and between intervention and control communities are scaled such that the sum of the weights is equal to the total number of probable male-female transmission pairs identified in the deep-sequenced sample (n = 82), and the weights are treated as known; broadly similar to calibration of weights in survey sampling. For example, the weighted counts of probable male-female transmission pairs within and between intervention and control communities are given by the vector,

(13) δ=z+  (θ^uu, θ^uv, θ^vu, θ^vv) ,=(z+θ^uu, z+θ^uv, z+θ^vu, z+θ^vv)

where z+=ijzij , i= u,v and j= u,v .

This ensures that the estimated Goodman’s CIs of HIV transmission flows within and between groups/strata in the trial population reflect that amount of information in the sample.

Let the weighted counts of the probable male-female transmission pairs identified within and between intervention and control communities, (z+θ^uu, z+θ^uv, z+θ^vu, z+θ^vv) denote the observed cell frequencies, δ1, δ2, , δk of a sample of size, z+ from a multinomial distribution with population parameters, π1, π2, , πk, and l=1, 2, , k population strata, respectively.

For example, δl represents the weighted counts of probable male-female transmission pairs identified in the lth stratum, and πl the corresponding probability that a probable male-female transmission pair falls within the lth stratum, respectively. In this case, we have k = 4 population strata that represent HIV transmission pairs within intervention communities; within control communities; transmission pairs for whom a member of an intervention community was the inferred source of transmission to a control community member and vice-versa. The point estimate for, πl or estimated proportion of HIV transmissions in the trial population in the lth stratum is π^l= δlz+ . The set of k simultaneous CIs at the 5% significance level for the k population-level HIV transmission flows, π1, π2, , πk are given by Cherry, 1996,

π^l πlπ^l+  for (l=1, 2, , k) ,

where the lower confidence bounds are described by,

π^l=[B+2(δl0.5) B(B+4(δl0.5)(z+ δl+0.5)/z+)] / 2(z++B)

and the upper confidence bounds are described by,

π^l+=[B+2(δl+0.5)+ B(B+4(δl+0.5)(z+ δl0.5)/z+)] / 2(z++B).

Here, B denotes the 95th percentile of a chi-squared distribution with k - 1 degrees of freedom. Note that π^l=0 if δl=0, and π^l+=1 if δl=z+ .

We have implemented the algorithms to estimate HIV transmission flows within and between population groups accounting for sampling heterogeneity; and the corresponding CIs as an R package, bumblebee that will be made available at the following URL: https://magosil86.github.io/bumblebee/ . A step-by-step tutorial on how to estimate HIV transmission flows with bumblebee and accompanying example data sets can be accessed at the following URL: https://github.com/magosil86/bumblebee/blob/master/vignettes/bumblebee-estimate-transmission-flows-and-ci-tutotial.md (Magosil, 2021).

Calculation of the weighted mean age gap between males and females in inferred probable transmission pairs accounting for variability in sampling of the trial population by gender and 5-year age group

Overview

To compare the ages of males and females in inferred male-female probable transmission pairs we computed weighted (arithmetic) mean age-gaps between males and females in (1) all inferred probable transmission pairs (n = 82), (2) male-to-female transmission events (n = 45), and (3) female-to-male transmission events (n = 37).

Data set and sampling probabilities

Request a detailed protocol

Consider a deep-sequence phylogenetic analysis where Lmf male-female probable transmission pairs/events were phylogenetically inferred in the sequenced sample. Here, the sequenced sample refers to individuals whose HIV-1 viral genomes were successfully sequenced and met minimum criteria for inclusion in phylogenetic analysis. Let individuals in the sequenced sample be grouped by gender into 5-year age-categories according to their age at enrollment in the study so that the number of successfully sequenced males in a specific 5-year age category is denoted by, nm(5-year age group) and the number of successfully sequenced females, nf(5-year age group) . Similarly, let Nm(5-year age group) and Nf(5-year age group) denote the estimated number of males and females respectively with HIV in the trial population in a specific 5-year age category. So that the sampling probabilities of male and female partners in a probable transmission pair are given by, sm(5-year age group)= nm(5-year age group) / Nm(5-year age group) for the male partner and, sf(5-year age group)= nf(5-year age group) / Nf(5-year age group) for the female partner.

Computation of mean weighted age gap

Request a detailed protocol

Let gi denotes the age gap in years between male and female partners in the ith transmission pair (i.e., male age – female age) and wi represents the weight assigned to the ith transmission pair. The weight of a male-female probable transmission pair is computed as the inverse of the product of the sampling probabilities of the male and female partners in the pair,

(14) wi=1sm(5year age group)  sf(5year age group).

Then the weighted mean age gap is given by

(15) g= i=1nwi gii=1nwi,

which can be expanded as:

(16) g= w1g1+ w2g2+ + wngn w1+ w2+ + wn .

Estimation of the weighted age gap variance: we compute the variance of the age-gaps between males and females in probable transmission pairs as:

(17)  Var(weighted age gap)= i=1nwi(gi g)2V1 (V2V1) ,

where V1= i=1nwi represents the sum of the weights, and V2= i=1nwi2 the sum of the squared weights, respectively.

Permutation test to evaluate whether there was preferential sexual mixing among trial communities by geographic proximity

Request a detailed protocol

To evaluate whether people were more likely to form out-of-community sexual partnerships with partners from nearby communities than communities further away, we performed a permutation test under the null hypothesis that the mean travel distance (kilometers) and drive time (hours) between communities where HIV transmission events were identified was no different from that between any pair of randomly selected trial communities. We obtained null distributions for the mean travel distance and drive time, respectively by permuting the order of community pairs over 10,000 iterations. The underlying correlation structure of viral genetic linkage was preserved, for example, transmission events that occurred between members of two communities or between members of a community and two or more other communities. A one-sided p value was obtained as the fraction of iterations where the mean travel distance between permuted community pairs was smaller than that for community pairs where HIV transmission events were identified; similarly, a one-sided p value was derived for travel time. Travel distances and drive times were sourced from the google distance matrix API (application programming interface) with the mapsapi R package v0.4.2. and will be made available as an R data set. The permutation-test was performed using the R statistical software version 3.5.2.

Results

Consensus sequence phylogenetics to identify clusters of participants with genetically similar HIV-1 infections

Most participants with HIV at time of enrollment in the trial were on antiretroviral treatment and virally suppressed

Out of the 5114 participants who provided a sample for HIV viral genotyping, 3178 were sampled at baseline and 1936 were sampled post-baseline, that is, about a year or more after the end of baseline survey activities in their communities. At the time of identification, most (86%, n = 4410) individuals with HIV-1 infection were on antiretroviral therapy and virally suppressed (≤400 copies per ml of blood plasma). Compared to the 2011 Botswana census population of trial communities, our data set underrepresents men (n = 1475 vs. 3639) and individuals aged 16–24 years (Supplementary file 1-Table 1A).

Most of the HIV viral genomes sampled were from prevalent infections, and typical genetic distances were large

Out of the 5114 participants whose HIV-1 viral genomes were deep-sequenced, 3832 participants met minimum criteria for inclusion in phylogenetic analyses (see Materials and methods); and of those 3832 participants, 2465 were sampled at baseline and 1367 sampled post-baseline. The mean ± standard deviation pairwise genetic distance between HIV-1 viral whole-genome consensus sequences of participants included in phylogenetic analyses was 12.5 ± 3.0%, and the maximum pairwise distance was 40% substitutions per site (Figure 3A) (see Materials and methods on comparing genetic distances between HIV-1 viral consensus sequences of trial participants included in phylogenetic analyses). The large distances between sequences possibly reflect sparse sampling and/or the sampling of a large percentage of trial participants relatively late in infection. HIV-1 viral populations typically diverge over time owing in part to within-host evolution.

About one in seven genotyped participants included in consensus sequence phylogenetic analyses were assigned to genetic similarity clusters of people with closely related HIV-1 infections

We identified 236 genetic similarity clusters comprising HIV-1 viral consensus sequences from 14% (525 / 3832) of the participants (Figure 5) (see Materials and methods section on consensus sequence phylogenetics to identify clusters of participants with genetically similar HIV-1 infections). This represents the union of clusters detected with HIV-TRACE (122 clusters, comprising 283 consensus sequences) and Cluster Picker (231 clusters, comprising 484 consensus sequences) algorithms (Figure 6). To maximize the number of transmission pairs that could be identified between participants in the Botswana/Ya Tsie trial with Phyloscanner we used the union of clusters detected with HIV-TRACE and Cluster Picker. There was substantial overlap between the two clustering algorithms, 96% (117 / 122) of the genetic clusters found with HIV-TRACE were also detected with Cluster Picker (Figure 6). Only 4% (21 / 525) of the participants assigned to genetic similarity clusters were seroconverters from the HIV-incidence cohort (see Materials and methods section on definition of seroconverters).

A dendogram showing clusters of genetically similar HIV-1 infections in the Botswana/Ya Tsie trial.

Genetic clusters are summarized by randomized-HIV-intervention condition, geographical region, and occurrence within and between trial communities. Clusters were identified from 3832 HIV-1 viral whole-genome consensus sequences as two or more sequences separated by a genetic distance not exceeding 0.045 (4.5%) substitutions per site and a bootstrap support threshold of at least 80%. More genetic clusters occurred within trial communities and geographical regions than between them.

Barplots showing the size distribution of clusters of genetically similar HIV-1 infections in the Botswana/Ya Tsie trial.

The bottom two plots show the size distribution of genetic clusters identified with HIV-TRACE (n = 122) and Cluster Picker algorithms (n = 231), respectively; and the top two plots show the size distribution of the union (n = 236) and overlap (n = 117) of genetic clusters detected with HIV-TRACE and Cluster Picker. Most clusters were small-sized comprising two or three members.

Most genetic clusters had few members and limited geographic spread

The sizes of genetic clusters ranged from 2 to 12 members, with 97% (230 / 236) of clusters comprising two or three members (Figure 6). The two largest clusters had 9 and 12 members, respectively, none of whom were seroconverters from the HIV-incidence cohort; their members resided in communities in all three geographic regions. Overall, however, genetic clusters were generally concentrated within geographic regions (Central-East, North/North-East, and South-East) than spread out across them (Figure 5), with 42% (100 / 236) of clusters localized to two communities (Figure 7) and 52% (124 / 236) of the clusters entirely within a single community (Figure 7). Moreover, genetic clusters were relatively evenly spread across trial arms; 38% (90 / 236) were unique to intervention communities; 33% (78 / 236) to control communities; and 29% (68 / 236) had genetic cluster members from both intervention and control communities (Figure 5). The number of genetic clusters identified within trial communities generally increased with sampling density; for example, communities such as: Shakawe, Ramokgonami, Mmathethe, Gumare, Maunatlala, and Tati Siding which had larger numbers of unique and shared clusters compared to other trial communities were also among the most densely sampled (Figure 8, Supplementary file 1-Table 1C). The abundance of small-sized genetic similarity clusters may reflect a moderate (14%) sampling fraction (Murray and Alland, 2002).

Barplots showing the spread of genetic clusters across trial communities.

The bottom two plots show the number of genetic clusters identified with HIV-TRACE (n = 122) and Cluster Picker algorithms (n = 231), respectively; and the top two plots show the union (n = 236) and overlap (n = 117) of genetic clusters found with HIV-TRACE and Cluster Picker. Most lineages localized to one or two trial communities; with few clusters having members spread out across five or more trial communities.

A barplot summarizing clusters of genetically similar HIV-1 infections by trial community.

Bars represent the total number of genetic clusters identified in each trial community partitioned into clusters that are specific to a community (lighter shade) and those that are shared with at least one other trial community (darker shade). Genetic clusters in intervention communities are shown in blue and those in control communities are represented in green. The number of genetic clusters identified among intervention and control communities varied.

Characteristics of participants in clusters

About 70% (366 / 525) of the HIV-1 viral consensus sequences in clusters were from women (Figure 9). This result was consistent with the overrepresentation of women among sequenced infections (Supplementary file 1, Table 1A). The ages of clustered men and women were similar with women having a median [lower–upper quartile] age of 37.8 [29.6 – 45.1] years and men 43.0 [34.9 – 51.0] years at time of sampling. 33% (78 / 236) of clusters included at least one participant with unsuppressed virus. Additionally, there were few clusters (4% (10 / 236)) where all members had HIV-1 infections with unsuppressed virus (Supplementary file 1-Table 1D). 54% (286 / 525) of the participants in clusters of genetically similar HIV-1 infections were also part of a baseline survey of 20% of households randomly sampled from each trial community to establish an incidence follow-up cohort and gather information on different socio-demographic variables including sexual risk behavior; see Makhema et al., 2019; Gaolathe et al., 2016 for details. Among them, 10% (30 / 286) of participants self-reported having multiple partners in the past twelve months; 73% (208 / 286) of participants self-reported a single partner in that period, 15% (44 / 286) of participants self-reported zero partners in that period and the rest did not disclose their number of partners. Of the 30 participants involved in multiple relationships, 90% (27 / 30) disclosed having one other sexual partner, and 10% (3 / 30) reported having two other sexual partners in addition to their primary sexual partner.

Density plots summarizing trial participants in clusters of genetically similar HIV-1 infections by age and gender.

The top two panels show the age-gender distribution of trial participants whose HIV-1 viral whole-genome consensus sequences were in the union (n = 525) and overlap (n = 281) of genetic clusters identified with HIV-TRACE and Cluster Picker. The bottom two plots show the age-gender distribution of trial participants whose HIV-1 viral whole-genome consensus sequences were in genetic clusters identified with HIV-TRACE (n = 283) and Cluster Picker (n = 484) clustering algorithms. The raw data is shown as a rug plot (red tick marks) on the x-axis. Men and women in genetic clusters had similar ages.

Deep-sequence phylogenetics to infer the probable order of transmission events within identified clusters of genetically similar HIV-1 infections

We identified 153 highly supported probable source-recipient pairs within 236 clusters of genetically similar HIV-1 infections. Seventy-one were same-sex linkages between women (n = 65) or men (n = 6), and the remaining were male-female pairs (n = 82). Considering that the predominant mode of HIV-1 transmission in Botswana and most of southern Africa is through heterosexual contact, that direct transmission between women is rare (Chan et al., 2014), same-sex pairs were presumed to be members of transmission chains with unsampled intermediates. We used the probability of inferring a phylogenetically linked directed male-male pair in the sample to calibrate an upper bound on the FDR of inferring direct transmission between males and females in phylogenetically linked male-female pairs; the estimated number of linked male-female pairs in the sample with unsampled intermediates was approximately 30, corresponding to a FDR of 36% (30 / 82) (see Materials and methods section on estimating error rates in phylogenetic inference of direct transmission between sampled males and females). This estimated false positive rate is likely inflated given that two individuals would need to be missing from the sequenced sample to incorrectly infer transmission in a male-female pair, whereas only a single female would need to be missing to erroneously infer transmission in a male-male pair. As indicated earlier (see Materials and methods section on criteria for inclusion in phylogenetic analyses), we expected our data set to contain the HIV-1 viral sequence of the transmitter for 11% (0.14 × (1 – 0.21)) of sequenced participants included in phylogenetic analyses (n = 3832), compared with the 82 male-female probable transmission pairs that we identified thus suggesting that consensus sequence phylogenetic analysis missed some viral genetic linkages between participants with chronic infections; and deep-sequence phylogenetic analysis identified only a subset of HIV-1 transmission events between participants assigned to genetic clusters. We restricted further analyses on inference of HIV transmission patterns to the highly supported 82 male-female probable transmission pairs. Analyses are presented first without an adjustment for sampling variability to illuminate patterns of viral transmission within the deep-sequenced sample. Thereafter an adjustment for variable sampling rates by demographic group (age, gender, trial community, and geographical region) or randomized-HIV-intervention (trial arm) is made to estimate the flow of HIV transmissions within the trial population (see Materials and methods sections on computing age-gender estimates of the number of people with HIV in each trial community and adjustment for variable sampling rates across different demographic groups or randomized-HIV-interventions).

Age and sex distribution of sources and recipients in inferred transmission pairs

The age distribution of transmitters and recipients, by gender, for the 82 male-female probable transmission pairs is shown in Figure 10. Inferred male-female and female-male HIV transmissions had similar age distributions for the inferred source of infection. Males were a mean ± standard deviation of 3.5 ± 9.6 years older than females in probable transmission pairs overall, with a difference of 4.9 ± 7.5 in male-to-female transmission events, and 1.8 ± 11.6 years in female-to-male transmission events. Adjusted for variability in sampling of the trial population by gender and 5-year age group, these differences were somewhat reduced: males in probable transmission pairs were a mean ± standard deviation of 1.3 ± 10.4 years older than females overall; with a difference of 3.8 ± 6.9 years in male-to-female transmission events, and −1.3 ± 12.7 years in female-to-male transmission events (see Materials and methods section on calculation of the weighted mean age gap between males and females in inferred probable transmission pairs accounting for variability in sampling of the trial population by gender and 5-year age group). The difference between the unadjusted and the weighted mean age gap between males and females in probable transmission pairs is consistent with the undersampling of younger males. Quantiles of age differences between males and females in inferred transmission pairs before and after adjustment for variability in sampling are presented in Supplementary file 1-Table 1E.

Age distribution of 82 male-female probable transmission pairs sampled in the Botswana/Ya Tsie trial.

Forty-five male-to-female and 37 female-to-male transmission events were identified from the deep-sequenced viral whole genomes of 525 trial participants in clusters of genetically similar HIV-1 infections. Men were typically older than women in both male-to-female and female-to-male transmission events.

Most HIV transmissions occurred between similarly aged partners

Supplementary file 1-Table 1F shows the proportions (unadjusted) of viral transmission events within and between 5-year age groups among the 82 probable male-female pairs inferred from the deep-sequenced sample. After adjusting for differential sampling by 5-year age group, the predicted proportions of HIV-1 transmissions in the trial population suggested preferential sexual mixing between similarly aged partners (Figure 11 and Supplementary file 1-Table 1F) (see Materials and methods sections on computing age-gender estimates of the number of people with HIV in each trial community and adjustment for variable sampling rates across different demographic groups or randomized-HIV-interventions). An adjustment by both 5-year age group and gender revealed broadly similar transmission patterns in the trial population (Figure 12 and Supplementary file 1-Table 1G).

Estimated transmission flows of HIV-1 infection within and between 5-year age groups in the Botswana/Ya Tsie trial population.

Transmission flows of HIV-1 infection in the trial population were estimated from 82 male-female probable transmission pairs identified from the deep-sequenced HIV-1 viral whole genomes of trial participants (n = 525) in clusters (n = 236) of genetically similar HIV-1 infections. Furthermore, transmission flows were adjusted for differential sampling among age groups (see Materials and methods section on adjustment for variable sampling rates across different demographic groups or randomized-HIV-interventions). Most viral transmission events occurred between similarly aged partners.

Estimated transmission flows of HIV-1 infection by gender within and between 5-year age groups in the Botswana/Ya Tsie trial population.

Transmission flows were estimated as described in Figure 11. Additionally, transmission flows were adjusted for differential sampling by gender and age group.

Similar proportions of men and women as inferred sources of transmission

55% [95% CI: 39.8 – 69.1] of the 82 inferred male-female transmission events originated from men with the remainder occurring from women, 45% [30.9 – 60.2] (Figure 13). An adjustment for variable sampling by gender revealed similar transmission patterns in the trial population (Supplementary file 1-Table 1H).

A barplot showing counts of male-female probable transmission pairs sampled in the Botswana/Ya Tsie trial by geographical location.

The 82 probable male-female infector-infectee pairs that were identified from clusters of genetically similar HIV-1 infections (n = 236) in the Botswana/Ya Tsie trial are first summarized into those with a male versus a female source of viral transmission. Male-to-female and female-to-male transmission events are then further partitioned according to whether partners in a transmission pair resided in the same or different trial communities and if those trial communities belonged to the same or different trial arms and geographical regions. C to I denote transmission into an intervention community from a control community and I to C represent transmission into a control community from an intervention community. Male-to-female transmission events are shown in green and female-to-male transmission events are represented in yellow. About 69% (n = 57) of the sampled transmission events in the Botswana/Ya Tsie trial occurred within the same trial community.

Most inferred HIV transmissions occurred within communities or between neighboring communities

Trial communities could be broadly grouped into three geographical areas: in the North/North-East, South-East, and Central-East region of Botswana, which borders South Africa and Zimbabwe (Figure 1). The proportion of inferred transmission events of HIV-1 infection in the deep-sequenced sample between members of the same trial community was 69% (57 / 82) compared with 21% (17 / 82) for members of different trial communities in the same region, and 10% (8 / 82) for members of trial communities in different regions (Figures 13 and 14). Of the 74 (74 / 82) transmission events within a single region (including those within a single community), 22 were in the Central-East, 28 in the North/North-East, and 24 in the South-East (Figure 13). For pairs where transmission events linked people in different trial communities the median [lower–upper quartile] driving distance between the trial communities was 161 km [108 – 420 km] or 100 mi [67 – 261 mi] and the median drive time was 1.86 hr [1.41 – 4.68 hr]. Furthermore, a comparison of mean drive times between pairs of trial communities for which transmission events were identified (n = 22) with those of all other possible pairs of participating trial communities excluding same community pairs (n = 900 – 22 – 30 = 848) revealed a shorter drive time on average between genetically linked trial communities compared with what would be expected under random sexual mixing (permutation-test, p§amp;lt; 0.01) (see Materials and methods section on permutation test to evaluate whether there was preferential sexual mixing among trial communities by geographic proximity). These results suggest that out-of-community sexual partnerships were more likely to form between partners residing in nearby communities compared with more distant communities.

A transmission matrix summarizing sampled transmission events within and between Botswana/Ya Tsie trial communities.

Numbered tiles represent counts of transmission events identified between a pair of trial communities. There were 32 transmission events identified within control communities (magenta), 33 within intervention communities (cyan), 13 from control-to-intervention communities (green), and 4 from intervention-to-control communities (red). Most transmission events were identified within trial communities. The number of sampled transmission events into intervention communities from control communities was higher than the reverse (n=13 vs. 4).

After adjusting for differential sampling by community, an estimated 24% of transmissions to a resident of a trial community originated from another trial community (Supplementary file 1-Table 1I) (see Materials and methods section on adjustment for variable sampling rates across different demographic groups or randomized-HIV-interventions). This is consistent with the high levels of mobility in Botswana, where people typically have family ties in ancestral villages and work in peri-urban or urban areas (Essex et al., 2019). Furthermore, trial communities were distributed along two principal highways that might have contributed to increased mobility: the A1 highway that connects the Northern and Southern parts of the country and the A3 highway that branches from the A1 highway connecting the North-Eastern and North-Western regions.

The estimated proportion of HIV transmissions in the trial population, after adjusting for differential sampling by geographic region, that occurred within the South-East region, 50% [95% CI: 29.6 – 70.5] was higher compared to that in the Central-East, 23% [9.7 – 45.4] and North-East regions, 18% [6.5 – 39.8], respectively (Figure 15A and Supplementary file 1-Table 1J) (see Materials and methods section on adjustment for variable sampling rates across different demographic groups or randomized-HIV-interventions). For comparison, the proportion of trial participants with viral genomes that met criteria for inclusion in phylogenetic analyses were 26% (1000 / 3832) in the South-East region, 42% (1630 / 3832) in the North/North-East region, and 31% (1202 / 3832) in the Central-East region (Supplementary file 1-Table 1C). The South-East region includes Gaborone city, the economic and administrative capital of Botswana; conversely, economic activity in the North-East and Central-East regions is centered around mining and agriculture.

Estimated transmission flows of HIV-1 infection in the Botswana/Ya Tsie trial population within and between geographical regions and trial arms.

Transmission flows were estimated as described in Figure 11. Additionally, transmission flows were adjusted for differential sampling among geographical regions and trial arms respectively. (A) Estimated transmission flows of HIV-1 infection in the Botswana/Ya Tsie trial population within and between geographical regions. Communities in the Botswana/Ya Tsie trial were broadly grouped into three geographical regions in the Central-East, North/North-East, and South-Eastern parts of the country. The flow of HIV-1 transmissions in the trial population was highest in the South-East region (50.1%) compared with the Central-East (23.1%) and North/North-East (17.9%) regions, respectively. (B) Estimated transmission flows of HIV-1 infection within and between intervention communities and control communities in the Botswana/Ya Tsie trial population. Most transmissions of HIV-1 infection occurred within the same trial arm and the flow of viral transmissions into intervention communities from control communities (15.3%) was about three times higher than the reverse (4.7%).

More HIV transmission events to residents of intervention communities originated from residents of control communities than vice versa

Overall analysis

The proportion of inferred transmission events of HIV-1 infection in the deep-sequenced sample occurring within intervention (n = 33) or within control communities (n = 32) was 79% (65 / 82) compared with 21% (17 / 82) across intervention and control communities (Figures 13 and 14). Of the 33 transmission events inferred within intervention communities, 7 occurred between partners residing in different communities; and for those inferred within control communities, 1 of the 32 transmission events occurred between partners from different communities. There were more transmission events identified in intervention communities that originated from control communities than the reverse (n = 13 vs. 4).

After adjusting for differential sampling by trial arm, the estimated proportions of HIV transmissions in the trial population within control communities, 51% [35.9 – 66.5] was about twice that predicted in intervention communities, 29% [16.6 – 44.6] (Figure 15B and Supplementary file 1-Table 1K); and the estimated proportion of HIV transmissions in the trial population flowing into intervention communities from control communities, 15% [6.9 – 30.1] was about three times higher than the reverse, 5% [1.0 – 16.9] (see Materials and methods section on adjustment for variable sampling rates across different demographic groups or randomized-HIV-interventions). Comparatively, the proportions of trial participants that met criteria for inclusion in phylogenetic analyses from control communities were 40% (1551 / 3832) and from intervention communities, 60% (2281 / 3832).

Baseline analysis

Of the 82 male-female probable transmission pairs identified in the deep-sequenced sample, 51 involved a recipient that was first sampled during the period of baseline household survey activities in their community, that is, a transmission that could not have been affected by the intervention because it occurred before the start of the intervention. Among these (n = 51), the proportion of inferred transmission events that occurred within intervention (n = 19) or within control communities (n = 25) was 86% (44 / 51) compared with 14% (7 / 51) across intervention and control communities (Supplementary file 1-Table 1L). Of the 19 transmission events identified within intervention communities, 4 were between partners living in different communities; compared with 1 out of 25 transmission events identified within control communities. The number of transmission events identified in intervention communities that originated from control communities was similar to the reverse (n = 4 vs. 3).

Adjusted for sampling variability among trial arms the flow of HIV transmissions in the trial population within control communities, 62% [41.6 – 78.8] was more than two times that in intervention communities, 25% [11.8 – 45.9] (Supplementary file 1-Table 1L); and the flow of HIV transmissions into intervention communities from control communities, 7% [1.5 – 25.3] was similar to the reverse, 5% [0.9 – 22.9] (see Materials and methods section on adjustment for variable sampling rates across different demographic groups or randomized-HIV-interventions).

Post-baseline analysis

By contrast, 31 of the 82 male-female probable transmission pairs inferred in the deep-sequenced sample involved a recipient that was first sampled about a year or more after the end of baseline household survey activities in their community; this subset reflects transmissions that could have in principle been affected by the intervention, as they may have occurred after the intervention began. Among these (n = 31), the proportion of inferred transmission events that occurred within intervention (n = 14) or within control communities (n = 7) was 68% (21 / 31) compared with 32% (10 / 31) across intervention and control communities (Supplementary file 1-Table 1M). Moreover, 3 of the 14 transmission events identified within intervention communities were between partners residing in different communities, and all 7 transmission events identified within control communities were between partners living in the same community. We identified more transmission events in intervention communities that originated from control communities than vice versa (n = 9 vs. 1).

Adjusted for variability in sampling among trial arms the flow of HIV transmissions within control communities, 32% [13.3 – 58.3] was similar to that within intervention communities, 34% [15.0 – 60.9]; and HIV transmissions into intervention communities from control communities, 30% [12.2 – 56.7] were more common than the reverse, 3% [0.1 – 27.3] (Supplementary file 1-Table 1M) (see Materials and methods section on adjustment for variable sampling rates across different demographic groups or randomized-HIV-interventions). This result is consistent with a predicted benefit of the treatment-as-prevention intervention in reducing HIV transmission in the Botswana/Ya Tsie trial, albeit precise dates of transmission were not identified.

A sensitivity analysis to evaluate the impact of probable transmission pairs with unsampled intermediates on the patterns of HIV transmission within and between intervention communities and control communities in the Botswana/Ya Tsie trial

The first set of sensitivity analyses was performed with highly supported directed same- and opposite-sex transmission pairs identified in: (1) HIV-TRACE clusters, (2) Cluster Picker clusters, (3) the overlap of HIV-TRACE and Cluster Picker clusters, and (4) the union of HIV-TRACE and Cluster Picker clusters (Supplementary file 1-Table 1N). The second set of sensitivity analyses was performed for the same four categories but restricted to highly supported directed opposite-sex pairs only where the recipient in a transmission pair was first sampled about a year or more after the end of baseline household survey activities in their community (i.e., post-baseline) (Supplementary file 1-Table 1O). In both sets of sensitivity analyses, the results were consistent with the primary analysis: transmissions into intervention communities from control communities were more common than the reverse post-baseline (Supplementary file 1-Table 1N and O) (see Materials and methods section on adjustment for variable sampling rates across different demographic groups or randomized-HIV-interventions). The signal was stronger as well as being more interpretable in the analysis restricted to opposite-sex pairs compared to the analysis that included same-sex pairs, which likely included one or more unsampled intermediates.

Discussion

HIV incidence remains high in East and Southern Africa especially among young women. Promising HIV prevention intervention packages that combine population-level HIV testing coupled with strengthened linkage-to-care and early initiation of antiretroviral therapy, jointly termed universal test-and-treat, have yielded lower than expected reductions in the occurrence of new HIV-1 infections (Hayes et al., 2019; Makhema et al., 2019; Havlir et al., 2019; Iwuji et al., 2018; Abdool Karim, 2019). This makes it more challenging to attain the UNAIDS goal of ending the HIV/AIDS epidemic as a public health emergency by 2030. An emerging hypothesis among four of the largest universal test-and-treat HIV prevention trials ever conducted in East and Southern Africa in Kenya, Uganda, Botswana, South Africa, and Zambia (Abdool Karim, 2019), comprising tens of thousands of participants, is that the intervention impact to reduce population-level HIV incidence was dampened by population mobility; in particular, sexual partnerships formed (1) between individuals in communities randomized to different HIV-intervention conditions and (2) with individuals in communities outside the trial population.

In this study, we focused on one of the four cluster/community-randomized trials, the Ya Tsie trial conducted in Botswana (Makhema et al., 2019), to quantify the extent of sexual mixing within and between: trial communities, trial arms, and geographic regions. Furthermore, we quantified the contribution of age-difference and gender to the spread of HIV infections in Eastern Botswana. We first used HIV-1 whole-genome viral consensus sequences to identify clusters of trial participants with genetically similar HIV-1 infections. This was done as a filtering step to save time and computational resources by excluding distantly related sequences. Thereafter, we employed deep-sequence phylogenetics to resolve the probable order of HIV transmission events within each identified cluster. Identified transmission events in the deep-sequenced sample were then used to estimate the flow of HIV transmissions in the trial population adjusting for sampling variability.

We found that HIV transmissions in the trial were more likely to occur within communities or between neighboring communities than between distant trial communities, as well as between similarly aged partners. One of the most striking findings of our analysis was that an estimated 24% of transmissions involving residents of trial communities occurred between different communities in the trial. Given that the trial covered only 7.6% of the national population, this suggests that a large fraction of all transmissions involved a partner outside the trial.

Furthermore, there was substantial sexual mixing between intervention and control communities. Overall, about one in every five HIV transmissions in the trial population involved sexual partnerships between residents of intervention communities and control communities. After baseline, that is, about a year or more after the introduction of the intervention one in three HIV transmissions occurred between intervention and control communities. HIV transmissions into intervention communities from control communities were similar to the reverse at baseline, and ten times more common post-baseline, concordant with a predicted benefit of a universal test-and-treat HIV prevention intervention, though dates of transmission were not identified.

The extent of sexual mixing between residents of intervention and control communities highlights high mobility between the two arms of the Botswana/Ya Tsie trial. For the primary trial endpoint comparing HIV incidence in intervention versus control communities, this kind of mixing will tend to dilute the effect since some of the prevention value of the intervention would appear in control communities, while intervention communities would have incidence that comes from control communities and from communities outside the trial thus is not reachable by the effect of the intervention. Both sources of incidence, from control communities and from outside the trial, could dilute the intervention effect observed in the trial. Although we could not evaluate it in this study, it is possible that out-of-trial index cases would dilute the effect even more strongly than control-arm ones because the control arms comprise only about 5% of Botswana’s population, while about 90% of the population lives outside the trial.

These findings might be relevant for the interpretation of the results of the other three universal test-and-treat HIV prevention trials conducted in Kenya, Uganda, South Africa, and Zambia. An important next step would be to quantify the extent to which the size of the observed effect in the Botswana/Ya Tsie trial was reduced by imported infections between partners in communities randomized to different HIV-intervention conditions and partners in communities outside the trial population.

There was a deliberate effort to enroll men and young individuals in the Botswana/Ya Tsie study as these groups are less likely to be diagnosed with HIV and engaged in care. Following an adjustment for variable sampling by age and gender (Figure 12), we found that men and women for whom sexual partners could be inferred phylogenetically contributed similarly to viral transmission of HIV-1 infection in the Botswana/Ya Tsie trial (see Materials and methods section on adjustment for variable sampling rates across different demographic groups or randomized-HIV-interventions). There would also be a contribution to HIV incidence from individuals whose source partner resided in a community outside the trial, though not evaluated in this analysis as HIV viral sequences from communities outside the trial population were not available.

Prior to this study, a recent consensus sequence phylogenetic study assessing age-disparate sexual partnerships in Kwa-Zulu Natal, South Africa identified men older than 25 years as an important source of viral transmission to younger women aged 15–25 years. Age disparate partnerships were defined as ones with an age-gap of more than 5 years. The Kwa-Zulu Natal study was based on a community sampling fraction of 4% that was not specifically targeted to people with HIV and assumed that the direction of viral transmission within clusters would be from older partners to younger ones (de Oliveira et al., 2017). Phylogenetic analyses in the Botswana/Ya Tsie trial were based on an overall sampling fraction of 14% of people aged 16–64 years living with HIV in trial communities; the sampling fraction ranged from 2.7% to 36.2% among the 30 trial communities. In the phylogenetic analyses of the Botswana/Ya Tsie study, there was little genetic evidence from analyses of deep-sequence data to suggest that older men were a substantial source of HIV-1 infection to younger women in Botswana. This finding is relevant to programs aimed at reducing HIV incidence in young women in sub-Saharan Africa such as the Determined, Resilient, Empowered, AIDS-free, Mentored, and Safe women (DREAMS) partnership. Consistent with our results, the Vaginal and Oral Interventions to Control the Epidemic (VOICE) trial, a placebo-controlled randomized study of pre-exposure prophylaxis to test the efficacy of providing oral and vaginal gel tenofovir to women in Southern Africa to prevent HIV infection (Saag, 2015; Balkus et al., 2015), found no association between age-disparate relationships and risk of HIV-1 infection in young women under 25 years residing in South Africa in the cities of Durban, Johannesburg, and Klerksdorp who had male partners that were >5 years older at enrollment.

Taken together, these findings might be relevant for understanding heterogeneity of HIV transmission in similarly designed phylogenetic studies in sub-Saharan Africa based on the sample size, demographic population, and adjustments for variable sampling.

Participants with genetically similar HIV-1 infections in the Botswana/Ya Tsie trial tended to aggregate in small-sized (two- or three-person) clusters. This is consistent with phylogenetic clustering studies in generalized epidemics in Southern Africa but contrasts with larger clusters typically found in concentrated HIV epidemics in Europe and North America (Grabowski et al., 2018). The small-sized genetic similarity clusters identified in trial communities might reflect undersampling of key populations, for example, men and younger individuals, in the HIV transmission network. Alternatively, the predominance of smaller clusters in generalized epidemics compared with larger clusters in concentrated epidemics might reflect differences in the manner in which sexual networks under the two settings evolve over time.

Most participants in the household survey of the Botswana/Ya Tsie trial indicated that they had not engaged in transactional sex in the last 12 months. This suggested that there was a limited contribution of sex workers to the transmission patterns identified in the trial.

Our findings provide insight on HIV transmission patterns in Botswana over a 5-year period between 2013 and 2018 during which the Botswana/Ya Tsie trial was conducted. An important next step would be the integration of the large database of HIV viral genomes assembled by the Botswana/Ya Tsie trial into a real-time HIV genomic surveillance program to highlight emerging hubs of HIV spread and inform targeted studies of HIV prevention. Such an effort would be facilitated by samples routinely collected for HIV viral load monitoring and the continuously decreasing cost of high-throughput sequencing.

Our study had several limitations. First, viral genomes from an estimated 14% of all people with HIV aged 16–64 years were sampled among participating trial communities. Comparatively, the sampling fraction of phylogenetic studies based on African genomes is commonly under 10% (Grabowski et al., 2018). Although our sampling fraction was a modest improvement over previous studies it is a relatively small sample of HIV viral genomes in the trial population. To estimate HIV transmission flows in the trial population based on probable viral transmission events identified in the deep-sequenced sample we used the method of Carnegie et al., 2014 in which undetected HIV transmissions were assumed to be missing at random conditional on group membership (Carnegie et al., 2014) (see Materials and methods section on adjustment for variable sampling rates across different demographic groups or randomized-HIV-interventions). Here group membership refers to demographic groups (age group, gender, trial community, and geographical region) or randomized-HIV-interventions.

Of the 82 probable male-female transmission pairs identified in the deep-sequenced sample there was a single individual involved in multiple transmission events; a male who transmitted to two females each residing in different communities from his own. This suggested that only a small degree of bias in estimation of CIs would arise from a simplifying assumption of independence between inferred transmission events; and allowed us to estimate CIs using methods for multinomial proportions (see Materials and methods section on computation of CIs for estimated HIV transmission flows in the trial population).

Second, although our estimates of HIV transmission flows in the trial population were adjusted for differential sampling we cannot exclude the possibility that unmeasured factors with influence on HIV transmission flows may have impacted our results. Most participants with HIV in this trial were virally suppressed, this might have biased our results in the sense that there might be some differences between the HIV transmission patterns we found and those that exist among viral genetic clusters largely populated by people with unsuppressed virus. Such heterogeneity in HIV transmission patterns might occur due to differences in risk behavior. Nevertheless, this study was based on a relatively representative sample of adults with HIV living in rural and peri-urban villages across Eastern Botswana.

In sum, we identified several key factors in the Botswana/Ya Tsie trial that impacted HIV transmission dynamics with potential relevance to similar studies done in sub-Saharan Africa. We found that most HIV transmissions in the Botswana/Ya Tsie trial occurred between similarly aged partners within the same trial community or between trial communities in close proximity. Moreover, there was a greater flow of HIV transmissions into intervention communities from control communities than vice versa potentially reducing the observed effect-size of the trial, as would transmissions averted by the intervention that were not picked up because the recipient was not in a trial community.

We recommend widely distributed and easily accessible HIV testing (e.g., universal HIV testing campaigns), treatment, and linkage-to-care to support people as they intermix within and across communities; augmented by targeted programs that might offer pre-exposure prophylaxis to younger women and the option of self-testing for men. Population-level genomic surveillance programs to identify communities with high flows of viral transmission within and between them would be helpful in spotting emerging hubs of transmission that can be prioritized for intervention. Taken together, these measures may reduce new HIV infections and shorten the time to epidemic control.

Code availability

Algorithms to estimate HIV transmission flows within and between population groups accounting for sampling variability and corresponding CIs have been implemented as an R package, bumblebee that will be made available at the following URL: https://magosil86.github.io/bumblebee. A step-by-step tutorial on how to estimate HIV transmission flows with bumblebee and accompanying example data sets can be accessed at: https://github.com/magosil86/bumblebee/blob/master/vignettes/bumblebee-estimate-transmission-flows-and-ci-tutotial.md (Magosil, 2021).

Data availability

All relevant data are within the paper, figures and tables. HIV-1 viral whole genome consensus sequences are provided as a Dryad dataset (https://doi.org/10.5061/dryad.0zpc86706). HIV-1 reads are available on reasonable request through a concept sheet proposal to the PANGEA consortium. Contact details are provided on the consortium website (https://www.pangea-hiv.org). Code availability: Algorithms to estimate HIV transmission flows within and between population groups accounting for sampling variability and corresponding confidence intervals have been implemented as an R package, bumblebee that will be made available at the following URL: https://magosil86.github.io/bumblebee A step-by-step tutorial on how to estimate HIV transmission flows with bumblebee and accompanying example datasets can be accessed at: https://github.com/magosil86/bumblebee/blob/master/vignettes/bumblebee-estimate-transmission-flows-and-ci-tutotial.md copy archived at swh:1:rev:e44b55de833780defd37c81d2bb94f65ed1dff12.

The following data sets were generated
    1. Magosi LE
    2. Zhang Y
    3. Golubchik T
    4. DeGruttola V
    5. Tchetgen Tchetgen E
    6. Novitsky V
    7. Moore J
    8. Bachanas P
    9. Segolodi T
    10. Lebelonyane R
    11. Pretorius Holme M
    12. Moyo S
    13. Makhema J
    14. Lockman S
    15. Fraser C
    16. Essex M
    17. Lipsitch M
    (2022) Dryad Digital Repository
    Deep-sequence phylogenetics to quantify patterns of HIV transmission in the context of a universal testing and treatment trial - BCPP/ Ya Tsie trial.
    https://doi.org/10.5061/dryad.0zpc86706

References

    1. Chan SK
    2. Thornton LR
    3. Chronister KJ
    4. Meyer J
    5. Wolverton M
    6. Johnson CK
    7. Arafat RR
    8. Joyce PM
    9. Switzer WM
    10. Heneine W
    11. Shankar A
    12. Granade T
    13. Owen MS
    14. Sprinkle P
    15. Sullivan V
    16. Centers for Disease Control and Prevention (CDC)
    (2014)
    Likely female-to-female sexual transmission of HIV--Texas, 2012
    MMWR. Morbidity and Mortality Weekly Report 63:209–212.
  1. Book
    1. Rabe-Hesketh S
    2. Skrondal A
    (2012)
    Multilevel and Longitudinal Modeling Using Stata (3rd Edition)
    StataCorp LP.
  2. Book
    1. Sophia RH
    (1999)
    GLLAMM: Stata Program to Fit Generalised Linear Latent and Mixed Models
    Boston College Department of Economics.
  3. Report
    1. The World Bank
    (2015)
    GINI Index - Botswana
    World Development Indicators.
  4. Report
    1. UNAIDS
    (2019)
    Global AIDS update 2019 — Communities at the centre
    Global AIDS update.
  5. Report
    1. UNAIDS
    (2020)
    HIV estimates with uncertainty bounds 1990-2019
    Global AIDS update.
  6. Report
    1. UNAIDS
    (2021)
    UNAIDS Global AIDS Update — Confronting inequalities — Lessons for pandemic responses from 40 years of AIDS
    Global AIDS Update.

Decision letter

  1. Francesca Di Giallonardo
    Reviewing Editor; University of New South Wales, Australia
  2. Miles P Davenport
    Senior Editor; University of New South Wales, Australia
  3. Nadine Tschumi
    Reviewer; University Hospital Zurich, Switzerland

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Deep-sequence phylogenetics to quantify patterns of HIV transmission in the context of a universal testing and treatment trial – BCPP/ Ya Tsie trial" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Miles Davenport as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Nadine Tschumi (Reviewer #1).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

Overall, we think the study was well designed and using a large sample size is highly valuable. The study provides detailed information on the methods including the release of a new R package for calculating transmission flow among populations. This is highly valuable for the interpretation of public health interventions, particularly test-and-treat. However, we all agree that it is important to add some clarifications of the methods used.

1) The authors use a combination of HIV-TRACE and Cluster Picker as tools for cluster identification. As such, the authors implement a genetic threshold of 4.5% genetic diversity. This is the core outcome of the study; all the subsequent analyses are depended on the clusters identified. HIV-TRACE is often used instead of a phylogenetic tree, as the latter can be computational time-consuming with large data sets. However, the study also estimates a phylogenetic tree, thus we wish for more clarification on why both methods were used and how different the cluster outcomes are with each method.

2) The cluster analysis resulted in many infection pairs between same-sex couples, indicating missing data in the transmission chains. The study excluded these pairs from subsequent analysis, resulting in a small number of overall pairs for the final transmission flow count. We think that such 'missing link' pairs could still be included in the analysis of estimating transmission flow between geographic regions. We were wondering if the study has investigated this and believe a constructive discussion in that regard could be favorable. See comments of reviewer 2.

3) We agree that the study is overall well-presented and well-written, however, we think it would be useful to clarify some of the methods in the main text. We also agreed that the figures were overall informative and useful, however, we believe that some of the illustrations could be improved to aid with the understanding of the results. See comments of reviewer 3.

Reviewer #1 (Recommendations for the authors):

1. I think my concerns 8-11 above could be relatively easily addressed by sensitivity analyses showing that a change in respective definitions did not affect the main outcomes qualitatively.

2. The estimated proportion of HIV transmissions in the trial population flowing into intervention communities from control communities and vice-versa was a key finding of the manuscript. Since the number of probable transmission pairs is low, the confidence intervals are wide and overlapping. I appreciated / think that it is important and correct that you only considered probable transmission pairs for the age gap analysis. However, for this analysis I wouldn't see a problem for including pairs with probable unsampled intermediates. Was this attempted / why not?

Reviewer #2 (Recommendations for the authors):

The methods describing the initial sequence quality control and alignment need to be expanded. Currently, it is unclear how the shiver assembly software was run. While the sequencing and bioinformatics pipelines used here have been used in previous studies the possible impact of sequencing errors should be discussed. Ideally, replicates would be used to determine the expected rate of sequencing errors. However, other approaches such as the proportion of minority variants at different coding locations for different variant frequency cutoffs could also be used (Dyrdak et al., 2019). This is particularly important given that most trial participants were virally suppressed which could impact the overall quality of the sequencing data.

Much of the downstream analysis and results relies on the initial separation of sequenced samples into putative transmission clusters. As a result, the choice of the 4.5% genetic distance threshold and the decision to combine the results of HIV-trace and cluster-picker needs much stronger motivation. This is particularly so as Supplementary Figures 7 and 8 suggests that the two methods give quite different results. Whilst I realise cluster picker is commonly used, HIV trace does not rely on the accuracy of a provided phylogeny and as the following transmission analysis only considers pairwise relationships, HIV trace seems more appropriate.

I may have misunderstood but the method for calibrating the false discovery rate (FDR) for the inference of direct transmission described in Supplementary note 3 appears to consider all pairwise relationships (equation 6). This would only be possible if a person was able to be infected by multiple different transmitters. Assuming that the authors have only considered a single transmitter per infection then this would lead to an overestimate of the FDR. This should be clarified in the text.

Finally, it would be useful if the authors could discuss what impact high risk communities such as sex workers could have on these results and the estimates of intercommunity transmission.

– Many of the methods are described in the supplementary materials which makes the interpretation of the results challenging. The manuscript would be improved by moving some of the supplementary methods into the main text.

– The command line parameters used to run MAFFT and HIValign should be provided.

– While I was able to successfully download and install the 'bumblebee' R package I could not get a small trial example to run. Example data and a short 'how to guide' should be included as part of the package.

– Figure 7 is quite confusing and I think relatively little is gained by the use of an alluvial plot. I think it would be clearer to separate this plot into multiple bar plots of the total number of transmissions rather than the proportion within each category.

– It would be better to combine Figures 9 and 10 into a multi panel figure

– Lines 177, 184, 216, 219, 279, 342, 345 etc: The multiplication symbols look like a variable and should be removed.

– Line 262: I think this should say that the Goodman method estimates CIs for the parameters of the multinomial distribution.

https://doi.org/10.7554/eLife.72657.sa1

Author response

Essential revisions:

Overall, we think the study was well designed and using a large sample size is highly valuable. The study provides detailed information on the methods including the release of a new R package for calculating transmission flow among populations. This is highly valuable for the interpretation of public health interventions, particularly test-and-treat. However, we all agree that it is important to add some clarifications of the methods used.

We thank the reviewers for their interest in this novel approach that uses deep-sequence phylogenetics to improve our understanding of HIV transmission patterns in the context of the universal HIV test-and-treat intervention.

1) The authors use a combination of HIV-TRACE and Cluster Picker as tools for cluster identification. As such, the authors implement a genetic threshold of 4.5% genetic diversity. This is the core outcome of the study; all the subsequent analyses are depended on the clusters identified. HIV-TRACE is often used instead of a phylogenetic tree, as the latter can be computational time-consuming with large data sets. However, the study also estimates a phylogenetic tree, thus we wish for more clarification on why both methods were used and how different the cluster outcomes are with each method.

We thank the reviewers for their comments and are happy to provide additional results that motivate our decision to use the union of clusters detected with HIV-TRACE and Cluster Picker to estimate HIV transmissions within and between demographic sub-groups in the Botswana – Ya Tsie trial population. The primary motivation was that a filtering step was required to save time and computational resources from evaluating sequences that were too distantly related, before applying the “gold standard” of Phyloscanner to detect directed (when possible) transmission pairs. Accordingly, clustering algorithms plus a distance threshold helped to achieve this filtering. Because we shared what we take to be the reviewers’ concerns about either of the algorithms alone, we sought to maximize the number of transmission pairs that could be identified between participants in the Botswana – Ya Tsie trial with Phyloscanner by using the union of clusters detected with HIV-TRACE and Cluster Picker. This also served as a sensitivity analysis that allowed us to evaluate the extent to which the clustering patterns observed were specific to a single algorithm.

Furthermore, a previous study done by Rose and colleagues (PMID: 27824249) to compare the number and size of clusters identified with HIV-TRACE and Cluster Picker clustering algorithms revealed that HIV-TRACE generally identified larger but fewer clusters, compared with clusters identified with Cluster Picker that were typically more numerous and mostly small 2-person clusters (Please see Figure 3B in Rose and colleagues (PMID: 27824249)). This suggested that HIV-TRACE would be helpful in detecting potentially larger transmission chains and Cluster Picker would be valuable in revealing potential transmission events between pairs of individuals.

Of the 236 genetic clusters detected with the two algorithms, we identified 19 full or partial clusters (including 41 sequences) that included members that were only detected with HIV-TRACE and 122 full or partial clusters (including 242 sequences) that were unique to Cluster Picker. Moreover, of the 82 directed male-female transmission pairs inferred from the sample, (n = 5) were from genetic clusters that were unique to HIV-TRACE compared with (n = 27) that were from clusters unique to Cluster Picker. Of the five transmission events unique to HIV-TRACE clusters, three occurred in intervention communities originating from control communities. By contrast, four of the twenty-seven transmission events unique to Cluster Picker clusters occurred in intervention communities from control communities.

In summary, estimates of HIV transmissions in the trial population based on the full overlap of clusters detected with HIV-TRACE and Cluster Picker would have excluded 32 of the 82 male-female pairs used for the primary analysis. To improve clarity, we have updated the methods, results and discussion with the following statements respectively:

“Consensus sequence phylogenetics to identify clusters of participants with genetically similar HIV-1 infections. To save time and computational resources from evaluating sequences that were too distantly related, we first identified clusters of participants with genetically similar HIV-1 infections as a filtering step, before performing ancestral host state reconstruction with Phyloscanner to detect probable directed transmission pairs. Two clustering algorithms, HIV Transmission Cluster Engine (HIV-TRACE) v0.4.4 [26, 27] and Cluster Picker v1.2.3 [28], were used to identify clusters of individuals whose HIV-1 viral whole genome consensus sequences were genetically similar -- suggesting they were probably members of a transmission chain [29, 30]. HIV-TRACE defines clusters based on pairwise genetic distances only; comparatively, Cluster Picker identifies clusters using pairwise genetic distances with the guidance of a phylogenetic tree. A multiple sequence alignment (as described in the methods section on comparing genetic distances between HIV-1 viral consensus sequences of trial participants included in phylogenetic analyses) was provided as input to HIV-TRACE and Cluster Picker. Additionally, for cluster picker, a corresponding phylogenetic tree inferred with FastTree2 v2.1.10 and boot-strap support values approximated with the Shimodaira-Hasegawa test [31] were provided as inputs. We defined genetic similarity clusters as groups of two or more participants whose viral whole genome consensus sequences were separated by a genetic distance at or smaller than 4.5% nucleotide substitutions per site--and, for Cluster Picker, a bootstrap support value of at least 80%. The genetic distance threshold of 4.5% nucleotide substitutions per site was motivated by the distribution of genetic distances separating HIV-1 subtype C viral whole genomes of epidemiologically-linked couples in the HIV Prevention Trials Network (HPTN) 052 trial (Figure 6) [8, 32]. A listing of parameters used for consensus-sequence phylogenetics with HIV-TRACE and Cluster Picker is provided in Supplementary file 1 Table 1B.”

“About 1 in 7 genotyped participants included in consensus sequence phylogenetic analyses were assigned to genetic similarity clusters of people with closely related HIV-1 infections. … To maximize the number of transmission pairs that could be identified between participants in the Botswana – Ya Tsie trial with Phyloscanner we used the union of clusters detected with HIV-TRACE and Cluster Picker.”

“We first used HIV-1 whole genome viral consensus sequences to identify clusters of trial participants with genetically similar HIV-1 infections. This was done as a filtering step to save time and computational resources by excluding distantly related sequences. Thereafter, we employed deep-sequence phylogenetics to resolve the probable order of HIV transmission events within each identified cluster.”

2) The cluster analysis resulted in many infection pairs between same-sex couples, indicating missing data in the transmission chains. The study excluded these pairs from subsequent analysis, resulting in a small number of overall pairs for the final transmission flow count. We think that such 'missing link' pairs could still be included in the analysis of estimating transmission flow between geographic regions. We were wondering if the study has investigated this and believe a constructive discussion in that regard could be favorable. See comments of reviewer 2.

We thank the reviewers for their suggestion and now report in the results the findings of a sensitivity analysis as follows:

“A sensitivity analysis to evaluate the impact of probable transmission pairs with unsampled intermediates on the patterns of HIV transmission within and between intervention communities and control communities in the Botswana – Ya Tsie trial. The first set of sensitivity analyses were performed with highly supported directed same- and opposite-sex transmission pairs identified in: (1) HIV-TRACE clusters, (2) Cluster Picker clusters, (3) the overlap of HIV-TRACE and Cluster Picker clusters and (4) the union of HIV-TRACE and Cluster Picker clusters (Supplementary file 1 Table 1N). The second set of sensitivity analyses were performed for the same four categories but restricted to highly supported directed opposite-sex pairs only where the recipient in a transmission pair was first sampled about a year or more after the end of baseline household survey activities in their community (i.e. post-baseline) (Supplementary file 1 Table 1O). In both sets of sensitivity analyses the results were consistent with the primary analysis: Transmissions into intervention communities from control communities were more common than the reverse post-baseline (Supplementary file 1 Tables 1N and 1O) (see methods section on adjustment for variable sampling rates across different demographic groups or randomized-HIV-interventions). The signal was stronger as well as being more interpretable in the analysis restricted to opposite-sex pairs compared to the analysis that included same-sex pairs, which likely included one or more unsampled intermediates.”

However, it is important to note that in a predominantly heterosexual epidemic same-sex pairs might represent pairs with one or more missing intermediates or false positives. Therefore, it would be challenging to interpret the estimated HIV transmission flows in the trial population without information on the number of missing intermediates, their gender, community and trial arm. For example, consider a directed female-female pair where female-1 was sampled from an intervention community (F1-intervention) and female-2 a control community (F2-control). Let the two sampled females be connected by a single unsampled male affording the following transmission chain, F1-intervention -> M – > F2 -control. If the male belongs to a control community the true transmission flow would be from intervention-to-control and control-to-control while if the male belongs to an intervention community then it would be intervention-to-intervention and intervention-to-control. These two are very different in terms of their implications, and we have no way to distinguish them. Thus same-sex pairs were excluded from the primary analyses to limit false positives and aid interpretation of the analysis.

3) We agree that the study is overall well-presented and well-written, however, we think it would be useful to clarify some of the methods in the main text. We also agreed that the figures were overall informative and useful, however, we believe that some of the illustrations could be improved to aid with the understanding of the results. See comments of reviewer 3.

We thank the reviewers for their suggestion. To improve clarity we have moved all Supplementary Notes to the methods section and we now also provide a detailed listing of parameters used for: Shiver, MAFFT, HIValign, HIV-TRACE, Cluster Picker and FASTTREE2 in Supplementary file 1 Table 1B. Furthermore, we have replaced the alluvial plot with a barplot (Figure 16) and combined Figures that show the estimated transmission flows of HIV-1 infection in the Botswana – Ya Tsie trial population within and between geographical regions and trial arms into a single multi-panel figure (Figures 18A and 18B) in line with the reviewer’s suggestions.

Reviewer #1 (Recommendations for the authors):

1. I think my concerns 8-11 above could be relatively easily addressed by sensitivity analyses showing that a change in respective definitions did not affect the main outcomes qualitatively.

2. The estimated proportion of HIV transmissions in the trial population flowing into intervention communities from control communities and vice-versa was a key finding of the manuscript. Since the number of probable transmission pairs is low, the confidence intervals are wide and overlapping. I appreciated / think that it is important and correct that you only considered probable transmission pairs for the age gap analysis. However, for this analysis I wouldn't see a problem for including pairs with probable unsampled intermediates. Was this attempted / why not?

Kindly refer to the responses to essential revision numbers 1 and 2.

Reviewer #2 (Recommendations for the authors):

The methods describing the initial sequence quality control and alignment need to be expanded. Currently, it is unclear how the shiver assembly software was run. While the sequencing and bioinformatics pipelines used here have been used in previous studies the possible impact of sequencing errors should be discussed. Ideally, replicates would be used to determine the expected rate of sequencing errors. However, other approaches such as the proportion of minority variants at different coding locations for different variant frequency cutoffs could also be used (Dyrdak et al., 2019). This is particularly important given that most trial participants were virally suppressed which could impact the overall quality of the sequencing data.

Kindly refer to the response to essential revision number 3. Furthermore, we agree with the reviewer that the impact of sequencing errors for sequences generated from proviral DNA is an important issue to discuss. Accordingly, we have expanded the methods as follows:

“Paired-end deep-sequencing of HIV viral genomes for phylogenetic analyses. … Moreover, the quality of sequencing was assessed with standard metrics for deep (or next-generation) sequencing data, however, we cannot exclude the potential for sequencing errors­ arising from hypermutations. The shiver sequence assembly software [20] was used to assemble and map each participant’s deep-sequencing short reads to a de-novo reference sequence tailored to the participant’s viral population. A listing of command-line parameters used to assemble HIV viral whole genomes with Shiver is provided in Supplementary file 1 Table 1B.”

Studying the impact of hypermutations on the quality of proviral DNA sequences is an interesting question that we prefer to reserve for future study, not least as it is not trivial.

Much of the downstream analysis and results relies on the initial separation of sequenced samples into putative transmission clusters. As a result, the choice of the 4.5% genetic distance threshold and the decision to combine the results of HIV-trace and cluster-picker needs much stronger motivation. This is particularly so as Supplementary Figures 7 and 8 suggests that the two methods give quite different results. Whilst I realise cluster picker is commonly used, HIV trace does not rely on the accuracy of a provided phylogeny and as the following transmission analysis only considers pairwise relationships, HIV trace seems more appropriate.

Kindly refer to the response to essential revision number 1.

I may have misunderstood but the method for calibrating the false discovery rate (FDR) for the inference of direct transmission described in Supplementary note 3 appears to consider all pairwise relationships (equation 6). This would only be possible if a person was able to be infected by multiple different transmitters. Assuming that the authors have only considered a single transmitter per infection then this would lead to an overestimate of the FDR. This should be clarified in the text.

We are happy to clarify for the reviewer that the estimated false discovery rate is indeed an upper bound as described in the results.

“Deep-sequence phylogenetics to infer the probable order of transmission events within identified clusters of genetically similar HIV-1 infections … We used the probability of inferring a phylogenetically-linked directed male-male pair in the sample to calibrate an upper bound on the false-discovery rate of inferring direct transmission between males and females in phylogenetically-linked male-female pairs; the estimated number of linked male-female pairs in the sample with unsampled intermediates was approximately 30, corresponding to a false discovery rate of 36% (30 / 82) (see methods section on estimating error rates in phylogenetic inference of direct transmission between sampled males and females). This estimated false positive rate is likely inflated given that two individuals would need to be missing from the sequenced sample to incorrectly infer transmission in a male-female pair, whereas only a single female would need to be missing to erroneously infer transmission in a male-male pair.”

In line with the reviewer’s suggestion this point is further emphasized in the methods section on estimating error rates in phylogenetic inference of direct transmission between sampled males and females as follows:

“This represents an upper bound on the false-discovery rate of inferring direct transmission between males and females in phylogenetically-linked male-female pairs.”

Finally, it would be useful if the authors could discuss what impact high risk communities such as sex workers could have on these results and the estimates of intercommunity transmission.

In line with the reviewer’s suggestions we have updated the discussion as follows:

“Most participants in the household survey of the Botswana – Ya Tsie trial indicated that they had not engaged in transactional sex in the last twelve months. This suggested that there was a limited contribution of sex workers to the transmission patterns identified in the trial.”

– Many of the methods are described in the supplementary materials which makes the interpretation of the results challenging. The manuscript would be improved by moving some of the supplementary methods into the main text.

We thank the reviewer for their suggestion and would like them to kindly note that all previous Supplementary Notes are now presented in the methods section.

– The command line parameters used to run MAFFT and HIValign should be provided.

A detailed listing of parameters used to run MAFFT and HIVAlign is now provided in Supplementary file 1 Table 1B.

– While I was able to successfully download and install the 'bumblebee' R package I could not get a small trial example to run. Example data and a short 'how to guide' should be included as part of the package.

We thank the reviewer for their comment and apologize that this was insufficiently clear in the paper. To improve clarity we have updated the code availability section and methods as follows:

“A step-by-step tutorial on how to estimate HIV transmission flows with bumblebee and accompanying example datasets can be accessed at: https://github.com/magosil86/bumblebee/blob/master/vignettes/bumblebee-estimate-transmission-flows-and-ci-tutotial.md ”.

– Figure 7 is quite confusing and I think relatively little is gained by the use of an alluvial plot. I think it would be clearer to separate this plot into multiple bar plots of the total number of transmissions rather than the proportion within each category.

Kindly refer to the response to essential revision number 3.

– It would be better to combine Figures 9 and 10 into a multi panel figure

Kindly refer to the response to essential revision number 3.

– Lines 177, 184, 216, 219, 279, 342, 345 etc: The multiplication symbols look like a variable and should be removed.

We thank the reviewer for their suggestion and have updated the manuscript accordingly.

– Line 262: I think this should say that the Goodman method estimates CIs for the parameters of the multinomial distribution.

We thank the reviewer for their suggestion and have updated the manuscript accordingly.

https://doi.org/10.7554/eLife.72657.sa2

Article and author information

Author details

  1. Lerato E Magosi

    Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T.H. Chan School of Public Health, Harvard University, Boston, United States
    Contribution
    Conceptualization, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review and editing
    For correspondence
    lmagosi@hsph.harvard.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3388-9892
  2. Yinfeng Zhang

    Division of Molecular & Genomic Pathology, University of Pittsburgh Medical Center Presbyterian Shadyside, Philadelphia, United States
    Contribution
    Resources, Writing – review and editing
    Competing interests
    No competing interests declared
  3. Tanya Golubchik

    Oxford Big Data Institute, Li Ka Shing Center for Health Information and Discovery, Nuffield Department of Medicine, Old Road Campus, University of Oxford, Oxford, United Kingdom
    Contribution
    Data curation, Resources, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-2765-9828
  4. Victor DeGruttola

    Department of Biostatistics, Harvard T.H. Chan School of Public Health, Harvard University, Boston, United States
    Contribution
    Methodology, Validation, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
  5. Eric Tchetgen Tchetgen

    Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, United States
    Contribution
    Methodology, Validation, Writing – review and editing
    Competing interests
    No competing interests declared
  6. Vladimir Novitsky

    1. Harvard T.H. Chan School of Public Health AIDS Initiative, Department of Immunology and Infectious Disease, Harvard T.H. Chan School of Public Health, Harvard University, Boston, United States
    2. Botswana Harvard AIDS Institute Partnership, Gaborone, Botswana
    Contribution
    Data curation, Resources, Writing – review and editing
    Competing interests
    No competing interests declared
  7. Janet Moore

    Division of Global HIV/AIDS and TB, Centers for Disease Control and Prevention, Atlanta, United States
    Contribution
    Resources, Writing – review and editing
    Competing interests
    No competing interests declared
  8. Pam Bachanas

    Division of Global HIV/AIDS and TB, Centers for Disease Control and Prevention, Atlanta, United States
    Contribution
    Resources, Writing – review and editing
    Competing interests
    No competing interests declared
  9. Tebogo Segolodi

    HIV Prevention Research Unit, Centers for Disease Control and Prevention, Gaborone, Botswana
    Contribution
    Resources, Writing – review and editing
    Competing interests
    No competing interests declared
  10. Refeletswe Lebelonyane

    Ministry of Health, Republic of Botswana, Gaborone, Botswana
    Contribution
    Resources, Writing – review and editing
    Competing interests
    No competing interests declared
  11. Molly Pretorius Holme

    Harvard T.H. Chan School of Public Health AIDS Initiative, Department of Immunology and Infectious Disease, Harvard T.H. Chan School of Public Health, Harvard University, Boston, United States
    Contribution
    Project administration, Resources, Writing – review and editing
    Competing interests
    No competing interests declared
  12. Sikhulile Moyo

    Botswana Harvard AIDS Institute Partnership, Gaborone, Botswana
    Contribution
    Resources, Writing – review and editing
    Competing interests
    No competing interests declared
  13. Joseph Makhema

    Botswana Harvard AIDS Institute Partnership, Gaborone, Botswana
    Contribution
    Resources, Writing – review and editing
    Competing interests
    No competing interests declared
  14. Shahin Lockman

    1. Harvard T.H. Chan School of Public Health AIDS Initiative, Department of Immunology and Infectious Disease, Harvard T.H. Chan School of Public Health, Harvard University, Boston, United States
    2. Botswana Harvard AIDS Institute Partnership, Gaborone, Botswana
    3. Brigham and Women's Hospital, Division of Infectious Diseases, Boston, United States
    Contribution
    Project administration, Resources, Writing – review and editing
    Competing interests
    participates in a data safety monitoring board for NIH-funded study of PK of TB drugs and antiretrovirals in children and on a scientific advisory board for observational study of DTG programmatic rollout in Botswana. Is also a member of the Finance Board and a member of the Board of Directors for the Botswana Harvard AIDS Institute Partnership. Receives no financial compensation for these roles, and has no other competing interests to declare
  15. Christophe Fraser

    Oxford Big Data Institute, Li Ka Shing Center for Health Information and Discovery, Nuffield Department of Medicine, Old Road Campus, University of Oxford, Oxford, United Kingdom
    Contribution
    Data curation, Methodology, Resources, Validation, Writing – review and editing
    Competing interests
    No competing interests declared
  16. Myron Max Essex

    1. Harvard T.H. Chan School of Public Health AIDS Initiative, Department of Immunology and Infectious Disease, Harvard T.H. Chan School of Public Health, Harvard University, Boston, United States
    2. Botswana Harvard AIDS Institute Partnership, Gaborone, Botswana
    Contribution
    Conceptualization, Funding acquisition, Resources, Supervision, Writing – review and editing
    Competing interests
    No competing interests declared
  17. Marc Lipsitch

    Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T.H. Chan School of Public Health, Harvard University, Boston, United States
    Contribution
    Conceptualization, Funding acquisition, Methodology, Resources, Supervision, Validation, Writing – original draft, Writing – review and editing
    For correspondence
    mlipsitc@hsph.harvard.edu
    Competing interests
    is a Reviewing Editor for eLife. Has received consultancy fees from Merck, University of Virginia Miller Center and Janssen, and has performed unpaid consultancy work for Janssen, Pfizer and Astra Zeneca. Has also received payments or honoraria from Sanofi Pasteur and Bristol Myers Squibb. ML has no other competing interests to declare
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1504-9213

Funding

Fogarty International Center (D43 TW009610)

  • Lerato E Magosi

Centers for Disease Control and Prevention (U01 GH000447 and U2G GH001911)

  • Lerato E Magosi
  • Janet Moore
  • Pam Bachanas
  • Refeletswe Lebelonyane
  • Molly Pretorius Holme
  • Shahin Lockman
  • Myron Max Essex

National Institutes of Health

  • Christophe Fraser
  • Marc Lipsitch

Bill and Melinda Gates Foundation

  • Christophe Fraser

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

The authors are grateful to participants and collaborators from the Botswana Combination Prevention Project and PANGEA consortium for their support during this work. Additionally, the authors thank Susan Eshleman for contributing data from the HPTN 052 study and Roger Shapiro for his helpful comments on early versions of this manuscript. This study was supported by the National Institute of General Medical Sciences (U54GM088558), the Fogarty International Center (FIC) of the U.S. National Institutes of Health (D43 TW009610), and the President’s Emergency Plan for AIDS Relief through the Centers for Disease Control and Prevention (CDC) (Cooperative agreements U01 GH000447 and U2G GH001911).

Ethics

Human subjects: The BCPP study was approved by the Botswana Health Research and Development Committee and the institutional review board of the Centers for Disease Control and Prevention; and was monitored by a data and safety monitoring board and Westat. Written informed consent for enrollment in the study and viral HIV genotyping was obtained from all participants.

Senior Editor

  1. Miles P Davenport, University of New South Wales, Australia

Reviewing Editor

  1. Francesca Di Giallonardo, University of New South Wales, Australia

Reviewer

  1. Nadine Tschumi, University Hospital Zurich, Switzerland

Publication history

  1. Preprint posted: June 23, 2021 (view preprint)
  2. Received: July 30, 2021
  3. Accepted: February 8, 2022
  4. Accepted Manuscript published: March 1, 2022 (version 1)
  5. Version of Record published: March 10, 2022 (version 2)

Copyright

This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Metrics

  • 307
    Page views
  • 60
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Lerato E Magosi
  2. Yinfeng Zhang
  3. Tanya Golubchik
  4. Victor DeGruttola
  5. Eric Tchetgen Tchetgen
  6. Vladimir Novitsky
  7. Janet Moore
  8. Pam Bachanas
  9. Tebogo Segolodi
  10. Refeletswe Lebelonyane
  11. Molly Pretorius Holme
  12. Sikhulile Moyo
  13. Joseph Makhema
  14. Shahin Lockman
  15. Christophe Fraser
  16. Myron Max Essex
  17. Marc Lipsitch
  18. On behalf of The Botswana Combination Prevention Project and PANGEA consortium
(2022)
Deep-sequence phylogenetics to quantify patterns of HIV transmission in the context of a universal testing and treatment trial – BCPP/Ya Tsie trial
eLife 11:e72657.
https://doi.org/10.7554/eLife.72657

Further reading

    1. Chromosomes and Gene Expression
    2. Genetics and Genomics
    Bethany Sump et al.
    Research Article

    For some inducible genes, the rate and molecular mechanism of transcriptional activation depends on the prior experiences of the cell. This phenomenon, called epigenetic transcriptional memory, accelerates reactivation and requires both changes in chromatin structure and recruitment of poised RNA Polymerase II (RNAPII) to the promoter. Memory of inositol starvation in budding yeast involves a positive feedback loop between transcription factor-dependent interaction with the nuclear pore complex and histone H3 lysine 4 dimethylation (H3K4me2). While H3K4me2 is essential for recruitment of RNAPII and faster reactivation, RNAPII is not required for H3K4me2. Unlike RNAPII-dependent H3K4me2 associated with transcription, RNAPII-independent H3K4me2 requires Nup100, SET3C, the Leo1 subunit of the Paf1 complex and, upon degradation of an essential transcription factor, is inherited through multiple cell cycles. The writer of this mark (COMPASS) physically interacts with the potential reader (SET3C), suggesting a molecular mechanism for the spreading and re-incorporation of H3K4me2 following DNA replication.

    1. Genetics and Genomics
    2. Neuroscience
    Alyssa J Lawler et al.
    Tools and Resources

    Recent discoveries of extreme cellular diversity in the brain warrant rapid development of technologies to access specific cell populations within heterogeneous tissue. Available approaches for engineering-targeted technologies for new neuron subtypes are low yield, involving intensive transgenic strain or virus screening. Here, we present Specific Nuclear-Anchored Independent Labeling (SNAIL), an improved virus-based strategy for cell labeling and nuclear isolation from heterogeneous tissue. SNAIL works by leveraging machine learning and other computational approaches to identify DNA sequence features that confer cell type-specific gene activation and then make a probe that drives an affinity purification-compatible reporter gene. As a proof of concept, we designed and validated two novel SNAIL probes that target parvalbumin-expressing (PV+) neurons. Nuclear isolation using SNAIL in wild-type mice is sufficient to capture characteristic open chromatin features of PV+ neurons in the cortex, striatum, and external globus pallidus. The SNAIL framework also has high utility for multispecies cell probe engineering; expression from a mouse PV+ SNAIL enhancer sequence was enriched in PV+ neurons of the macaque cortex. Expansion of this technology has broad applications in cell type-specific observation, manipulation, and therapeutics across species and disease models.