Estimating the potential to prevent locally acquired HIV infections in a UNAIDS FastTrack City, Amsterdam
Abstract
Background:
More than 300 cities including the city of Amsterdam in the Netherlands have joined the UNAIDS FastTrack Cities initiative, committing to accelerate their HIV response and end the AIDS epidemic in cities by 2030. To support this commitment, we aimed to estimate the number and proportion of Amsterdam HIV infections that originated within the city, from Amsterdam residents. We also aimed to estimate the proportion of recent HIV infections during the 5year period 2014–2018 in Amsterdam that remained undiagnosed.
Methods:
We located diagnosed HIV infections in Amsterdam using postcode data (PC4) at time of registration in the ATHENA observational HIV cohort, and used HIV sequence data to reconstruct phylogeographically distinct, partially observed Amsterdam transmission chains. Individuallevel infection times were estimated from biomarker data, and used to date the phylogenetically observed transmission chains as well as to estimate undiagnosed proportions among recent infections. A Bayesian Negative Binomial branching process model was used to estimate the number, size, and growth of the unobserved Amsterdam transmission chains from the partially observed phylogenetic data.
Results:
Between 1 January 2014 and 1 May 2019, there were 846 HIV diagnoses in Amsterdam residents, of whom 516 (61%) were estimated to have been infected in 2014–2018. The rate of new Amsterdam diagnoses since 2014 (104 per 100,000) remained higher than the national rates excluding Amsterdam (24 per 100,000), and in this sense Amsterdam remained a HIV hotspot in the Netherlands. An estimated 14% [12–16%] of infections in Amsterdan MSM in 2014–2018 remained undiagnosed by 1 May 2019, and 41% [35–48%] in Amsterdam heterosexuals, with variation by region of birth. An estimated 67% [60–74%] of Amsterdam MSM infections in 2014–2018 had an Amsterdam resident as source, and 56% [41–70%] in Amsterdam heterosexuals, with heterogeneity by region of birth. Of the locally acquired infections, an estimated 43% [37–49%] were in foreignborn MSM, 41% [35–47%] in Dutchborn MSM, 10% [6–18%] in foreignborn heterosexuals, and 5% [2–9%] in Dutchborn heterosexuals. We estimate the majority of Amsterdam MSM infections in 2014–2018 originated in transmission chains that preexisted by 2014.
Conclusions:
This combined phylogenetic, epidemiologic, and modelling analysis in the UNAIDS FastTrack City Amsterdam indicates that there remains considerable potential to prevent HIV infections among Amsterdam residents through citylevel interventions. The burden of locally acquired infection remains concentrated in MSM, and both Dutchborn and foreignborn MSM would likely benefit most from intensified citylevel interventions.
Funding:
This study received funding as part of the HTEAM initiative from Aidsfonds (project number P29701). The HTEAM initiative is being supported by Aidsfonds (grant number: 2013169, P29701, P60803), Stichting Amsterdam Dinner Foundation, BristolMyers Squibb International Corp. (study number: AI424541), Gilead Sciences Europe Ltd (grant number: PAHIVPREP160024), Gilead Sciences (protocol numbers: CONL2764222, COUS2761712, CONL9856195), and M.A.C AIDS Fund.
Editor's evaluation
Congratulations on this impressive paper which combines clinical biomarker data, patient specific data and viral genetics data to estimate the proportion of HIV infections occurring within key subgroups of the population in Amsterdam. The work is methodologically impressive and also may be of high utility for understanding the spread of HIV and other viral infections through the population.
https://doi.org/10.7554/eLife.76487.sa0Introduction
Human immunodeficiency virus (HIV) is concentrated in metropolitan areas (Joint United Nations Programme on HIV/AIDS, 2014). In response, as of March 2021 over 300 cities have joined the FastTrack Cities initiative (www.fasttrackcities.org) by signing the Paris Declaration, committing to end the AIDS epidemic by 2030, by addressing disparities in access to basic health and social services, social justice and economic opportunities (UNAIDS, 2019). Several of these fasttrack cities have successfully developed strategies which best address the needs of the local epidemic, including London’s HIV Prevention Programme and early ART initiation, and New York’s Status Neutral Prevention and Treatment Cycle (Public Health England, 2018; Myers et al., 2018). A central milestone in this agenda is to characterise the number of HIV infections that are acquired from sources within cities and are thus preventable through local interventions, as well as to identify the primary risk groups with infections from local sources.
In the Netherlands, Amsterdam is the city with the greatest HIV burden nationally, reflecting in part large communities of MSM and foreignborn individuals. Amsterdam has a long history of a collaborative HIV approach in combating the epidemic and joined the UNAIDS FastTrack Cities initiative on 1 December 2014. Citylevel HIV responses were galvanised in the HIV Transmission Elimination Amsterdam project (HTeam) that same year (de Bree et al., 2019). The HTeam fasttrack response, amongst others, focussed on outreach activities, encouraging repeat testing every 3–6 months to identify acute and early HIV infection, followed by immediate initiation of combination antiretroviral therapy (cART) in newly diagnosed patients, and rollout of preexposure prophylaxis (PreP) in populations at increased risk of HIV infection (den Daas et al., 2018; Bartelsman et al., 2017; Hoornenborg et al., 2019; Dijkstra et al., 2019). Prior to the COVID19 pandemic, the number of annual HIV diagnoses in Amsterdam residents has consistently declined from ~300 new citylevel HIV diagnoses in 2010 to ~120 in 2018, primarily in Dutchborn and foreignborn MSM. Given these achievements, it is now unclear how many of the remaining new infections are locally acquired and could thus still be locally averted. Late diagnoses remain common and are a particular concern in this effort, both for individual health and the risk that unnoticed transmission chains pose to public health.
Here, we build on Amsterdam’s combined case and genomic surveillance data to reconstruct transmission chains at city level, defined as a single introduction of HIV into Amsterdam residents, followed by a direct infection chain among Amsterdam residents (Figure 1). We exploit clinical patient data to estimate times of HIV infection at individual level, which provides crucial temporal information for interpreting the observed transmission chains. This allows us to estimate the extent of undiagnosed infections at the forefront of the cities’ transmission chains, among infections that are estimated to have occured since Amsterdam joined the FastTrack Cities network in 2014. We then characterise the growth and origins of Amsterdam transmission chains in 2014–2018, and quantify in particular the proportion of Amsterdam infections in this time period that had an Amsterdam resident as source, and could have been locally averted.
Materials and methods
Demographic and clinical cohort data comprising citylevel infections
Request a detailed protocolData were obtained from the prospective ATHENA cohort of all people living with HIV (PLHIV) in care in the Netherlands, including patient demographics and longitudinal CD4, HIV viral load, viral sequence, and treatment data (see Appendix 1, Section 2) (Boender et al., 2018). Sequencing methods are described previously (Bezemer et al., 2004). Cohort data are near complete in the sense that 2% of individuals opted out of participating in the ATHENA study, and 5.2% of individuals who entered ATHENA were lost to followup (Boender et al., 2018; Sighem et al., 2020). We geolocated diagnosed infections to Amsterdam based on patients’ postcode of residence at time of first registration in ATHENA or the most recent registration update, which includes PLHIV that changed residence to Amsterdam at a registration update (4%), PLHIV that changed residence to another Dutch municipality after first registration (4%), and PLHIV that were consistently resident in Amsterdam (92%).
Participants were stratified by region of birth: MSM (The Netherlands; Western Europe, North America, Oceania; Eastern and Central Europe; South America and the Caribbean; Other), and heterosexual individuals (The Netherlands; South America and the Caribbean; SubSaharan Africa; Other), resulting in 9 risk groups in total. Throughout, we denote transmission group (Amsterdam MSM or heterosexuals) by $t$, and geographic region of birth by $r$.
We here focus on citylevel transmission chains growing in the period from 1 January 2014 to 31 December 2018, which for brevity we refer to as 2014–2018. Available demographic, clinical, and viral sequence data were obtained for HIV diagnoses in Amsterdam from the ATHENA database version closed on 1 May 2019.
Estimating HIV infection dates and undiagnosed infections
Request a detailed protocolUsing longitudinal viral load and CD4 count data and further demographic and clinical information, we estimated time from infection to diagnosis for all HIV diagnosed patients with a Bayesian approach (Pantazis et al., 2019). Briefly, data from the CASCADE collaboration on 19,788 observed HIV seroconverters were used to parameterize a bivariate normal linear model of the joint time evolution of HIV viral load and CD4 cell count decline since time of infection in the context of additional covariates (sex, region of origin, mode of infection, age at time of diagnosis). Then we used the trained model to estimate infection times from longitudinal biomarker data for Amsterdam patients, with an average of four viral load observations and six CD4 cell count observations per patient. We next reconstructed characteristic timetodiagnosis distributions for each of the nine Amsterdam risk groups (MSM/heterosexual, and region of birth) with a Bayesian hierarchical model from the individuallevel estimates, modelling the individuallevel estimates with a Weibull distribution. To avoid censoring of infectiontodiagnosis times, we focused analyses on the subset of infections in 2010–2012 which were diagnosed by 1 May 2019 since most infections in this window would have been diagnosed by the close of study, and assume as supported by mathematical models that timetodiagnosis did not change substantially in 2010–2019 (Sighem, 2017; Sighem et al., 2017). The model was implemented with Stan version 2.21 (Carpenter et al., 2017). Full details are provided in Appendix 1, Section 3.
We then calculated the proportion of infections in each year $y=2014,...,2018$ in each of the 9 Amsterdam risk groups that were not diagnosed by database closure (which we denote by ${\delta}_{try}$) from the fitted model. To adjust for trends in incidence over time, the annual estimates were weighted by the estimated number of HIV infections in each year among Amsterdam MSM and heterosexual individuals without stratifiction by inmigrant status, according to the European Centre for Disease Control and Prevention (ECDC) HIV modelling tool for Amsterdam, version 1.3.0 (Stockholm: European Centre for Disease Prevention and Control, 2017) through weights,
where y=2014,...,2018 and ${N}_{ty}^{InfECDC}$ are the estimated total number of infections in year $y$ in Amsterdam MSM or heterosexuals. We then obtained an overall estimate of the proportion of undiagnosed infections in 2014–2018, ${\delta}_{tr}$ , by applying these weights to the yearly proportions through
Recognizing the limitations in applying weights that do not account for differences by place of birth, we used in sensitivity analyses as weights the observed trends in the number of annual HIV diagnoses in the corresponding Amsterdam risk group. The total number of Amsterdam infections in 2014–2018 including the undiagnosed (which we denote by ${N}_{tr}^{Inf}$) was next estimated by dividing the number of diagnosed Amsterdam infections in 2014–2018 (which we denote by ${N}_{tr}^{D}$) with the estimated proportion of diagnosed individuals,
Phylogenetic reconstruction of transmission chains among Amsterdam residents
Request a detailed protocolTo reconstruct distinct HIV transmission chains among Amsterdam residents, we used the first available partial HIV1 polymerase (pol) sequence from Amsterdam PLHIV, Dutch PLHIV from outside Amsterdam, and ~82,000 pol sequences from nonDutch PLHIV. The nonDutch viral sequences were retrieved from the Los Alamos HIV1 sequence database subject to a length of at least 1300 in the pol gene on March 2, 2020 (www.hiv.lanl.gov). The basic local alignment search tool (BLAST v2.10.0) was used to select the top 20 closest background sequences to any Dutch sequence (Altschul et al., 1990). All sequences were subtyped using Comet v2.3 (Struck et al., 2014). Sequences with an uncertain subtype classification using Comet were analysed with Rega v3.0 (PinedaPeña et al., 2013). Any remaining sequences for which a subtype could not be resolved were discarded from further analysis (n=122). Subtypespecific alignments were generated with Virulign (Libin et al., 2019) (Appendix 1 Section 4.1) and sequences from other subtypes were added as outgroup for the purpose of phylogenetic rooting. The final alignments were trimmed to positions 2253–3870 in the reference genome HXB2 (Ratner et al., 1985).
Subtypespecific HIV phylogenetic trees were generated for alignments with at least 50 Amsterdam sequences (subtypes and recombinant forms B, 01AE, 02AG, C, D, G, A1 or 06 cpx) using FastTree v2.1.8 (Price et al., 2010) rooted at the outgroup, and the outgroup taxa were then pruned from the phylogeny. Next, we attributed to all viral lineages in the phylogenies a ‘state’ label that included information on the transmission risk group (MSM, heterosexual, other) and location with phyloscanner version 1.8.0 (Wymant et al., 2018); see Bezemer et al., 2022 for details. Locations were classified into Amsterdam (for ATHENA patients with an Amsterdam postcode at time of registration or a registration update), the Netherlands (for other ATHENA patients), and the 9 world regions Africa, Western Europe, Eastern Europe and Central Asia, North America, Latin America and the Caribbean, Dutch Caribbean and Suriname, Middle East and North Africa, South and SouthEast Asia and Oceania (for nonDutch sequences).
In the labelled phylogeny, the lineage labels jump backwards in time, for example from Amsterdam MSM associated with a lineage ending in a tip observed in Amsterdam MSM to Western Europe. Thus, we can group lineages according to the same label between jumps, and we follow Wymant et al., 2018 in referring to these groups as phyloscanner subgraphs. We assumed that we have sufficient background sequences such that no additional background sequences would further separate transmission chains among Amsterdam residents into more distinct chains. A subtle but important related point is that with the available location data at time of registration or a registration update, we are only able to phylogenetically reconstruct transmission chains by residence status rather than the location at which transmission actually occurred. For example, two Amsterdam residents appear in the same phyloscanner subgraph if they infected each other during a shortterm visit in another Dutch, European or global location, if they were both infected from a common source during such a shortterm visit and the source remained unsampled, if they infected each other before they began their residence in Amsterdam, or after they moved to another Dutch municipality. Diagnosed Amsterdam patients in the same subgraph were then interpreted as belonging to the same transmission chain, and the estimated state of the root of the subgraph was interpreted as the geographical origin of the transmission chain. Throughout, we refer to the subgraphs also as the phylogenetically observed (parts of) transmission chains. Using this approach, we note that unlike most phylogenetic clustering analyses (Burns et al., 2017), every infected patient with a sequence is included in one subgraph, and all partially observed transmission chains of size one are included in the analysis to ensure that the entire distribution of observed transmission chains is represented in the analysis (Bezemer et al., 2022). To capture phylogenetic uncertainty, phylogenetic analyses were repeated on 100 bootstrap replicates drawn from each subtype alignment, and transmission chains were enumerated across these replicate analyses.
We classified phylogenetically reconstructed transmission chains by the infection dates that we estimated from each patient’s diagnosis date, risk group, age, CD4 trajectory and viral load trajectory. Chains were classified as ‘preexisting’ if at least one of its members had a posterior median infection date before 2014, and as ‘emerging’ if all members had a posterior median infection date after January 1, 2014.
Virally unsuppressed transmission chains
Request a detailed protocolFor all preexisting chains, we determined the number of infectious individuals at the start of 2014 from viral load data. Specifically, we defined patients as suppressed by 2014 if their last viral load measurement before 2014 was below 100 copies/ml, and count for each preexisting chain its suppressed and unsuppressed members by 2014.
Estimating the growth of citylevel transmission chains
Request a detailed protocolBecause of the large number of late presenters and incomplete sequence coverage in diagnosed patients, the phylogenetically observed transmission chains are incomplete and statistical models were required to estimate the growth and origins of Amsterdam transmission chains. We here extended the Bayesian branching process model of Bezemer et al., 2022 to estimate the growth of preexisting transmission chains. Specifically, given $m=1,...,M$ index cases of a chain that preexisted, the final size distribution of stuttering transmission chains is under a Negative Binomial branching process model given by
where NegBin is the Negative Binomial distribution characterised by mean $\mu m$ and dispersion parameter $\varphi m$, $i=0,1,2,...$ is the number of new cases, and μ < 1. Incomplete sampling of new cases can be accommodated via
where $\rho$ denotes the probability that a new case in 2014–2018 is diagnosed and has a viral sequence sampled by database closure. In the model, the index cases are assumed to be infectious and defined by the number of unsuppressed members by 2014 in a preexisting chain, adjusted for the sampling probability of such members. We further capped the infinite sum in (3) in the model, recognizing that the summands rapidly tend to zero. The corresponding equation for emergent transmission chains (since 2014 as defined above) is similar,
where $n=1,2,...$ are the total number of observed cases in an emerging chain. We then denote with $x}_{s$ and $\stackrel{\sim}{x}}_{s$ respectively the observed growth distributions for the phylogenetically observed, preexisting and emergent transmission chains in the phylogeny of subtype/ recombinant form, and for either Amsterdam MSM or heterosexuals, which we denote by $s$. Here, $x}_{s$ is a matrix with rows indicating the number of index cases and columns indicating the number of new cases, and $\stackrel{\sim}{x}}_{s$ is a row vector with rows indicating the total number of cases in emerging chains. For ease of reading, we suppress the subscripts where possible from now on. The likelihood then comprises the growth distributions of emerging chains, preexisting chains that continued to grow, and preexisting chains with unsuppressed members that did not grow, with the following loglikelihood,
where $M$ is the largest number of index cases observed across the chains after adjusting for sampling, $I$ is the largest number of new cases observed in preexisting chains and $N$ is the largest number of new cases observed in emergent chains, including the first case. Preexisting chains for which all members were suppressed by 2014 and which did not grow were not included, because these chains had no unsuppressed index case. Due to small counts, we grouped the observed growth distributions for the phylogenetically observed transmission chains for nonB subtypes together before fitting the model. We fitted the branching process model under a Bayesian framework with Stan version 2.21 to the observed growth distributions among MSM, borrowing information across subtypes B and nonB, and similarly for heterosexuals. The primary output of the model are posterior predictive distributions on the number, size and growth of the actual transmission chains among Amsterdam residents, both for MSM and heterosexuals, and by viral subtype. This includes emerging chains that were entirely unsampled. Full details are provided in Appendix 1, Section 6.
Derived statistical estimates
Request a detailed protocolGiven estimates of the number and growth of both preexisting and emergent transmission chains, it is straightforward to derive estimates of the proportion of HIV infections among Amsterdam residents in 2014–2018 that had an Amsterdam resident as source (which we denote by $\gamma $ and refer to as the proportion of locally acquired infections). This is because all infections originating from an individual living in Amsterdam had a local source, except the index cases in the emerging chains that were introduced from outside of Amsterdam. Ignoring population subgroups for the derivation, we have
where ${N}^{I}$ is the estimated number of new infections between 2014 and 2018 in Amsterdam residents, ${N}^{C}$ is the estimated number of transmission chains which emerged between 2014 and 2018 and $\alpha $ is the estimated proportion of emergent transmission chains with an Amsterdam origin. Since each transmission chain has one index case, $\alpha {N}^{C}$ is the estimated number of infections with nonAmsterdam origin, and $N}^{I}\alpha {N}^{C$ is the estimated number of infections that had an Amsterdam resident as a source.
Using Equation 8, we were able to obtain estimates (8) for Amsterdam MSM residents and Amsterdam heterosexual residents, and for each phylogeny, that is stratified further by each of the major subtypes and recombinant forms (which we denote by ${\gamma}_{s}$). To obtain estimates stratified by the nine Amsterdam risk groups of interest (where $t$ denotes transmission group MSM or heterosexual and $r$ denotes geographic region of birth), we calculated weighted averages of the ${\gamma}_{ts}$ across chains and subtypes, with the weight determined as the proportion of the infected individuals in transmission group $t$ (i.e. either MSM or heterosexuals) from region of birth $r$ that are infected with subtype/recombinant form s. Specifically,
where the proportions ${\nu}_{tsr}$ are for brevity defined in Appendix 1 Section 7. We interpret ${\gamma}_{tr}$ as the proportion of Amsterdam infections in transmission risk group $t$, from geographic region $r$, that have the potential to be preventable through local interventions.
Ethics
As from 2002 ATHENA is managed by Stichting HIV Monitoring, the institution appointed by the Dutch Ministry of Public health, Welfare and Sport for the monitoring of people living with HIV in the Netherlands. People entering HIV care receive written material about participation in the ATHENA cohort and are informed by their treating physician on the purpose of data collection, thereafter they can consent verbally or elect to optout. Data are pseudonymised before being provided to investigators and may be used for scientific purposes. A designated data protection officer safeguards compliance with the European General Data Protection Regulation (Boender et al., 2018).
Results
Substantial declines in HIV diagnoses and infections in Amsterdam
Between 1 January 2014 and 1 May 2019, there were 846 HIV diagnoses in Amsterdam residents who selfidentified as MSM (75%) or heterosexual (20%). Of the remaining diagnoses, 1 (<1%) was among injecting drug users (IDU), 12 (1%) were through other modes of transmission and 30 (3%) had an unknown mode of transmission. A total of 275 (33%) of the diagnoses in MSM and heterosexuals presented with a CD4 count below 350, with late presentation being higher among heterosexuals. All diagnosed patients had biomarker data available to estimate time to diagnosis, and 516 of 846 (61%) were estimated to have been infected between 2014 and 2018 based on the posterior median infection time estimate (Table 1). In the preceding 5year period 2009–2013, there were 1436 HIV diagnoses in Amsterdam and a similar proportion of these presented late (567, 39%). There were 1128 diagnoses with estimated infection in 2009–2013, suggesting a substantial reduction in infections in 2014–2018. Yet, the rate of new Amsterdam diagnoses since 2014 (104 per 100,000) remained higher than the national rates excluding Amsterdam (24 per 100,000), and in this sense Amsterdam remains a HIV hotspot in the Netherlands.
Nine of ten Amsterdam diagnoses and infections are in MSM
A total of 190 (37%) Amsterdam diagnoses with estimated infection in 2014–2018 were in Dutchborn MSM, 256 (50%) in foreignborn MSM, 23 (4%) in Dutchborn men and women identifying as heterosexuals, and 47 (9%) in foreignborn heterosexuals. Thus, the large majority of Amsterdam diagnoses with infection dates between 2014 and 2018 were in foreignborn and Dutchborn MSM, and an important question that we address below is if these diagnoses also likely had an Amsterdam source.
Overall, we find the individuallevel timetodiagnosis estimates varied substantially within each of the 9 Amsterdam risk groups shown in Table 1 (see also Appendix 1—figures 1 and 2). The posterior median timetodiagnosis estimates among individuals were 14 months longer in heterosexuals than in MSM, 9 months longer in Dutchborn heterosexuals than Dutchborn MSM, and 19 months longer in foreignborn heterosexuals than foreignborn MSM (Appendix 1—figure 3). These substantial diagnosis delays continue to undermine the longterm prognosis of infected individuals and transmission prevention efforts.
High proportion of infections since 2014 that remained undiagnosed by May 2019
Local estimates of the continuum of care indicate that Amsterdam has surpassed the 959595 targets, with an estimated 5% of all people in Amsterdam living with HIV that remained undiagnosed by the end of 2019 (Sighem et al., 2020; UNAIDS, 2019). Based on the timetodiagnosis estimates in our cohort, we can focus here at the forefront of ongoing transmission chains and quantify the proportion of recent Amsterdam infections in 2014–2018 that remained undiagnosed by 1 May 2019. Figure 2 shows that the estimated undiagnosed proportions are considerably higher when we focus on infections acquired since 2014. Accounting for declining diagnosis and infection trends (see Materials and methods), an estimated 14% [12–16%] of infections in Amsterdan MSM in 2014–2018 remained undiagnosed, and 41% [35–48%] in Amsterdam heterosexuals (Table 1). The highest proportion of undiagnosed Amsterdam infections in 2014–2018 are in heterosexuals born in SubSaharan Africa, with 57% [47–67%].
While the bivariate model of biomarker data that underpins the individuallevel timetodiagnosis estimates has been validated (Pantazis et al., 2019), our estimates of the proportion of undiagnosed infections in 2014–2018 depend further on the trends in the number of infections in each year as shown in Equation 2. The main analysis is based on trends in HIV infections in Amsterdam MSM and heterosexuals that were estimated with the ECDC HIV Modelling Tool for Amsterdam. The ECDC estimates account for late diagnoses, but aggregate over region of birth. Recognizing this limitation, in sensitivity analyses we used instead trends in directly observed Amsterdam diagnoses, which apply to each Amsterdam risk group but do not account for confounding due to late diagnoses. In the sensitivity analysis, we estimate that 14% [13–17%] of infections in Amsterdam MSM in 2014–2018 remained undiagnosed, and 34% [28–41%] in Amsterdam heterosexuals. Further details are presented in Appendix 1, Section 3.3–3.5.
More than 1800 distinct transmission chains among Amsterdam residents
We next adopted viral phylogenetic methods to understand how the diagnosed Amsterdam infections since 2014 are distributed across Amsterdam’s HIV transmission networks. A total 378 of the 516 (73%) individuals had a pol sequence available, of whom 341 were of the major subtypes or recombinant forms that are circulating in Amsterdam (B, 01AE, 02AG, C, D, G, A1 and 06 cpx). 37 individuals were excluded from further analysis as their subtype identification was inconclusive, or they were associated with other subtypes or recombinant forms with fewer than 50 sequences in Amsterdam. Appendix 1—table 1 summarises the characteristics of the study population, and those with a sequence available. We reconstructed viral phylogenies using the HIV sequence data from these individuals combined with viral sequences from 3647 Amsterdam diagnoses with estimated infection prior to 2014, 6087 diagnosed individuals from the Netherlands outside Amsterdam, and 14,222 viral sequences from outside the Netherlands that were genetically closest to those circulating in the Netherlands (Appendix 1—figures 4–25). Key statistics based on the bootstrap analysis are reported in Appendix 1—Tables 2 and 3.
We identified across the major HIV1 subtypes and circulating recombinant forms 1829 distinct viral phylogenetic subgraphs that comprised at least one diagnosed Amsterdam infection prior to 2014, which we refer to as the phylogenetically observed preexisting transmission chains (Figure 3 and Appendix 1—figure 26). There were 1253 preexisting chains in MSM, of which 949 (76%) had all members virally suppressed as of 2014, and of those 906 (95%) had no new member in 2014–2018. The remaining 5% of subgraphs likely grew from unsuppressed index individuals that did not have an HIV sequence sampled. In heterosexuals, there were 576 preexisting chains, of which 401 (70%) had all members virally suppressed as of 2014, and of those 391 (98%) had no new member in 2014–2018. The proportion of unsuppressed subgraphs in Amsterdam heterosexuals was indeed statistically significantly lower than in Amsterdam MSM, but not strongly so (pvalue 0.02, onesided chisquare test). To summarise, transmission appears to have stopped since 2014 in almost all phylogenetically observed preexisting chains that had all their observed members suppressed by 2014.
Growth of the phylogenetically observed parts of citylevel transmission chains
Considering growth, 89 (7%) of the 1253 phylogenetically observed preexisting chains in Amsterdam MSM had at least one new member diagnosed in 2014–2018, and 114 chains emerged (Table 2 and Figure 3). In Amsterdam heterosexuals, 15 (3%) of the 576 phylogenetically observed preexisting chains had at least one new member diagnosed in 2014–2018, and 26 chains emerged. The emerging chains thus outnumbered the growing preexisting chains in both Amsterdam MSM and heterosexuals. However, the observed phylogenetic data are challenging to interpret directly because larger proportions of recent infections remain undiagnosed, approximately half of diagnosed individuals did not have a sequence sampled, and small chains are more likely to remain entirely unobserved (see Materials and methods).
Emerging transmission chains outnumber preexisting, growing transmission chains
We next used a Bayesian branching process growth model to predict the size and growth of the actual transmission chains (see Materials and methods and Appendix 1, Section 6). Model fit to the observed growth distributions was very good (Appendix 1—figure 27). We estimate that there are substantially more emerging chains in Amsterdam since 2014 than phylogenetically observed, 172 [154195] in MSM and 58 [4283] in heterosexuals, reflecting that emergent chains have a high probability to be entirely unobserved when growth is below the epidemic reproduction threshold of one (Table 2). Thus, the estimated actual, emerging chains outnumber the growing preexisting chains in both Amsterdam MSM and heterosexuals more strongly than the phylogenetic data suggest.
In terms of proportions, an estimated 61% [55–67%] of the growing chains among Amsterdam MSM were emerging, and 69% [56–81%] of the growing chains among Amsterdam heterosexuals. We estimate further that 47% [39–55%] of the estimated infections among Amsterdam MSM in 2014–2018 were in emerging chains, and 61% [45–77%] of the estimated infections among Amsterdam heterosexuals (Table 3). Thus, on average the preexisting chains contributed more new cases in 2014–2018 to Amsterdam infections than the emerging chains.
Proportion of locally preventable infections
From the emerging transmission chains, we can directly estimate the proportion of Amsterdam infections since 2014 that had an Amsterdam source (see Materials and methods). We interpret these infections as locally preventable, because they are within the reach of the HIV prevention efforts in Amsterdam. In Amsterdam MSM, an estimated 67% [60–74%] of infections in 2014–2018 were locally preventable, with little variation by region of birth (Figure 4, proportions next to error bars). In Amsterdam heterosexuals, an estimated 56% [41–70%] of infections in 2014–2018 were locally preventable, with more variation by region of birth, though we caution that the underlying sample sizes were small.
We next multiplied the proportions of locally preventable infections with the estimated number of infections in 2014–2018 in each of the 9 Amsterdam risk groups to obtain estimates of the absolute number of locally preventable infections in Amsterdam in 2014–2018 in each risk group (Figure 4, yaxis). Of the estimated 415 [316542] locally preventable Amsterdam infections in 2014–2018, an estimated 178 [129243] (43% [37–49%]) were in foreignborn MSM, 171 [124231] (41% [35–47%]) in Dutchborn MSM, 45 [2482] (10% [6–18%]) in foreignborn heterosexuals, and 21 [1039] (5% [2–9%]) in Dutchborn heterosexuals.
Discussion
More than 300 cities have by the end of 2021 signed the FastTrack Cities Paris Declaration and committed to end the AIDS epidemic by 2030, addressing disparities in access to basic health and social services, social justice and economic opportunities. The city of Amsterdam reached the UNAIDS Fasttrack Cities 959595 targets before the onset of the COVID19 pandemic, and has seen a decade of declines in citylevel HIV diagnoses. Here, we characterised the number, size and growth of HIV transmission chains among Amsterdam residents, and quantified the further potential of preventing HIV infection at city level. It is important to recognize that through the analyses conducted here, the exact location of infection events cannot be identified. Rather, the available location data enable us to identify groups of Amsterdam residents with phylogenetically distinct HIV, which are the inferential basis for estimating the number, size, and growth of the actual unobserved transmission chains among Amsterdam residents. Regardless of the exact infection location, Amsterdam residents live in Amsterdam, and are thus within reach of Amsterdam public health and local prevention interventions.
We can structure our insights in four themes. First, when focusing on the denominator of recent infections that are estimated to have occurred in the 5year period 2014–2018, the proportions of individuals that remained undiagnosed by early 2019 were high and variable, between 9% and 20% in (selfidentified) Amsterdam MSM risk groups, and between 28% and 57% in Amsterdam heterosexual risk groups. These results underscore that strategies aimed at raising awareness of HIV infection, providing easy access to checking symptoms of early HIV infection, encouraging frequent testing, PrEP provision, addressing fears of a positive test and reducing stigma are vital to break the forefront of ongoing HIV transmission chains (https://hebikhiv.nl/en/; Dijkstra et al., 2017; Heijman et al., 2009; Burns et al., 2017; Myers et al., 2018). The estimated times to diagnosis document substantial disparities across risk groups in entering HIV care in Amsterdam, and separate efforts have characterised individuals with late diagnoses (Op de Coul et al., 2016; Bil et al., 2019; Slurink et al., 2021). We explored the impact of assumptions on incidence trends to the undiagnosed estimates and found some sensitivities (Appendix 1, Section 3.3), although estimates were all very similar as long as the assumed incidence trends reflected available data. Further sensitivity analyses are reported in Appendix 1 Section 3.4–3.5. We further validated the timetodiagnosis estimates by comparing the estimated proportion of recent HIV infections (≤6 months) with those estimated in an independent study in Amsterdam using avidity assays (Slurink et al., 2021), and found them to be similar (Appendix 1—figure 28). The main limitation of our biomarker approach is thus that at present we cannot account for time trends in timetodiagnosis.
Second, we documented the growth of Amsterdam HIV transmission chains in which all phylogenetically observed members were virally suppressed by 2014. We find that regardless of risk group, almost all such virally suppressed chains did not grow in the sense that no new infections were phylogenetically observed. These results are unsurprising and mirror the established relationship that treatment for HIV infection, which results in undetectable viral load equals untransmittable virus (Rodger et al., 2019).
Third, we initially speculated that with a decade of declining HIV diagnoses in Amsterdam, those infections that still occur might be concentrated in newly seeded, emerging transmission chains. It is challenging to interpret the directly observed data because high proportions of individuals remain undiagnosed and/or are not sequenced, and emerging chains are more likely to be completely undetected. We thus used statistical growth models accounting for unsampled cases, and we estimate in contrast to our initial speculations that 53% of new Amsterdam MSM infections in 2014–2018 grew from chains that existed prior to 2014, and 39% of new Amsterdam heterosexual infections. Following up and tracing back from known transmission chains is easier than discovering emerging chains, and so the many new infections that originate in existing chains have particularly high prevention potential (Oster et al., 2018; Little et al., 2021; Dennis et al., 2021).
Fourth, we quantified the locally preventable infections among Amsterdam residents in 2014–2018, defined as the infections in Amsterdam residents in 2014–2018 who are estimated to have as source another Amsterdam resident. Using the virus’ genetic code as an objective marker into infection events, we estimate that regardless of declining diagnoses and incidence, the majority of infections in Amsterdam residents in 2014–2018 remained locally preventable in all risk groups investigated. The statistical strength of evidence into this finding was strong for Amsterdam MSM (all 95% credible intervals for the proportion of locally preventable infections were above 50%), but more moderate for Amsterdam heterosexuals (wider credible intervals including 50%), reflecting that relatively few infections in Amsterdam heterosexuals in 2014–2018 were observed with a viral sequence by early 2019 due to frequent late diagnosis and incomplete viral sequencing. These findings are consistent with data from clinic surveys in migrants across Europe (AlvarezDel Arco et al., 2017), which indicated similar levels of incountry HIV acquisition post migration of 51% in heterosexual women and 58% in heterosexual men.
In summary, our data from 2014 to 2018 indicates considerable potential to prevent HIV infections among Amsterdam residents through citylevel interventions, even in the context of substantial improvements in curbing the number of diagnoses and infections in Amsterdam over the past 10 years. Within the similarities in demographics, HIV burden, access to care, and prevention approaches between Amsterdam and many cities in Western Europe and worldwide, our conclusions are relevant to the wider UNAIDS FastTrack cities, and provide evidencebased support for locally targeted combination HIV prevention interventions in metropolitan areas. COVID19 has severely disrupted prevention messaging, testing and PrEP services and early pathways to care, making innovative and targeted HIV prevention approaches all the more important.
Appendix 1
Supplementary tables and figures
Data were obtained from Stichting HIV Monitoring, collected as part of the open ATHENA cohort of all patients in care in the Netherlands. The dataset includes includes the municipal health service (GGD) region of the patient at the time of registration to the cohort, or at their most recent registration update, based on their the postcode of their place of residency (PC4 code) either at time of registration to the cohort, or at their most recent registration update. PC4 is the most granular administrative city level in Amsterdam, with 12,000 residents on average per PC4 area and a number of residents ranging from 10 to 26,263. Appendix 1—figure 29 shows a map of the 81 Amsterdam PC4 areas. Amsterdam patients were identified as patients with a first or more recent registration in the Amsterdam GGD region.
The ATHENA database version was closed on March 31st 2019 (Boender et al., 2018). We obtained data for 19,204 patients from the Netherlands, with 7,773 of these having an Amsterdam postcode at first or last registration.
We leverage baseline data recorded at registration on year of birth, country of birth, mode of transmission, date of death (if patient has died), date of AIDS diagnosis, date of ART start, date of last HIV negative test and date of first HIV positive test.
We also obtained datasets from the ATHENA cohort of partial HIV1 polymerase (pol) sequences of Amsterdam patients, including date of sample, and of clinical data collected longitudinally of viral load measurements and CD4 counts.
In the study, we focus on infections estimated to have been acquired between 20142018 (see Section 3.1). We also consider MSM and heterosexual transmission groups only, since less than 2% of infections were in other transmission groups. Table 1 summarises patient characteristics for all Amsterdam individuals estimated to have been infected with HIV between 20142018, and those who have a viral sequence available. The cohort is predominantly male (92%), and MSM (86%). 41% of individuals were between 2534 years old at their estimated time of infection. Less than 3% of individuals were estimated to have been infected aged 60 or older. 41% of individuals infected between 20142018 were born in the Netherlands, followed by 13% from South America and the Caribbean, which are predominantly individuals from Suriname and the Dutch Caribbean. Appendix 1—table 1 also reports characteristics of patients with a viral sequence available. Empirically comparing only those with a sequence with the complete Amsterdam cohort of all individuals infected between 20142018, indicates that those patients with a sequence are representative of the whole diagnosed population.
For each transmission group, we define each strata by place of birth, according to the main migrant populations in Amsterdam. For Amsterdam MSM and heterosexuals, respectively, these are,
Since we focus on infections acquired between 20142018, we define the study start and end time by,
Estimating HIV infection dates and undiagnosed infections
In this section, we first describe how we fit a model to clinical biomarker data to estimate the time from infection to diagnosis, and consequently the date of infection. Next, we describe how we fit a model to the posterior median estimates of the timetodiagnosis, to estimate the proportion of Amsterdam infections which remained undiagnosed by the close of the study.
Estimating HIV infection dates
3.1.1 Data
We define the complete cohort of patients registered in Amsterdam by $\mathcal{N}$. We first follow methods in Pantazis et al., 2019 to estimate time from infection to diagnosis for individual $i\in \mathcal{N}$ by w_{i}. We use an indicator ${R}_{i}$ to denote transmission risk group of each individual, where,
We utilise clinical biomarker data for each patient on CD4 counts and viral loads, measured after diagnosis but before onset of AIDS or start of ART. As a caveat, we keep viral load measurements within one week of ART start, and CD4 counts within one month of ART start. This choice is supported by the fact that ART takes time to act. We denote CD4 counts by ${y}^{c}$, and viral loads by ${y}^{r}$, and encapsulate measurements for all $i$ individuals in a vector,
Each measurement is collected at an (unknown) time since infection,
We have clinical data prior to AIDS diagnosis or start of ART for 6,879 (88%) of patients. For the remaining 12% we are unable to estimate the time of infection. We then denote the time between diagnosis and each biomarker measurement by,
From this, we can then express the time from infection to measurement date in Equation S5 in terms of the estimated date of infection, w_{i}, and the time between diagnosis and each biomarker measurement as follows,
3.1.2 Model
We then use a bivariate linear mixed model for the joint distribution of the two biomarkers over time and denote their distribution by,
for the joint distribution of the two biomarkers over time. We place a uniform prior on w_{i} over (0,u_{i}), where u_{i} is the interval between time at risk for each individual and HIV diagnosis. We take the risk onset date to be the maximum of the time between the last negative and test and diagnosis, and the time between the individual turning 15 years of age and diagnosis.
The posterior distribution of w_{i} is as follows:
from which we estimate the median time from infection to diagnosis for w_{i}, and 95% credible intervals.
3.1.3 Estimated quantities
Then, if ${T}_{i}^{\text{diagnosis}}$ is the reported diagnosis date for individual $i$, we estimate their infection date, denoted by ${T}_{i}^{\text{infection}}$, with,
Appendix 1—figure 1 shows the distribution of individual median estimates for timetodiagnosis by the risk groups given by Equation (S1a) and (S1b) for MSM and heterosexuals, respectively. Figure 1 plots the diagnosis date against the estimated infection date for all individuals diagnosed between 2014 and the May 2019. 95% credible intervals indicate uncertainty around individual level estimates from the model. We note that treatment guidelines changed in 2015 from starting ART based on CD4 count, which is measured every 6 months, to immediate ART initiation. Since we only consider biomarker measurements taken prior to ART start, as a result we have fewer biomarker measurements per individual for PLHIV diagnosed since 2015, which leads to larger uncertainty around date of infection.
Estimating the proportion of infections in 20142018 that were undiagnosed by May 2019
3.2.1 Data
We next sought to estimate the proportion of infections in 20142018 that remained undiagnosed by May 2019. The patient data is rightcensored, so many recent infections may yet be undiagnosed in the patient data set. For this reason, we considered the subset of Amsterdam diagnoses that we estimated to have been acquired between 2010 and the end of 2012, since most infections acquired in this interval would have been diagnosed by early 2019 given typical disease progression (Pantaleo et al., 1993). We first define an indicator ${U}_{i}(\tau )$, which is a function of a given year $\tau $, in which,
We then define the synthetic cohort of infections in 20102012 by S12.
We then consider individuals $k\in {\mathcal{C}}^{\mathrm{M}\mathrm{S}\mathrm{M}}$ and $l\in {\mathcal{C}}^{\mathrm{H}\mathrm{S}\mathrm{X}}$. For each transmission group, we defined each strata by place of birth given in Equation (S1a) and (S1b). Appendix 1—table 6 shows the characteristics of patients used to fit the model.
3.2.2 Hierarchical model
We fit a hierarchical Weibull model to the estimated times from infection to diagnosis (w_{i}) in Stan, for MSM and heterosexuals separately. For MSM, we denote the function $j(k)$ which takes as value the place of birth of individual $k$, as defined in equation (S1a). We estimate ethnicityspecific shape and scale parameters ${\kappa}_{j(k)\in \mathcal{M}}$ and ${\lambda}_{j(k)\in \mathcal{M}}$ which can borrow information from each other through a hierarchical prior distribution. For convenience when choosing priors, we reparameterised the Weibull distribution in terms of its median and 80% quantile ($\mathrm{log}{\chi}_{j(k)}^{50}$, $\mathrm{log}{\chi}_{j(k)}^{80}\mathrm{log}{\chi}_{j(k)}^{50}$). The quantile function for the Weibull distribution is given by Equation (S13).
We then express the parameters of the Weibull distribution as follows:
and then specify the Weibull model and its prior distribution as follows,
where $Q(0.5)$ and $Q(0.8)$ are the empirical quantiles from the estimated times to diagnosis, for each transmission group.
The joint posterior distribution of model Equation S15a was estimated in rstan with Stan version 2.21 using three Hamiltonian Monte Carlo chains with 2000 samples each including a warmup of 500 samples. The models mixed well, with no correlation between parameters, and had at most one divergence (Appendix 1—figures 30–33). The smallest effective sample size across parameters for the MSM model was 1461 and 1059 for the heterosexual model.
3.2.3 Estimated quantities
We then estimate, for a given year $y\in \mathcal{Y}=2014,\dots ,2018$ of infection, the probability of an MSM individual not being diagnosed by 2019, given their place of birth, as follows:
To account for changes in infection rate over the study period, we generate weights ${\omega}_{y}$ for each year using the distribution of total infections among MSM by year estimated by the ECDC modelling tool (Stockholm: European Centre for Disease Prevention and Control, 2017),
where $N}_{y}^{InfECDC$ are the estimated total infections acquired among MSM in Amsterdam in year $y$. We then calculate the average probability that an individual infected in 20142018 remained undiagnosed by 2019 with
We then denote the number of diagnosed Amsterdam MSM infected in 20142018 and born in world region $j(k)$ by ${N}_{j(k)}^{D}$. Finally, we can estimate the total number of infections in 20142018 in Amsterdam MSM through,
and obtain numerical estimates of ${N}^{Inf(k)}$ via the Monte Carlo samples from the joint posterior and the calculated proportions ${\delta}_{j(k)}^{(k)}$ of undiagnosed infections. Poster median estimates and 95% credible intervals of (S16)(S19) are obtained by summarising the set of Monte Carlo samples after the transformations. The model for heterosexuals is formulated analogously.
Appendix 1—figure 34 shows the estimated Weibull distributions for the time to diagnoses, stratified by MSM and heterosexuals and place of birth. The empirical cumulative distribution functions (CDFs) of the times to diagnoses are for comparison shown as step functions (black). The fits were good, with the empirical CDFs generally lying within the 95% posterior intervals of the fitted CDFs for all risk groups. Appendix 1—figure 35 summarises the total number of estimated infections acquired between 2014 and 2018, the subset of those that were diagnosed by 2019, and the subset of those which have a viral sequence available. The sequence sampling fraction is shown above each bar.
Appendix 1—figure 35 shows the total estimated number of infections acquired between 2014 and 2018 among Dutchborn and foreignborn individuals in Amsterdam, by risk group, alongside the number of infections by date of diagnosis, and the number and proportion of those with a sequence available.
Sensitivity analysis: Estimating the average undiagnosed from infections acquired in 20142018 with weights
Our approach to estimating the proportion undiagnosed fits the model to the times to diagnoses for infections acquired in 20102012, and calculates a weighted average to account for a change in incidence rates over the study period 20142018 using estimated infection rates. We compared this approach to where we assume constant rates of infection in 20142018 (no weights), and where we weight by diagnosis rates in 20142018.
Appendix 1—table 7 compares the estimates for the undiagnosed population with the three approaches. There is evidence for declining incidence among MSM in both the diagnosis and estimated infection rates, so assuming constant weights may overestimate the proportion undiagnosed. However, whilst diagnosis rates appear to have also declined over time, estimated infection rates were relatively stable between 2014 and 2018, which leads to similar estimates of undiagnosed with equal weights.
Sensitivity analysis: Using only data on last negative tests
We carried out several sensitivity analyses to explore the impact of alternative approaches to estimating infection dates on the proportion of Amsterdam infections in 20142019 that are estimated to have remained undiagnosed by 2019. We first considered estimating the date of infection as the midpoint between last negative HIV test and first positive HIV test, where available. We therefore only considered patients with a last negative HIV test to fit the model for the timetodiagnosis distributions. In contrast, the approach taken to estimating the infection date in the main analysis considers the time at risk to be either the time since last negative HIV test, or the time since the patient was 15 years old where a last negative test is not available. Based on the midpoint estimates, each individual was classified to have been infected before or after 2014 in analogy to Equation (S11). We had 266 patients across the synthetic cohorts defined by Equation (S12), compared with 632 when using the estimated date of infection. This is reflective of the fact many individuals do not have a reported last negative test date.
Appendix 1—figure 36 compares the estimated proportion of undiagnosed Amsterdam infections obtained as in the main analysis from all biomarker data from all individuals (Appendix 1—figure 36A) to that obtained when using only midpoint estimates from seroconverters (Appendix 1—figure 36B). Estimates are compared by year of infection for each risk group. When using data only from the seroconverters, the estimated proportions of undiagnosed individuals are much smaller. This is likely driven by the fact we excluded patients without a last negative test, who may have typically had longer estimated times to diagnosis. This was also observed by Ratmann et al., 2016. There are also considerably fewer data points, particularly among heterosexuals, resulting in elevated uncertainty in these estimates. Appendix 1—figure 37 shows our estimates for the total number of infected individuals in Amsterdam. Clearly, whilst the estimates are more conservative where we use midpoint estimates than we find using the estimated times to diagnosis (see Appendix 1—figure 37), we still find a substantial proportion of individuals to be undiagnosed by 2019.
Sensitivity analysis: Estimates from ECDC modelling tool
Third, we considered utilising estimates for Amsterdam from the ECDC HIV modelling tool (Stockholm: European Centre for Disease Prevention and Control, 2017). We used model estimates of the estimated infections per year among MSM and heterosexuals, respectively, and the estimated proportion of infections diagnosed within $1,\mathrm{\dots},15$ years. Estimates were only available by transmission group, but not place of birth. Estimates also also only available by year, so we estimate the proportion of infections acquired between 2014 and 2018 undiagnosed by the end of 2019.
If ${I}_{y}$ are the estimated number of individuals infected in year $y$, and ${\delta}_{y,z}$ is the probability of an individual infected in year $y$ being diagnosed in year $z\in \{y,\cdots ,2018\}$. Then, the proportion of individuals infected between 2014 and 2018 who are undiagnosed by the end of 2019, are estimated by:
where ${\omega}_{y}$ are weights according to the proportion of individuals infected in year $y$:
Estimates of undiagnosed were similar for MSM (28.7%) of infections were undiagnosed by 2019, but considerably higher among heterosexuals compared to the estimates from the Weibull model (62.2%).
Viral phylogenetic analyses
Multiple sequence alignment
We used partial HIV pol sequences from Amsterdam and the rest of the Netherlands from the ATHENA cohort and aligned these to the reference genome HXB2 (Ratner et al., 1985) using Virulign (Libin et al., 2019). Sequences which failed to align were aligned globally using Mafft version 7 (Katoh and Standley, 2013). Nucleotide positions which were missing for most sequences, or not in the reference sequence HXB2 were removed. Known antiretroviral resistant mutations were masked using the R package big.phylo (Ratmann, 2019). The final alignment was 1302nt in length. We carried out some manual curation of the alignment, removing all gaps and resolving sequences which did not align correctly. We then classified sequences by subtype using COMET v2.3 (Struck et al., 2014) and verified any which were uncertain with REGA v3.0 (PinedaPeña et al., 2013).
We downloaded 82,708 background sequences from the Los Alamos HIV1 sequence database on 27th February 2020, specifying fragments in the POL region longer than 1300nt. We then used the Basic Local Alignment Search tool (BLAST, https://blast.ncbi.nlm.nih.gov/Blast.cgi) to identify the top 20 closest background sequences to each of the Dutch sequences, which we kept and aligned to the Dutch sequences using the HXB2 reference sequence. We created alignments by subtype, excluding the least common subtypes with fewer than 50 Dutch and background sequences.
Reconstruction of city transmission chains
We used FastTree v2.1.8 to reconstruct phylogenetic trees for each subtype (Price et al., 2010). We then assigned labels to each sequence. Sequences from Amsterdam were labelled according to their risk group, sequences from the rest of the Netherlands (excluding Amsterdam) were labelled as such, and background sequences were labelled according to the country they originated from. The geographic regions for the MSM trees were,
and similarly for heterosexual trees,
We then used phyloscanner v1.8.0 (Wymant et al., 2018) to assign one of the state labels to each viral lineage in the reconstructed phylogenies. Appendix 1—figures 4–14 show the annotated phylogenetic trees for all major subtypes and circulating recombinant forms that circulate in Amsterdam. From the annotated trees, we extracted the viral phylogenetic subgraphs that were assigned to Amsterdam individuals. We assume that viral phylogenetics correctly assigns individuals into subgraphs, which we interpret as the observed parts of distinct citylevel transmission chains.
Branching process model of partially observed, growing transmission chains
In this section, we describe how we build on existing methods to model the growth of the existing and newly introduced transmission chains. Utilising the phylogenetic subgraph data described in Section 4.2, we show how we can model their growth from a point in time, rather than from the first introduction, by utilising the number of infectious cases in the subgraph at a given point in time, and how many new cases were generated from those infectious cases. We also describe the model likelihood of new transmission chains which emerged.
We model the growth of transmission chains using putative infection dates, estimated in 3.1. For individuals with no estimate for date of infection, due to missing clinical biomarker data after diagnosis, we subtracted the posterior median timetodiagnosis for an individual estimated using the model described in equation (S15a) in the corresponding migrant group, defined by Equation (S1a) and (S1b).
Probability that $m$ index cases collectively generate $i$ new infections
We model the spread of HIV transmission chains that are characterised by reproduction numbers below one, through branching processes characterised by Negative Binomial offspring distributions (Blumberg et al., 2014). A central component of branching process theory is the probability generating function $Q(s)={\sum}_{i=0}^{\mathrm{\infty}}{q}_{i}{s}^{i}$, where q_{i} is the probability that one individual generates $i$ new infections in one generation, and q_{0} is the probability that one individual generates no further infections. For our purposes, we will use two fundamental formulae. First, the $k$ th derivative of $Q$ is
and so the probability q_{k} is recovered through
Second, the $k$ th coefficient of ${Q}^{2}(s)$ is
which is the probability that two individuals collectively generate $k$ new infections. Thus, the probability that $m$ index cases collectively generate $i$ new infections is given by the $i$ th coefficient of ${Q}^{m}(s)$. We denote this probability by
We consider a Negative Binomial offspring distribution, parameterised in terms of the mean μ and dispersion parameter $\varphi $, so that its variance is given by $\mu (1+\mu /\varphi )$. Thus, as $\varphi $ tends to zero, $\mu /\varphi $ increases, and so does the variance to mean ratio $(1+\mu /\varphi )$. This means that the Negative Binomial can simultaneously model average reproduction numbers as well as additional heterogeneity in the number of new infections per generation, that goes beyond the variation described by a Poisson offspring distribution. The probability generating function of the Negative Binomial offspring distribution is
Thus, we have that the probability that $m$ index cases generate $i$ new infections is
where $m=1,2,\mathrm{\dots}$ are fixed, and the number of new infections takes on values $i=0,1,\mathrm{\dots}$. It is helpful to note that equation (S29a)(S29d) has an intuitive interpretation, it is a Negative Binomial with mean $\mu m$ and dispersion parameter $\varphi m$, which we denote by
where $m=1,2,\mathrm{\dots}$ are fixed, and the number of new infections takes on values $i=0,1,\mathrm{\dots}$.
Equivalently, we can express the probability that $m$ index cases result in a total number of $n$ cases through
or more simply
where $m=1,2,\mathrm{\dots}$ are fixed, and the number of total cases are $n=m,m+1,\mathrm{\dots}$.
Probability that $m$ index cases result in a transmission chain with $i$ new infections
Transmission chains require that infections occur in a particular order, while in contrast equation (S29a)(S29d) do not impose in what generation how many infections occur. For example, with one index case $m=1$ and a total size $n$, Equation S31 quantifies the probability that $n1$ new infections occur, but there is no constraint that the index case generates at least one new infection in the next generation.
Dwass, 1969 derived the correction factor, and the probability that a transmission chain with $m$ index cases has $i$ new infections, or equivalently $n$ cases, is
where $m=1,2,\mathrm{\dots}$, $i=0,1,\mathrm{\dots}$, and $n=m,m+1,\mathrm{\dots}$.
Probability that $m$ index cases result in subgraphs with $i$ sampled, new infections
In practice, only a subset of new infections are captured in viral phylogenies because only a subset of infections are diagnosed, and of those only a subset have virus sequenced. We make two assumptions. First, infections are missing independently of each other with the same probability $1\rho $, so $\rho $ is the sampling probability of infections. Second, uncertainty in $\rho $ can be quantified within several percentage points through surveillance data and/or modelling; we use this assumption later to ensure that the remaining parameters are statistically identifiable.
Then, the probability of observing individuals in a subgraph that has known index cases is
where $m=1,2,\mathrm{\dots}$, $i=0,1,\mathrm{\dots}$, and $c(km,\mu ,\varphi )$ is from Equation S33a. It is possible that an observed subgraph has $m$ index cases by a particular study start time ${\psi}_{\text{start}}$ and no new infections between ${\psi}_{\text{start}}$ and ${\psi}_{\text{end}}$, as defined in , Equations S2a and S2b, and the probability of observing one such subgraph is ${c}_{\text{obs}}(0m,\mu ,\varphi ,\rho )$.
Probability that emergent subgraphs have $n$ sampled cases
Some observed subgraphs are emergent in the sense that they consist of individuals that were all diagnosed after the study start time $T$. In this case, Equation (S34) cannot be used because it assumes that subgraphs contain at least one index case prior to the study start time $T$. We assume that emergent subgraphs are seeded by one index case, which for example ignores the possibility that sexual partners infected each other and then moved to Amsterdam, and seeded a new transmission chain in Amsterdam. The probability of observing an emergent transmission chain of size $n$ is given by
where unlike Equation (S34), $n=1,2,\mathrm{\dots}$ may include in the count the index case (if it is sampled), and $\stackrel{~}{c}(zm=1,\mu ,\varphi )$ is from Equation (S33b). The denominator corrects for the event that the index case and all new infections in an emergent chain are unsampled, which is possible with nonzero probability, but always unobserved.
Likelihood of the growth distribution of phylogenetic subgraphs
We now describe the likelihood of the growth distribution of viral phylogenetic subgraphs, which throughout we identify as the observed parts of distinct citylevel transmission chains. In what follows, for brevity, we only consider one transmission group and omit reference to this transmission group. All equations are analogous for the other transmission group.
We start with the viral phylogenetic subgraphs in the viral phylogeny of one subtype, and omit for brevity also any indication of that subtype. The data consist of a twodimensional array $\mathbf{x}$, where ${x}_{mi}$ denotes the number of subgraphs that had $m$ index cases at the study start time ${\psi}_{\text{start}}$ and $i$ sampled, new infections by the study end time ${\psi}_{\text{end}}$. Here, $m=1,\mathrm{\dots},M$ and $i=0,\mathrm{\dots},I$ where $M$ denotes the largest number of index cases observed, and $I$ denotes the largest number of new infections observed. In addition, the data consist of a onedimensional array $\stackrel{~}{x}$, where ${\stackrel{~}{x}}_{n}$ denotes the number of emergent subgraphs that have $n$ sampled cases during the study period. Here, $n=1,\mathrm{\dots},N$, because at least one case needs to be sampled in order to observe the corresponding subgraph.
Then, we associate the following loglikelihood to the growth distributions of preexisting and emergent subgraphs,
The loglikelihood thus involves infinite sums through equations (S34) and (S35). We approximated these by summing up to the $10*I*({N}_{D}/{N}_{S})$ th term, where ${N}_{D}$ are the number of diagnosed individuals and ${N}_{S}$ are the number of sequenced individuals, so $I*({N}_{D}/{N}_{S})$ is the expected number of individuals in the transmission chain that corresponds to the largest observed subgraph. In applying this loglikelihood, we assume that (1) all transmission chains have reached their final size by the end of the study period, i.e. that they are complete; (2) that all emergent transmission chains have one index case; (3) that each case has an equal and independent probability of being sampled.
Next we consider the joint likelihood that arises from consideration of viral subgraphs of the same transmission group (e.g. MSM or heterosexual individuals) across all HIV subtypes or circulating recombinant forms. Since the number of subgraphs and new cases acquired between 20142018 are very small for some subtypes, we aggregate the subgraph size distributions for nonB subtypes. We index subtypes B and nonB by $s=1,\mathrm{\dots},S$, where $S=2$, and denote the corresponding subgraph growth distributions by ${\mathbf{x}}_{s}$, and $\stackrel{~}{x}}_{s$. Then, we model the loglikelihood of all the data for one transmission group through
where the ${\mu}_{s}$, ${\varphi}_{s}$, ${\rho}_{s}$ are specific to the corresponding transmission group and subtype.
Bayesian inference
We estimate citylevel transmission dynamics, the growth distribution of transmission chains, and the proportion of locally acquired infections through the loglikelihood (Equation S37) of phylogenetically observed subgraphs.
Preliminaries
6.1.1 Number of index cases
For each individual $i$ in the cohort $\mathcal{N}$, if r_{i} is their last viral load measurement taken before 2014, we define them to be not virally suppressed by 2014 through,
Then, for each observed subgraph $j$ where $(j=1,\mathrm{\dots},A)$, m_{j} are the observed index cases, we count the number of individuals infected by, but who were not virally suppressed, by the start of 2014. For example for MSM, if $\mathcal{C}}^{\mathrm{M}\mathrm{S}\mathrm{M}$ is the subset of MSM in Amsterdam,
the number of observed index cases in subgraph $j$ is,
and ${m}_{j}>0$. We count analogously for heterosexuals. The actual number of index cases ${m}_{j}^{*}\sim \text{NegBinom}({m}_{j},\nu )$, where $\nu $ is the sampling fraction of individuals who were not virally suppressed by 2014. We estimate the true number of index cases under complete sampling ${m}_{j}^{*}$ by,
When ${m}_{j}=0$, estimate ${m}_{j}^{*}$ from the mode of the pmf for the distribution $\text{Binomial}(0;{m}_{j}^{*},\nu )$.
6.1.2 Number of subgraphs with no new infections
For the subgraphs in which no individuals were not virally suppressed by 2014 (i.e. no observed index case), and no observed new case between 20142018, were not included in the subgraph sizes and assumed to have died out.
Hierarchical model
The parameters of the model (Equation S37) are the subtypespecific mean parameters of the offspring distributions, ${\mu}_{1},\mathrm{\dots},{\mu}_{S}$, the dispersion parameters ${\varphi}_{s}$ and the sampling parameter $\rho $. To estimate the ${\mu}_{1},\mathrm{\dots},{\mu}_{S}$, we borrow information across subtypes through a hierarchical prior distribution. We interpret the mean parameters of the offspring distributions as the effective reproduction numbers during the study period for the corresponding subtype. The variancetomean ratio of the Negative Binomial offspring distribution is $1+{\mu}_{s}/{\varphi}_{s}$ and measures the degree of dispersion of the size distribution of the transmission chains. For ease of inference, we reparameterize the dispersion parameter into the variancetomean ratio minus one and also specify a hierarchical prior distribution,
The log posterior density is given by
where the prior densities are specified as follows. For the effective reproduction numbers, we specified the normalnormal twolevel
The hyperprior of the grand mean was centred on the maximum likelihood estimate $\mathrm{log}{\widehat{\mu}}_{\text{MLE}}=\mathrm{log}(11/\overline{x})$, where $\overline{x}$ is the average subgraph size (Blumberg and LloydSmith, 2013). The hyperprior of the grand standard deviation $\sigma $ was specified by considering the differences in the log maximum likelihood estimates $\mathrm{log}{\widehat{\mu}}_{\text{MLE}}$ for each subtype.
For the variancetomean ratio, we specified
where 1 is the rate parameter for the exponential distribution. For the sampling parameter, we specified
where ${N}_{S}$ are the number of sequenced individuals, ${N}_{D}$ are the number diagnosed and $\delta $ are the proportion of undiagnosed individuals.
Numerical inference
The joint posterior distribution was estimated using Stan version 2.21 across three chains, each with 2,000 samples.
The models mixed well; Appendix 1—figures 40 and 41 show the trace plot for the parameter with the smallest effective sample size in each model, which was 1637 for the MSM model and 1622 for the heterosexual model. Appendix 1—figures 42 and 43 shows the pairs plot of parameters for the MSM and heterosexual models, respectively. We note that we did not observe multiplicative nonidentifiabilities (banana shape) between the reproduction rate R0 and the variancetomean ratio, as found by Blumberg and LloydSmith, 2013.
Target quantities derived from fitted model
Estimated number of new cases in transmission chains since 2014
To estimate the actual number of new infections in transmission chains since 2014 from the phylogenetically observed subgraphs, we use the model fits in combination with the size equations (5.2) and (5.2) to obtain the posterior predictive number of new cases in a transmission chain with $m=1,\mathrm{\dots}$ index cases in 2014. For emergent chains, we assume as before that there was one index case since 2014. Specifically, we have
where ${i}^{*}=0,1,\mathrm{\dots}$, and for ease of notation we have dropped the suffixes for different subtypes, transmission groups, or time intervals. We approximate Equation (S47) numerically from $k=1,\mathrm{\dots},K$ Monte Carlo samples ${\mu}^{(k)},{\varphi}^{(k)}$ of the joint posterior distribution by generating samples from
This is easily done since the inference algorithm already tabulates the probabilities $c({i}^{*}m,{\mu}^{(k)},{\varphi}^{(k)})$ for ${i}^{*}=0,1,\mathrm{\dots}$.
Equation (S48) allows us to generate one Monte Carlo sample of the actual growth of all transmission chains. We denote the number of all preexisting phylogenetically observed transmission chains with at least one index case by
and index each of them through ${j}^{x}=1,\mathrm{\dots},\mathbf{x}$. Correspondingly we denote the number of emergent subgraphs by
A certain proportion of emergent transmission chains remains phylogenetically unobserved owing to incomplete sampling. In our model, the probability that an emergent transmission chain is entirely unobserved is given by
as in Equation (S35). Thus, the expected number of emergent transmission chains is $\stackrel{~}{x}/(1{\rho}_{\text{notobs}}^{e})$. We obtain a Monte Carlo estimate of (Equation S51) by plugging in our estimates of the joint posterior density,
Using Equation (S52), we can predict the number of completely unobserved, emergent subgraphs through
where the Negative Binomial is specified in terms of the number of failures and success probabilities, with mean $(\stackrel{~}{x}(1{\rho}_{\text{notobs}}^{e(k)}))/(1{\rho}_{\text{notobs}}^{e(k)})$, so that the mean of $\stackrel{~}{x}+{N}_{\text{notobs}}^{\ast (k)}$ equals as desired $\stackrel{~}{x}/(1{\rho}_{\text{notobs}}^{e(k)})$. We index all emergent transmission chains through $j}^{e}=1,\dots ,\stackrel{~}{x}+{N}_{\text{notobs}}^{\ast (k)$, and note that the number of emergent transmission chains is uncertain.
Then, the actual number of new cases since 2014 in the chain corresponding to the observed, preexisting subgraph ${j}^{x}$ is predicted by sampling ${i}_{{j}^{x}}^{*(k)}\sim c(\cdot {m}_{{j}^{x}},{\mu}^{(k)},{\varphi}^{(k)})$, where ${m}_{{j}^{x}}$ is the number of index cases in the corresponding subgraph. Similarly, the actual number of new cases since 2014 of the chain corresponding to the emerging transmission chain ${j}^{e}$ is predicted by sampling ${i}_{{j}^{e}}^{*(k)}\sim \stackrel{~}{c}(\cdot 1,{\mu}^{(k)},{\varphi}^{(k)})$, and then calculating ${n}_{{j}^{e}}^{*(k)}={i}_{{j}^{e}}^{*(k)}+1$. For the emergent subgraphs, we add 1 since we assume as before that the index case occurred after 2014. Aggregating these sizes, we predict the size distribution of the number of chains with $i$ new cases since 2014 by
where $i=0,1,\mathrm{\dots}$ and 1 is the indicator function that evaluates to 1 if ${i}_{{j}^{x}}^{*(k)}$ is equal to $i$, and otherwise to zero. The median estimate and 95% credible intervals of ${x}_{i}^{*}$ are obtained by drawing posterior samples $(k)$, repeating the calculation of (S54) for each $k$, and then summarising the set of samples.
Appendix 1—figure 38 shows the observed growth of subgraphs in red next to the predicted actual growth of subgraphs in blue (with 95% credible intervals) for Amsterdam MSM and heterosexuals.
Estimated origins of transmission chains between 2014 and 2018
If ${\widehat{\pi}}_{r}=({\widehat{\pi}}_{1},\mathrm{\dots},{\widehat{\pi}}_{R})$ are the proportion of phylogenetically observed subgraphs since 2014 with geographic origin $r$, we can predict the origins of the preexisting and emergent transmission chains, y_{j}, for each Monte Carlo sample through,
We then denote the proportion of predicted emergent transmission chains since 2014 with Amsterdam origin ($A$) by
Appendix 1—table 5 reports the estimated ancestral origins of viral lineages in Amsterdam in the central phylogenetic analysis, and uncertainty estimates obtained from the bootstrapped analyses. The estimated origins predicted from the model are also reported, with 95% credible intervals.
Estimated number of new cases between 2014 and 2018
From Section 7.1, we can use the posterior predictive distribution of the number of new cases for a chain of a given index size Equation S47 to obtain a Monte Carlo prediction of the number of citylevel cases since 2014. This is given by
Estimated ethnicity of new cases between 2014 and 2018
For Amsterdam MSM, we consider geographic regions of birthplace of new cases $r\in \mathcal{M}$. For Amsterdam heterosexuals, we consider georegions $r\in \mathscr{H}$, defined in Equation (S1a) and (S1b). Consider, for MSM, ${\widehat{\zeta}}_{r}$ is a vector of the proportions of diagnosed individuals estimated to have been infected since 2014, born in each geographic region $r\in \mathcal{M}$. We then predict the birthplace regions of the total ${x}^{*(k)}$ new cases between 20142018, ${\mathbf{z}}_{n}$ , for each Monte Carlo sample through,
Proportion of locally acquired infections
The proportion of locally acquired infections since 2014 is defined by the proportion of citylevel cases since 2014 that acquired infections in Amsterdam. In the model, all secondary infections originating from index cases of preexisting transmission chains are infections that were acquired locally. Similarly, all secondary infections originating from index cases of emergent transmission chains are infections that were acquired locally. The index cases of preexisting transmission chains do not contribute to the denominator (S57) because they existed prior to 2014. This leaves the index cases of emergent transmission chains, for which we also have a Monte Carlo estimate,
A proportion of these index cases also acquired infection locally, from other risk groups in Amsterdam. We denote this proportion by $\lambda $ (Equation S56). This allows us to estimate the proportion of locally acquired infections since 2014 through
The median estimate and 95% credible intervals of $\gamma $ are obtained by drawing posterior samples $(k)$, repeating the calculation of Equation S60 for each $k$, and then summarising the set of samples.
Appendix 1—table 8 presents the estimated proportion of locally acquired infections by subtype, and the quantities used to calculate from Equation S60.
We then seek to estimate the proportion of locally acquired infections by ethnicity for each transmission group model separately. Until now, all generated quantities are calculated for each subtype, without specific indexing. To obtain estimates of locally acquired infections by ethnicity, we apply weights to the subtypespecific estimates of locally acquired infections, using the proportion of predicted individuals from each georegion with subtype B or nonB infections. To obtain estimates of locally acquired infections by place of birth, we first generate weights which correspond to the proportion of infected individuals from region of birth $r$ that were infected with subtype/circulating recombinant form $s$,
Then, if ${\gamma}_{s}^{(k)}$ is the proportion of locally acquired infections for subtype $s\in \{\text{B,nonB}\}$, we then calculate the proportion of locally acquired infections by place of birth $r$ as follows:
As before, the median estimate and 95% credible intervals of $\gamma $ are obtained by drawing posterior samples $(k)$, repeating the calculation of Equation S62 for each $k$, and then summarising the set of samples.
Appendix 1—table 9 presents the characteristics of the observed phylogenetically observed subgraphs alongside the model estimates for the parameters of the branching process model, and the proportion of infections estimated to have been acquired through citylevel transmissions by transmission group and subtype.
Appendix 1—figure 39 presents the estimated ${\gamma}_{r}^{(k)}$ from Equation S62 and the composition of subtypes among the predicted total new cases used to estimate locally acquired infections by ethnicity from the subtypespecific estimates.
Assessing model fit
To assess model fit, we perform posterior predictive checks against the phylogenetically observed growth distribution of subgraphs for each transmission group and subtype. To keep notations simple, we drop in what follows the suffixes that indicate dependence on transmission group and subtype.
We previously described the phylogenetically observed growth distribution through the number of preexisting subgraphs with $m$ index cases by 2014 and $i$ new cases since 2014, $x}_{mi$, and the number of emergent subgraphs since 2014 with new cases. To generate posterior predictions and , we index the preexisting, phylogenetically observed subgraphs by . With regard to emergent transmission chains, for the purpose of assessing model fit, we index only the corresponding phylogenetically observed subgraphs, . In analogy to Equation (S47), we use the samplingadjusted size equations (S34) and (S35), which lead to the posterior predictive densities
We use these posterior predictive densities to predict the (observed) growth of the preexisting subgraphs, and the (observed) growth of the emergent subgraphs, and compare these predictions to the observed values. Specifically, given a Monte Carlo sample ${\mu}^{(k)},{\varphi}^{(k)},{\rho}^{(k)}$ from the posterior distribution, we predict the growth of the preexisting, phylogenetically observed subgraph ${j}^{x}$ through
Similarly, we predict the growth of the emergent, phylogenetically observed subgraph ${j}^{e}$ through
Finally, we aggregate ((S65)(S66)) to obtain a posterior prediction of the observed growth distributions,
The posterior predictive check then tests if the observed ${x}_{mi}$, ${\stackrel{~}{x}}_{n}$ lie within the 95% range of the posterior predictive samples $\{{x}_{mi}^{*(k)},k=1,\mathrm{\dots},K\}$ and $\{{\stackrel{~}{x}}_{n}^{*(k)},k=1,\mathrm{\dots},K\}$.
Appendix 1—figure 27 shows the posterior predictive check for the MSM and heterosexual models, respectively, by subtype. 100% of the observed subgraph counts fall within the 95% credible intervals of the predicted subgraph size distribution, indicating very good model fit.
Sensitivity analysis
Observed subgraph size distribution considering infection date
Appendix 1—figure 44 shows how the observed growth distributions of the subgraphs compare when considering all diagnoses with a sequence between 2014 and 2018, and all diagnoses with a sequence estimated to have been infected between 2014 and 2018. There are fewer sequences in total when considering infection date, since some diagnoses since 2014 are estimated to be infections acquired before 2014.
Data availability
Anonymised data are available in the public Github repository https://github.com/alexblenkinsop/locally.acquired.infections, (copy archived at swh:1:rev:02b5e1150acce280def913a0d2e30ec13a880122). These include aggregated timetodiagnosis data, and reconstructed phylogenetic trees labelled by one of the 9 Amsterdam risk groups and year of sequence sample. Statistical information or data for separate research purposes from the ATHENA cohort can be requested by submitting a research proposal (https://www.hivmonitoring.nl/english/research/researchprojects/). HIV physicians can review the data of their own treatment centre and compare these data with the full cohort through an online report builder. For correspondence: hiv.monitoring@amc.uva.nl.
References

Basic local alignment search toolJournal of Molecular Biology 215:403–410.https://doi.org/10.1016/S00222836(05)803602

Inference of R(0) and transmission heterogeneity from the size distribution of stuttering chainsPLOS Computational Biology 9:e1002993.https://doi.org/10.1371/journal.pcbi.1002993

ReportEvaluation of the LondonWide HIV Prevention Programme (LHPP)NIHR School for Public Health Research.

Stan: a probabilistic programming languageJournal of Statistical Software 76:1–32.https://doi.org/10.18637/jss.v076.i01

Is reaching 909090 enough to end AIDS? Lessons from AmsterdamCurrent Opinion in HIV and AIDS 14:455–463.https://doi.org/10.1097/COH.0000000000000586

Increasing awareness and prompting HIV testing: contributions of amsterdam hiv testing week 2016International Journal of STD & AIDS 29:1057–1065.https://doi.org/10.1177/0956462418770014

ConferenceTargeted Screening and Immediate Start of Treatment for Acute HIV Infection Decreases Time between HIV Diagnosis and Viral Suppression among MSM at a Sexual Health Clinic in Amsterdam10th IAS Conference on HIV Science 2019.

The total progeny in a branching process and a related random walkJournal of Applied Probability 6:682–686.https://doi.org/10.2307/3212112

Opting out increases HIV testing in a large sexually transmitted infections outpatient clinicSexually Transmitted Infections 85:249–255.https://doi.org/10.1136/sti.2008.033258

MAFFT multiple sequence alignment software version 7: improvements in performance and usabilityMolecular Biology and Evolution 30:772–780.https://doi.org/10.1093/molbev/mst010

Effective human immunodeficiency virus molecular surveillance requires identification of incident cases of infectionClinical Infectious Diseases 73:842–849.https://doi.org/10.1093/cid/ciab140

Redefining prevention and care: a statusneutral approach to HIVOpen Forum Infectious Diseases 5:ofy097.https://doi.org/10.1093/ofid/ofy097

The immunopathogenesis of human immunodeficiency virus infectionThe New England Journal of Medicine 328:327–335.https://doi.org/10.1056/NEJM199302043280508

Determining the likely place of hiv acquisition for migrants in europe combining subjectspecific information and biomarkers dataStatistical Methods in Medical Research 28:1979–1997.https://doi.org/10.1177/0962280217746437

Sources of HIV infection among men having sex with men and implications for preventionScience Translational Medicine 8:320ra2.https://doi.org/10.1126/scitranslmed.aad1863

ReportMonitoring Report 2020. Human Immunodeficiency Virus (HIV) Infection in the NetherlandsAmsterdam: Stichting HIV Monitoring.

Monitoring recently acquired HIV infections in Amsterdam, the Netherlands: the Attribution of test locationsFrontiers in Reproductive Health 3:1–9.https://doi.org/10.3389/frph.2021.568611

SoftwareECDC HIV modelling tool, version Version 1.3.0European Centre for Disease Prevention and Control.

COMET: adaptive contextbased modeling for ultrafast HIV1 subtype identificationNucleic Acids Research 42:e144.https://doi.org/10.1093/nar/gku739

PHYLOSCANNER: Inferring transmission from within and betweenhost pathogen genetic diversityMolecular Biology and Evolution 35:719–733.https://doi.org/10.1093/molbev/msx304
Decision letter

Joshua T SchifferReviewing Editor; Fred Hutchinson Cancer Research Center, United States

Miles P DavenportSenior Editor; University of New South Wales, Australia

David A RasmussenReviewer; North Carolina State University, United States
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Decision letter after peer review:
Thank you for submitting your article "Estimating the potential to prevent locally acquired HIV infections in a UNAIDS Fast–Track City, Amsterdam" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and David Serwadda as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: David A Rasmussen (Reviewer #2).
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.
Essential revisions:
Please clarify certain assumptions about the model regarding the source of local infections, as described mostly by reviewer 2 and a bit by the other 2 reviewers
Reviewer #1 (Recommendations for the authors):
This manuscript describes an analysis of the transmission dynamics of HIV–1 in Amsterdam in the context of a rapid scale–up of prevention measures. Rather than relying on a single source of information, it integrates both virus sequence diversity (by phylogenetic analysis) and statistical modelling to derive estimates of the numbers of unsampled infections and to use clinical biomarkers to estimate dates of infection.
Specifically, the authors reconstructed phylogenetic trees relating large numbers of HIV–1 sequences collected in Amsterdam, outside Amsterdam, and from around the world. The distribution of risk groups and locations was extrapolated by maximum parsimony from the observed sequences at the tips of the tree down towards their common ancestors, and clusters associated with local transmission within Amsterdam were extracted. This process was repeated for 100 replicate trees, which is an important step for accommodating the uncertainty in reconstructing phylogenetic trees.
The authors used a Bayesian method to fit a hierarchical model of the time elapsed between HIV infection and diagnosis ("age" of infection), as predicted by viral load and CD4 measurements, and stratify the model by risk group and location. The resulting estimated dates of infection provide an important context to interpreting the transmission clusters extracted from the phylogenetic trees – many studies have endeavoured to obtain similar information.
The presence of undiagnosed infections that are epidemiologically related to clusters in the study was estimated by fitting a branching process model to the clusters augmented with estimated infection dates. This is a very noteworthy aspect of the analysis, especially because the failure to observe undiagnosed infections is a significant issue that makes it difficult to interpret HIV–1 clusters derived from the comparative analysis of sequence variation. In this case, clusters of recently diagnosed individuals who were infected for a long period of time, as indicated by clinical biomarkers, implies a larger number of unsampled infections related to that particular group.
The authors report that, despite the success of scaling up prevention measures in Amsterdam at limiting the onward transmission of HIV–1, there remains a substantial undiagnosed population associated with a disproportionately greater number of recent infections, and that this largely affects foreign–born men who have sex with men. These methodological advances and unique data resources distinguish this study from an abundant literature on HIV–1 transmission clusters. Most of my comments, provided as recommendations for the authors, about the manuscript concern language and the presentation of results.
– Page 2, "(HIV) is concentrated in metropolitan areas, […] representing 26% of the global HIV burden" – minor point, it would be easier to interpret this statistic if it were accompanied with a rough estimate of what share of the global population is contained in metropolitan areas
– Page 2, is there a URL or reference for the UNAIDS fast–track Cities initiative?
– Page 3, "among infections that are estimated to have occurred since 2014" – is the significance of this cutoff year associated with the fast track initiative? It would be helpful to spell this out for the reader.
– Page 4, I have heard of the Weibull distribution but not the Weibull likelihood – please clarify.
– Page 4, "we used the first available partial HIV–1 polymerase (pol) sequence" – please provide HXB2 nucleotide coordinates
– Page 4, "retrieved from the Los Alamos HIV–1 sequence database" – the LANL database requests authors to provide the URL (https://www.hiv.lanl.gov)
– Page 4, "All sequences were subtyped using Comet and Rega." Please provide version numbers and briefly explain how discordant subtype assignments were resolved between these two algorithms. (I found the explanation in the supplementary material and it's not that much longer – why not explain it in main text?)
– Page 4, did you use a double–precision version of FastTree?
– Page 4, I have a problem with the term "phylogenetic subgraphs". A subgraph is a subset of nodes in the network that is not necessarily a single connected component, whereas the authors are presumably referring to either subtrees (all descendants from an internal node), or subset trees (a contiguous subset of descendants from an internal node). In the HIV molecular epidemiology literature, we would refer to these as subtrees.
– Page 4, "Diagnosed Amsterdam patients in the same subgraph are interpreted as belonging to the same transmission chain" – this interpretation is not very accurate – while individuals in a distinct subtree are more epidemiologically related, they are not necessarily in the same "transmission chain" and it is misleading to make this association. The conventional terminology would be "transmission cluster".
– Page 4, "the estimated state of the root of the subgraph is interpreted as the origin of the transmission chain" – this ancestral state reconstruction is going to be quite uncertain in many cases – was this uncertainty propagated from the replicate bootstrap trees? – it would also be good to emphasize that "origin" refers to location and risk group, and not a single individual
– Figure 1, please check this figure for color accessibility – also the use of thin (low weight) font and narrow lines in the tree may make the figure difficult to read for the vision impaired
– Figure 3, I don't understand why there is so much unused space at the top half of the lower plots – I surmise that the authors intend to preserve the y–axis scales between the top and bottom plots for each of A and B columns, but why not truncate the lower plots? I also think the "zoom in" plots have limited utility and could be moved to supplementary material. Otherwise the plots are a rather effective visualization method.
– Page 12, please provide some details on the branching process model, since it plays an important role in this analysis – although it is well described in supplementary material, it would be helpful to give a little more info in the main text
– Figure 4, for what it's worth, these palettes render well in black and white
– Page 15, "the proportion of undiagnosed individual individuals infected with HIV are" – singular or plural?
– Code availability – this GitHub repository appears to be set to private – please make this available for peer review
– Figures S4–S14 – I'm not a fan of radial layouts for displaying phylogenies with node annotations – also, are these trees rooted? this layout method implies that they are, but I did not see any methodological description for rooting trees. Also, these appear to be vectorized images – it's actually more efficient to use a rasterized format like PNG or TIFF for large annotated trees because you're otherwise storing information about an excessive number of objects to draw.
– Figure S4, what's circulating recombinant form A1? do you mean subtype A1?
– Equation S12, please use or something similar to prevent the MSM and HSX superscripts from being rendered in math mode
– Please explain how the pairs plots (Figures S18, S19) were used to assess convergence of the Hamiltonian Monte Carlo chain samples. The figure legends don't provide any information.
Reviewer #2 (Recommendations for the authors):
While Amsterdam has done a tremendous job identifying and treating HIV infections and a large proportion of the HIV infected population is in care, the authors results highlight that there are often still substantial delays in diagnosis and that as many as 20% of individuals infected between 2014–2018 remained undiagnosed by the end of 2018. These results have important implications for HIV prevention strategies and policy: long–term goals aimed at diagnosing or treating a certain fraction of the overall infected population may need to be reconsidered in a more dynamic context focused on the delays between infection and diagnosis/treatment.
Overall, the study design is strong and the statistical methods used to estimate the time to diagnosis and the proportion of undiagnosed infections are sound.
However, it's not clear to me what assumptions are being made to estimate the proportion of new infections arising locally from individuals living in Amsterdam. It seems there may be several latent assumptions here, namely (1) there are no unobserved transmission events in the transmission chains between individuals living in/outside of Amsterdam (2) sampling more extensively outside of Amsterdam would not break apart the observed transmission chains into a larger number of smaller chains (with more external introductions from outside of Amsterdam). It seems to me these assumptions could bias the estimated proportion of new infections arising locally, and therefore the "preventable fraction", upwards. At the very least I think it would be helpful to be more transparent about the assumptions behind the estimator given in eq. 2.
Pg. 5 "predicted actual transmission chains" – not clear what is meant by this.
Pg. 6 "N_C is the estimated number of transmission chains which emerged between 2014–2018" –– is N_C the estimated number of transmission chains or rather the estimated number of individuals in those transmission chains? It seems like it would need to be the later to make N_C directly comparable to N_I.
Reviewer #3 (Recommendations for the authors):
The paper uses various sources of data to identify when and among whom new HIV transmissions are occurring in Amsterdam. The models use multiple data inputs and the conclusions are of high public health importance. The results appear to be largely supported by the analyses. The multidisciplinary approach could be of use for tracking infectious diseases in other settings.
A few areas of the paper could use clarification as described below
– I am a bit confused by the term upwards bias in time to diagnosis estimates. Please be specific in what is meant by this phrase.
– The data in the 2nd and 3rd to last columns in Table 1 would be easier to digest in a histogram.
– Tables 2 and 3 would also be easier to digest in graphical form.
– Please provide 1–2 sentences summarizing the Bayesian methods in Pantazis et al. in the Methods.
– Please justify the assumption that time to diagnosis did not change substantially over 2010–2019. It seems like that with lower true incidence, in theory a higher proportion of diagnoses could occur during later stages of infection.
– Please more precisely define "uninterrupted state label".
– Why were other subtypes or recombinants excluded?
– The blue and turquoise in Figure 3 are difficult to discriminate.
– Discussion: What is the ECDC model? Please specify.
https://doi.org/10.7554/eLife.76487.sa1Author response
Essential revisions:
Please clarify certain assumptions about the model regarding the source of local infections, as described mostly by reviewer 2 and a bit by the other 2 reviewers
We have been very much encouraged by the overall positive feedback on our work from the editorial board and the reviewers. Thank you.
With regards to the essential revisions noted above, we carefully updated the entire manuscript as best as possible to describe our definitions, assumptions, and implications including possible caveats.
We further met one more time with the wider HTeam in Amsterdam (the PIs, physicians, patient representatives and city authorities involved in the HIV transmission elimination team Amsterdam) to clarify language around the notion of “Amsterdam transmission chains”, as we felt this is especially important in the wider context of concerns around using HIV phylogenetics methods.
We now define the term Amsterdam transmission chains more clearly in the Introduction,
“Here, we build on Amsterdam’s combined case and genomic surveillance data to reconstruct transmission chains at city level, defined as a single introduction of HIV into Amsterdam residents, followed by a direct infection chain among Amsterdam residents (Figure 1).”
We clarified the nature of our location data in the Methods section,
“We geolocated diagnosed infections to Amsterdam based on patients’ postcode of residence at time of first registration in ATHENA or the most recent registration update, which includes PLHIV that changed residence to Amsterdam prior to first registration and PLHIV that changed residence to another Dutch municipality after first registration.”
We further clarified on the exact search criteria to define the background sequence data set, which is crucial to be able to separate infected Amsterdam residents into distinct phylogenetic clades (or subgraphs using the preferred terminology associated with the phyloscanner software). We highlight that this was an iterative analysis process, involving manual inspection of identified Amsterdam clades with incountry experts to detect unlikely large clades and refine the search criteria,
“To reconstruct distinct HIV transmission chains among Amsterdam residents, we used the first available partial HIV1 polymerase (pol) sequence from Amsterdam PLHIV, Dutch PLHIV from outside Amsterdam, and ~82,000 pol sequences from nonDutch PLHIV. The nonDutch viral sequences were retrieved from the Los Alamos HIV1 sequence database subject to a length of at least 1300 in the pol gene on March 2, 2020 (www.hiv.lanl.gov). The basic local alignment search tool (BLAST v2.10.0) was used to select the top 20 closest background sequences to any Dutch sequence (Altschul et al. 1990).”
We further clarified our assumptions around our epidemiological interpretation of the observed phylogenetic data,
“In the labelled phylogeny, the lineage labels jump backwards in time, e.g. from Amsterdam MSM associated with a lineage ending in a tip observed in Amsterdam MSM to Western Europe. Thus, we can group lineages according to the same label between jumps, and we follow Wymant et al. in referring to these groups as phyloscanner subgraphs (Wymant et al. 2018). We assumed that we have sufficient background sequences such that no additional background sequences would further separate transmission chains among Amsterdam residents into more distinct chains. A subtle but important related point is that with the available location data at time of registration or a registration update, we are only able to phylogenetically reconstruct transmission chains by residence status rather than the location at which transmission actually occurred. For example, two Amsterdam residents appear in the same phyloscanner subgraph if they infected each other during a shortterm visit in another Dutch, European or global location, if they were both infected from a common source during such a shortterm visit and the source remained unsampled, if they infected each other before they began their residence in Amsterdam, or after they moved to another Dutch municipality. Diagnosed Amsterdam patients in the same subgraph were then interpreted as belonging to the same transmission chain, and the estimated state of the root of the subgraph was interpreted as the geographical origin of the transmission chain. Throughout, we refer to the subgraphs also as the phylogenetically observed (parts of) transmission chains. Using this approach, we note that unlike most phylogenetic clustering analyses (Brenner et al. 2017), every infected patient with a sequence is included in one subgraph, and all partially observed transmission chains of size one are included in the analysis to ensure that the entire distribution of observed transmission chains is represented in the analysis (Bezemer et al. 2022). To capture phylogenetic uncertainty, phylogenetic analyses were repeated on 100 bootstrap replicates drawn from each subtype alignment, and transmission chains were enumerated across these replicate analyses.”
We cut shorter descriptions of the directly observed data in the Results, section Growth of the phylogenetically observed parts of citylevel transmission chains,
“The emerging chains thus outnumbered the growing preexisting chains in both Amsterdam MSM and heterosexuals. However, the observed phylogenetic data are challenging to interpret directly because larger proportions of recent infections remain undiagnosed, approximately half of diagnosed individuals did not have a sequence sampled, and small chains are more likely to remain entirely unobserved (see Materials and methods).”
We more clearly describe the term ‘locally preventable infections’,
“From the emerging transmission chains, we can directly estimate the proportion of Amsterdam infections since 2014 that had an Amsterdam source (see Materials and methods). We interpret these infections as locally preventable, because they are within the reach of the HIV prevention efforts in Amsterdam.”
We added clarifications on the interpretability of the reconstructed transmission chains among Amsterdam residents at the start of the Discussion,
“More than 300 cities have by the end of 2021 signed the FastTrack Cities Paris Declaration and committed to end the AIDS epidemic by 2030, addressing disparities in access to basic health and social services, social justice and economic opportunities. The city of Amsterdam reached the UNAIDS Fasttrack Cities 959595 targets before the onset of the COVID19 pandemic, and has seen a decade of declines in citylevel HIV diagnoses. Here, we characterised the number, size and growth of HIV transmission chains among Amsterdam residents, and quantified the further potential of preventing HIV infection at city level. It is important to recognize that through the analyses conducted here, the exact location of infection events cannot be identified. Rather, the available location data enable us to identify groups of Amsterdam residents with phylogenetically distinct HIV, which are the inferential basis for estimating the number, size, and growth of the actual unobserved transmission chains among Amsterdam residents. Regardless of the exact infection location, Amsterdam residents live in Amsterdam, and are thus within reach of Amsterdam public health and local prevention interventions.”
We further clarified the strength of evidence of our primary findings in the Discussion,
“Fourth, we quantified the locally preventable infections among Amsterdam residents in 20142018, defined as the infections in Amsterdam residents in 20142018 who are estimated to have as source another Amsterdam resident. Using the virus’ genetic code as an objective marker into infection events, we estimate that regardless of declining diagnoses and incidence, the majority of infections in Amsterdam residents in 20142018 remained locally preventable in all risk groups investigated. The statistical strength of evidence into this finding was strong for Amsterdam MSM (all 95% credible intervals for the proportion of locally preventable infections were above 50%), but more moderate for Amsterdam heterosexuals (wider credible intervals including 50%), reflecting that relatively few infections in Amsterdam heterosexuals in 20142018 were observed with a viral sequence by early 2019 due to frequent late diagnosis and incomplete viral sequencing.”
Reviewer #1 (Recommendations for the authors):
This manuscript describes an analysis of the transmission dynamics of HIV–1 in Amsterdam in the context of a rapid scale–up of prevention measures. Rather than relying on a single source of information, it integrates both virus sequence diversity (by phylogenetic analysis) and statistical modelling to derive estimates of the numbers of unsampled infections and to use clinical biomarkers to estimate dates of infection.
Specifically, the authors reconstructed phylogenetic trees relating large numbers of HIV–1 sequences collected in Amsterdam, outside Amsterdam, and from around the world. The distribution of risk groups and locations was extrapolated by maximum parsimony from the observed sequences at the tips of the tree down towards their common ancestors, and clusters associated with local transmission within Amsterdam were extracted. This process was repeated for 100 replicate trees, which is an important step for accommodating the uncertainty in reconstructing phylogenetic trees.
The authors used a Bayesian method to fit a hierarchical model of the time elapsed between HIV infection and diagnosis ("age" of infection), as predicted by viral load and CD4 measurements, and stratify the model by risk group and location. The resulting estimated dates of infection provide an important context to interpreting the transmission clusters extracted from the phylogenetic trees – many studies have endeavoured to obtain similar information.
The presence of undiagnosed infections that are epidemiologically related to clusters in the study was estimated by fitting a branching process model to the clusters augmented with estimated infection dates. This is a very noteworthy aspect of the analysis, especially because the failure to observe undiagnosed infections is a significant issue that makes it difficult to interpret HIV–1 clusters derived from the comparative analysis of sequence variation. In this case, clusters of recently diagnosed individuals who were infected for a long period of time, as indicated by clinical biomarkers, implies a larger number of unsampled infections related to that particular group.
The authors report that, despite the success of scaling up prevention measures in Amsterdam at limiting the onward transmission of HIV–1, there remains a substantial undiagnosed population associated with a disproportionately greater number of recent infections, and that this largely affects foreign–born men who have sex with men. These methodological advances and unique data resources distinguish this study from an abundant literature on HIV–1 transmission clusters. Most of my comments, provided as recommendations for the authors, about the manuscript concern language and the presentation of results.
Thank you for this positive evaluation.
– Page 2, "(HIV) is concentrated in metropolitan areas, […] representing 26% of the global HIV burden" – minor point, it would be easier to interpret this statistic if it were accompanied with a rough estimate of what share of the global population is contained in metropolitan areas
Thank you for this comment. We have reviewed these estimates in the main paper as stated in the UNAIDS report, and found the actual basis of these estimates to be vague with little information on how they were generated. For this reason, we have decided to remove the quantitative part of this statement, and leave only the qualitative subsentence in the introduction. Specifically, we have changed the text in the Introduction as follows:
“Human immunodeficiency virus (HIV) is concentrated in metropolitan areas (Joint United Nations Programme on HIV/AIDS (UNAIDS) 2014).”
– Page 2, is there a URL or reference for the UNAIDS fast–track Cities initiative?
Thank you for highlighting this, we have added a reference to the UNAIDS FastTrack cities website where it is first mentioned,
“In response, as of March 2021 over 300 cities have joined the FastTrack Cities initiative (www.fasttrackcities.org) by signing the Paris Declaration, committing to end the AIDS epidemic by 2030, by addressing disparities in access to basic health and social services, social justice and economic opportunities (UNAIDS 2019).”
– Page 3, "among infections that are estimated to have occurred since 2014" – is the significance of this cutoff year associated with the fast track initiative? It would be helpful to spell this out for the reader.
Thank you for this comment. 2014 was the year Amsterdam joined the FastTrack Cities network and the HTeam was founded. We have amended the text as follows:
“…among infections that are estimated to have occurred since Amsterdam joined the FastTrack Cities network in 2014 and galvanised its local response with the inception of the HTeam on December 1, 2014.”
– Page 4, I have heard of the Weibull distribution but not the Weibull likelihood – please clarify.
Thanks for raising this. We have revised the text to clarify that we fit a model in which the times to diagnosis follow a Weibull distribution as follows:
“We next reconstructed characteristic timetodiagnosis distributions for each of the 9 Amsterdam risk groups (MSM/heterosexual, and region of birth) with a Bayesian hierarchical model from the individuallevel estimates, modelling the individuallevel estimates with a Weibull distribution.”
– Page 4, "we used the first available partial HIV–1 polymerase (pol) sequence" – please provide HXB2 nucleotide coordinates
We have added the coordinates and referenced the HXB2 genome as follows in the Methods section, subsection ‘Phylogenetic reconstruction of citylevel transmission chains’:
“Subtypespecific alignments were generated with Virulign (Libin et al. 2019) (Appendix 1 Section 4.1) and sequences from other subtypes were added as outgroup for the purpose of phylogenetic rooting. The final alignments were trimmed to positions 22533870 in the reference genome HXB2 (Ratner et al. 1985).”
– Page 4, "retrieved from the Los Alamos HIV–1 sequence database" – the LANL database requests authors to provide the URL (https://www.hiv.lanl.gov)
Thank you for pointing this out. We have added in a reference to the URL:
“The nonDutch viral sequences were retrieved from the Los Alamos HIV1 sequence database subject to a length of at least 1300 in the pol gene on March 2, 2020 (www.hiv.lanl.gov).”
– Page 4, "All sequences were subtyped using Comet and Rega." Please provide version numbers and briefly explain how discordant subtype assignments were resolved between these two algorithms. (I found the explanation in the supplementary material and it's not that much longer – why not explain it in main text?)
Thank you for highlighting this. We have added version numbers and clarified how Rega was used to verify uncertain subtypes from Comet as follows:
“All sequences were subtyped using Comet v2.3 (Struck et al. 2014). Sequences with an uncertain subtype classification using Comet were analysed with Rega v3.0 (PinedaPeña et al. 2013). Any remaining sequences for which a subtype could not be resolved were discarded from further analysis (n = 122).”
– Page 4, did you use a double–precision version of FastTree?
We did not use a doubleprecision version, and have noted this for future work. Thank you for pointing this out.
– Page 4, I have a problem with the term "phylogenetic subgraphs". A subgraph is a subset of nodes in the network that is not necessarily a single connected component, whereas the authors are presumably referring to either subtrees (all descendants from an internal node), or subset trees (a contiguous subset of descendants from an internal node). In the HIV molecular epidemiology literature, we would refer to these as subtrees.
Thanks for this query. We refer to subgraphs in the context of the phyloscanner software, and obtained the following clarifications from the lead authors: “Here, host subgraphs are the result of ancestralhost state reconstruction, defined as connected regions of the tree that are assigned the same host state. The background subgraphs do not necessarily include every descendant of a node, so we choose not to use the term ‘subtree’, which has a very well understood meaning in phylogenetics as you have defined it.”
To make the distinction clearer and honour this recommendation, the first time we refer to these subgraphs, we have corrected “phylogenetic subgraphs” to say “phyloscanner subgraphs”, as follows:
“In the labelled phylogeny, the lineage labels jump backwards in time, e.g. from Amsterdam MSM associated with a lineage ending in a tip observed in Amsterdam MSM to Western Europe. Thus, we can group lineages according to the same label between jumps, and we follow (Wymant et al. 2018) in referring to these groups as phyloscanner subgraphs.”
– Page 4, "Diagnosed Amsterdam patients in the same subgraph are interpreted as belonging to the same transmission chain" – this interpretation is not very accurate – while individuals in a distinct subtree are more epidemiologically related, they are not necessarily in the same "transmission chain" and it is misleading to make this association. The conventional terminology would be "transmission cluster".
Thank you for this comment. We have discussed this point at length amongst the writing team, and the wider HTeam in Amsterdam (the PIs, physicians, patient representatives and city authorities involved in the HIV transmission elimination team Amsterdam) to identify the best language to use. Ultimately, we decided against the word “transmission cluster” due to its similarity with the terms “phylogenetic cluster” and “phylogenetic clustering methods” that are ubiquitous in HIV phylogenetics, and which we have not adopted here.
Yet, your comment touches upon several important points and concepts, which we have modified throughout. We now define the term Amsterdam transmission chains more clearly in the Introduction,
“Here, we build on Amsterdam’s combined case and genomic surveillance data to reconstruct transmission chains at city level, defined as a single introduction of HIV into Amsterdam residents, followed by a direct infection chain among Amsterdam residents (Figure 1).”
We clarified the nature of our location data in the Methods section,
“We geolocated diagnosed infections to Amsterdam based on patients’ postcode of residence at time of first registration in ATHENA or the most recent registration update, which includes PLHIV that changed residence to Amsterdam prior to first registration and PLHIV that changed residence to another Dutch municipality after first registration.”
We further clarified our assumptions around our epidemiological interpretation of the observed phylogenetic data,
“In the labelled phylogeny, the lineage labels jump backwards in time, e.g. from Amsterdam MSM associated with a lineage ending in a tip observed in Amsterdam MSM to Western Europe. Thus, we can group lineages according to the same label between jumps, and we follow Wymant et al. in referring to these groups as phyloscanner subgraphs (Wymant et al. 2018). We assumed that we have sufficient background sequences such that no additional background sequences would further separate transmission chains among Amsterdam residents into more distinct chains. A subtle but important related point is that with the available location data at time of registration or a registration update, we are only able to phylogenetically reconstruct transmission chains by residence status rather than the location at which transmission actually occurred. For example, two Amsterdam residents appear in the same phyloscanner subgraph if they infected each other during a shortterm visit in another Dutch, European or global location, if they were both infected from a common source during such a shortterm visit and the source remained unsampled, if they infected each other before they began their residence in Amsterdam, or after they moved to another Dutch municipality. Diagnosed Amsterdam patients in the same subgraph were then interpreted as belonging to the same transmission chain, and the estimated state of the root of the subgraph was interpreted as the geographical origin of the transmission chain. Throughout, we refer to the subgraphs also as the phylogenetically observed (parts of) transmission chains. Throughout, we refer to the subgraphs also as the phylogenetically observed (parts of) transmission chains. Using this approach, we note that unlike most phylogenetic clustering analyses (Brenner et al. 2017), every infected patient with a sequence is included in one subgraph, and all partially observed transmission chains of size one are included in the analysis to ensure that the entire distribution of observed transmission chains is represented in the analysis (Bezemer et al. 2022). To capture phylogenetic uncertainty, phylogenetic analyses were repeated on 100 bootstrap replicates drawn from each subtype alignment, and transmission chains were enumerated across these replicate analyses.”
We more clearly describe the term ‘locally preventable infections’,
“From the emerging transmission chains, we can directly estimate the proportion of Amsterdam infections since 2014 that had an Amsterdam source (see Materials and methods). We interpret these infections as locally preventable, because they are within the reach of the HIV prevention efforts in Amsterdam.”
– Page 4, "the estimated state of the root of the subgraph is interpreted as the origin of the transmission chain" – this ancestral state reconstruction is going to be quite uncertain in many cases – was this uncertainty propagated from the replicate bootstrap trees? – it would also be good to emphasize that "origin" refers to location and risk group, and not a single individual
Thank you for this question. Indeed, we estimate the origins for the central trees and 100 replicate trees, generated from bootstrapped alignments, to capture phylogenetic uncertainty. We fully agree that there is uncertainty in exactly which of the location labels (the 11 nonAmsterdam locations, the Netherlands, Africa, Western Europe, Eastern Europe and Central Asia, North America, Latin America and the Caribbean, Dutch Caribbean and Suriname, Middle East and North Africa, and South and SouthEast Asia and Oceania). However please note that for our purposes and the calculation of the locally acquired infections, it is only relevant whether a transmission chain originated within or outside of Amsterdam, and this is associated with less phylogenetic uncertainty. Specifically, we find good agreement in Ams/nonAms origin allocations across all 100 bootstrap replicate trees: well over 80% of the Amsterdam phyloscanner subgraphs have the Ams/nonAms origin allocation (see Author response image 1). Please note that the tips in each subgraph may change across bootstrap trees, so we here mapped subgraphs by identifying the subgraph with the most common tips shared with each subgraph from the central tree. Note also that this uncertainty is reflected in our calculations, specifically equation S56 in Appendix 1 Section S7.2.
We have also clarified that “origin” refers to the geographic origin of the transmission chain in the Methods section:
“Diagnosed Amsterdam patients in the same subgraph are interpreted as belonging to the same transmission chain, and the estimated state of the root of the subgraph is interpreted as the geographical origin of the transmission chain”
– Figure 1, please check this figure for color accessibility – also the use of thin (low weight) font and narrow lines in the tree may make the figure difficult to read for the vision impaired
Thank you for highlighting this. We have modified the figure to improve colour accessibility, increased font size and revised the tree subfigure to improve legibility.
– Figure 3, I don't understand why there is so much unused space at the top half of the lower plots – I surmise that the authors intend to preserve the y–axis scales between the top and bottom plots for each of A and B columns, but why not truncate the lower plots? I also think the "zoom in" plots have limited utility and could be moved to supplementary material. Otherwise the plots are a rather effective visualization method.
Thank you for pointing this out. We have reduced the white space in the subfigure of subgraphs among heterosexuals. Since it is important to see both the transmission chains which did not grow further since 2014 (panel A) and also the detail of the transmission chains which grew since 2014 (panel B), we have added some annotation to indicate that these are ‘zoomed in’ from panel A.
– Page 12, please provide some details on the branching process model, since it plays an important role in this analysis – although it is well described in supplementary material, it would be helpful to give a little more info in the main text
Thank you for raising this. We agree and have added more details in the model, including the probability densities for the actual chain sizes and observed chain sizes as follows:
“Specifically, given $m=1,\dots ,M$ index cases of a chain that preexisted, the final size distribution of stuttering transmission chains is under a Negative Binomial branching process model given by
$c(im,\mu ,\varphi )\text{}=\text{}\frac{m}{m+i}$ NegBin
where NegBin is the Negative Binomial distribution characterised by mean $\text{\mu m}$ and dispersion parameter $\text{\varphi m}$, $i=0,1,2,\dots$ is the number of new cases, and μ < 1. Incomplete sampling of new cases can be accommodated via
where $\rho$ denotes the probability that a new case in 20142018 is diagnosed and has a viral sequence sampled by database closure. In the model, the index cases are assumed to be infectious and defined by the number of unsuppressed members by 2014 in a preexisting chain, adjusted for the sampling probability of such members. We further capped the infinite sum in (3) in the model, recognizing that the summands rapidly tend to zero. The corresponding equations for emergent transmission chains (since 2014 as defined above) is similar,
The likelihood then comprises the growth distributions of emerging chains, preexisting chains that continued to grow, and preexisting chains with unsuppressed members that did not grow, with the following loglikelihood,
where $M$ is the largest number of index cases observed across the chains after adjusting for sampling, $I$ is the largest number of new cases observed in preexisting chains and $N$ is the largest number of new cases observed in emergent chains, including the first case. Preexisting chains for which all members were suppressed by 2014 and which did not grow were not included, because these chains had no unsuppressed index case. We fitted the branching process model under a Bayesian framework with Stan version 2.21 to MSM chains, borrowing information across subtypes, and similarly for heterosexual chains.”
– Figure 4, for what it's worth, these palettes render well in black and white
Thank you for the feedback.
– Page 15, "the proportion of undiagnosed individual individuals infected with HIV are" – singular or plural?
Thank you for highlighting this. We have amended the sentence to reflect that there are several proportions, by place of birth, as follows:
“First, when focusing on the denominator of recent infections that are estimated to have occurred in 20142018, the proportions of undiagnosed individuals infected with HIV are all between 920% in (selfidentified) Amsterdam MSM risk groups, and between 2857% in Amsterdam heterosexual risk groups.”
– Code availability – this GitHub repository appears to be set to private – please make this available for peer review
Thanks for raising this. The repository has now been made public at github.com/alexblenkinsop/locally.acquired.infections.
– Figures S4–S14 – I'm not a fan of radial layouts for displaying phylogenies with node annotations – also, are these trees rooted? this layout method implies that they are, but I did not see any methodological description for rooting trees. Also, these appear to be vectorized images – it's actually more efficient to use a rasterized format like PNG or TIFF for large annotated trees because you're otherwise storing information about an excessive number of objects to draw.
Thank you for raising this. We have added some further details on how the trees were rooted to the methods section as follows:
“Subtypespecific alignments were generated with Virulign (Libin et al. 2019) (Appendix 1 Section 4.1) and sequences from other subtypes were added as outgroup for the purpose of phylogenetic rooting. The final alignments were trimmed to positions 22533870 in the reference genome HXB2 (Ratner et al. 1985).
Subtypespecific HIV phylogenetic trees were generated for alignments with at least 50 Amsterdam sequences (subtypes and recombinant forms B, 01AE, 02AG, C, D, G, A1 or 06cpx) using FastTree v2.1.8 (Price, Dehal, and Arkin 2010). Following tree reconstruction, trees were manually rooted at the outgroup, and the outgroup taxa were then pruned from the phylogeny. Next, we attributed”
We have now also plotted the inferred, rooted trees in standard layout, and added these due to size as Supplementary Figures to the manuscript (Appendix 1 – Figures 425).
– Figure S4, what's circulating recombinant form A1? do you mean subtype A1?
Thank you for spotting this. We have amended the legend to say subtype A1.
– Equation S12, please use or something similar to prevent the MSM and HSX superscripts from being rendered in math mode
Thank you for highlighting this. We have amended the risk groups to not be formatted in math mode.
– Please explain how the pairs plots (Figures S18, S19) were used to assess convergence of the Hamiltonian Monte Carlo chain samples. The figure legends don't provide any information.
Thank you for raising this. For both the models for the undiagnosed and the branching process model we have added trace plots and reported the parameter with the smallest effective sample size as follows.
In section S3.3.2 of Appendix 1:
“The models mixed well, with no correlation between parameters, and had at most one divergence (Figures 3033). The smallest effective sample size across parameters for the MSM model was 1562 and 1326 for the heterosexual model.”
In section S6.3 of Appendix 1:
“The models mixed well; Figures 4041 show the trace plot for the parameter with the smallest effective sample size in each model, which was 2377 for the MSM model and 1125 for the heterosexual model. Figures 4243 shows the pairs plot of parameters for the MSM and heterosexual models, respectively. ”
Reviewer #2 (Recommendations for the authors):
While Amsterdam has done a tremendous job identifying and treating HIV infections and a large proportion of the HIV infected population is in care, the authors results highlight that there are often still substantial delays in diagnosis and that as many as 20% of individuals infected between 2014–2018 remained undiagnosed by the end of 2018. These results have important implications for HIV prevention strategies and policy: long–term goals aimed at diagnosing or treating a certain fraction of the overall infected population may need to be reconsidered in a more dynamic context focused on the delays between infection and diagnosis/treatment.
Overall, the study design is strong and the statistical methods used to estimate the time to diagnosis and the proportion of undiagnosed infections are sound.
However, it's not clear to me what assumptions are being made to estimate the proportion of new infections arising locally from individuals living in Amsterdam. It seems there may be several latent assumptions here, namely (1) there are no unobserved transmission events in the transmission chains between individuals living in/outside of Amsterdam (2) sampling more extensively outside of Amsterdam would not break apart the observed transmission chains into a larger number of smaller chains (with more external introductions from outside of Amsterdam). It seems to me these assumptions could bias the estimated proportion of new infections arising locally, and therefore the "preventable fraction", upwards. At the very least I think it would be helpful to be more transparent about the assumptions behind the estimator given in eq. 2.
Thank you for this positive and indeed very helpful public review. Please note our response to the editor at the beginning of this document for a detailed summary on how we clarified definitions, assumptions, and the interpretation of our findings.
Pg. 5 "predicted actual transmission chains" – not clear what is meant by this.
Thank you for raising this point. We agree. We have clarified the text as follows:
“Given estimates of the number and growth of both preexisting and emergent transmission chains, it is straightforward to derive estimates of the proportion of HIV infections among Amsterdam residents in 20142018 that had an Amsterdam resident as source (which we denote by $\text{\gamma}$ and refer to as the proportion of locally acquired infections). This is because all infections originating from an individual living in Amsterdam had a local source, except the index cases in the emerging chains that were introduced from outside of Amsterdam.”
Pg. 6 "N_C is the estimated number of transmission chains which emerged between 2014–2018" –– is N_C the estimated number of transmission chains or rather the estimated number of individuals in those transmission chains? It seems like it would need to be the later to make N_C directly comparable to N_I.
Thank you for highlighting this. N_C is the number of transmission chains. By definition, each chain has one index case, and so it is comparable to N_I. We have revised our text to clarify our reasoning behind this key equation as follows
“Given estimates of the number and growth of both preexisting and emergent transmission chains, it is straightforward to derive estimates of the proportion of HIV infections among Amsterdam residents in 20142018 that had an Amsterdam resident as source (which we denote by and refer to as the proportion of locally acquired infections). This is because all infections originating from an individual living in Amsterdam had a local source, except the index cases in the emerging chains that were introduced from outside of Amsterdam. We have
where $N}_{\text{I}$ is the estimated number of new infections between 20142018 in both preexisting chains and emerging transmission chains including the index case, $N}_{C$ is the estimated number of transmission chains which emerged between 20142018 and $\alpha$ is the proportion of emergent transmission chains with an Amsterdam origin. Since each transmission chain has one index case, $\text{\alpha}{N}_{C}$ is the estimated number of infections with nonAmsterdam origin, and $N}_{I}\alpha \phantom{\rule{thinmathspace}{0ex}}{N}^{c$ is the estimated number of infections that had an Amsterdam resident as a source.”
Reviewer #3 (Recommendations for the authors):
The paper uses various sources of data to identify when and among whom new HIV transmissions are occurring in Amsterdam. The models use multiple data inputs and the conclusions are of high public health importance. The results appear to be largely supported by the analyses. The multidisciplinary approach could be of use for tracking infectious diseases in other settings.
A few areas of the paper could use clarification as described below
– I am a bit confused by the term upwards bias in time to diagnosis estimates. Please be specific in what is meant by this phrase.
Thank you for raising this. We agree. We have clarified our language in the Results and Discussion section as follows:
In the Results section:
“While the bivariate model of biomarker data that underpins the individuallevel timetodiagnosis estimates has been validated (Pantazis et al. 2019), our estimates of the proportion of undiagnosed infections in 20142018 depend further on the trends in the number of infections in each year as shown in Equation 2. The main analysis is based on trends in HIV infections in MSM and heterosexuals that were estimated with the ECDC HIV Modelling Tool for Amsterdam. The ECDC estimates account for late diagnoses, but apply to MSM and heterosexuals across Amsterdam. In sensitivity analyses we used instead trends in directly observed Amsterdam diagnoses, which apply to each Amsterdam risk group but do not account for confounding due to late diagnoses. In the sensitivity analysis, we estimate that 14% [1317%] of infections in Amsterdan MSM in 20142018 remained undiagnosed, and 34% [2841%] in Amsterdam heterosexuals. Further details are presented in Appendix 1, Section 3.33.5.”
In the Discussion section:
“We explored the impact of assumptions on incidence trends to the undiagnosed estimates and found some sensitivities (Appendix 1, Section 3.3), although estimates were all very similar as long as the assumed incidence trends reflected available data. Further sensitivity analyses are reported in Appendix 1 Section 3.43.5. We also further validated the timetodiagnosis estimates by comparing the estimated proportion of recent HIV infections (≤6 months) with those estimated in an independent study in Amsterdam using avidity assays (Slurink et al. 2021), and found them to be similar (Appendix 1 Figure 28). The main limitation of our biomarker approach is thus that at present we cannot account for time trends in timetodiagnosis.”
– The data in the 2nd and 3rd to last columns in Table 1 would be easier to digest in a histogram.
Thank you for this comment. We had indeed considered multiple different versions of Table 1, with the aim to place the derived estimates in the last column into a logical broader context. At the same time we appreciate that the table is busy, and several columns could also be presented in a figure.
In our revision we have opted to keep the column “Estimated undiagnosed” in the table, so that readers see the logical steps from the observed infections to the estimated total infections, and have direct access the numerical values, which are of interest. Based on this comment we have moved the column “Estimated time to diagnosis” into the new Appendix 1 – Figure 3:
We have further taken the opportunity to clarify the column names of the table as follows:
– Tables 2 and 3 would also be easier to digest in graphical form.
Thank you for this comment. We created the corresponding figures based on your input, however upon discussion with all coauthors, the majority felt that in this instance, it is important to explicitly list the numbers in favour of showing the overall trend in a figure. We have left tables 2 and 3 untouched.
– Please provide 1–2 sentences summarizing the Bayesian methods in Pantazis et al. in the Methods.
This is a good point. We realise our description has been very brief. We expanded the text in the Methods section as follows:
“Using longitudinal viral load and CD4 count data and further demographic and clinical information, we estimated time from infection to diagnosis for all HIV diagnosed patients with a Bayesian approach (Pantazis et al. 2019). Briefly, data from the CASCADE collaboration on 19,788 observed HIV seroconverters were used to parameterize a bivariate normal linear model of the joint time evolution of HIV viral load and CD4 cell count decline since time of infection in the context of additional covariates (sex, region of origin, mode of infection, age at time of diagnosis). Then we used the trained model to estimate infection times from longitudinal biomarker data for Amsterdam patients, with an average of 4 viral load observations and 6 CD4 cell count observations per patient.”
– Please justify the assumption that time to diagnosis did not change substantially over 2010–2019. It seems like that with lower true incidence, in theory a higher proportion of diagnoses could occur during later stages of infection.
Thank you, this is a good point. We have addressed this in two parts.
1/ At present our methodological approach cannot be used to derive time trends in the individuallevel timetodiagnosis estimates. Separate published modelling efforts suggest that this is not a major caveat, and we have added corresponding references in the Methods section:
“We next reconstructed characteristic timetodiagnosis distributions for each of the 9 Amsterdam risk groups (MSM/heterosexual, and region of birth) with a Bayesian hierarchical model from the individuallevel estimates, modelling the individuallevel estimates with a Weibull distribution. To avoid censoring of infectiontodiagnosis times, we focused analyses on the subset of infections in 20102012 which were diagnosed by May 1, 2019 since most infections in this window would have been diagnosed by the close of study, and assume time to diagnosis did not change substantially in 20102019 (Ard van Sighem et al. 2017; Ard van Sighem 2017).”
We call out this caveat in the Discussion as well:
“The main limitation of our biomarker approach is thus that at present we cannot account for time trends in time to diagnosis.”
2/ Regardless, even with an assumed time to diagnosis that is constant in 20142018 (importantly: separately for each pop strata), we find that among infections occurring in year y=2014, 2015, …, 2018, the proportion of infections that remains undiagnosed is increasing by year. This is a simple consequence of the fact that for each infection year cohort, we are progressively observing lower quantiles in the timetodiagnosis distribution (please recall Figure 2). Now, to obtain an overall estimate of the undiagnosed infections, we need to weight the annualised estimates, and we realised based on your very helpful comment that we did not account for the declining incidence trends in our calculations. This had a substantive impact on our estimates – thank you again for sharing your insight here.
We now make this methodological step clear in the Methods section:
“We then calculated the proportion of infections in each year $y=2014,\dots ,2018$ in each of the 9 Amsterdam risk groups that were not diagnosed by database closure (which we denote by $\delta}_{\text{y}$) from the fitted model. To adjust for trends in incidence over time, the annual estimates were weighted by the estimated number of HIV infections in each year among Amsterdam MSM and heterosexual individuals without stratifiction by inmigrant status, according to the European Centre for Disease Control and Prevention (ECDC) HIV modelling tool for Amsterdam, version 1.3.0 (Stockholm: European Centre for Disease Prevention and Control 2017) through weights,
where y = 2014,…,2018 and $N}_{y}^{InfECDC$ are the estimated total number of infections in year $y$ in Amsterdam MSM or heterosexuals. We then estimated the overall estimate of the proportion of undiagnosed infections in 20142018, $\delta $, by applying these weights to the yearly proportions through
Recognizing the limitations in applying weights that do not account for differences by place of birth, we used in sensitivity analyses as weights the observed trends in the number of annual HIV diagnoses in the corresponding Amsterdam risk group. ”
Considering incidence rates that decline at the estimated incidence rate from the ECDC modelling for the Dutch population, the weights are:
Assuming incidence rates that decline as the observed diagnosis rates in Amsterdam, the weights are:
Table 7 (added to Appendix 1) summarises the estimates of the proportion of undiagnosed infections using the various approaches, which indicates that our estimates into the proportion of undiagnosed infections is relatively insensitive provided the weighting used in the calculations reflects the declines in incidence rates:
We have modified the Results section as follows:
“Local estimates of the continuum of care indicate that Amsterdam has surpassed the 959595 targets, with an estimated 5% of all people in Amsterdam living with HIV that remained undiagnosed by the end of 2019 (A. van Sighem et al. 2020; UNAIDS 2019). Based on the timetodiagnosis estimates in our cohort, we can focus here at the forefront of ongoing transmission chains and quantify the proportion of recent Amsterdam infections in 20142018 that remained undiagnosed by May 1, 2019. Figure 2 shows that the estimated undiagnosed proportions are considerably higher when we focus on infections acquired since 2014. Accounting for declining diagnosis and infection trends (see Methods), an estimated 14% [1216%] of infections in Amsterdan MSM in 20142018 remained undiagnosed, and 41% [3548%] in Amsterdam heterosexuals (Table 1). The highest proportion of undiagnosed Amsterdam infections in 20142018 are in heterosexuals born in SubSaharan Africa, with 57% [4767%].
While the bivariate model of biomarker data that underpins the individuallevel timetodiagnosis estimates has been validated (Pantazis et al. 2019), our estimates of the proportion of undiagnosed infections in 20142018 depend further on the trends in the number of infections in each year as shown in Equation 2. The main analysis is based on trends in HIV infections in MSM and heterosexuals that were estimated with the ECDC HIV Modelling Tool for Amsterdam. The ECDC estimates account for late diagnoses, but apply to MSM and heterosexuals across Amsterdam. In sensitivity analyses we used instead trends in directly observed Amsterdam diagnoses, which apply to each Amsterdam risk group but do not account for confounding due to late diagnoses. In the sensitivity analysis, we estimate that 14% [1317%] of infections in Amsterdan MSM in 20142018 remained undiagnosed, and 34% [2841%] in Amsterdam heterosexuals. Further details are presented in Appendix 1, Section 3.33.5.”
We have also clarified the Discussion:
“We explored the impact of assumptions on incidence trends to the undiagnosed estimates and found some sensitivities (Appendix 1, Section 3.3), although estimates were all very similar as long as the assumed incidence trends reflected available data. We explored the impact of assumptions on incidence trends to the undiagnosed estimates and found some sensitivities (Appendix 1, Section 3.3), although estimates were all very similar as long as the assumed incidence trends reflected available data. Further sensitivity analyses are reported in Appendix 1 Section 3.43.5. We also further validated the timetodiagnosis estimates by comparing the estimated proportion of recent HIV infections (≤6 months) with those estimated in an independent study in Amsterdam using avidity assays (Slurink et al. 2021), and found them to be similar (Appendix 1 Figure 28). The main limitation of our biomarker approach is thus that at present we cannot account for time trends in timetodiagnosis.”
– Please more precisely define "uninterrupted state label".
Thank you for this comment. We decided to refer to this relatively minor detail through a citation, rather than explain it here once more, as follows:
“Next, we attributed to all viral lineages in the phylogenies a ‘state’ label that included information on the transmission risk group (MSM, heterosexual, other) and location with phyloscanner version 1.8.0 (Wymant et al. 2018); see (Bezemer et al. 2022) for details. Locations were classified into Amsterdam (for ATHENA patients with an Amsterdam postcode at time of registration or a registration update), the Netherlands (for other ATHENA patients), and the 10 world regions Africa, Western Europe, Eastern Europe and Central Asia, North America, Latin America and the Caribbean, Dutch Caribbean and Suriname, Middle East and North Africa, and South and SouthEast Asia and Oceania (for nonDutch sequences). In the labelled phylogeny, the lineage labels jump backwards in time, e.g. from Amsterdam MSM associated with a lineage ending in a tip observed in Amsterdam MSM to Western Europe. Thus, we can group lineages according to the same label between jumps, and we follow (Wymant et al. 2018) in referring to these groups as phyloscanner subgraphs.”
– Why were other subtypes or recombinants excluded?
In principle, we can perform phylogeographic analyses and subgraph identification for trees of all identified sequences. However from an analytical point of view this becomes increasingly laboursome while the epidemiological scientific insights at populationlevel are diminishing as sample sizes are decreasing. Our Bayesian hierarchical branching process model can handle subgraph data from all subtypes, but information is increasingly borrowed from the subtypes that are heavily represented in the data. In particular, given we have >1000 taxa associated with subtype B, the subtype specific parameters in the model associated with a rare recombinant will essentially look like those for subtype B. While logical from a statistical point, such estimates are at best uninformative and at worst misleading to public health practitioners. We have thus decided to exclude subtypes or CRFs that are at population level represented by less than 50 sequences, and state that for these subtypes/CRFs, we are unable to provide estimates.
We have modified the text as follows. In the Methods section:
“Subtypespecific HIV phylogenetic trees were generated for alignments with at least 50 Amsterdam sequences using FastTree v2.1.8. For the excluded subtypes/ CRFs, sequence sample sizes were too low to characterise transmission chains and chain growth at populationlevel.”
And in the Results section:
“43 individuals were excluded from further analysis as their subtype identification was inconclusive, or they were associated with other subtypes or recombinant forms with fewer than 50 sequences in Amsterdam. ”
– The blue and turquoise in Figure 3 are difficult to discriminate.
Thank you for pointing this out. We have amended the colour scheme to improve colour accessibility.
– Discussion: What is the ECDC model? Please specify.
This is a good point, we clarified the text and added a reference as stated below.
“We then calculated the proportion of infections in each year $y$ from 2014 to 2018 that were not diagnosed by database closure (which we denote by $\delta}_{\text{y}$) from the fitted model. To adjust for trends in incidence over time, the annual estimates were weighted by the estimated number of HIV infections in each year among Amsterdam MSM and heterosexual individuals, according to the European Centre for Disease Control and Prevention (ECDC) HIV modelling tool for Amsterdam, version 1.3.0 (Stockholm: European Centre for Disease Prevention and Control 2017) through weights,
where y = 2014,…,2018 and $N}_{y}^{InfECDC$ are the estimated total number of infections in year $y$ in Amsterdam.”
https://doi.org/10.7554/eLife.76487.sa2Article and author information
Author details
Funding
Aids Fonds (P29701)
 Alexandra Blenkinsop
 Ard van Sighem
 Oliver Ratmann
 Godelieve J de Bree
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank the steering committee of the Amsterdam HIV transmission initiative and the Machine Learning & Global Health network for earlier comments on this work; and Imperial College Research Computing Service, DOI: 10.14469/hpc/2232, for providing the computational resources to perform this study.
The HTEAM initiative is being supported by Aidsfonds (grant number: 2013169, P29701, P60803), Stichting Amsterdam Dinner Foundation, BristolMyers Squibb International Corp. (study number: AI424541), Gilead Sciences Europe Ltd (grant number: PAHIVPREP160024), Gilead Sciences (protocol numbers: CONL2764222, COUS2761712, CONL9856195), and M.A.C AIDS Fund.
Ethics
Human subjects: As from 2002 ATHENA is managed by Stichting HIV Monitoring, the institution appointed by the Dutch Ministry of Public health, Welfare and Sport for the monitoring of people living with HIV in the Netherlands. People entering HIV care receive written material about participation in the ATHENA cohort and are informed by their treating physician on the purpose of data collection, thereafter they can consent verbally or elect to optout. Data are pseudonymised before being provided to investigators and may be used for scientific purposes. A designated data protection officer safeguards compliance with the European General Data Protection Regulation (Boender et al. 2018).
Senior Editor
 Miles P Davenport, University of New South Wales, Australia
Reviewing Editor
 Joshua T Schiffer, Fred Hutchinson Cancer Research Center, United States
Reviewer
 David A Rasmussen, North Carolina State University, United States
Version history
 Received: December 17, 2021
 Preprint posted: March 15, 2022 (view preprint)
 Accepted: August 1, 2022
 Accepted Manuscript published: August 3, 2022 (version 1)
 Version of Record published: October 7, 2022 (version 2)
Copyright
© 2022, Blenkinsop et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 692
 Page views

 145
 Downloads

 2
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Epidemiology and Global Health
Background:
Machine learning (ML) techniques improve disease prediction by identifying the most relevant features in multidimensional data. We compared the accuracy of ML algorithms for predicting incident diabetic kidney disease (DKD).
Methods:
We utilized longitudinal data from 1365 Chinese, Malay, and Indian participants aged 40–80 y with diabetes but free of DKD who participated in the baseline and 6year followup visit of the Singapore Epidemiology of Eye Diseases Study (2004–2017). Incident DKD (11.9%) was defined as an estimated glomerular filtration rate (eGFR) <60 mL/min/1.73 m^{2} with at least 25% decrease in eGFR at followup from baseline. A total of 339 features, including participant characteristics, retinal imaging, and genetic and blood metabolites, were used as predictors. Performances of several ML models were compared to each other and to logistic regression (LR) model based on established features of DKD (age, sex, ethnicity, duration of diabetes, systolic blood pressure, HbA1c, and body mass index) using area under the receiver operating characteristic curve (AUC).
Results:
ML model Elastic Net (EN) had the best AUC (95% CI) of 0.851 (0.847–0.856), which was 7.0% relatively higher than by LR 0.795 (0.790–0.801). Sensitivity and specificity of EN were 88.2 and 65.9% vs. 73.0 and 72.8% by LR. The top 15 predictors included age, ethnicity, antidiabetic medication, hypertension, diabetic retinopathy, systolic blood pressure, HbA1c, eGFR, and metabolites related to lipids, lipoproteins, fatty acids, and ketone bodies.
Conclusions:
Our results showed that ML, together with feature selection, improves prediction accuracy of DKD risk in an asymptomatic stable population and identifies novel risk factors, including metabolites.
Funding:
This study was supported by the National Medical Research Council, NMRC/OFLCG/001/2017 and NMRC/HCSAINV/MOH00101900. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

 Epidemiology and Global Health
The NIHfunded RECOVER study is collecting clinical data on patients who experience a SARSCoV2 infection. As patient representatives of the RECOVER Initiative’s Mechanistic Pathways task force, we offer our perspectives on patient motivations for partnering with researchers to obtain results from mechanistic studies. We emphasize the challenges of balancing urgency with scientific rigor. We recognize the importance of such partnerships in addressing postacute sequelae of SARSCoV2 infection (PASC), which includes ‘long COVID,’ through contrasting objective and subjective narratives. Long COVID’s prevalence served as a call to action for patients like us to become actively involved in efforts to understand our condition. Patientcentered and patientpartnered research informs the balance between urgency and robust mechanistic research. Results from collaborating on protocol design, diverse patient inclusion, and awareness of community concerns establish a new precedent in biomedical research study design. With a public health matter as pressing as the longterm complications that can emerge after SARSCoV2 infection, considerate and equitable stakeholder involvement is essential to guiding seminal research. Discussions in the RECOVER Mechanistic Pathways task force gave rise to this commentary as well as other review articles on the current scientific understanding of PASC mechanisms.