Phylodynamics of SARS-CoV-2 in France, Europe, and the world in 2020
Abstract
Although France was one of the most affected European countries by the COVID-19 pandemic in 2020, the dynamics of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) movement within France, but also involving France in Europe and in the world, remain only partially characterized in this timeframe. Here, we analyzed GISAID deposited sequences from January 1 to December 31, 2020 (n = 638,706 sequences at the time of writing). To tackle the challenging number of sequences without the bias of analyzing a single subsample of sequences, we produced 100 subsamples of sequences and related phylogenetic trees from the whole dataset for different geographic scales (worldwide, European countries, and French administrative regions) and time periods (from January 1 to July 25, 2020, and from July 26 to December 31, 2020). We applied a maximum likelihood discrete trait phylogeographic method to date exchange events (i.e., a transition from one location to another one), to estimate the geographic spread of SARS-CoV-2 transmissions and lineages into, from and within France, Europe, and the world. The results unraveled two different patterns of exchange events between the first and second half of 2020. Throughout the year, Europe was systematically associated with most of the intercontinental exchanges. SARS-CoV-2 was mainly introduced into France from North America and Europe (mostly by Italy, Spain, the United Kingdom, Belgium, and Germany) during the first European epidemic wave. During the second wave, exchange events were limited to neighboring countries without strong intercontinental movement, but Russia widely exported the virus into Europe during the summer of 2020. France mostly exported B.1 and B.1.160 lineages, respectively, during the first and second European epidemic waves. At the level of French administrative regions, the Paris area was the main exporter during the first wave. But, for the second epidemic wave, it equally contributed to virus spread with Lyon area, the second most populated urban area after Paris in France. The main circulating lineages were similarly distributed among the French regions. To conclude, by enabling the inclusion of tens of thousands of viral sequences, this original phylodynamic method enabled us to robustly describe SARS-CoV-2 geographic spread through France, Europe, and worldwide in 2020.
Editor's evaluation
This paper is a comprehensive, quantitative, and robust overview of the global, European, and French genomic epidemiology of SARS-CoV-2 in the first year of the pandemic. It contributes methodological advances in maximum likelihood phylogeography, using multiple scales and providing a simulation-based validation. The results show two distinct patterns of SARS-CoV-2 exchange events between the first and second half of 2020, with Europe being involved in most intercontinental exchanges: France experienced viral introductions primarily from North America and Europe during the first wave, while the second wave saw limited intercontinental movement and a significant contribution of the virus from Russia into Europe.
https://doi.org/10.7554/eLife.82538.sa0Introduction
On December 1, 2019, an outbreak of severe respiratory disease was identified in the city of Wuhan, China (Huang et al., 2020). The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was rapidly identified as the agent of the disease (Zhu et al., 2020), responsible for the ongoing global pandemic of coronavirus disease 2019 (COVID-19). By the end of 2020, the virus caused over 1.8 million deaths worldwide including ~65,000 deaths in France, concomitantly with social and economic devastations in many regions of the world (Mofijur et al., 2021; Santomauro et al., 2021). Since the beginning of COVID-19 pandemic, the scientific community has thoroughly characterized the virus, including its pathogenesis, the monitoring of its circulation in human populations, and the development of several treatments or vaccines (Cevik et al., 2020; Krammer, 2020). Epidemiological models have been particularly helpful to quantify viral spread both in the short and long terms and to inform public health decisions (Hoertel et al., 2020; Kissler et al., 2020).
In addition to clinical and epidemiological insights, viral whole-genome sequencing has become a powerful and invaluable tool to better understand infection dynamics (Volz et al., 2013), including the COVID-19 pandemic. The number of available SARS-CoV-2 whole-genome sequences has rapidly grown thanks to the efforts of scientists and researchers gathered via international networks such as the Global Initiative on Sharing All Influenza Data, GISAID (https://www.gisaid.org/; Khare et al., 2021). These genomic sequences are essential to effectively reconstruct the global viral spread and the origins of variants. Genomic data have become a strong asset in addition to epidemiological data to inform governments and help public health decisions (Attwood et al., 2022; Rife et al., 2017). However, due to the computational time required for many analyses, existing phylogenetic tools are limited for studying large amounts of data such as those generated by widespread viral sequencing. Therefore, it is still necessary to develop methods to analyze large datasets while optimizing computational calculation times. Producing appropriate subsamples through several replicates may be an efficient approach in this matter.
In France, the first COVID-19 suspected case was identified in late December 2019 (Deslandes et al., 2020), and the first confirmed cases of SARS-CoV-2 infection were detected on January 24, 2020, in individuals who had recently traveled in China (Bernard Stoecklin et al., 2020). COVID-19 cases remained scarce until the end of February, when the national incidence curve of new SARS-CoV-2 infections started to rise (Figure 1). By the end of February, reinforced measures were announced, including social distancing, cessation of passenger flights to France, school closure, and finally, a complete lockdown across the entire country from March 17 to May 10, 2020. The reported daily incidence and numbers of severe cases peaked at the beginning of April 2020 before decreasing steadily until August 2020. However, after the relaxation of social distancing measures in June, a second wave of infections occurred in early September peaking at more than 100,000 positive cases and 1300 confirmed deaths in a single day on November 2, 2020 (Figure 1). After this peak, daily incidence and severe COVID-19 cases gradually diminished down to a number of positive daily cases varying between 2000 and 25,000 at the end of 2020 thanks to a second national lockdown applied between October 29 and December 15, 2020. Epidemiological trends were similar in most European countries except for Russia or Romania, where high rates of SARS-CoV-2-related deaths were reported even in the summer of 2020. Of note, the other continents showed different patterns of virus circulation: compared to Europe, the number of deaths increased about 2 weeks later in North America and remained high throughout 2020; and from early May, Asia and South America were also highly impacted by the pandemic (Figure 1—figure supplement 1).

Timeline of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)-related deaths and stringency index in France, 2020.
Key events are indicated on the timeline. Official lockdowns included stay home orders and closure of schools and daycares. Based on SARS-CoV-2-related deaths, the two first French epidemic waves are, respectively, dated from March to July 2020, and September to December 2020. SARS-CoV-2-related deaths are displayed as the daily number of deaths (light blue area) and as the weekly average of daily number of deaths (dark blue curve). The stringency (Oxford) index is a composite measure based on different response indicators including school and workplace closures and travel bans, rescaled to a value from 0 to 100 (100 = strictest) (Hale et al., 2021).
Elucidating the SARS-CoV-2 dynamic throughout the various phases of the pandemic is paramount to better anticipate how to limit virus circulation for future viral epidemics (Rife et al., 2017). Here, we analyzed GISAID deposited sequences to elucidate the origins and spread of the virus in France, Europe, and the world from January 1 to December 31, 2020. Through a maximum likelihood discrete trait phylogeographic method, we estimated the main geographical areas that contributed to viral introduction into France and Europe, the countries/continents to which France exported SARS-CoV-2 the most and the contribution of the different French regions to the national circulation of the virus. The main exchanged lineages were also investigated. We looked at the differences in virus circulation during each of the two European epidemic waves of 2020 independently. Given France’s central geographic location in Europe and the high proportion of international travelers visiting this country before the pandemic, we aimed to explore the role that France played in SARS-CoV-2 exchanges both in Europe and worldwide.
Results
Defining appropriate subsamples using simulations
From January 1 to December 31, 2020, a total of 638,706 sequences were retained in our study. Inferring a phylogenic tree with such a large number of sequences would require very long calculation times. To overcome this limit, we constructed smaller datasets by randomly choosing subsamples (with replacement) of the sequences. The number of sequences for each country at each week was chosen to be proportional to the number of SARS-CoV-2-related deaths per country and per week with a 2-week shift to account for the time between infection and death.
As a proof of concept, we conducted an extensive simulation study to estimate the accuracy of the discrete trait phylogeographic inference for rates of transitions between two distinct locations. First, to evaluate the precision of such inference on a tree of 1000 leaves, we simulated a two-states model with different combinations of transition rates in 50 replicates. Parameters were correctly estimated with limited variability across the 50 replicates. The median parameters across replicates gave a very accurate estimation (Figure 2A).

Estimating variability in transition rates using simulations.
(A) Estimated versus true parameters in the simulation study of the two-states model. The two panels show the two transition rates. For each set of parameters, 50 replicates were conducted. The large red dot is the median of the replicates. The red cross is the true parameter value, on the bisector. (B) Estimated rate of transition in subsampled trees. For each replicate (n = 50), one point is the result of one subsampled smaller phylogenetic tree (from a large phylogenetic tree). The big dot shows the median for each replicate. The horizontal red line is the overall median (of the medians), across replicates. The horizontal dashed gray line is the true rate. Only one of the two rate parameters is shown. (C) Log-median error in parameter estimation as a function of the log number of replicates, when inference is conducted on truly independent replicate evolutionary histories, on a tree of 1000 leaves. The points are the data, the dashed line shows the line of slope ‘−1’ which is the expectation as the replicates are truly independent. (D) Log-median error as a function of log number of subsamples used for the inference done on subsampled phylogenetic trees. The colored points and lines show the inference done on 50 distinct realizations of the evolutionary process on the whole tree. The dashed line is the overall regression line with a slope of −0.7. We tested between 1 and 10 subsampled trees (x-axis).
Next, to evaluate how independent parameter estimates are done on randomly subsampled trees of the same larger phylogeny, we inferred parameters on 50 100-leaves trees randomly subsampled from a 10,000-leaves SARS-CoV-2 phylogenetic tree. For each resulting subtree, we conducted inferences on 50 replicates corresponding to 50 realizations of the stochastic process of evolution of the discrete character – as done in the first simulation – on the whole tree of 10,000 leaves. For each replicate, we observed some error on the estimation of the parameter, because one replicate only corresponds to one possible realization of the evolutionary process, although the overall median of inferred parameters across subsampled trees was closer to the true parameter values (Figure 2B).
Different estimations of the transition rates conducted on different subsampled trees are not expected to be fully independent because the subtrees partly share the same evolutionary history. Therefore, we estimated the level of independence of these estimations. When several estimates are perfectly independent from one another and are averaged to obtain the final estimate of the quantity of interest, we expect the error in parameter estimation to converge to 0 with a 1/N (N−1) scaling, where N is the number of replicates. This is indeed what we observed when we calculated the error on estimation of the parameter as a function of the chosen number of replicates N in the first set of simulations. Here, the replicates were truly independent replicate realizations of the evolutionary history and inference was conducted on the whole tree of 1000 leaves (Figure 2C). On the contrary, when estimates are perfectly dependent, error on the averaged parameter estimate is expected to not decrease with N. When evaluating the error on parameter estimates across subsamples of the large tree, we expected the scaling of error as a function of number of subsamples N to be intermediate between non-independence (~N0 scaling) and perfect independence (~N−1 scaling). Using the relationship between log(error) as a function of log(N), we estimated a slope of −0.7 (Figure 2D). Thus, inferences conducted on subsamples of the same phylogenetic tree are partly independent. The precise degree of independence is expected to depend on the shape of the phylogenetic tree, but the coefficient was similar when doing the same study on a randomly generated tree instead of the SARS-CoV-2 tree.
We finally conducted another round of simulations to evaluate the error on what we considered as exchange between multiple locations when using sparse subsampling. For that, a 1,000,000-leaves tree was simulated with a five-states discrete trait representing geographical units. Then, 100 subsampled 1000-leaves trees from the whole phylogenetic tree were produced and the ancestry for the discrete trait was reconstructed from the leaf data only. We estimated the number of transitions (exchanges) of each type and compared them with the one obtained from the main tree, finding a mean error rate of 2.7% over the 100 subsamples (Figure 2—figure supplement 1).
Altogether, these simulations suggested that using subsamples of 1000 sequences from a large dataset and performing partially independent replicates seems to be sufficient to accurately estimate transition events.
Description of the datasets and global diversity of SARS-CoV-2 sequences
We defined 100 subsamples of sequences proportionally to COVID-19 deaths across geographic locations and time for different geographic scales (worldwide, Europe, and French regions) and time periods (from January 1 to July 25, 2020, and from July 26 to December 31, 2020, respectively, covering the first and second European epidemic waves). We chose the sampling intensity guided by the weekly number of SARS-CoV-2-related deaths reported by public health organizations. Here, the number of SARS-CoV-2-related deaths was used rather than the number of detected cases because the latter was biased due to variable ascertainment rates across countries and time. For example, the larger number of PCR tests conducted in the second epidemic wave could wrongly suggest that the virus circulated much more during the second half of 2020 (Figure 1—figure supplement 2).
For each geographic scale and time period, there was a positive correlation between the weekly number of SARS-CoV-2-related deaths and the weekly number of sequences we included for a subsample (Spearman’s rank correlation, p < 0.001; r = 0.94 for the lowest correlation). We also confirmed that the number of sequences per territory was, on average, properly temporally distributed within each time period (Figure 2—figure supplement 2).
Some countries and French administrative regions were however discarded in the analyses because they were not sufficiently represented in the GISAID database. Overall, a total of 39,288 and 39,755 distinct SARS-CoV-2 sequences were included across the 100 sampled phylogenies for the worldwide dataset, respectively, for the first and the second time periods (Table 1). At the European scale, 26,757 and 27,658 different SARS-CoV-2 sequences covering 11 countries were analyzed across the 100 subsamples (Table 1). Focusing on French administrative regions, sequences available on the GISAID database were very sparse. The Provence-Alpes-Côte d’Azur (PACA, Marseille area) was the only region that highly sequenced SARS-CoV-2 in 2020. Île-de-France (IDF, Paris area), Auvergne-Rhône-Alpes (ARA, Lyon area), Occitanie (OCC, Toulouse and Montpellier area), and Bretagne (BRE, Rennes area) have sequenced much less than PACA, but provided sufficient data to investigate SARS-CoV-2 geographic exchange events in France. The remaining French administrative regions were discarded since too few sequences were available to properly match the number of weekly SARS-CoV-2 deaths (Figure 2—figure supplement 3). We thus considered 2543 unique sequences across the 100 subsamples between January 1 and July 25, 2020, and 3124 unique sequences between July 26 and December 31, 2020 (Table 1).
Number of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences investigated for each dataset.
Dataset | Geographies | Period investigated | Average number of sequences sampled in a subsample | Total number of sequences |
---|---|---|---|---|
World | Africa, Asia, Europe, France, North America, Oceania, South America | January 1 to July 25, 2020 | 846 | 39,288 |
July 26 to December 31, 2020 | 777 | 39,755 | ||
Europe | Belgium, France, Germany, Italy, The Netherlands, Poland, Romania, Russia, Spain, Sweden, United Kingdom | January 1 to July 25, 2020 | 904 | 26,757 |
July 26 to December 31, 2020 | 872 | 27,658 | ||
France | Auvergne-Rhône-Alpes (ARA), Bretagne (BRE), Île-de-France (IDF), Occitanie (OCC), Provence-Alpes-Côte d’Azur (PACA) | January 1 to July 25, 2020 | 416 | 2543 |
July 26 to December 31, 2020 | 433 | 3124 |
The genomic diversity of circulating SARS-CoV-2 in the different continents, countries, and French regions was found to be similar (Figure 2—figure supplement 4). Overall, genomes showed high sequence conservation compared to the Wuhan-Hu-1 reference in 2020 (mean and median of ~13 single nucleotide polymorphisms (SNPs) with 95% of the distribution comprised between 4 and 25 SNPs).
Which continents exchanged SARS-CoV-2 with Europe and France?
Through 100 distinct, dated and ancestrally reconstructed phylogenetic trees, we first studied SARS-CoV-2 exchanges worldwide for each of the time periods studied. Between January 1 and July 25, 2020 (covering the first European epidemic wave), we found that Europe (excluding France) accounted for 57.3% of the total number of exportation events, and was the main source of SARS-CoV-2 exportations toward the other continents in all of the subsamples (Figure 3A–D and Figure 3—figure supplement 1). North America also highly participated in virus exportation during this period time (24.3%). South America and Asia were each associated with 7.1% of the total number of exportation events, consistent with a later circulation of the virus in these continents (Figure 1—figure supplement 1). France was estimated to have contributed 4.2% of the total exportation events, indicating that France was not the major European source of SARS-CoV-2 at the international level between January 1 and July 25, 2020. The exportation events from France were mostly headed toward Europe and, to a lesser extent, to North America, South America, and Asia (Figure 3B and Figure 3—figure supplement 2). These events mostly consisted of the B.1 (80.2%), B.1.1 (6.1%), and B.1.356 (3.5%) lineages (Figure 3E). North America received a large proportion of SARS-CoV-2 from other continents (28.5% of the introduction events), followed by South America (23.1%) and Europe (17.3%) (Figure 3A–D). An average of 11.7% of all SARS-CoV-2 introductions were into France, and originated from North America (50.7%) and Europe (45.7%) (Figure 3—figure supplement 2). These introductions consisted of the B.1 (71.9%) and B.1.1 (18.2%) lineages (Figure 3E). The first introductions into France were detected at the beginning of February, and progressively increased to reach a peak just before the nationwide lockdown from March 17, 2020 (Figure 3D). Only South America and Asia were associated with a continuous increase in SARS-CoV-2 introductions after this date, probably because no such drastic measures were generalized there and the circulation of the virus remained limited in these regions.

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) exchange worldwide.
Exchange events were inferred with 100 subsampled phylogenies between January 1 and July 25, 2020, and between July 26 and December 31, 2020. (A) Number of introduction and exportation events for each subsample and for each continent and France. (B) SARS-CoV-2 exchange flows between continents and France during the two time periods investigated. In these plots, migration flow out of a particular location starts close to the outer ring and ends with an arrowhead at the destination location. Arrow width is proportional to the exchange strength. (C) Number of exportation and (D) introduction events per territory over time. The mean number of exchanges over the subsamples and for each week was calculated. Gray bars indicate the dates of the complete lockdowns in France. (E) Proportion of pango lineages exported from France and introduced into France. Lineages with a proportion <3% were grouped into the ‘other’ clade.
From July 26 to December 31, 2020 (second European epidemic wave), we observed 1.4 times fewer exchange events worldwide compared to the first half of 2020. Importantly, we showed the importance of analyzing several subsamples, as there was a large variation in the total number of exportation or introduction events, especially in Europe (Figure 3A). Europe was, as between January 1 and July 25, 2020, the main source of exchanges with a total of 49.9% of the exportation events across subsamples, followed by North America (18.0%), Asia (14.1%), and South America (9.5%) (Figure 3A–D and Figure 3—figure supplement 1). Most of the events occurred during the summer period (June to August 2020), corresponding to the summer holidays in most countries of the world. France accounted for 8.3% of the exportation events, but they were almost exclusively oriented toward other European countries (89.9%) and overall detected from August to November 2020 (Figure 3B and Figure 3—figure supplement 2), consistent with the SARS-CoV-2 incidence in this period in France (Figure 1—figure supplement 2). The B.1.160 lineage accounted for almost all the exportation events from France (97.9%) (Figure 3E). In a similar fashion, SARS-CoV-2 introductions into France mostly originated from Europe (81.9%) (Figure 3B) and were detected at a low rate from April 2020, then at a higher but always limited rate from June 2020, and at a strong level in September and October 2020 (Figure 3—figure supplement 2). These SARS-CoV-2 introductions into France consisted in majority of B.1.177 (28.0%), B.1.160 (24.7%), B.1 (17.7%), B.1.1 (10.0%), and B.1.258 (7.6%) lineages (Figure 3E).
How did the virus spread in Europe?
We then aimed to get a more comprehensive view of SARS-CoV-2 exchanges between France and other European countries with the same approach. Here, we only focused on European countries associated with a high incidence and without under-sampling due to a lack of data on GISAID (Table 1).
By calculating the count of introduction and exportation events between January 1 and July 25, 2020 across the subsamples, we observed that Italy was the major contributor to virus exportation toward other European countries, with an average of 41.5% of the total number of exportation events. The United Kingdom, France, and Spain also highly participated in virus exportation, and consisted of 21.6, 18.1, and 11.8% of the total number of exportation events, respectively (Figure 4A–D and Figure 4—figure supplement 1). These observations are in line with epidemiological data, since Italy was the first country in Europe to be heavily affected by the pandemic; and France, the United Kingdom, and Spain were the three other European countries associated with the highest number of SARS-CoV-2-related deaths during the first wave (Figure 1—figure supplement 1). The number of all exportation events however decreased after the implementation of lockdowns in the different countries (with the first one occurring in Italy on March 9, 2020) (Figure 4C). France mostly exported SARS-CoV-2 toward Belgium (25.5%), Germany (21.0%), and the United Kingdom (20.4%), and a little less toward Spain (10.1%) and the Netherlands (9.2%). All of these events occurred before the official lockdown in France (March 17, 2020) (Figure 4B and Figure 4—figure supplement 2), and consisted of the B.1 (78.5%), B.1.1 (8.8%), and B.1.356 (6.9%) lineages (Figure 4E). The rate of SARS-CoV-2 exportations from France then decreased until the second European epidemic wave, as it was also the case for other European countries except Russia (Figure 4C). For all introduction events, the proportions were more balanced: the United Kingdom accounted for a quarter of the total number of events, while Russia, Belgium, Germany, Italy, Spain, France, and the Netherlands represented between 6.5 and 11.9% of the total number of events (Figure 4D). In France, a high rate of introduction events was observed in February and March before the lockdown and originated mostly from Italy (44.3%), the United Kingdom (30.8%), and Spain (14.6%) (Figure 4—figure supplement 2). These introductions consisted in majority of the B.1.1 (47.9%) and B.1 (35.4%) lineages (Figure 4E).

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) exchanges on the European scale.
Transmission events were calculated by averaging the results from 100 subsampled phylogenies between January 1 and July 25, 2020, and between July 26 and December 31, 2020. (A) Number of introduction and exportation events for each subsample and for each European country. (B) SARS-CoV-2 exchange flows between European countries during the two time periods investigated. In these plots, migration flow out of a particular location starts close to the outer ring and ends with an arrowhead at destination location. Arrow width is proportional to the exchange strength. (C) Number of exportation and (D) introduction events per territory over time. The mean number of exchanges over subsamples and for each week was calculated. Gray bars indicate the dates of the complete lockdowns in France. (E) Proportion of pango lineages exported from France and introduced into France. Lineages with a proportion <3% were grouped into the ‘other’ clade.
The second time period (from July 26 to December 31, 2020) showed a different pattern of exchanges. Here, we estimated 1.3 times fewer exchanges compared to the first half of 2020. Russia accounted for most of the exportation events (27.6%) (Figure 4A, B). These events were estimated to occur during the spring (after the relaxation of containment measures in most European countries) and the summer periods (Figure 4C). This result was expected since Russia was almost the sole European country to report a high number of SARS-CoV-2-related deaths during this period (Figure 1—figure supplement 1). Spain (16.9%), France (14.0%), Germany (10.2%), Italy (7.9%), Poland (7.5%), and the United Kingdom (6.6%) also highly participated in virus exportation (Figure 4A–D and Figure 4—figure supplement 1). Most of these events were detected between August and October 2020 (Figure 4C), and strongly decreased just before the second lockdown in most European countries (the first one occurring in Spain on October 9, 2020). Again, these observations are consistent with epidemiological reports, as Spain was the first country in European Union to be associated with a sharp increase of SARS-CoV-2-related deaths, rapidly followed by France. France mostly exported the virus toward Italy (22.2%), Germany (21.3%), and Belgium (19.1%) (Figure 4B and Figure 4—figure supplement 2), and mostly the B.1.160 lineage (79.2%) (Figure 4E). Focusing on introduction events, Germany accounted for 23.8% of the total number of introductions, followed by the United Kingdom (15.3%), Italy (15.1%), and France (9.8%). For the remaining European countries, the proportion of introduction events was comprised between 4.5 and 7.6%, except for Russia that accounted for <0.1% of the events. Introductions into France mostly originated from Russia (31.7%) during the summer period, and later from Spain (23.7%), Italy (12.5%), and Germany (11.0%) (Figure 4—figure supplement 2). These SARS-CoV-2 introductions into France consisted in majority of B.1.177 (31.0%), B.1.1 (22.6%) and B.1.258 (7.6%) lineages (Figure 4E).
The first and second French epidemic waves: two patterns of SARS-CoV-2 transmissions within France
We finally conducted an analysis of virus spread inside of France by studying exchanges between French administrative regions. As previously explained, only five regions (ARA, BRE, IDF, OCC, and PACA, which cover the most densely populated urban areas in France, that is, Paris, Lyon, Montpellier, Toulouse, and Marseille) were investigated since only a few, if any, sequences were available on GISAID for the other regions (Figure 2—figure supplement 3). Of note, the Grand-Est region was one of the first massively affected by SARS-CoV-2 in France during the first wave and, thus, must have played a major role. The fact that it was not sampled is a limitation that must be taken into account when reading these results.
During the first French epidemic wave, almost all exportation events originated from IDF (84.0%). ARA, BRE, OCC, and PACA only accounted for 8.2, 0.4, 2.3, and 5.1% of the total number of exportation events (Figure 5A, B and Figure 5—figure supplement 1). As for the wider geographic scales, most exportation events occurred between February and March, and largely decreased after the national lockdown (Figure 5C and Figure 5—figure supplement 2). ARA was the region having most SARS-CoV-2 introductions, with 48.3% of the total number of introduction events, followed by PACA (22.1%), OCC (17.3%), and BRE (7.0%) (Figure 5C). All the regions were mostly affected by the B.1 lineage at similar rates, except for OCC with more B.1.1 lineage (Figure 5E).

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) exchanges in France.
Exchange events were calculated through 100 subsampled phylogenies between January 1 and July 25, 2020, and between July 26 and December 31, 2020. (A) Number of introduction and exportation events for each subsample and for each French administrative region. (B) SARS-CoV-2 exchange flows between French administrative regions during the two time periods investigated. Migration flow out correspond to the mean of the French sets of sequences. The number of exportation and introduction events per French region over time are detailed (C) between January 1 and July 25, 2020, and (D) between July 26 and December 31, 2020. The proportion of pango lineages exported or introduced are given for each French region during (E) the first and (F) the second time period investigated. Lineages with a proportion <3% were grouped into the ‘other’ clade.
During the second wave, IDF was no longer the single epicenter of SARS-CoV-2 inter-region exchanges. Both IDF and ARA participated concomitantly in virus spread at similarly large levels (43.3 and 42.6% of the total number of exportation events, respectively) (Figure 5A, B and Figure 5—figure supplement 1). The first exportation events mostly occurred in the spring and summer in IDF. ARA was associated with a high increase in exportation events by the end of July, consistent with the resumption of the epidemic. Again, the second national lockdown – implemented in October 29, 2020 – was associated with a strong decline of the number of exchange events (Figure 5D). Regarding introduction events, PACA received 31.3% of the total number of introductions, followed by ARA (24.7%), IDF (24.1%), OCC (11.5%), and BRE (8.2%). Overall, ARA, BRE, OCC, and PACA mostly sent and received the B.1.160 lineage. Only IDF showed a different pattern, exporting at similar rates both the B.1.160 and the B.1.177 lineages (Figure 5F).
Altogether, SARS-CoV-2 exchange events within France showed two distinct patterns between the first and second epidemic waves of 2020. While most SARS-CoV-2 exchanges involved the Paris area during the first half of 2020, the situation was much more balanced after the summer 2020 with the Lyon area (ARA) also contributing to many exchanges. Viral spread across French regions in summer 2020 could be explained by travel associated with the summer holidays.
Discussion
Prior to this analysis, the dynamics of SARS-CoV-2 transmissions within France or between France and other countries (in Europe or worldwide) has been only partially characterized. To our knowledge, three studies have explored the virus spread in France, but only focused on the first French epidemic wave and/or could not address SARS-CoV-2 introduction events from other continents or other European countries (Danesh et al., 2020; Elie and Alizon, 2020; Gámbaro et al., 2020). Other studies have focused on larger scales such as Europe as a whole, but did not explore in-depth the dynamics of exchanges especially in France (Hodcroft et al., 2021; Lemey et al., 2021; Nadeau et al., 2021). Here, we studied introduction and exportation events in France at different scales during the first year of the pandemic, covering the two first European epidemic waves. Our approach provides a way to analyze large amounts of genomic data for future epidemics and improve our understanding of viral spread in France and Europe.
Methodologically, most phylodynamics studies on SARS-CoV-2, as well as on other viruses or pathogens, are based on Bayesian approaches (notably with the popular and robust BEAST tool; Drummond and Rambaut, 2007) using one or more phylogenetic trees composed of several thousand sequences. To reconstruct the evolution parameters and ancestral states at nodes (here, the geographical area and the SARS-CoV-2 lineage), Bayesian methods require high computational power, representing their main limitation (Drummond and Rambaut, 2007). Overall, the generation of a dated phylogenetic tree and the ancestral state reconstruction for even a few thousands of sequences is no small feat, requiring a very long computational time, and sometimes parameter estimation fails to converge. Furthermore, conclusions obtained would be based on a single sample with a relatively limited number of sequences, but different samples may lead to different results (Hall et al., 2016). As stated by Dellicour et al., Bayesian phylogenetic analysis on about 2800 SARS-CoV-2 sequences required over 150 hr to obtain enough samples from the posterior distribution across 15 parallel runs with a GPU accelerated implementation (Dellicour et al., 2021). Since the GISAID database stores several hundred thousand SARS-CoV-2 sequences (n = 638,704 on May 8, 2022 for the year 2020 alone), we believe that using a phylogeny of a few thousand sequences only might not be the best way to accurately infer exchange events. In order to obtain more comprehensive trends drawing conclusions from a larger number of SARS-CoV-2 sequences, we constructed 100 semi-independent maximum likelihood phylogenies of relatively limited size (from 418 to 904 sequences depending on the geographic scale and time period investigated) and inferred ancestral states by a maximum likelihood discrete trait phylogeographic method. Bayesian approach seems not to be more accurate than maximum likelihood for the reconstruction of ancestral states (Hanson-Smith et al., 2010), and we gain a lot in terms of sampling, computational time, and trends by sampling 100 semi-independent subsamples. The maximum likelihood-based method was recently used to depict with accuracy the genomic epidemiology of the virus in Canada (McLaughlin et al., 2022), as well as in other contexts (Cunningham et al., 1998; Pupko et al., 2000).
We characterized SARS-CoV-2 spatial spread within France, and between France and European and all other countries in 2020. Between January 1 and July 25, 2020 (first European epidemic wave), France mostly exchanged SARS-CoV-2 with Europe and North America. The observation contrasts with the second European epidemic wave – that took place between late October and the end of 2020 – where France almost exclusively exchanged the virus with other European countries. This result is logical insofar as the borders of France have gradually reopened with European countries but remained mostly closed to the rest of the world. A very low rate of SARS-CoV-2 exchanges between Europe (including France) and Asia or South America was found in 2020, which is not surprising since the virus started to spread in these continents only from May 2020 where all borders of Europe were already closed. On the European scale, France was a major source of SARS-CoV-2 exchange. It ranked third regarding exportation events (mainly directed toward neighboring countries) only behind the United Kingdom and Italy between January 1 and July 25, 2020, and third again behind Spain and Russia (where exchange events happened mostly in summer 2020) between July 26 and December 31, 2020. The first cases introduced from European countries into France came from Italy and Spain, consistent with the early days of SARS-CoV-2 spread in Europe. These results are in line with the observations made across Europe through Bayesian inference (Lemey et al., 2021). In line with the lack of SARS-CoV-2 genomic diversity at the beginning of the pandemic, both B.1 and B.1.1 were the most representative lineages introduced into or exported from France. During the second European epidemic wave, some cases were introduced into France (and some other European countries) from Russia where virus incidence was high since the summer 2020. Then, most introductions into France originated from Spain which initiated the second wave in European Union, followed by Italy and Germany. Most of the introductions into France were the lineages B.1.1, B.1.177, and B.1.258 that highly circulated in Europe. By contrast, France mostly exported the B.1.160 lineage which was firstly detected in July 2020 in Marseille (PACA, France), and represented the majority of sequences in fall 2020. The frequency of the B.1.160 lineage then decreased rapidly at the beginning of 2021 in favor of B.1.1.7. Interestingly, the number of international exchanges was lower during the second wave compared to the first one, indicating the effectiveness of border control measures and that French people reduced travel outside France after the first national lockdown.
Our results on the French scale should be interpreted considering several limitations. In addition to a limited overall size, genomic data only covered 5 out of 13 administrative regions (Figure 2—figure supplement 3). The North and East of France (i.e., regions Hauts-de-France and Grand-Est, respectively) that border Belgium, Luxembourg, and Germany, were not investigated. More importantly, the region Grand-Est was associated with early large clusters that participated in virus spread across the country during the first European epidemic wave, such as the evangelical gathering of the Christian Open-Door Church (Mulhouse, Haut-Rhin) that took place from 17 to February 21, 2020 and consisted of approximately a thousand infected individuals. Keeping these limitations in mind, we observed two distinct patterns of SARS-CoV-2 exchanges within France in 2020. During the first wave, most of the exportations originated from IDF (Paris area). This may be linked with the larger number of individuals living in IDF who left to other French regions at the beginning of the epidemic. In the second epidemic wave, we observed a similar contribution of IDF to exchanges, but this time accompanied with ARA (Lyon area) which was associated with a strong SARS-CoV-2 incidence. The circulation of lineages was the same between the French regions, both during the first and second epidemic waves. The B.1 was the major circulating lineage during the first wave, while B.1.160 was mostly transmitted during the second one, concomitantly with B.1.177 in IDF.
To conclude, using an original phylodynamic approach to investigate a large SARS-CoV-2 sequence dataset, we characterized SARS-CoV-2 spatial spread within France, across Europe and in the world in 2020 and unraveled distinct patterns of exchanges between the two first epidemic waves.
Materials and methods
Sequence acquisition and curation, and dataset production
Request a detailed protocolA phylodynamics study (which aims to depict with accuracy exchange events between locations) requires a large number of sequences (Attwood et al., 2022). SARS-CoV-2 genome sequences were downloaded from the GISAID database on May 8, 2022 (n = 10,005,024). The following inclusion and exclusion criteria were applied to select samples for analysis: (1) only complete viral genomes from infected individuals (more than 29 kb in length) were included; (2) the maximal proportion of undetermined nucleotide bases was fixed to 5%; (3) sequences with unknown or incomplete sampling dates, or unknown geographical location, were excluded; (4) sampling dates were comprised between January 1, 2020 and December 31, 2020; and (5) sequences from non-human were excluded. The final dataset included 638,704 sequences. In a second pass, 26,103 genomes were considered as outliers as they were in the extreme 5% regarding the number of non-synonymous mutations.
We studied the epidemic at three different geographic scales: (1) worldwide, to identify SARS-CoV-2 exchanges between continents and France; (2) at the European scale, to determine more precisely which European countries mostly exchanged with France; and finally (3) at the French administrative region level, to get a better understanding of virus spread inside the country. For each geographic scale, the analyses were independently run for the two first European epidemic waves (January 1 to July 25, 2020 (weeks 1–30), and July 26 to December 31, 2020 (weeks 31–52); Figure 1).
Considering the large number of sequences in GISAID that passed our criteria (n = 638,704), we produced 100 subsamples of sequences for each investigated geographic scale analysis (i.e., worldwide, Europe, and French administrative regions) and time period (January 1 to July 25, 2020 and July 26 to December 31, 2020). To determine the number of SARS-CoV-2 genome sequences to be included so as to be representative of each location epidemic, we used the number of SARS-CoV-2-related deaths per territory and per week as a proxy for virus circulation taking place 2 weeks earlier. We did not use SARS-CoV-2 detected cases as the proxy for infections because the intensity of COVID-19 testing was increased after the first wave, which probably increased the ascertainment probability (Figure 1—figure supplement 2). For each subsample, sequences were drawn randomly from our GISAID extraction. Some countries and French regions had to be discarded because of their under-representation in the GISAID database. For the European scale, we selected countries with more than 10,000 sequences. Ukraine, Czech Republic, and Hungary passed this threshold, but were excluded for having too many weeks without enough data. At the French region scale, we included only regions with less than 20% of weeks lacking data (Figure 2—figure supplement 3).
For the first time period investigated (January 1 to July 25, 2020), an average of 846, 904 and 416 SARS-CoV-2 sequences were included in each set at the worldwide, European and French scales, respectively (Table 1). When considering the second time period (July 26 to December 31, 2020), the worldwide, European and French scale sequence subsamples contained 777, 872, and 433 genomic sequences (Table 1).
Defining appropriate subsample using simulations
Request a detailed protocolBefore performing phylodynamics analysis using SARS-CoV-2 phylogenetic trees of limited sizes, we performed a series of simulations to demonstrate the accuracy of our approach. In a first simulation, we investigated the inference of transition parameters on a random phylogenetic tree of 1,000 leaves produced with the rtree function of the ape package (Paradis et al., 2004) in R (R Development Core Team, 2022). For that, we simulated a two-states model, with two different transition rates, using the ape rTraitDisc function (Paradis et al., 2004) with the ‘all rates different’ (ARD) model. From each simulation, the two transition parameters were inferred using the ace function of the ape package (which performed maximum likelihood inference of the two rates). In this simulation, we tested different combinations of the two rates.
In a second simulation analysis, we tested the accuracy of inference done on subsamples of a larger phylogenetic tree. We considered a SARS-CoV-2 phylogenetic tree of 10,000 leaves retrieved from the GISAID database, and subsampled 50 randomly chosen 100-leaves phylogenies for inference. For each resulting subtree, we conducted inferences – using the same strategy as previously explained – on 50 replicates corresponding to 50 realizations of the stochastic process of evolution of the discrete character on the whole tree of 10,000 leaves.
Then, we evaluated the level of independence between estimations done on subsampled trees, using the SARS-CoV-2 phylogenetic tree of 10,000 leaves and following the same strategy as previously detailed.
In order to evaluate the error on counting transitions between states of a discrete trait along a phylogeny when using sparse subsampling, we simulated a main tree of 1,000,000 leaves with a five-states discrete trait again using rtree and rTraitDisc (ARD model) from the R ape package (Paradis et al., 2004). We then made 100 random subsampled 1000-leaves trees and reconstructed the ancestry for the discrete trait from the leaf data only using the ape function ace (once again ARD model). The mean error rate was obtained by calculating the number of transitions of each type and comparing them with the one obtained from the main tree. The non-parametric 95% confidence interval was calculated from 1000 bootstraps.
Multiple sequence alignment and maximum likelihood phylogenetic tree
Request a detailed protocolFor each subsample, genome sequences were independently aligned against the Wuhan-Hu-1 reference genome (GenBank accession: NC_045512.2) using MAFFT v7.450 then merged altogether (Katoh and Standley, 2013). The alignments were then masked for problematic sites listed by Nicola De Maio et al. (https://virological.org/t/masking-strategies-for-sars-cov-2-alignments/480). All the sites present in the list were transformed as ‘N’ and thus eliminated from the subsequent analyses. From each masked alignment, we inferred a maximum likelihood phylogenetic tree using FastTree v.2.1.11 under a general time-reversible (GTR) model with a discrete gamma distribution to model inter-site rate variation and 1000 ultrafast-bootstraps to compute support values (Price et al., 2010). The choice of FastTree was mainly due to its short runtime while giving results very close to IQTree 2.1.2 (Minh et al., 2020) as described by Rob Lanfear (https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation.md). Each phylogenetic tree was dated using LSD2 v2.3 using Wuhan Hu-1 as outgroup for rooting (To et al., 2016), with the parameter ‘-e 2’ to be more stringent about outlier nodes. Phylogenetic trees were manipulated and viewed using treeio and ggtree R packages, respectively (Wang et al., 2020; Yu et al., 2017).
Worldwide, Europe, and French region SARS-CoV-2 phylodynamics analyses
Request a detailed protocolFor each phylogenetic tree we produced, ancestral state reconstructions of discrete geographical and SARS-CoV-2 lineage data were performed with the ace function of ape package in R (Paradis et al., 2004). Three distinct Markov models of discrete character evolution through maximum likelihood were tested and compared: the Equal Rates (ER) which assumes a single rate of transition among all possible states, the All Rates Different (ARD) which allows a distinct rate for each position transition between two states, and the Symmetrical Rates (SYM) where forward and reverse transitions share the same parameter. In this study, we only present the results obtained with the SYM model because both the ARD and ER models led to nearly identical observations compared to the SYM model. Of note, the use of ARD required extensive calculation times to infer maximum likelihood transition rates.
From the ancestral state reconstruction, we assigned nodes to a location when supported by ≥50% of the state likelihood. To detect and count each exchange event, we compared the geographic state of a node with the state of its two children. If the nodes corresponded to different geographic locations, we counted an exchange event from the parent node state to the given child node state. The midpoint of the parent–child branch was chosen as the date of exchange in all cases. The same strategy was applied to investigate the proportions of introduced or exported pango lineages. Exchange events across all the subsamples were then weekly averaged.
Data availability
All genome sequences and associated metadata in the dataset are published in GISAID's EpiCoV database. To view the contributors of each individual sequence with details such as accession number, virus name, collection date, originating lab and submitting lab and the list of authors, visit: https://doi.org/10.55876/gis8.230120zd. All the scripts developed for this study were deposited in the following GitHub repository: https://github.com/Rcoppee/PhyloCoV (copy archived at swh:1:rev:9a984b6414e8c2377694ce922eb31874cdaddf1a).
References
-
Reconstructing ancestral character states: a critical reappraisalTrends in Ecology & Evolution 13:361–366.https://doi.org/10.1016/s0169-5347(98)01382-2
-
A phylodynamic workflow to rapidly gain insights into the dispersal history and dynamics of SARS-cov-2 lineagesMolecular Biology and Evolution 38:1608–1613.https://doi.org/10.1093/molbev/msaa284
-
SARS-cov-2 was already spreading in France in late December 2019International Journal of Antimicrobial Agents 55:106006.https://doi.org/10.1016/j.ijantimicag.2020.106006
-
Beast: Bayesian evolutionary analysis by sampling treesBMC Evolutionary Biology 7:214.https://doi.org/10.1186/1471-2148-7-214
-
Sars-cov-2 genomic and phylodynamic analysesRevue Francophone Des Laboratoires 2020:57–62.https://doi.org/10.1016/S1773-035X(20)30314-2
-
Robustness of ancestral sequence reconstruction to phylogenetic uncertaintyMolecular Biology and Evolution 27:1988–1999.https://doi.org/10.1093/molbev/msq081
-
A stochastic agent-based model of the SARS-cov-2 epidemic in FranceNature Medicine 26:1417–1421.https://doi.org/10.1038/s41591-020-1001-6
-
MAFFT multiple sequence alignment software version 7: improvements in performance and usabilityMolecular Biology and Evolution 30:772–780.https://doi.org/10.1093/molbev/mst010
-
GISAID’s role in pandemic responseChina CDC Weekly 3:1049–1051.https://doi.org/10.46234/ccdcw2021.255
-
Corrigendum to: IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic eraMolecular Biology and Evolution 37:1530–1534.https://doi.org/10.1093/molbev/msaa131
-
Impact of COVID-19 on the social, economic, environmental and energy domains: lessons learnt from a global pandemicSustainable Production and Consumption 26:343–359.https://doi.org/10.1016/j.spc.2020.10.016
-
A fast algorithm for joint reconstruction of ancestral amino acid sequencesMolecular Biology and Evolution 17:890–896.https://doi.org/10.1093/oxfordjournals.molbev.a026369
-
SoftwareR: A language and environment for statistical computing, version 2.6.2R Foundation for Statistical Computing, Vienna, Austria.
-
Phylodynamic applications in 21st century global infectious disease researchGlobal Health Research and Policy 2:13.https://doi.org/10.1186/s41256-017-0034-y
-
Fast dating using least-squares criteria and algorithmsSystematic Biology 65:82–97.https://doi.org/10.1093/sysbio/syv068
-
Viral phylodynamicsPLOS Computational Biology 9:e1002947.https://doi.org/10.1371/journal.pcbi.1002947
-
Treeio: an R package for phylogenetic tree input and output with richly annotated and associated dataMolecular Biology and Evolution 37:599–603.https://doi.org/10.1093/molbev/msz240
-
Ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated dataMethods in Ecology and Evolution 8:28–36.https://doi.org/10.1111/2041-210X.12628
-
A novel coronavirus from patients with pneumonia in China, 2019New England Journal of Medicine 382:727–733.https://doi.org/10.1056/NEJMoa2001017
Decision letter
-
Caroline ColijnReviewing Editor; Simon Fraser University, Canada
-
Eduardo L FrancoSenior Editor; McGill University, Canada
-
Christopher JR IllingworthReviewer; MRC University of Glasgow Centre for Virus Research, United Kingdom
Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.
Decision letter after peer review:
Thank you for submitting your article "Phylodynamics of SARS-CoV-2 transmissions in France, Europe and the world during 2020" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and a Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Christopher JR Illingworth (Reviewer #1); Angela McLaughlin (Reviewer #3).
As is customary in eLife, the reviewers have discussed their critiques with one another. What follows below is the Reviewing Editor's edited compilation of the essential and ancillary points provided by reviewers in their critiques and in their interaction post-review. Please submit a revised version that addresses these concerns directly. Although we expect that you will address these comments in your response letter, we also need to see the corresponding revision clearly marked in the text of the manuscript. Some of the reviewers' comments may seem to be simple queries or challenges that do not prompt revisions to the text. Please keep in mind, however, that readers may have the same perspective as the reviewers. Therefore, it is essential that you attempt to amend or expand the text to clarify the narrative accordingly.
Essential revisions:
Both reviewers and the Editor think that this is an important study that presents interesting findings about the spread of SARS-CoV-2 in France and in Europe, and are positive about the work's eventual publication. However, the reviewers brought up important considerations, including methodological concerns, that need to be addressed in a revised version. Following the consultation with the editor and reviewers, we would welcome a revised version that addresses the substantial methodological comments. In summary:
(1) Sampling: there are two questions. One is about the number of sequences and the number of deaths (both reviewers commented on this), as opposed to the number of infections; whether the correlation (sequences to deaths) is an artefact of the method used (Rev 1) and whether it validates the analysis (Rev2). In addition, since deaths and infections can be reported differently in different places, what is the impact of having different sampling fractions in different geographies? this could be addressed by looking at the robustness of the inferences to different choices of relative numbers of sequences from different regions
(2) Downsampling to < 1000 sequences, bootstrapping and adding: We were all concerned about the small individual datasets and the assumption that adding the results would generate the correct numbers of viral movements. After all, why this number of replicates – why not add after 5000 or 800 or some other number of replicates? Also, there are some instances where the same introduction event would be detected in many replicates, and somewhere it would be missed in most of them, so it's not clear that it's additive. Other authors have averaged, in similar analyses (but since each replicate might miss some events, averaging might not be). Given that this analytical method has the advantage that much larger numbers of sequences are tractable to analyze, the authors should analyze substantially larger datasets.
(3) The fast clock assumption has the potential to bias the analysis; a number of other related methodological issues were raised by reviewers. These should be addressed.
(4) Quantitative descriptions of the results and their variability should be included in the results (for example, the importation rate). This will aid in articulating the public health interpretation and relevance, which reviewers also note would add interest and weight to the paper.
Reviewer #1 (Recommendations for the authors):
The method is described as 'original' but line 377 implies that the same method was used to study data from Canada. Please clarify what is new about the approach used here relative to previous studies.
It is stated in line 65 that existing phylogenetic tools cannot include the large datasets generated by projects such as the international COVID sequencing effort. However, researchers on the COG-UK project generated a tree of sequences comprising hundreds of thousands, if not millions of sequences (https://www.cogconsortium.uk/priority-areas/data-linkage-analysis/public-data-analysis/). It would be helpful to clarify what can and cannot be achieved with which phylogenetic methods.
The positive correlations observed between the number of deaths and the number of sequences in a dataset (Figure 2) is a reassuring check but seems to be a direct artefact of the methods, whereby the number of deaths was used to determine the number of sequences. I suggest including these plots in supplemental material.
I would value more explicit information in the Methods section about how data from the inferences performed on different bootstrapped samples was combined.
The link to the stated GitHub repository did not work at the time I tried it.
Reviewer #2 (Recommendations for the authors):
L97: Figure comments: It is recommended to add an annotation for the date delineation of the two COVID waves analyzed in the main graphic. The legend should report how reproduction rate was calculated, and this should be clearly described in the methods. Relatedly, L447-448: state exact dates instead of 'mid' and 'late' months. Applies in results as well.
L106: Although the authors say that sequences are representative of 'sars-cov-2 infections', they then describe the comparison to deaths. Did the authors also compare case incidence to deaths and separately, to sequences over time? Need to better justify the use of deaths rather than new diagnoses. It would help the reader if there was also a presentation of cases over time in addition to deaths that could be done for example in Figure 1.
L114;L124;L209;L287,L465: Regarding the discarding of sequences from geographies that were underrepresented in the sequence set – how would the results be expected to change if these geographies were not discarded? They would be perhaps less likely to be sampled as well as represented in the ancestral nodes due to low representation, but they might have indicated additional nuances to the French regional transmission patterns.
Also, in the methods, what was the threshold by which the authors decided if a region was too underrepresented to be included? L404: what percent of cases and/or deaths were associated with the administrative regions that were excluded due to low or no sequences in each wave?
L115: Need to clarify how many sequences per phylogeny in the results. Were sequences sampled by geography randomly or proportionally to the relative number of deaths over time? Were the sampled sequences distributed uniformly over time or proportionally to deaths over time? Were sequences sampled entirely at random?
L133: Were sequences masked for problematic sites, i.e. due to homoplasy and likely sequencing errors, as identified by Nicola de Maio and group, prior to the phylogenetic inference? This is important and standard in these analyses; if the authors performed masking of problematic sites it is important to mention in the methods. If masking of problematic cites was not performed then this should be done and the analyses repeated during revision.
L156: why is the analysis delineated into periods in mid-July when the authors state here that the second European wave started in October? The delineation of epidemic waves and studied periods needs to be clarified and justified.
L161-165; L237-240; 374-376: the positive correlation between intra-territory transmission events and the estimated number of deaths does not convincingly validate the method, because this is what would be expected even in the presence of remaining bias. Why not total transmission in relation to deaths? Lastly, the authors have shown deaths are also associated with sequences, therefore this does not distinguish model accuracy from reflecting the relative inclusion of sequences.
L378: McLaughlin et al. 2022 used maximum likelihood ancestral reconstruction, but did not sum across replicates; rather, they averaged across replicates. Also, it would be good to include a couple other references as well that a similar approach, as well as the origins of the method algorithms (Pupko et al., 2000, Cunningham et al., 1998)
There are multiple instances in the Results section where relative flows are discussed for different geographies or over time, but only in a qualitative rather than qualitative manner. Please quantify rate estimates and/or relative changes/differences (as the authors did on L296 and L321) in other places. L186: 'at similar rates'. L192: 'quite similar'. L225: 'sequentially – and drastically – increased…'. L233: 'drastic'. L245: 'drastic'. L266: 'steep increase'. L268: 'sharp increase'. L294 'much more'. L308: 'a bit lower'.
L279: it is speculative that government measures had varying success rates based on the analyses conducted and presented in this manuscript.
L275; and generally: The manuscript would benefit from more discussion or stratification of which Pango lineages or Nextstrain clades dominated particular transmission types and routes in the two waves.
L299: Normally, Spearman rank correlation can be appropriately performed when there are 10 or more observations, here, as written the correlation has been performed with only three measurements – please clarify or rectify this.
L430; Supplementary: The GISAID identifiers should be a separate appendix and the authors need to follow the GISAID recommendations for inclusion of a table acknowledging labs for each sequence.
L443-446: What is missed by not conducting all three analyses together? How much international origin among a given region in France, for example?
L470: The inclusion of earlier diversity from the first wave in the trees for the second wave is laudable, however, how did the authors ignore transmission events involving those context sequences when analyzing the second wave in terms of intra- vs inter-territory transmission? Precisely how this was done needs to be clarified in the manuscript.
L480: How were trees rooted? This is an important omission and should be clearly specified in methods.
L483: how could the use of a strict clock rate have impacted the results? Employing model selection and describing in methods to justify would be valuable. If the authors have already performed model selection then clear description in the results is warranted.
L486: in regards to Figure 6-Figure supp 1 on estimating versus fixing the clock rate, the authors should move some of the text in the legend into the main results to better explain why the authors fixed a strict clock rate. Also, the reader may wonder to what extent the issue encountered with early node dating and long branches might have been related to specifying a strict instead of relaxed clock. Thus, please elaborate on this in the results.
Generally in the figure supplement legends, any interpretations of the plots should be moved into the Results section.
It would be informative to contextualize the authors' results within the literature available to date by having a more thorough comparison to other papers on the genomic epidemiology of SARS-CoV-2 in Europe in the Background and/or Discussion; for example, Worobey 2021, Hodcroft 2021, and Huisman et al. 2022.
https://doi.org/10.7554/eLife.82538.sa1Author response
Essential revisions:
Both reviewers and the Editor think that this is an important study that presents interesting findings about the spread of SARS-CoV-2 in France and in Europe, and are positive about the work's eventual publication. However, the reviewers brought up important considerations, including methodological concerns, that need to be addressed in a revised version. Following the consultation with the editor and reviewers, we would welcome a revised version that addresses the substantial methodological comments. In summary:
(1) Sampling: there are two questions. One is about the number of sequences and the number of deaths (both reviewers commented on this), as opposed to the number of infections; whether the correlation (sequences to deaths) is an artefact of the method used (Rev 1) and whether it validates the analysis (Rev2). In addition, since deaths and infections can be reported differently in different places, what is the impact of having different sampling fractions in different geographies? this could be addressed by looking at the robustness of the inferences to different choices of relative numbers of sequences from different regions
We thank the reviewers and editors for highlighting these various issues. First of all, the choice of sampling the number of sequences per week based on the number of deaths seems to us much more relevant than based on the number of infections. Indeed, confirming a death related to COVID-19 is globally robust and reported by health care centers, whatever the time period investigated. In sharp contrast, studying the number of infections completely biases any comparison between the first and second waves, since the availability of detection tools (RT-qPCR and antigenic tests) varied widely by country and throughout the year. PCR tests were almost not used outside of hospitals during the first wave in Europe, whereas their use has increased significantly in the second part of 2020. Therefore, we necessarily find many more infections than deaths during the second wave. Here we made this observation in France as an example (other countries were investigated and presented in the new Figure 1—figure supplement 2): for the first wave, about 30,000 deaths were recorded, and 34,000 during the second wave, suggesting that the virus had overall a similar impact on the French population, which we think is expected because no vaccine was available. Focusing on infections, there were about 215,000 confirmed infections in the first wave, and ten-fold more (about 2.5 million cases) during the second wave (based on COVID-19 data explorer from https://ourworldindata.org/). It is highly unlikely that the virus circulated ten-fold more in the population, since containment measures were still quite high as indicated by the stringency score index shown in Figure 1 (replacing the virus reproduction rate in the previous version). This difference cannot be explained by a change in virulence of the circulating strains, since the α variant started to increase in frequency only at the very end of 2020. Therefore, the drastic ratio of death to infection change is easily explained by the increase of testing during the year 2020. Since PCR tests were used in a large majority of countries from the summer of 2020 onwards, we thus believe that sampling the number of sequences on the basis of COVID-19-related deaths is a more accurate proxy than sampling from the number of confirmed infections. Since information is now stated at lines 165-171 and 458-464.
Concerning the correlation between the number of deaths and the number of sequences included for a subsample for a given time period of time and geographical scale, we aimed to show that our selection algorithm performs a consistent sampling with the epidemiological data, so as not to over- or under-represent a locality at a specific date. We however agree that the correlation of intra-regional transmission and the number of deaths is a very tautological argument (since our baseline sampling respects these spatio-temporal constraints), and is therefore not a direct validation of the method. We removed this element from the revised manuscript.
(2) Downsampling to < 1000 sequences, bootstrapping and adding: We were all concerned about the small individual datasets and the assumption that adding the results would generate the correct numbers of viral movements. After all, why this number of replicates – why not add after 5000 or 800 or some other number of replicates? Also, there are some instances where the same introduction event would be detected in many replicates, and somewhere it would be missed in most of them, so it's not clear that it's additive. Other authors have averaged, in similar analyses (but since each replicate might miss some events, averaging might not be). Given that this analytical method has the advantage that much larger numbers of sequences are tractable to analyze, the authors should analyze substantially larger datasets.
The low sampling and the production of some subsamples is, in our opinion, one of the strengths of our work and should be the subject of a precise justification. The use of about 1,000 sequences allows a non-negligible saving in computation time, both for phylogenetic reconstruction, temporal calibration, and the reconstruction of ancestral characters at nodes. To demonstrate the validity of multiple subsampling of the same pool of sequences, we included in the manuscript a simulation assay (lines 107-160 for the results, lines 475-502 for the method, figure 2 and figure 2—figure supplement 1). Based on our results, performing a number of subsamples greater than 100 is not useful. In our various analyses we systematically studied the number of introduction and exportation events per subsample, and calculated the mean and standard deviation over the course of the subsamples: we observed that each locality reaches an equilibrium of exchange events fairly quickly (often after 25 subsamples).
Summing exchange events by region and date across subsamples was unconventional. As recommended, we averaged the events by locality and week across all subsamples (the ratio between geographical units remains unchanged).
(3) The fast clock assumption has the potential to bias the analysis; a number of other related methodological issues were raised by reviewers. These should be addressed.
We would like to thank the editors and reviewers for this remark, as the molecular clock is a crucial feature of this phylodynamic study. Initially, our goal was primarily to correlate the number of SARS-CoV-2-related deaths with the number of intra-regional exchanges. To ensure that, we fixed a fast molecular clock, with a mutation rate about 10 times higher than expected (6 x 10-3 instead of ~7 x 10-4 substitution per site per year). After many tests, discussions and recommendations (in part with and by Dr. Angela McLaughlin, reviewer #2 of this work and first author of the genomic epidemiology of the first two waves of SARS-CoV-2 in Canada (doi: 10.7554/eLife.73896)), our initial attempt was not really justified and, as previously pointed out, the correlation between the number of intra-regional transmissions and the epidemiological data is a simple byproduct of the methodology. We agree that the choice of such a clock has the potential to strongly bias our observations. Therefore, we reassessed all our code and uncovered a simple error in our dating method. After correction of this error, the mutation rate – previously set at 6 x 10-3 substitution per site per year – is now estimated for each subsample and consistently with other studies, reaching ~7 x 10-4.
The other methodological concerns raised by the reviewers will be discussed specifically later.
(4) Quantitative descriptions of the results and their variability should be included in the results (for example, the importation rate). This will aid in articulating the public health interpretation and relevance, which reviewers also note would add interest and weight to the paper.
In the first version of our study, we mainly presented the results in a qualitative way. We agree with the editors and reviewers that a quantitative aspect is required for having a much more precise view of the exchanges between the different localities, but also of the main changes between the first and second epidemic waves. We therefore reworked all the results to put more emphasis on the quantitative aspect of the exchanges between localities. Now, we systematically present the percentages of introduction and exportation events in order to have a more precise idea of which continents/European countries/French regions exchanged the most.
Reviewer #1 (Recommendations for the authors):
The method is described as 'original' but line 377 implies that the same method was used to study data from Canada. Please clarify what is new about the approach used here relative to previous studies.
The similarity with a pre-existing study was a clear overstatement on our part. McLaughlin et al. 2022 (eLife, doi: 10.7554/eLife.73896) studied the genomic epidemiology of the first two waves of SARS-CoV-2 in Canada without resorting to Bayesian inference. The authors also used a subsampling strategy, however, the aim was to produce subsets using different proportions of Canadian sequences to identify the proportion maximizing the ability to identify domestically circulating sublineages (they ended up with 27,552 sequences, 75% of their total). The re-sampling strategy is indeed ‘original’ and we added a simulation assay to show its performance. All this information has been clarified in the revised version.
It is stated in line 65 that existing phylogenetic tools cannot include the large datasets generated by projects such as the international COVID sequencing effort. However, researchers on the COG-UK project generated a tree of sequences comprising hundreds of thousands, if not millions of sequences (https://www.cogconsortium.uk/priority-areas/data-linkage-analysis/public-data-analysis/). It would be helpful to clarify what can and cannot be achieved with which phylogenetic methods.
Building a phylogeny of several hundreds of thousands or millions of sequences is theoretically conceivable. When we look in depth at the astounding work done by COG-UK, we can see on the methods described on their ‘phylopipe’ github (https://github.com/virus-evolution/phylopipe) that they do not build a gigantic tree in one go but instead use a smart divide and conquer strategy bases on computing subtrees of known clades that they graft together at the end.
Whether using Bayesian or maximum likelihood approaches, dating such phylogenies and reconstructing ancestral characters at nodes (for example, the Pango lineages constituted of hundreds of factors) can require several weeks – if not more – of computational time. Therefore, we chose a strategy to overcome these limits by considering relatively small phylogenies across different sampling (i.e., subsamples). This information has been clarified in the revised version of the manuscript, especially at lines 357-360.
The positive correlations observed between the number of deaths and the number of sequences in a dataset (Figure 2) is a reassuring check but seems to be a direct artefact of the methods, whereby the number of deaths was used to determine the number of sequences. I suggest including these plots in supplemental material.
We agree that the positive correlations between the number of SARS-CoV-2-related deaths and the number of sequences in a set are simply a result of a sequence draw consistent with the epidemiological data. The plots were made to show that the datasets we produced do not include notable biases. As advised, these plots were put as a supplementary figure (figure 2—figure supplement 2), while we put more emphasis on a simulation analysis to prove that a set of 1,000 sequences gave satisfying results in describing more complex phylogenies.
I would value more explicit information in the Methods section about how data from the inferences performed on different bootstrapped samples was combined.
In the previous version, exportation and introduction events were summed, which is indeed not the most conventional strategy. In the revised version, we averaged the events within subsamples, by locality and by week. It did not change any trend or ratio but it is more standard now. This clarification has been introduced in the “Methods” section (lines 536-537).
The link to the stated GitHub repository did not work at the time I tried it.
The link to access the scripts produced for this study did not work. This has been corrected, and anyone can use and modify the scripts for their own works/projects.
Reviewer #2 (Recommendations for the authors):
L97: Figure comments: It is recommended to add an annotation for the date delineation of the two COVID waves analyzed in the main graphic. The legend should report how reproduction rate was calculated, and this should be clearly described in the methods. Relatedly, L447-448: state exact dates instead of 'mid' and 'late' months. Applies in results as well.
We annotated in the Figure 1—figure supplement 2 the two time periods investigated in the study, as well as in some parts in the main text. As far as the reproduction rate is concerned, since we did not talk about it or relate it to our study, it has been removed from the figure. Instead, we put the stringency index, which seems to us much more relevant. The definition of this index is explained in the legend of Figure 1 (lines 695-698). For lines 447-448 (as well as in other parts in the manuscript, notably the “Results” section), we modified the text by giving the exact dates rather than the notions of “mid” or “late”.
L106: Although the authors say that sequences are representative of 'sars-cov-2 infections', they then describe the comparison to deaths. Did the authors also compare case incidence to deaths and separately, to sequences over time? Need to better justify the use of deaths rather than new diagnoses. It would help the reader if there was also a presentation of cases over time in addition to deaths that could be done for example in Figure 1.
This is a misnomer on our part. The subsamples we produced are spatio-temporally representative of COVID-19-related deaths, not infections. We did not produce subsamples based on infections because we believe that this parameter is biased due to the heavy use of PCR tests only from the second epidemic wave, wrongly suggesting that the virus circulated much more during the second wave than the first wave. For more details, we refer the reviewer to point (1) of the essential revisions.
L114;L124;L209;L287,L465: Regarding the discarding of sequences from geographies that were underrepresented in the sequence set – how would the results be expected to change if these geographies were not discarded? They would be perhaps less likely to be sampled as well as represented in the ancestral nodes due to low representation, but they might have indicated additional nuances to the French regional transmission patterns.
The reviewer's remark is very pertinent, and we asked ourselves this question at the beginning of this study: should we or not discard countries/regions for which the number of sequences was (significantly) underestimated? We decided to discard such countries/regions in order to avoid misinterpretation. If we take the example of the French regions, based on the few reports from Santé Publique France (https://www.santepubliquefrance.fr/dossiers/coronavirus-covid-19), the Grand-Est region would have been strongly involved in the dissemination of the virus during the first French wave, in particular because of the evangelical gathering in Mulhouse, in the Haut-Rhin, where more than a thousand people were contaminated. This was one of the largest epidemic clusters in Europe at the beginning of the epidemic. However, on the GISAID platform, only a handful of sequences associated with the Haut-Rhin region were available, and we could not have reproduced such an observation. In the new version, we however added two additional French regions, Bretagne and Occitanie, which include some others very densely populated French cities (Toulouse, Montpellier and Rennes).
Also, in the methods, what was the threshold by which the authors decided if a region was too underrepresented to be included? L404: what percent of cases and/or deaths were associated with the administrative regions that were excluded due to low or no sequences in each wave?
This information was missing from the “Methods” section. We corrected that in the new version of the manuscript. Briefly put as can be seen in Figure 2—Figure Supplement 3, the regions were very unevenly represented in the data. We excluded regions with more than 20% of weeks insufficiently represented in the database (lines 464-469).
L115: Need to clarify how many sequences per phylogeny in the results. Were sequences sampled by geography randomly or proportionally to the relative number of deaths over time? Were the sampled sequences distributed uniformly over time or proportionally to deaths over time? Were sequences sampled entirely at random?
Before producing a phylogeny, we perform a random draw of sequences per country/French region and per week. For that, we first specify the desired sample size (e.g., 1,000 sequences). The next step consists in assigning the number of sequences associated with each country/French region according to the number of deaths related to COVID-19, and then to make a breakdown by week, still according to the number of deaths. In the revised version of the manuscript, we detail this procedure in the “Results” and "Methods" section. The dataset generation algorithm is provided on GitHub (https://github.com/Rcoppee/PhyloCoV). The figure showing the correlation, for a subsample, between the number of sequences included per continent/European country/French region and per week as a function of the number of deaths represents, according to us, a proof of the well constitution of the subsamples.
L133: Were sequences masked for problematic sites, i.e. due to homoplasy and likely sequencing errors, as identified by Nicola de Maio and group, prior to the phylogenetic inference? This is important and standard in these analyses; if the authors performed masking of problematic sites it is important to mention in the methods. If masking of problematic cites was not performed then this should be done and the analyses repeated during revision.
We mask the problematic sites using the list provided by De Maio and colleagues (filedate=2021-10-27) on GitHub. All the sites listed in the vcf file were changed to N’s after the alignment step. This has been now precised in the text (lines 506-509).
L156: why is the analysis delineated into periods in mid-July when the authors state here that the second European wave started in October? The delineation of epidemic waves and studied periods needs to be clarified and justified.
The second European epidemic wave effectively begins in October with an almost exponential increase in the number of SARS-CoV-2-related deaths. However, on a continental scale, Asia and South America (and to a lesser extent, North America) were particularly affected by the pandemic during the summer of 2020 (between June and September, based on COVID-19 data explorer from https://ourworldindata.org). Therefore, we thought it would be interesting to study the role of Europe during this period. Furthermore, while the epidemic has strongly declined in Europe in summer 2020, some European countries such as Russia continued to have a relatively high number of COVID-19-related deaths. It was therefore interesting for us to better understand what this epidemic in Russia may have entailed. In our study, it appears that Russia likely transmitted the virus to a number of European countries, possibly causing a resurgence of cases in Europe, particularly in Spain, before proliferating to the rest of Europe.
We clarified the concepts of time periods by giving the precise dates of each period to avoid ambiguity thorough the revised version of the manuscript.
L161-165; L237-240; 374-376: the positive correlation between intra-territory transmission events and the estimated number of deaths does not convincingly validate the method, because this is what would be expected even in the presence of remaining bias. Why not total transmission in relation to deaths? Lastly, the authors have shown deaths are also associated with sequences, therefore this does not distinguish model accuracy from reflecting the relative inclusion of sequences.
The reviewer made broadly the same point as Reviewer 1. We agree with their observations: the positive correlation between the number of intra-regional events and the number of deaths related to COVID-19 is only an artifact of the methodology. In the revised version of the manuscript, we removed all such notions of correlations because they did not ultimately add any weight to the observations. The number of intra-region transmissions was simply used to determine the total proportion of inter-region transmissions, and thus quantify the increase or decrease in the number of such events between the two time periods investigated.
L378: McLaughlin et al. 2022 used maximum likelihood ancestral reconstruction, but did not sum across replicates; rather, they averaged across replicates. Also, it would be good to include a couple other references as well that a similar approach, as well as the origins of the method algorithms (Pupko et al., 2000, Cunningham et al., 1998)
As in the work of McLaughlin et al. – which serves as a benchmark for our present work – we averaged the observations across subsamples in the revised version. We also included the two proposed citations introducing a similar “replicate” approach (line 373).
There are multiple instances in the Results section where relative flows are discussed for different geographies or over time, but only in a qualitative rather than qualitative manner. Please quantify rate estimates and/or relative changes/differences (as the authors did on L296 and L321) in other places. L186: 'at similar rates'. L192: 'quite similar'. L225: 'sequentially – and drastically – increased…'. L233: 'drastic'. L245: 'drastic'. L266: 'steep increase'. L268: 'sharp increase'. L294 'much more'. L308: 'a bit lower'.
We thank the reviewer for this comment because we feel that including many more quantitative, rather than qualitative, results give more weight to our observations. In addition to the methodology, we rewrote much of the “Results” section to provide a multitude of numerical data, providing a much better understanding of the overall involvement of each geographic region in COVID-19 transmission.
L279: it is speculative that government measures had varying success rates based on the analyses conducted and presented in this manuscript.
This concluding sentence of the results was indeed very presumptuous on our part and has been removed in the revised version.
L275; and generally: The manuscript would benefit from more discussion or stratification of which Pango lineages or Nextstrain clades dominated particular transmission types and routes in the two waves.
The reviewer rightly asked if it was possible to see which lineages dominated the exchanges at the different geographic scales. In the revised manuscript, we performed a reconstruction of ancestral states at nodes including both geographic region and Pango lineage. In this way, we were able to study, notably in relation to France, which lineages were strongly introduced in this country and, conversely, which were the main lineages exported from France. These analyses were carried out at the different periods of time investigated. Novel items were added in the main figures to summarize these results.
L299: Normally, Spearman rank correlation can be appropriately performed when there are 10 or more observations, here, as written the correlation has been performed with only three measurements – please clarify or rectify this.
These correlations have been removed in the revised version, as previously explained.
L430; Supplementary: The GISAID identifiers should be a separate appendix and the authors need to follow the GISAID recommendations for inclusion of a table acknowledging labs for each sequence.
In the revised version, we followed the recommendations provided by the GISAID application. A specific page presenting all the sequences included in the analysis is dedicated, with all the participating authors and laboratories (lines 432-435).
L443-446: What is missed by not conducting all three analyses together? How much international origin among a given region in France, for example?
Conducting an analysis by aggregating the different geographical scales into a single dataset is not compatible with our strategy of producing small phylogenies. Indeed, on the continental scale and in the case of small phylogenies, we only included few sequences per week and per country (depending on the number of deaths). Therefore, we could not model with sufficient accuracy the exchanges for smaller geographical scales such as the European scale or the scale of French regions. The separation of geographical scales and time periods allows us to include sufficiently “large” sets of sequences with a relatively limited computational time cost.
L470: The inclusion of earlier diversity from the first wave in the trees for the second wave is laudable, however, how did the authors ignore transmission events involving those context sequences when analyzing the second wave in terms of intra- vs inter-territory transmission? Precisely how this was done needs to be clarified in the manuscript.
After further testing, we found that the presence of some sequences from the first time period was not essential to produce phylogenies with proper dating and tMRCA. Therefore, we no longer included sequences from the first time period in the second time period set.
L480: How were trees rooted? This is an important omission and should be clearly specified in methods.
It was indeed an unforgivable omission and we amended our method in the manuscript (lines 515-517). We simply define Wuhan-Hu-1/NC_045512 as an outgroup in LSD2 v2.3 for rooting.
L483: how could the use of a strict clock rate have impacted the results? Employing model selection and describing in methods to justify would be valuable. If the authors have already performed model selection then clear description in the results is warranted.
The issue of molecular clock was, in our opinion, the main bias of our work and was used to correct discrepancies in dating that were due to a bug in our code. We would like to thank once again reviewer #2 with whom we were able to exchange and who gave us many advices on the good use of the clock and to eliminate the possible biases. In this revised version, and after code proofreading, the molecular clock is no longer fixed but estimated using LSD2 v2.3 following the production of phylogenies with FastTree. The mutation rate for each replicate was very stable across the subsamples, around 7 x 10-4 substitution per site per year, a value close to those found in other studies (Xia et al. 2021, Viruses, doi: 10.3390/v13091790). Concerning the dating of events, we added some clarifications, and in particular on the fact that we considered the middle of the branch as the date of the event (lines 534-535).
L486: in regards to Figure 6-Figure supp 1 on estimating versus fixing the clock rate, the authors should move some of the text in the legend into the main results to better explain why the authors fixed a strict clock rate. Also, the reader may wonder to what extent the issue encountered with early node dating and long branches might have been related to specifying a strict instead of relaxed clock. Thus, please elaborate on this in the results.
Following correction of our molecular clock, this additional figure has been removed.
Generally in the figure supplement legends, any interpretations of the plots should be moved into the Results section.
As recommended by the reviewer, we tried to limit the interpretations and presentations of results in the figure legends.
It would be informative to contextualize the authors' results within the literature available to date by having a more thorough comparison to other papers on the genomic epidemiology of SARS-CoV-2 in Europe in the Background and/or Discussion; for example, Worobey 2021, Hodcroft 2021, and Huisman et al. 2022.
We included comparisons with other studies in the “Discussion” section, highlighting convergences and divergences about genomic epidemiology in Europe in 2020.
https://doi.org/10.7554/eLife.82538.sa2Article and author information
Author details
Funding
No external funding was received for this work.
Acknowledgements
We gratefully thank all the laboratories and organizations that deposited SARS-CoV-2 sequences on the GISAID database, without which this study would not have been possible. We would like to thank the editors of eLife and the reviewers, who provided extremely valuable feedback and greatly improved the quality of this work. Funding This study was supported in part by the ANRS MIE (Agence Nationale de la Recherche sur le SIDA et les hépatites virales – Maladies Infectieuses Emergentes), the FRM (Fondation pour la Recherche Médicale), and the Inserm UMR1137 unit.
Senior Editor
- Eduardo L Franco, McGill University, Canada
Reviewing Editor
- Caroline Colijn, Simon Fraser University, Canada
Reviewer
- Christopher JR Illingworth, MRC University of Glasgow Centre for Virus Research, United Kingdom
Version history
- Received: August 8, 2022
- Preprint posted: August 11, 2022 (view preprint)
- Accepted: April 15, 2023
- Accepted Manuscript published: April 26, 2023 (version 1)
- Version of Record published: May 11, 2023 (version 2)
Copyright
© 2023, Coppée et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 515
- Page views
-
- 93
- Downloads
-
- 1
- Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
- Epidemiology and Global Health
- Medicine
Background:
Although there are several efficacious vaccines against COVID-19, vaccination rates in many regions around the world remain insufficient to prevent continued high disease burden and emergence of viral variants. Repurposing of existing therapeutics that prevent or mitigate severe COVID-19 could help to address these challenges. The objective of this study was to determine whether prior use of bisphosphonates is associated with reduced incidence and/or severity of COVID-19.
Methods:
A retrospective cohort study utilizing payer-complete health insurance claims data from 8,239,790 patients with continuous medical and prescription insurance January 1, 2019 to June 30, 2020 was performed. The primary exposure of interest was use of any bisphosphonate from January 1, 2019 to February 29, 2020. Bisphosphonate users were identified as patients having at least one bisphosphonate claim during this period, who were then 1:1 propensity score-matched to bisphosphonate non-users by age, gender, insurance type, primary-care-provider visit in 2019, and comorbidity burden. Main outcomes of interest included: (a) any testing for SARS-CoV-2 infection; (b) COVID-19 diagnosis; and (c) hospitalization with a COVID-19 diagnosis between March 1, 2020 and June 30, 2020. Multiple sensitivity analyses were also performed to assess core study outcomes amongst more restrictive matches between BP users/non-users, as well as assessing the relationship between BP-use and other respiratory infections (pneumonia, acute bronchitis) both during the same study period as well as before the COVID outbreak.
Results:
A total of 7,906,603 patients for whom continuous medical and prescription insurance information was available were selected. A total of 450,366 bisphosphonate users were identified and 1:1 propensity score-matched to bisphosphonate non-users. Bisphosphonate users had lower odds ratios (OR) of testing for SARS-CoV-2 infection (OR = 0.22; 95%CI:0.21–0.23; p<0.001), COVID-19 diagnosis (OR = 0.23; 95%CI:0.22–0.24; p<0.001), and COVID-19-related hospitalization (OR = 0.26; 95%CI:0.24–0.29; p<0.001). Sensitivity analyses yielded results consistent with the primary analysis. Bisphosphonate-use was also associated with decreased odds of acute bronchitis (OR = 0.23; 95%CI:0.22–0.23; p<0.001) or pneumonia (OR = 0.32; 95%CI:0.31–0.34; p<0.001) in 2019, suggesting that bisphosphonates may protect against respiratory infections by a variety of pathogens, including but not limited to SARS-CoV-2.
Conclusions:
Prior bisphosphonate-use was associated with dramatically reduced odds of SARS-CoV-2 testing, COVID-19 diagnosis, and COVID-19-related hospitalizations. Prospective clinical trials will be required to establish a causal role for bisphosphonate-use in COVID-19-related outcomes.
Funding:
This study was supported by NIH grants, AR068383 and AI155865, a grant from MassCPR (to UHvA) and a CRI Irvington postdoctoral fellowship, CRI2453 (to PH).
-
- Epidemiology and Global Health
A large observational study has found that irregular sleep-wake patterns are associated with a higher risk of overall mortality, and also mortality from cancers and cardiovascular disease.