1. Epidemiology and Global Health
  2. Microbiology and Infectious Disease
Download icon

Rapid feedback on hospital onset SARS-CoV-2 infections combining epidemiological and sequencing data

  1. Oliver Stirrup  Is a corresponding author
  2. Joseph Hughes
  3. Matthew Parker
  4. David G Partridge
  5. James G Shepherd
  6. James Blackstone
  7. Francesc Coll
  8. Alexander Keeley
  9. Benjamin B Lindsey
  10. Aleksandra Marek
  11. Christine Peters
  12. Joshua B Singer
  13. The COVID-19 Genomics UK (COG-UK) consortium
  14. Asif Tamuri
  15. Thushan I de Silva
  16. Emma C Thomson
  17. Judith Breuer  Is a corresponding author
  1. Institute for Global Health, University College London, United Kingdom
  2. MRC-University of Glasgow Centre for Virus Research, United Kingdom
  3. Sheffield Bioinformatics Core, The University of Sheffield, United Kingdom
  4. Sheffield Institute for Translational Neuroscience, The University of Sheffield, United Kingdom
  5. Sheffield Biomedical Research Centre, The University of Sheffield, United Kingdom
  6. Sheffield Teaching Hospitals NHS Foundation Trust, United Kingdom
  7. The Florey Institute for Host-Pathogen Interactions & Department of Infection, Immunity and Cardiovascular Disease, Medical School, University of Sheffield, United Kingdom
  8. The Comprehensive Clinical Trials Unit at UCL , University College London, United Kingdom
  9. Department of Infection Biology, Faculty of Infectious and Tropical Diseases, London School of Hygiene & Tropical Medicine, United Kingdom
  10. Clinical Microbiology, NHS Greater Glasgow and Clyde, United Kingdom
  11. Research Computing, University College London, United Kingdom
  12. Institute of Infection, Immunity and Inflammation, College of Medical, Veterinary and Life Sciences, University of Glasgow, United Kingdom
  13. Department of Infectious Diseases, Queen Elizabeth University Hospital, United Kingdom
  14. Division of Infection and Immunity, University College London, United Kingdom
Tools and Resources
  • Cited 5
  • Views 663
  • Annotations
Cite this article as: eLife 2021;10:e65828 doi: 10.7554/eLife.65828

Abstract

Background:

Rapid identification and investigation of healthcare-associated infections (HCAIs) is important for suppression of SARS-CoV-2, but the infection source for hospital onset COVID-19 infections (HOCIs) cannot always be readily identified based only on epidemiological data. Viral sequencing data provides additional information regarding potential transmission clusters, but the low mutation rate of SARS-CoV-2 can make interpretation using standard phylogenetic methods difficult.

Methods:

We developed a novel statistical method and sequence reporting tool (SRT) that combines epidemiological and sequence data in order to provide a rapid assessment of the probability of HCAI among HOCI cases (defined as first positive test >48 hr following admission) and to identify infections that could plausibly constitute outbreak events. The method is designed for prospective use, but was validated using retrospective datasets from hospitals in Glasgow and Sheffield collected February–May 2020.

Results:

We analysed data from 326 HOCIs. Among HOCIs with time from admission ≥8 days, the SRT algorithm identified close sequence matches from the same ward for 160/244 (65.6%) and in the remainder 68/84 (81.0%) had at least one similar sequence elsewhere in the hospital, resulting in high estimated probabilities of within-ward and within-hospital transmission. For HOCIs with time from admission 3–7 days, the SRT probability of healthcare acquisition was >0.5 in 33/82 (40.2%).

Conclusions:

The methodology developed can provide rapid feedback on HOCIs that could be useful for infection prevention and control teams, and warrants further prospective evaluation. The integration of epidemiological and sequence data is important given the low mutation rate of SARS-CoV-2 and its variable incubation period.

Funding:

COG-UK HOCI funded by COG-UK consortium, supported by funding from UK Research and Innovation, National Institute of Health Research and Wellcome Sanger Institute.

Introduction

Nosocomial transmission of SARS-CoV-2 presents a significant health risk to both vulnerable patients and to healthcare workers (HCWs) (Kursumovic et al., 2020; Wang et al., 2020; Leclerc et al., 2020; The DELVE Initiative, 2020; Shah et al., 2020). There is a variable incubation period, extending up to day 14 from exposure to the virus in symptomatic cases (Lauer et al., 2020). It is also known that transmission is possible from asymptomatic or presymptomatic carriers (He et al., 2020; Rivett et al., 2020; Oran and Topol, 2020; Lucey et al., 2021), complicating identification of hospital acquisition among hospital onset COVID-19 infections (HOCIs) and tracing of likely sources of infection.

There is now substantial evidence from retrospective studies that genome sequencing of epidemic viruses, together with standard infection prevention and control (IPC) practice, better excludes nosocomial transmissions and better identifies routes of transmission than IPC investigation alone (Brown et al., 2019; Houldcroft et al., 2018; Roy et al., 2019). The development of rapid sequencing methods capable of generating pathogen genomes within 24–48 hr has recently created the potential for clinical IPC decisions to be informed by genetic data in near-real time (Meredith et al., 2020). Although SARS-CoV-2 has a low mutation rate (Fauver et al., 2020), sufficient viral diversity exists for viral sequences to provide information regarding potential transmission clusters (van Dorp et al., 2020). However, phylogenetic methods alone cannot reliably identify linked infections, and the need for clinical teams to gather additional patient data presents challenges to the timely interpretation of SARS-CoV-2 sequence data.

To overcome these barriers, we have developed a sequence reporting tool (SRT) that integrates genomic and epidemiological data from HOCIs to rapidly identify closely matched sequences within the hospital and assign a probability estimate for nosocomial infection. The output report is designed for prospective use to reduce the delay from sequencing to impact on IPC practice. The work was conducted as part of the COVID-19 Genomics (COG) UK initiative, which sequences large numbers of SARS-CoV-2 viruses from hospitals and the community across the UK (COVID-19 Genomics UK (COG-UK), 2020). Here we describe the performance of the SRT using COG-UK sequence data for HOCI cases collected from Glasgow and Sheffield between February and May 2020 and explore how it may have provided additional useful information for IPC investigations.

Materials and methods

The SRT methodology is applied to HOCI cases, defined here as inpatients with first positive SARS-CoV-2 test or symptom onset >48 hr after admission, without suspicion of COVID-19 at admission. The SRT algorithm returns an estimate of the probability that each HOCI acquired their infection post-admission within the hospital, with information provided on closely matching viral sequences from the ward location at sampling and wider hospital. Results for individual HOCIs are evaluated in relation to the IPC classification system recommended by Public Health England (PHE), based on interval from admission to positive test: 3–7 days post admission = indeterminate healthcare-associated infection (HCAI); 8–14 days post admission = probable HCAI; >14 days post admission = definite HCAI (Public Health England, 2020). We also applied the PHE definition of healthcare-associated COVID-19 outbreaks (Public Health England, 2020) (i.e. ≥2 cases associated with specific ward, with at least one being a probable or definite HCAI) to ward-level data, and for each outbreak evaluated whether there was one or more distinct genetic cluster. This was determined by consecutive linkage of each HOCI into clusters using a two single-nucleotide polymorphism (SNP) threshold (with HOCIs assigned to a genetic cluster if a sequence match to any member). Sequences with <90% genomic coverage were excluded from all analyses.

Data collection and processing

Glasgow

Request a detailed protocol

During the first wave of SARS-CoV-2, the MRC-University of Glasgow Centre for Virus Research collected residual clinical samples from SARS-CoV-2-infected individuals following diagnosis at the West of Scotland Specialist Virology Centre. Samples were triaged for rapid sequencing using Oxford Nanopore Technologies (ONT) for suspected healthcare-related infections or Illumina sequencing in all other cases (details in Appendix 1).

Sheffield

Request a detailed protocol

Residual clinical samples from SARS-CoV-2-positive cases diagnosed at Sheffield Teaching Hospitals NHS Foundation Trust were sequenced at the University of Sheffield using ARTIC network protocol (ARTIC Network, 2020) and ONT. Throughout the epidemic, members of the IPC team were notified by the laboratory and by clinical teams of positive results and reviewed relevant areas to ensure optimisation of practice and appropriate management of patients. Electronic reports were created contemporaneously, including an assessment as to whether suspected linked cases were present based on ward-level epidemiology. As part of SRT validation, these reports were accessed retrospectively by a study team member blind to the sequencing data and each included HOCI case was defined as being thought unlinked to other cases, a presumed index case in an outbreak or a presumed secondary case.

HOCI classification algorithm

The sequence matching and probability score algorithm is run separately for each ‘focus sequence’ corresponding to a HOCI. We use associated metadata to assign other previously collected sequences to categories representing where the individual may be part of a SARS-COV-2 transmission network:

  • Unit reference set: individual could be involved with transmission on same unit (ward/ICU, etc.) as focus sequence (look-back interval: 3 weeks)

  • Institution reference set: individual could be involved with transmission in same institution/hospital as focus sequence (look-back interval: 3 weeks)

  • Community reference set: individual could be involved with transmission outside of focus sequence institution (look-back interval: 6 weeks)

It is possible for samples to be members of multiple reference sets. For example, an outpatient may be involved in SARS-CoV-2 transmission at the institution they attended and/or in community transmission.

For each run of the algorithm, pairwise comparisons are conducted between the focus sequence and each sequence within the unit reference set, institution reference set and community reference set. A reference set sequence is considered a close match to the focus sequence if there is a maximum of two SNP differences between them. This choice was based on reported healthcare-associated outbreak events (Meredith et al., 2020; Rockett et al., 2020) and the overall mutation rate of SARS-CoV-2 (details in Appendix 1).

Probability calculations

Request a detailed protocol

We use an expression of Bayes theorem to estimate probabilities for post-admission infection of each focus case divided by exposure on the unit, within the rest of the institution and from visitors (if allowed). An estimate of the prior probability (Pprior) of post-admission infection for each focus case is modified to a posterior probability according to the information provided by the sequence data. The algorithm is based on sound statistical principles, but involves heuristic approximations.

In symptomatic focus cases, we base Pprior on the time interval (t) from admission to date of symptom onset or first positive test (if date of symptom onset not recorded). We calculate Pprior = F(t), where F() is the cumulative distribution function of incubation times (Lauer et al., 2020) (derivation in Appendix 1).

In theory, it would be optimal to use all of the information in the exact sequences observed. However, with the goal of constructing a computationally simple algorithm, we base our calculations on the probability of observing a similar sequence (within two SNPs) to that actually observed for each focus case conditional on each potential infection source/location: infection in the community, current unit/ward or elsewhere in the hospital/institution, or from a visitor. For the unit and hospital, we estimate this probability using the observed sequence match proportion (on pairwise comparison to the focus sequence) in the unit reference set and institution reference set, respectively. For community- or visitor-acquired infection, we use a weighted proportion of matching sequences in the community reference set, with weightings determined by a calibration model that describes geographic clustering of similar sequences among community-acquired infections (described in Appendix 1). The geographic weighting model was fitted separately for each study site using sequences strongly thought to represent community-acquired infection: all community-sampled sequences and patients presenting to the Emergency Department with COVID-19, excluding those recorded as being HCWs.

Software

The analysis was conducted in R (v. 4.0.2, R Foundation, Vienna) using sequence processing and comparison functions from ape (v. 5.4) and geospatial functions in the PostcodesioR (v. 0.1.1) and gmt packages (v. 2.0). R code to run the algorithm is available (Stirrup, 2021), and it has also been implemented as a standalone SRT for prospective use (HOCI Sequence Reporting Tool working group, 2020) within COV-GLUE (Singer et al., 2020).

Results

Study populations

Glasgow

The Glasgow dataset included 1199 viral sequences (available as of 23 June 2020): 426 were derived from community sampling sites, 351 from patients presenting to Emergency Department or acute medical units, 398 from hospital inpatients and 24 from outpatients. Limited data were available regarding the total number of HCWs testing positive and their identification among community samples, but 15 sequences were recorded as being from HCWs. First positive test dates ranged from 3 March to 27 May 2020. All consensus sequences had genomic coverage >90%.

We applied the SRT algorithm to data from three hospitals with required metadata available, for which 128/246 inpatient cases with sequences were HOCIs. Two of these patients had been transferred from another hospital within 14 days prior to their positive test and were not processed as focus sequences. One inpatient without recorded sampling location was excluded, leaving 125 HOCIs for analysis. Population sequencing coverage was 536/1578 (34.0%) overall for patients at the three hospitals and 128/328 (39.0%) for HOCIs specifically (Appendix 1—figure 1).

Sheffield

The Sheffield dataset included 1630 viral sequences with accompanying metadata (available as of 10 October 2020): 714 were from inpatients, 117 were from outpatients and 799 were from HCWs. For this retrospective evaluation, 447/714 inpatient samples taken on date of admission were assumed to represent community-onset cases and used to calibrate the model. First positive test dates ranged from 23 February to 30 May 2020. One sequence with genome coverage <90% was dropped from further analysis (an inpatient on date of admission). 201 of the inpatients were HOCIs. Population sequencing coverage was 714/977 (73.1%) overall for inpatients, 201/261 (77.0%) for HOCIs specifically and 799/962 (83.1%) for HCWs.

Comparison to standard PHE classification

SRT algorithm results in comparison to standard PHE classifications are summarised in Figure 1 and Table 1. The majority of HOCI cases in Glasgow (78/125, 62.4%) and over a third in Sheffield (71/201, 35.3%) met the definition of a definite HCAI and so are known to have acquired the virus post-admission irrespective of sequencing results. The probable HCAI cases formed the next largest group at each site. Overall, the SRT algorithm identified close sequence matches from the same ward for 66.4% of definite and 64.2% of probable HCAIs, indicating likely within-ward transmission (examples in case studies). When one or more close sequence match was identified on the focus sequence’s ward, the SRT probability of infection on the ward was >0.5 in 185/189 cases (Figure 2). For indeterminate HCAIs, the SRT probability of HCAI was >0.5 in 33/82 (40.2%), and in 27/33 (81.8%) a close sequence match on the ward was present. Overall, 14/125 (11.2%) HOCIs in Glasgow and 175/201 (87.1%) in Sheffield had at least one close sequence match to a HCW sample, reflecting the much greater availability of sequences from HCWs in the Sheffield dataset.

Plots of posterior probability of healthcare-associated infection (HCAI) against prior probability of HCAI.

Plot of the posterior probability of healthcare-associated infection (HCAI) for (a) Glasgow and (b) Sheffield hospital onset COVID-19 infection cases from the sequence reporting tool algorithm against the prior probability of HCAI based only on time from admission to diagnosis, grouped by standard infection prevention and control classification recommended by Public Health England. Marginal histograms are displayed with bin-widths of 0.05.

Plot of the posterior probabilities of healthcare-associated infection (HCAI) estimated using the sequence reporting tool algorithm from a source on the current ward versus a source elsewhere in the hospital for (a) Glasgow and (b) Sheffield hospital onset COVID-19 infection cases grouped by standard Public Health England classification.

In cases where there are no close sequence matches in the dataset (including among community cases), the results returned are based solely on the priors and the metadata; this explains the fact that there are some cases with estimated posterior probability of infection on the ward greater than 0.5 for whom there were no sequence matches on the ward.

Table 1
Summary of sequence reporting tool outputs for the Glasgow and Sheffield datasets, according to standard IPC definitions recommended by Public Health England regarding likelihood of HCAI.
Glasgow dataSheffield data
IPC classificationIPC classification
Indeterminate HCAIProbable HCAIDefinite HCAIIndeterminate HCAIProbable HCAIDefinite HCAI
n HOCI cases202778626871
Time from admission to sample*, days4.5 (3–6)11 (9-13)48 (26-83)5 (4–6)9 (8–13)22 (17–31)
Summary of sequence matches returned for each HOCI case
Close sequence match on ward5 (25.0)15 (55.6)53 (68.0)24 (38.7)46 (67.6)46 (64.8)
No close sequence match on ward, but match within hospital8 (40.0)7 (25.9)19 (24.4)34 (54.8)21 (30.9)21 (29.6)
No close sequence match anywhere within hospital7 (35.0)5 (18.5)6 (7.7)4 (6.5)1 (1.5)4 (5.6)
Close sequence match to one or more HCW1 (5.0)0 (0)13 (16.7)55 (88.7)61 (89.7)59 (83.1)
No close sequence match anywhere within dataset2 (10.0)1 (3.7)4 (5.1)4 (6.5)1 (1.5)4 (5.6)
Probability calculations
Prior probability of HCAI†0.39 (0.11–0.66)0.97 (0.92–0.99)1.00 (1.00–1.00)0.49 (0.29–0.66)0.92 (0.86–0.99)1.00 (1.00–1.00)
Posterior probability of HCAI‡0.33 (0.02–0.67)0.98 (0.96–1.00)1.00 (1.00–1.00)0.40 (0.11–0.80)0.98 (0.93–1.00)1.00 (0.99–1.00)
Posterior probability of HCAI‡ category
Low (<30%)10 (50.0)4 (14.8)2 (2.6)25 (40.3)0 (0)0 (0)
Moderately low (≥30% and <50%)2 (10.0)0 (0)0 (0)12 (19.4)0 (0)0 (0)
Medium (≥50% and <70%)4 (20.0)0 (0)0 (0)4 (6.5)5 (7.4)3 (4.2)
High (≥70% and <85%)3 (15.0)0 (0)0 (0)8 (12.9)7 (10.3)2 (2.8)
Very high (≥85%)1 (5.0)23 (85.2)76 (97.4)13 (21.0)56 (82.4)66 (93.0)
  1. Data shown as median (interquartile range) or n (%).

    *Or first +ve test where known.

  2. †Based on time from admission.

    ‡From source on ward or within hospital.

  3. HCAI: healthcare-associated infection; HOCI: hospital onset COVID-19 infection; HCW: healthcare worker; IPC: infection prevention and control.

In 16/244 (6.6%) cases that met the probable or definite HCAI definitions, there was no sequence match within the hospital; this is likely due to incomplete sequence data from SARS-CoV-2 hospitalised cases and staff (with population sequencing coverage <40% patients and very limited for staff from Glasgow and ≈75% of patients and staff in Sheffield) and the presence of asymptomatic and/or undiagnosed carriers. To reflect this the SRT will report ‘This is a probable/definite HCAI based on admission date, but we have not found genetic evidence of transmission within the hospital’ in such situations. There were 26 HOCIs in the Sheffield dataset for whom it was recorded that visitors were allowed on the ward at time of sampling. In three of these, the estimated probability of infection from a visitor was between 0.4 and 0.5 (all had ≥18 days from admission and no ward close sequence matches).

Within the Sheffield dataset, we identified six wards with two genetically distinct outbreak clusters (of two or more patients) and three wards with three distinct outbreaks (see Case study 2). Standard IPC assessment had classified each as a single outbreak. We also identified 10 and 44 HOCIs in the Glasgow and Sheffield datasets, respectively, with no apparent genetic linkage to other HOCI cases on the ward but who met the PHE definition of inclusion within an outbreak event (Table 2).

Table 2
Summary of distinct outbreak events for the Glasgow and Sheffield datasets, according to standard PHE definition and with the addition of sequence data.
Glasgow dataSheffield data
n HOCI cases125201
n ward locations4438
Sequence matches per HOCI case
n sequence matches from same ward, median (IQR, range)1 (0–5, 0–12)1 (0–4, 0–18)
n sequence matches from rest of hospital, median (IQR, range)3 (1–8, 0–52)27 (5–52, 0–150)
Standard PHE definition of outbreak event
HOCI cases part of ward outbreak event, n (%)95 (76.0)184 (91.5)
n ward outbreak events1724
n HOCI cases per ward outbreak event, median (IQR, range)4 (2–8, 2–17)5 (3.5–10.5, 2–28)
Days from first to last case in outbreak, median (IQR, range)8 (6–15, 0–31)18 (13–34, 3–68)
n wards with more than one distinct outbreak event00
Outbreak events with sequence linkage
HOCI cases part of ward outbreak event, n (%)85 (68.0)140* (69.7)
n ward outbreak events1633
n HOCI cases per ward outbreak event, median (IQR, range)3.5 (2–8, 2–16)3 (2–4, 1–19)
Days from first to last case in outbreak, median (IQR, range)6 (4–9, 0–15)4 (2–8, 0–17)
n wards with more than one distinct outbreak event09†
  1. * Includes two HOCIs which each showed a close sequence match to another case on the same ward with interval from admission to sample date ≤2 days.

    †In three wards, there were three genetically distinct outbreak events.

  2. HOCI: hospital onset COVID-19 infection; IQR: interquartile range; PHE: Public Health England.

Comparison to local IPC conclusions in Sheffield

Contemporaneous notes by IPC teams in Sheffield classified 18/201 HOCIs as the index case in outbreaks. IPC staff defined an index case as the first detected in an environment regardless of prior inpatient stay and, correspondingly, of these 14/18 were the first sequence on their ward and one was the second (the first 1 day earlier from a different bay on the ward was also recorded as an index case, and IPC staff deemed a ward outbreak with unclear index or possibly two index cases). Of the 18 index cases, 11 showed at least one subsequent close sequence match on the same ward (the two index cases on a single ward were not genetically similar, and for 1/18 there were no subsequent sequences from the ward). The median SRT probability of HCAI was 0.70 (IQR 0.22–1.00, range 0.04–1, >0.5 in 12/18).

A further 144/201 HOCIs were classified as being part of local outbreaks, and among these the median SRT probability of HCAI was 0.98 (IQR 0.89–1.00, range 0.02–1.00, >0.5 in 129/144) with one or more close sequence match on the same ward in 104/144. The remaining 39/201 HOCIs, including 10 that were not recorded as HOCIs at the time, were classified by the IPC teams as not being part of local outbreaks. Among these the median SRT probability of HCAI was 0.74 (IQR 0.23–0.99, range 0.02–1.00, >0.5 in 23/39), with one or more close sequence matches on the same ward in 7/39.

Case study 1

Figure 3 shows a phylogenetic tree of eight HOCIs within a single ward at a Glasgow hospital (Hospital 5, Unit 93), alongside associated metadata and SRT probability outputs. The first HOCI detected (UID0032) was transferred from another hospital within the previous 2 weeks and so SRT output was not generated. All subsequent HOCIs return close sequence matches to at least one prior case on the ward, leading to SRT probability estimates of ward-acquired infection >0.9, even for UID0017 (an indeterminate HCAI). The phylogenetic tree indicates UID0032 has an SNP lacked by most of the cases identified on the ward, and therefore did not seed all of the cases in the outbreak cluster. Also shown is a single HOCI from a different ward in the same hospital (UID0025); this individual was an indeterminate HCAI, but a higher proportion of similar viral sequences within the hospital in comparison to their local community led to a SRT result of probable hospital-acquired infection.

Maximum-likelihood phylogeny of the sequences found in Hospital 5 Unit 93 and Unit 92 up until the 16th of May of the Glasgow dataset.

The black lines represent the time from admission to sampling. The values below the line are the posterior probability for unit infection + the posterior probability of hospital infection from the sequence reporting tool. The tip nodes are coloured according to the local authority area of the community surveillance sequences (circles) or of the patients (crosses).

Case study 2

Figure 4 shows phylogenetic trees relating to three distinct viral lineages identified on a single ward in the Sheffield dataset (classified by contemporaneous IPC investigation as a single outbreak). Two of these lineages also include sequences from inpatients sampled from other wards within the same hospital. Detailed ward movement data highlighted additional possible links between patients in the B.2.1 cluster. Both UID0149 and UID0157 were present at LOC0111 prior to their sample dates.

Maximum-likelihood phylogeny of the sequences found in location ‘0111’ in the Sheffield dataset, also including patients at several other ward locations.

The tree tip nodes are coloured according to ward locations. The black lines represent the time from admission to sampling. The values below the line are the posterior probability for unit infection + the posterior probability of hospital infection from the sequence reporting tool. The circle containing a number represents community sequences that are identical and at the base of this lineage (n = 36).

Discussion

We have developed a novel approach for identification and investigation of hospital-acquired SARS-CoV-2 infections combining epidemiological and sequencing data, designed to provide rapid and concise feedback to IPC teams working to prevent nosocomial transmission. Through retrospective application to clinical datasets, we have demonstrated that the methodology is able to provide confirmatory evidence for most PHE-defined definite and probable HCAIs and provide further information regarding indeterminate HCAIs. Thus the SRT may allow IPC teams to optimise their use of resources on areas with likely nosocomial acquisition events.

While the SRT is not likely to change IPC conclusions in cases meeting the definition of ‘definite’ or ‘probable’ HCAI based on interval from admission to symptom onset, in 91% of cases it did identify patients in the same ward or elsewhere in the hospital who could plausibly be linked to the HOCI within a single outbreak event. Those definite and probable HOCIs without close sequence matches are likely to reflect transmission from sources within the hospital that have either not been diagnosed or who were diagnosed without viral sequencing. In such cases, it is impossible to calculate a probability of transmission and the SRT will simply state that no sequence matches were found within the hospital.

For cases meeting the definition of ‘indeterminate healthcare associated’, the probability scores returned would be useful for IPC teams. These probabilities are dependent on comparison to sequences from cases of community-acquired infection obtained either from direct community sampling or from patients sampled at admission. The Sheffield dataset was lacking the former data source, but the SRT nonetheless classified a similar proportion of ‘indeterminate healthcare associated’ HOCIs as community-acquired infections to that found in the Glasgow dataset (approximately 60%).

Current PHE guidelines define healthcare-associated COVID-19 outbreaks as two or more cases associated with a specific setting (e.g. ward), with at least one case having illness onset after 8 days of admission (Public Health England, 2020). However, the guidelines note that ‘investigations of healthcare-associated SARS-CoV-2 infection should also take into account COVID-19 cases categorised as "indeterminate healthcare associated"’ (i.e. onset 3–7 days after admission), for which our SRT output would be useful. In most HOCIs meeting this definition of inclusion within an outbreak event, we found evidence of clusters of similar viral sequences located on the ward concerned, and the SRT results were in line with available local IPC classifications in the majority of cases. However, a substantial minority (54/279) of HOCIs, although assumed to be part of a ward outbreak, were, in fact, isolated cases for which the sequencing data refuted genetic linkage to other sequences from the ward. The SRT also provided evidence of wards where IPC-defined outbreak events comprised two or three clearly distinct viral lineages.

The retrospective datasets analysed in this study represent the first few months of the COVID-19 epidemic in the UK, and nosocomial transmission of the virus in the UK during this period has previously been reported at multiple sites (Meredith et al., 2020; Rickman et al., 2021; Carter et al., 2020). HCWs were at increased risk of infection and adverse health outcomes (Kursumovic et al., 2020; Wang et al., 2020; The DELVE Initiative, 2020; Shah et al., 2020; Houlihan et al., 2020) and could have been important drivers of nosocomial transmission (Rivett et al., 2020). Data were limited for Glasgow, but the Sheffield dataset contained a large number of sequences obtained from HCWs, with population sequencing coverage for this group >80%, and there was a close sequence match to at least one HCW observed for 87% of HOCIs. Our analysis has not evaluated direction of transmission to or from HCWs, but they were clearly linked into transmission networks within the hospital. A limitation of the current SRT approach and of the retrospective data available is that they do not include detailed information regarding work locations for HCWs. However, prospective use of the SRT would allow IPC teams to investigate linkage from a HOCI to any HCWs flagged as having a close sequence match.

While a phylogenetic approach is useful in excluding direct transmission between cases, it can be more problematic to confirm transmission source (Volz and Frost, 2013). Phylogenetic models can evaluate the full genetic information provided by viral sequence data, but there are challenges in incorporating and summarising associated patient metadata in a timely fashion (Villabona-Arenas et al., 2020). The challenge of timely collection and standardisation of patient metadata is also relevant for use of the SRT that we have developed, but it is possible to automate such processes through electronic patient record systems. There have been advances in recent years in the computational efficiency and workflow standardisation possible for phylogenetic analyses that have made it easier to use these methods for real-time investigation of outbreaks, for example, through the development of the Nextstrain project (Hadfield et al., 2018; Huddleston et al., 2021). However, there does not currently exist phylogenetic software for SARS-CoV-2 that produces reports or other outputs designed for direct and immediate use by IPC professionals. There will be cases in which phylogenetic analysis would provide information beyond that returned by the SRT, and the two approaches may be complementary to one another for outbreak investigation.

Comparison of SRT output to phylogenetic trees in a number of test cases suggested that some clusters of genetically similar cases identified within a specific ward likely represented more than one transmission event onto the ward from similar viral lineages circulating within the healthcare system. Whilst monophyletic clusters associated with a single location are easier to interpret, we consider the presence of viruses within a ward or hospital that are genetically similar to a HOCI as evidence for nosocomial infection even when they are not plausible transmission sources themselves, given the potential for asymptomatic transmission (He et al., 2020; Rivett et al., 2020; Oran and Topol, 2020; Lucey et al., 2021) and complex transmission networks (Meredith et al., 2020).

The SRT uses a number of heuristic approximations in order to provide an integrated summary of epidemiological and sequence data. However, this choice is associated with the limitation that it does not provide a full probabilistic model of potential transmission networks. Further development of the SRT would also aim to more fully incorporate patient movement data and shift locations for HCWs.

We believe that collaboration between methodologists, virologists, IPC clinicians and software engineers is essential in order to create workflows and reporting systems that will enable the routine use of pathogen sequence data for IPC. The SRT represents such a collaboration, and it has been designed to enable automation of the linkage and processing of viral sequence and patient metadata and subsequent feedback of relevant information to IPC staff. The automated feedback provided by the SRT is nonetheless dependent on timely sequencing of a high proportion of viral samples from cases within the hospital concerned, ideally in combination with sequences also available from community-sampled cases. In the UK this has been possible through the national COG-UK project (COVID-19 Genomics UK (COG-UK), 2020). Denmark has also implemented high population-coverage sequencing of SARS-CoV-2 (Bager et al., 2021), but this is not the case for most countries. The emergence and rapid dominance of lineage B.1.1.7 in the UK (Volz et al., 2021) has provided a case study for the impact of national-level genomic surveillance, but further evidence is required to determine whether rapid sequencing is worth the necessary investment for routine use within IPC practice. This judgement would also be dependent on the available health infrastructure and resources at both the local and national levels.

Prospective evaluation of the SRT is currently underway within a multicentre study in the UK (Blackstone et al., 2021). This study and its accompanying research programme will evaluate the impact of routine viral sequencing and use of the SRT on IPC knowledge, actions and outcomes, and will include quantitative, qualitative (Flowers et al., 2021) and health economic analyses to help guide the future development of pathogen genomics for IPC.

Our novel approach to the investigation of HOCIs has shown promising characteristics on retrospective application to two clinical datasets. The SRT described allows rapid feedback on HOCIs that integrates epidemiological and sequencing data to generate a simplified report at the time that sequence data become available. Prospective evaluation is required in order to recommend use of the SRT in clinical practice, and this work is ongoing. The methodology has been developed for hospital inpatients, but the principles may also be applicable to other settings.

Appendix 1

Methods

Details of sequencing protocols

Glasgow

Sequencing with ONT followed the protocols developed by the ARTIC network (v1 and v2) https://artic.network/ncov-2019. The reads were aligned to the reference strain (MN908947) using minimap2 (https://doi.org/10.1093/bioinformatics/bty191) and denoised using nanopolish (https://www.nature.com/articles/nmeth.3444) prior to primer trimming and consensus calling with iVar using a minimum depth of 20 reads (https://doi.org/10.1186/s13059-018-1618-7). Sequencing with Illumina also used the ARTIC network protocol for amplicon generation but was followed by a DNA KAPA library preparation kit (Roche) and indexing with NEBNext multiplex oligos (NEB) using seven PCR cycles. Libraries were pooled and loaded on a MiSeqV2 cartridge. Illumina reads were processed with the PrimalAlign pipeline (Orton, 2021). Briefly, reads were trimmed using trim_galore (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) aligned to the reference using BWA (10.1093/bioinformatics/btp698). Then, amplicon primers were removed and the consensus called with a read depth of 10 using iVar (https://doi.org/10.1186/s13059-018-1618-7). Metadata associated with each sample was collated in a redcap database (https://www.project-redcap.org/).

Sheffield

Sequencing with ONT followed the protocols developed by the ARTIC network (v1 and v2) (https://artic.network/ncov-2019). Following base calling, data were demultiplexed using ONT Guppy using a high-accuracy model. Reads were filtered based on quality and length (400–700 bp), then mapped to the Wuhan reference genome and primer sites trimmed. Reads were then downsampled to 200× coverage in each direction. Variants were called using nanopolish (Simpson, 2021) and used to determine changes from the reference. Consensus sequences were constructed using reference and variants called.

Further details of reference set definitions

Data sources for algorithm

There are two potential sources of data for the HOCI classification algorithm. Firstly, there are institution-sampled sequences: these include all viral sequences from samples obtained within the institution/hospital. These sequences are linked to metadata providing basic information regarding the patient concerned and details of the sample from which the sequence was obtained. Secondly, there are community-sampled sequences: these include all relevant sequences obtained from samples from testing within the local community. These sequences are associated with a more limited set of linked metadata describing date of sample, residential outer postcode of subject and place of work if they are recorded as being a HCW.

Unit reference set

This dataset comprises all institution sequences sampled on or ≤3 weeks prior to (or ≤2 days after for the prospective version of the SRT) the sample date of the focus sequence and for which both the institution and the unit are the same as that for the focus sequence.

Institution reference set

This dataset comprises firstly all institution-sampled sequences from HCWs, outpatients and inpatients diagnosed >48 hr after admission for which the institution matches that of the focus sequence sampled on or ≤3 weeks prior to (or ≤2 days after for the prospective version of the SRT) the sample date of the focus sequence and for which the unit is either not the same as that for the focus sequence or is missing. Secondly, the dataset includes all institution-sampled sequences from A and E patients or inpatients diagnosed ≤2 days after admission for which the institutionID matches that of the focus sequence sampled between (inclusively) 3 weeks and 3 days prior to the sample date of the focus sequence and for which the unit is either not the same as that for the focus sequence or is missing. Thirdly, this dataset also includes the subset of community-sampled sequences of HCWs at the same institution as the focus sequence.

Community reference set

This dataset comprises firstly all community-sampled sequences sampled on or ≤6 weeks prior to (or ≤2 days after for the prospective version of the SRT) the sample date of the focus sequence. This dataset also includes institution-sampled sequences sampled on or ≤6 weeks prior to (or ≤2 days after for the prospective version of the SRT) the sample date of the focus sequence from all non-inpatient samples, and those inpatients for whom sample date and symptom onset date (if recorded) are both ≤2 days after the admission date.

Note that some institution-sampled sequences will contribute to both the community reference set and either the unit reference set or the institution reference set (e.g. outpatients sampled within 3 weeks prior to the focus sequence would be included in both the community reference set and the institution reference set). HCWs recorded among the community-sampled sequences within ≤3 weeks prior to the sample date of the focus sequence will also be included in both the community reference set and the institution reference set if their workplace matches the institution of the focus sequence.

Formulae for probability calculations

Posterior of unit-acquired infection (UI) = 

PpriorPuP(seq±2SNPs|UI)PpriorPuP(seq±2SNPs|UI)+PpriorPvP(seq±2SNPs|VI)+Pprior(1PuPv)P(seq±2SNPs|II)+(1Pprior)P(seq±2SNPs|CI)

Posterior of institution-acquired infection (II) = 

Pprior(1PuPv)P(seq±2SNPs|II)PpriorPuP(seq±2SNPs|UI)+PpriorPvP(seq±2SNPs|VI)+Pprior(1PuPv)P(seq±2SNPs|II)+(1Pprior)P(seq±2SNPs|CI)

Posterior of visitor-acquired infection (VI) = 

PpriorPvP(seq±2SNPs|VI)PpriorPuP(seq±2SNPs|UI)+PpriorPvP(seq±2SNPs|VI)+Pprior(1PuPv)P(seq±2SNPs|II)+(1Pprior)P(seq±2SNPs|CI)

Posterior of community-acquired infection (CI) = 

(1Pprior)P(seq±2SNPs|CI)PpriorPuP(seq±2SNPs|UI)+PpriorPvP(seq±2SNPs|VI)+Pprior(1PuPv)P(seq±2SNPs|II)+(1Pprior)P(seq±2SNPs|CI)

With terms defined as follows:

  • Pprior: prior probability of post-admission infection for each focus case, based on time interval from admission to date of symptom onset or first positive test

  • Pu: prior probability of UI given post-admission infection (set based on expert opinion)

  • Pv: prior probability of VI given post-admission infection (set based on expert opinion)

  • P(seq±2SNPs|infectionsource/location): probability of observing a similar sequence (within two SNPs) to that actually observed for each focus case conditional on each potential infection source/location (estimated from sequence reference sets)

When there is a close sequence match found in any of the defined reference sets, the posterior probability estimates for UI, II, VI and CI will always sum to 1. However, when there is no close sequence match in any of the reference sets the posterior probability calculations are not valid and the algorithm will return the prior probabilities for each potential source/location of infection.

Further details regarding sequence matching process

The ±2 SNP threshold for a close sequence match was initially based on reports of healthcare-associated outbreak events for which this was the maximum pairwise difference within clusters (Meredith: DOI:10.1101/2020.05.08.20095687 and Rockett: DOI: 10.1101/2020.04.19.048751). The outbreak events described included sequences with up to around 3 weeks between first and last samples. This SNP threshold is also supported by calculations using the overall mutation rate of SARS-CoV-2. If we take the average mutation rate of the virus to be 24 SNPs/year (Nextstrain value 24 June, https://nextstrain.org/ncov/global?l=clock), then assuming independent (Poisson distributed) mutation events, ignoring the chance of mutations occurring at the same position in the genome and using a fixed generation time of 5 days, then there is an approximate:

  • 72% chance of no new SNPs per generation

  • 24% chance of one new SNP per generation

  • 4% chance of two new SNPs per generation

  • 0.4% chance of three new SNPs per generation

A two SNP threshold would therefore be expected to identify close sequence matches between direct transmission pairs in a large majority of cases. Ambiguous nucleotide positions will be considered to match if there is an overlap in the possible values for the two sequences. ‘N’ values recorded in either the focus sequence or comparison sequence will be considered to be a match at that position.

Further details of prior probability calculations for post-admission infection

We calculate Pprior = F(t), where F() is the cumulative distribution function of a published log-normal distribution for incubation times (Lauer et al: doi:10.7326/M20-0504; μ = 1.621, σ = 0.418). For symptomatic HOCI cases, the IPC classifications recommended by PHE translate into the following value ranges for Pprior:

  • indeterminate HCAI: 0.11 (onset 3 days post-admission) to 0.78 (onset 7 days post-admission)

  • probable HCAI: 0.86 (onset 8 days post-admission) to 0.99 (onset 14 days post-admission)

  • definite HCAI: Pprior ≥ 0.995

For asymptomatic focus cases, we define our prior on the basis that some proportion of the cases detected will never become symptomatic (Pa) with the remainder going on to develop symptoms within the next few days (1 - Pa). We then define our prior probability of post-admission infection in these cases as

Pprior=(1-Pa)*F(t+c)+Pa*F(t)

where t is the interval from admissionDate to sampleDate, and c is a constant reflecting the average interval within which we expect symptoms to appear (among those cases in which they do). Pa is set at 0.4 based on the findings of a published review article (Oran and Topol: 10.7326/M20-3012), and c is set to three based on a combination of expert opinion of the study PIs, the known distribution of time from infection to symptom onset and expert experience of asymptomatic screening.

Source given post-admission infection

The model requires prior values for the probability of UI and VI given post-admission infection: Pu and Pv, respectively. However, in specifying the model we define Pu as the probability of UI given post-admission infection when there are no visitors allowed on the ward, in which case the probability of VI is 0 and Pv’ = 0. If visitors are allowed on the ward for the focus case, then we set P= Pu’ × (1 – Pv).

Based on expert opinion of the clinical co-authors, Pu is set to different values according to the unit/ward type of the focus sequence with single-bed wards having a lower prior probability of unit post-admission infection than bay wards: 0.5 for single-bed wards and 0.7 for bay wards. We assumed a Pv of 0.2. The Pu values (when visitors are allowed) are therefore 0.4 for single-bed wards and 0.56 for bay wards. The largest of the three Glasgow hospitals included comprises single-room wards, whilst the other two and the Sheffield site comprise bay wards.

Derivation of prior probability for post-admission infection

If we assume a uniform individual-level hazard (λ) of infection from 1 February 2020 (t0), whether in hospital or not, then the probability density function (PDF) of infection at time tinf from this date is λe^(-λtinf). The PDF of infection at time tinf conditional on this occurring at any point prior to the date of symptom onset (tonset) is (λe^(-λtinf))/(1-e^(-λtpos)), which is approximately 1/tonset for small λ (taking the limit as λ->0). For HOCI cases, we are interested in whether tinf occurred before or after the time of admission to hospital (tadm). Also considering the evidence provided by the known incubation time of the disease (PDF f and CDF F), we integrate over the range of possible infection dates:

P(tadmtinf|tinftonset,Tonset=tonset)=[tadmtonsetf(tonsetx)/tonset.dx]/[0tonsetf(tonsetx)/tonset.dx][tadmtonsetf(tonsetx)/tonset.dx]/[tonsetf(tonsetx)/tonset.dx]=[tonsettadm0f(u)/tonset.du]/(1/tonset)=0tonsettadmf(u).du=F(tonsettadm)

Geographic weighting for community reference set

Geographic weighting function

The weight of each sequence within the community reference set is determined by geographic distance from the residential outer postcode of the focus case using a function of the form

weight=(1β)exp(τcommunityDistanceToIndex[i])+β,

where β takes a value between 0 and 1, and τ > 0. These parameters are set based on calibration to the available community reference set at each site. The rationale for this weighting is that there is likely to be geographic clustering of viral lineages, and so newly observed community transmissions of SARS-CoV-2 are more likely to show genetic similarity to past sequences from the local area of that individual’s home than to past sequences from regions that are further away. If postcode is missing for a case in the community reference set, then distance to the focus sequence is set to 100 km.

Statistical model for derivation of geographic weighting parameters

The statistical model for geographic weighting is fitted separately for each study site using sequences which are strongly thought to represent community-acquired infection: all community-sampled sequences and patients presenting to A and E with COVID-19, excluding those who are recorded as being HCWss or who do not have an available valid outer postcode. We will refer to these sequences as the ‘calibration set’.

A statistical model is constructed to find the optimal values of β and τ to maximise the estimated probability (Psim:i) of a newly observed community-acquired case having a similar sequence (±2 SNPs) to that observed for each sequence in the calibration set. The estimated probability in each case within the calibration set is calculated as a weighted sum of ‘close match’ indicator variables for all other sequences in the calibration set sample from 6 weeks prior up until the sample date of that case, with the weighting function defined in terms of geographic distance between residential outer postcodes and the β and τ parameters as described for the community reference set.

An overall log-likelihood function is defined using a Bernoulli distribution for each of the n sequences within the calibration set:

l=i=1nlog(Psim:i)

The values of β and τ that maximise l were obtained for each of the study sites using the ‘bbmle’ package for R, with logit-parameterisation of β and log-parameterisation of τ.

We assume that the probability of a sequence match conditional on infection from visitor on unit/ward can be calculated using the same weighting scheme as for the probability of a sequence match conditional on community-acquired infection (i.e. P(seq±2 SNPs|CI)==P(seq±2 SNPs|VI)).

Additional matching on ward location history

There is the potential for the algorithm described to return large numbers of close sequences matches with the hospital as a whole, which may make it difficult for IPC teams to use the output to direct their investigations when there are no potential sources of infection identified on the same ward as the focus case. We propose a location matching procedure in order to highlight the most relevant sequence matches for further investigation. This process does not currently form part of the statistical model, meaning that it can be treated as optional functionality for the SRT in the COG-UK HOCI study, and we have restricted the input data to a simplified format in order to minimise data management requirements.

For each inpatient sample in the input metadata for the algorithm, we specify a single string variable comprising the concatenated names of any ward locations in the ≤14 days prior to the sample date and a separate string variable with any ward locations in the ≤14 days after the sample date. For each focus case submitted to the algorithm, output is flagged if there is any match identified between the wards listed in each of these fields or the ward at time of sampling for a close sequence match in comparison to the prior and current ward locations for the focus sequence (excluding those cases where there is already matching ward location at time of sampling for each).

Details of phylogenetic methods

Phylogenies were produced by the grapevine pipeline (https://github.com/COG-UK/grapevine) as part of the COG-UK Consortium (https://www.cogconsortium.uk). Briefly, sequences from GISAID and those produced as part of the COG-UK Consortium are independently quality controlled and aligned to the Wuhan reference using minimap2 (https://doi.org/10.1093/bioinformatics/bty191). The two alignments are then combined, the homoplasy at site 11083 is masked and the tree is reconstructed using FastTreeMP (http://www.microbesonline.org/fasttree/). For each of the hospitals of interest, the tree is pruned to keep sequences from Scotland or Yorkshire (as relevant) and by date excluding sequences subsequent to the last ‘focus’ patient sample date on the ward.

Details of SRT report format

The SRT system for prospective use needs to provide useful and appropriate feedback in both low-incidence and high-incidence settings for new HOCI cases. This is planned through the generation of a concise one-page PDF summary report for each focus sequence. This summary report contains key focus sequence metadata, information regarding the estimated probabilities for infection source and details of up to 10 close sequence matches identified within the same unit/ward and/or elsewhere in the hospital.

Probability summary categories

The sequence matching and probability score algorithm generates probability estimates for the source of infection for the focus patient being from the current unit/ward, from elsewhere in the hospital, from the community (pre-admission) or from a visitor. These probability estimates always sum to 1. In the summary report, probability estimates for each source of infection are categorised using the following levels:

  • 0–30%: low

  • 30–50%: moderately low

  • 50–70%: probable

  • 70–85%: high

  • 85–100%: very high

For clarity of presentation and communication, probability categories will not always be displayed in the summary report for all four potential sources of infection (i.e. ward/unit, elsewhere in hospital, visitor or community). Special handling rules for specific situations are described below.

Close sequence matches within the same unit and/or hospital

The maximum number of close sequence matches that can be listed on the one-page summary report is 10 (for the combined sum of unit-level and institution-level matches). If the number of ward-level matches is n > 5 and the total number of close sequence matches is N > 10, then the number of ward-level matches is truncated at 5 + max((5 – (N – n)),0). If there are over 10 close sequence matches in total, then the following message is displayed ‘Over 10 close matches; see detailed report for further information’.

Within each set of unit-level and institution-level close sequence matches, ordering and priority for inclusion within the available slots is determined by the following set of criteria (in decreasing order of importance):

  1. Number of SNPs relative to Wuhan strain present in comparison sequence but absent in focus sequence (fewer = higher priority)

  2. Number of SNPs relative to Wuhan strain present in focus sequence but absent in comparison sequence (fewer = higher priority)

  3. Whether comparison sequence is from a HCW (HCWs listed first)

  4. HCAI status of comparison sequence (priority order: definite, probable, indeterminate, otherwise)

  5. Samples from the past before samples in future

  6. Samples from within the 2 weeks prior to focus sequence sample date before others

  7. Number of units overlapping with focus sample's units

Report messages for specific output combinations

No close sequence matches on unit/ward

If there are no close sequence matches to the focus sequence on their current unit/ward, then no probability category is reported for this potential infection source (the algorithm returns a zero probability in such cases, which could be misleading given uncertainty over screening and sequencing coverage). The message ‘No matches from within unit’ is displayed. The probability score category for infection from elsewhere in the hospital is provided in such cases.

No close sequence matches elsewhere in hospital

If there are no close sequence matches to the focus sequence elsewhere in the hospital, then no probability category is reported for this potential infection source. The message ‘No matches elsewhere in hospital’ is displayed.

No evidence of transmission within unit or hospital for probable or definite HCAI

If the estimated probability of community-acquired infection from the algorithm is >50%, but the interval from admission to symptom onset (if recorded) or sample date is ≥8 days, then the following message is displayed in place of the estimated probability of community-acquired infection: ‘This is a probable/definite HCAI based on admission date, but we have not found genetic evidence of transmission within the hospital’.

Probable unit- or hospital-acquired infection with source unclear

If the posterior probability of unit-acquired infection and the posterior probability of infection from a source elsewhere in the hospital are each estimated to be <50%, but the sum of these two posterior probabilities is ≥50%, then the following message is displayed: ‘Overall, this is a probable unit- or institution-acquired infection with source unclear’.

Timeline graph

The timeline graph provides a visual representation of available sequences from the same unit/ward and the same institution/hospital as the focus sequence in the period from 3 weeks prior to their sample date to 1 week after. The key indicates which sequences are close matches to the focus sequence, and the numbering corresponds to that in the tabular summary of most relevant close sequence matches.

Sequencing prioritisation for prospective use of the SRT

The SRT algorithm was initially designed for use with comprehensive sequencing of all SARS-CoV-2 cases within a hospital, in combination with representative sequencing of community-sampled cases. However, it may be difficult to achieve high population sequencing coverage in some situations, such as if there is a sudden surge in new admissions to the hospital and/or in new HOCI cases. In such scenarios, we have recommended the following prioritisation of samples (from highest to lowest) for sequencing within the prospective HOCI study (https://clinicaltrials.gov/ct2/show/NCT04405934):

  1. HOCI cases

  2. SARS-CoV-2 +ve patients on wards where there is a HOCI case

  3. HCWs with known contact with HOCI cases

  4. Other HCWs

  5. SARS-CoV-2 +ve patients admitted to any other wards

  6. SARS-CoV-2 +ve patients attending for acute care (e.g. Accident and Emergency) but not admitted

These prioritisation rules are guided by the following rationale:

Most probable and definite HCAIs (based on time from admission) show a close sequence match to at least one other case on the same ward, so sequencing of HOCI cases and any cases on the same ward would be enough to identify these links.

  • Links between ward outbreaks will be of particular importance to IPC investigations and would be identified with sequencing focused on HOCI cases.

  • The probability calculations within the SRT are most important for indeterminate HCAIs, and where there is no sequence match on the same ward the estimated probability of nosocomial infection is <50% in the majority of such cases (36/38 for the Sheffield dataset). The probability estimates for indeterminate HCAIs should be interpreted with caution where overall sequencing coverage is poor, but SRT results are unlikely to lead to inappropriate changes to standard IPC actions if groups ‘1’, ‘2’ and ‘3’ have been sequenced.

  • Where there is a complete lack of close sequence matches within the hospital for probable or definite HCAIs, the SRT returns the message that there is a lack of available genetic evidence for linkage (but not that nosocomial infection is unlikely).

Following from this reasoning, we feel that useful information would be returned by the SRT as long as high sequencing coverage is achieved for groups ‘1’, ‘2’ and ‘3’. High sequencing coverage of groups ‘4’, ‘5’ and ‘6’ would allow the SRT to identify potential links between cases that would likely be missed by standard IPC investigations.

For indeterminate HCAIs with no close sequence matches on the same ward, an inaccurate ‘zero’ posterior probability of post-admission infection will be returned if one or more similar sequences are found in the community reference set but no similar sequences are observed in the institution reference set with imperfect sequencing coverage. This is likely to be a more important issue in the setting of low SARS-CoV-2 incidence.

For example, if there are 40 cases that could be included in the institution reference set for a focus sequence and 2 of these (5%) would be a close sequence match, then we would need to sequence at least 31/40 (77.5%) in order to have ≥95% probability of observing at least one of the close sequence matches. However, if there are 200 cases that could be included in the institution reference set and 10 of these (5%) would be a close sequence match, then we would need to sequence at least 51/200 (25.5%) in order to have ≥95% probability of observing at least one of the close sequence matches.

A similar relationship would also be observed if we consider a rarer sequence type. If there are 40 cases that could be included in the institution reference set for a focus sequence and 1 of these (2.5%) would be a close sequence match, then we would need to sequence at least 38/40 (95%) in order to have ≥95% probability of observing the one close sequence match. However, if there are 200 cases that could be included in the institution reference set and 5 of these (2.5%) would be a close sequence match, then we would need to sequence at least 90/200 (45%) in order to have ≥95% probability of observing at least one of the close sequence matches.

On this basis, we believe that the goal of close to 100% sequencing coverage should be pursued in the setting of low incidence of SARS-CoV-2, but that overall sequencing coverage of 50% or more may be sufficient in the event that a high incidence of SARS-CoV-2 leads to too great a case load for available sequencing resources.

Results

Sequencing coverage in Glasgow dataset

Appendix 1—figure 1
Proportion of cases sequenced in Greater Glasgow and Clyde Health Board between 1 March and 27 May (with sequence available as of 23 June 2020) by location of test (A).

Also displayed are the proportion of sequenced cases in the three focus hospitals subdivided by assessment and inpatient locations (B), and the proportion of hospital onset COVID-19 infection (HOCI) cases sequenced at these hospitals (C).

Home residence locations and geographic model parameters

Appendix 1—figure 2
Home residence location of individuals in (a) the Glasgow dataset and (b) the Sheffield dataset, displayed by sample source (not including healthcare workers [HCWs]).

Locations are analysed using only the outer postcode, and as such random jitter (within longitude and latitude of 0.05) has been added to allow display without overlap of points. Plot created using ggmap for R with map obtained from Stamen maps. For Glasgow, 766 cases were included in the calibration set with estimates of τ = 0.15 and β = 0.0 for the geographic clustering model, whilst for Sheffield 446 cases were included in the calibration set with resulting estimates of τ = 0.84 and β = 0.16.

SNP distance distributions

For the Glasgow sequence dataset as a whole, the median pairwise SNP difference among all sequences was 9, and there were 1.3, 3.4, 6.4 and 10.1% of pairwise comparisons with 0, ≤1, ≤2 and ≤3 SNP differences, respectively. For the Sheffield dataset as a whole, the median pairwise SNP difference among all sequences was 8, and there were 1.2, 3.3, 6.5 and 10.8% of pairwise comparisons with 0, ≤1, ≤2 and ≤3 SNP differences, respectively.

Appendix 1—figure 3
Frequency plot of all pairwise SNP differences among (a) all 1199 sequences in the Glasgow dataset and (b) all 1629 analysed sequences in the Sheffield dataset.

Additional case study

Appendix 1—figure 4 shows a phylogenetic tree indicating complex transmission networks across multiple hospitals in the Glasgow area (with SRT outputs for Hospitals 2 and 4). A monophyletic cluster of HOCIs can be seen in Hospital 2 Unit 48, with the first detected case identified by the SRT as a hospital-acquired and the rest unit-acquired infections. A paraphyletic group of HOCIs was detected in Hospital 4 Unit 69. Patient 1 (UID0042) was screened for COVID in Unit 69 on 14 April 2020 after developing a cough and oxygen requirement. The patient was moved from the nightingale area to a single room on the ward on 14 April 2020 and was confirmed positive on 15 April 2020.

Appendix 1—figure 4
Maximum-likelihood tree for sequences found in Hospital 2 Unit 48 and Hospital 4 Unit 69 of the Glasgow dataset up until the 21st of April (inclusive).

The circles with numbers represent the number of community sequences that are identical and at the base of each lineage (n = 5, n = 35, n = 4). Tree tips with black circles represent further community sequences. The black lines represent the time from admission to sampling. The values below the line are the posterior probability for unit infection + the posterior probability of hospital infection from the sequence reporting tool.

On 20 April 20, a second patient on Unit 69 (not sequenced) was screened after developing a cough and pyrexia and confirmed positive on 21 April 2020. The patient was in a single room at the time of symptom onset; however, they had been in the main nightingale ward opposite patient 1 for 5 days. At this point 13 asymptomatic contacts in Unit 69 were screened, and 8 (UID0043, UID0073, UID0041, UID0095, UID0116, UID0094, UID0083, UID0121) were positive. These cases are all identified as hospital-acquired or unit-acquired infections and can be grouped into a genetically similar cluster with a maximum pairwise distance of two SNPs between each member and its nearest neighbour. However, this cluster clearly represents multiple introductions of SARS-CoV-2 onto the ward.

Examples of SRT reports

Appendix 1—figure 5
Example of sequence reporting tool output with estimated very highly probable infection within unit.
Appendix 1—figure 6
Example of sequence reporting tool output with estimated probable infection within hospital.

Data availability

The sequence data analysed are included within publicly available datasets (https://www.cogconsortium.uk/data/), and a list of the relevant sequence identification codes is provided (Supplementary File 1). Due to data governance restrictions related to individual patient data linked to genetic sequences it is not possible to publicly share the associated meta-data. Requests for access to the data can be made by submission of a research proposal to the COG-UK Steering Committee (contact@cogconsortium.uk).

The following previously published data sets were used
    1. The COVID-19 Genomics UK (COG-UK) consortium
    (2020) Trimmed and masked alignment
    ID Trimmed and masked alignment. UK SARS-CoV-2 sequence data.

References

    1. Bager P
    2. Wohlfahrt J
    3. Fonager J
    4. Albertsen M
    5. Yssing Michaelsen T
    6. Holten Møller C
    7. Ethelberg S
    8. Legarth R
    9. Button MSF
    10. Gubbels S
    11. Voldstedlund M
    12. Mølbak K
    13. Skov RL
    14. Fomsgaard A
    15. Krause TG
    (2021)
    Increased risk of hospitalisation associated with infection with SARS-CoV-2 lineage B. 1.1. 7 in Denmark
    The Lancet Infectious Diseases 1:00290-5.
    1. Houlihan CF
    2. Vora N
    3. Byrne T
    4. Lewer D
    5. Kelly G
    6. Heaney J
    7. Gandhi S
    8. Spyer MJ
    9. Beale R
    10. Cherepanov P
    11. Moore D
    12. Gilson R
    13. Gamblin S
    14. Kassiotis G
    15. McCoy LE
    16. Swanton C
    17. Hayward A
    18. Nastouli E
    19. Aitken J
    20. Allen Z
    21. Ambler R
    22. Ambrose K
    23. Ashton E
    24. Avola A
    25. Balakrishnan S
    26. Barns-Jenkins C
    27. Barr G
    28. Barrell S
    29. Basu S
    30. Beale R
    31. Beesley C
    32. Bhardwaj N
    33. Bibi S
    34. Bineva-Todd G
    35. Biswas D
    36. Blackman MJ
    37. Bonnet D
    38. Bowker F
    39. Broncel M
    40. Brooks C
    41. Buck MD
    42. Buckton A
    43. Budd T
    44. Burrell A
    45. Busby L
    46. Bussi C
    47. Butterworth S
    48. Byrne F
    49. Byrne R
    50. Caidan S
    51. Campbell J
    52. Canton J
    53. Cardoso A
    54. Carter N
    55. Carvalho L
    56. Carzaniga R
    57. Chandler N
    58. Chen Q
    59. Cherepanov P
    60. Churchward L
    61. Clark G
    62. Clayton B
    63. Cobolli Gigli C
    64. Collins Z
    65. Cottrell S
    66. Crawford M
    67. Cubitt L
    68. Cullup T
    69. Davies H
    70. Davis P
    71. Davison D
    72. D'Avola A
    73. Dearing V
    74. Debaisieux S
    75. Diaz-Romero M
    76. Dibbs A
    77. Diring J
    78. Driscoll PC
    79. Earl C
    80. Edwards A
    81. Ekin C
    82. Evangelopoulos D
    83. Faraway R
    84. Fearns A
    85. Ferron A
    86. Fidanis E
    87. Fitz D
    88. Fleming J
    89. Frederico B
    90. Gaiba A
    91. Gait A
    92. Gamblin S
    93. Gandhi S
    94. Gaul L
    95. Golding HM
    96. Goldman J
    97. Goldstone R
    98. Gomez Dominguez B
    99. Gong H
    100. Grant PR
    101. Greco M
    102. Grobler M
    103. Guedan A
    104. Gutierrez MG
    105. Hackett F
    106. Hall R
    107. Halldorsson S
    108. Harris S
    109. Hashim S
    110. Healy L
    111. Heaney J
    112. Herbst S
    113. Hewitt G
    114. Higgins T
    115. Hindmarsh S
    116. Hirani R
    117. Hope J
    118. Horton E
    119. Hoskins B
    120. Houlihan CF
    121. Howell M
    122. Howitt L
    123. Hoyle J
    124. Htun MR
    125. Hubank M
    126. Huerga Encabo H
    127. Hughes D
    128. Hughes J
    129. Huseynova A
    130. Hwang M-S
    131. Instrell R
    132. Jackson D
    133. Jamal-Hanjani M
    134. Jenkins L
    135. Jiang M
    136. Johnson M
    137. Jones L
    138. Kanu N
    139. Kassiotis G
    140. Kiely L
    141. King Spert Teixeira A
    142. Kirk S
    143. Kjaer S
    144. Knuepfer E
    145. Komarov N
    146. Kotzampaltiris P
    147. Kousis K
    148. Krylova T
    149. Kucharska A
    150. Labrum R
    151. Lambe C
    152. Lappin M
    153. Lee S-A
    154. Levett A
    155. Levett L
    156. Levi M
    157. Liu H-W
    158. Loughlin S
    159. Lu W-T
    160. MacRae JI
    161. Madoo A
    162. Marczak JA
    163. Martensson M
    164. Martinez T
    165. Marzook B
    166. Matthews J
    167. Matz JM
    168. McCall S
    169. McCoy LE
    170. McKay F
    171. McNamara EC
    172. Minutti CM
    173. Mistry G
    174. Molina-Arcas M
    175. Montaner B
    176. Montgomery K
    177. Moore C
    178. Moore D
    179. Moraiti A
    180. Moreira-Teixeira L
    181. Mukherjee J
    182. Naceur-Lombardelli C
    183. Nastouli E
    184. Nelson A
    185. Nicod J
    186. Nightingale L
    187. Nofal S
    188. Nurse P
    189. Nutan S
    190. Oedekoven C
    191. O'Garra A
    192. O'Leary JD
    193. Olsen J
    194. O'Neill O
    195. Ordonez Suarez P
    196. O'Reilly N
    197. Osborne N
    198. Pabari A
    199. Pajak A
    200. Papayannopoulos V
    201. Patel N
    202. Patel Y
    203. Paun O
    204. Peat N
    205. Peces-Barba Castano L
    206. Perez Caballero A
    207. Perez-Lloret J
    208. Perrault MS
    209. Perrin A
    210. Poh R
    211. Poirier EZ
    212. Polke JM
    213. Pollitt M
    214. Prieto-Godino L
    215. Proust A
    216. Shah Punatar R
    217. Puvirajasinghe C
    218. Queval C
    219. Ramachandran V
    220. Ramaprasad A
    221. Ratcliffe P
    222. Reed L
    223. Reis e Sousa C
    224. Richardson K
    225. Ridewood S
    226. Roberts R
    227. Rodgers A
    228. Romero Clavijo P
    229. Rosa A
    230. Rossi A
    231. Roustan C
    232. Rowan A
    233. Sahai E
    234. Sait A
    235. Sala K
    236. Sanderson T
    237. Santucci P
    238. Sardar F
    239. Sateriale A
    240. Saunders JA
    241. Sawyer C
    242. Schlott A
    243. Schweighoffer E
    244. Segura-Bayona S
    245. Shaw J
    246. Shin GY
    247. Silva Dos Santos M
    248. Silvestre M
    249. Singer M
    250. Snell DM
    251. Song O-R
    252. Spyer MJ
    253. Steel L
    254. Strange A
    255. Sullivan AE
    256. Swanton C
    257. Tan MSY
    258. Tautz-Davis ZH
    259. Taylor E
    260. Taylor G
    261. Taylor HB
    262. Taylor-Beadling A
    263. Teixeira Subtil F
    264. Terré Torras B
    265. Toolan-Kerr P
    266. Torelli F
    267. Toteva T
    268. Treeck M
    269. Trojer H
    270. Tsai M-HC
    271. Turner JMA
    272. Turner M
    273. Ule J
    274. Ulferts R
    275. Vanloo SP
    276. Veeriah S
    277. Venkatesan S
    278. Vousden K
    279. Wack A
    280. Walder C
    281. Walker PA
    282. Wang Y
    283. Ward S
    284. Wenman C
    285. Wiliams L
    286. Williams MJ
    287. Wong WK
    288. Wright J
    289. Wu M
    290. Wynne L
    291. Xiang Z
    292. Yap M
    293. Zagalak JA
    294. Zecchin D
    295. Zillwood R
    296. Matthews R
    297. Severn A
    298. Adam S
    299. Enfield L
    300. McBride A
    301. Gärtner K
    302. Edwards S
    303. Lorencatto F
    304. Michie S
    305. Manley E
    306. Shahmanesh M
    307. Lukha H
    308. Prymas P
    309. McBain H
    310. Shortman R
    311. Wood L
    312. Davies C
    313. Williams B
    314. Ng KW
    315. Cornish GH
    316. Faulkner N
    317. Riddell A
    318. Hobson P
    319. Agua-Doce A
    320. Bartolovic K
    321. Russell E
    322. Carr L
    323. Sanchez E
    324. Frampton D
    325. Byott M
    326. Paraskevopoulou SM
    327. Crayton E
    328. Meyer C
    329. Vora N
    330. Gkouleli T
    331. Stoltenberg A
    332. Ranieri V
    333. Byrne T
    334. Lewer D
    335. Hayward A
    336. Gilson R
    337. Kelly G
    338. Roberts F
    339. Hatipoglu E
    (2020) Pandemic peak SARS-CoV-2 infection and seroconversion rates in London frontline health-care workers
    The Lancet 396:e6–e7.
    https://doi.org/10.1016/S0140-6736(20)31484-7
    1. Lucey M
    2. Macori G
    3. Mullane N
    4. Sutton-Fitzpatrick U
    5. Gonzalez G
    6. Coughlan S
    7. Purcell A
    8. Fenelon L
    9. Fanning S
    10. Schaffer K
    (2021)
    Whole-genome Sequencing to Track Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) Transmission in Nosocomial Outbreaks
    Clinical Infectious Diseases 72:e727–e735.
    1. Rivett L
    2. Sridhar S
    3. Sparkes D
    4. Routledge M
    5. Jones NK
    6. Forrest S
    7. Young J
    8. Pereira-Dias J
    9. Hamilton WL
    10. Ferris M
    11. Torok ME
    12. Meredith L
    13. Gupta R
    14. Lyons PA
    15. Toshner M
    16. Warne B
    17. Bartholdson Scott J
    18. Cormie C
    19. Gill H
    20. Kean I
    21. Maes M
    22. Reynolds N
    23. Wantoch M
    24. Caddy S
    25. Caller L
    26. Feltwell T
    27. Hall G
    28. Hosmillo M
    29. Houldcroft C
    30. Jahun A
    31. Khokhar F
    32. Yakovleva A
    33. Butcher H
    34. Caputo D
    35. Clapham-Riley D
    36. Dolling H
    37. Furlong A
    38. Graves B
    39. Gresley EL
    40. Kingston N
    41. Papadia S
    42. Stark H
    43. Stirrups KE
    44. Webster J
    45. Calder J
    46. Harris J
    47. Hewitt S
    48. Kennet J
    49. Meadows A
    50. Rastall R
    51. Brien CO
    52. Price J
    53. Publico C
    54. Rowlands J
    55. Ruffolo V
    56. Tordesillas H
    57. Brookes K
    58. Canna L
    59. Cruz I
    60. Dempsey K
    61. Elmer A
    62. Escoffery N
    63. Jones H
    64. Ribeiro C
    65. Saunders C
    66. Wright A
    67. Nyagumbo R
    68. Roberts A
    69. Bucke A
    70. Hargreaves S
    71. Johnson D
    72. Narcorda A
    73. Read D
    74. Sparke C
    75. Warboys L
    76. Lagadu K
    77. Mactavous L
    78. Gould T
    79. Raine T
    80. Mather C
    81. Ramenatte N
    82. Vallier A-L
    83. Kasanicki M
    84. Eames P-J
    85. McNicholas C
    86. Thake L
    87. Bartholomew N
    88. Brown N
    89. Parmar S
    90. Zhang H
    91. Bowring A
    92. Martell G
    93. Quinnell N
    94. Wright J
    95. Murphy H
    96. Dunmore BJ
    97. Legchenko E
    98. Gräf S
    99. Huang C
    100. Hodgson J
    101. Hunter K
    102. Martin J
    103. Mescia F
    104. O'Donnell C
    105. Pointon L
    106. Shih J
    107. Sutcliffe R
    108. Tilly T
    109. Tong Z
    110. Treacy C
    111. Wood J
    112. Bergamaschi L
    113. Betancourt A
    114. Bowyer G
    115. De Sa A
    116. Epping M
    117. Hinch A
    118. Huhn O
    119. Jarvis I
    120. Lewis D
    121. Marsden J
    122. McCallum S
    123. Nice F
    124. Curran MD
    125. Fuller S
    126. Chaudhry A
    127. Shaw A
    128. Samworth RJ
    129. Bradley JR
    130. Dougan G
    131. Smith KGC
    132. Lehner PJ
    133. Matheson NJ
    134. Wright G
    135. Goodfellow IG
    136. Baker S
    137. Weekes MP
    (2020) Screening of healthcare workers for SARS-CoV-2 highlights the role of asymptomatic carriage in COVID-19 transmission
    eLife 9:e58728.
    https://doi.org/10.7554/eLife.58728

Decision letter

  1. Albert Osterhaus
    Reviewing Editor; University of Veterinary Medicine Hannover, Germany
  2. Jos WM van der Meer
    Senior Editor; Radboud University Medical Centre, Netherlands
  3. Albert Osterhaus
    Reviewer; University of Veterinary Medicine Hannover, Germany

Our editorial process produces two outputs: i) public reviews designed to be posted alongside the preprint for the benefit of readers; ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Acceptance summary:

This paper by Stirrup and colleagues describes a methodology that can be useful for for real-time investigation of COVID-19 outbreaks and SARS-CoV-2 infection prevention by infectious disease control teams in the hospital setting. It combines epidemiological and viral sequencing data – rather than only epidemiological data – to identify and investigate the infection source of hospital onset SARS-CoV-2 infections. The challenge for the use of this methodology is to secure the timely availability of the sequence data and requires close collaboration between methodologists, virologists, infectious disease clinicians and software engineers, to collectively create the appropriate workflows and reporting systems.

Decision letter after peer review:

Thank you for submitting your article "Rapid feedback on hospital onset SARS-CoV-2 infections combining epidemiological and sequencing data" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, including Albert Osterhaus as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Jos van der Meer as the Senior Editor.

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

The method was validated retrospectively with data from two UK city hospital settings and is eventually meant to be used for future use for the identification, prevention and control of healthcare-associated SARS-CoV-2 infections in hospitals.

Indeed, there is an urgent need for tools that can synthesise multiple data streams to provide real time information to healthcare professionals. It is questionable to what extent the tool presented is generalisable to medical facilities outside of the specific data rich settings considered here, or if the tool is useful for prospective analyses.

Concerning the question to what extent the tool presented is generalisable to medical facilities. the papers' claims are largely supported by the data and may be useful for SARS-CoV-2 infection prevention and control teams, provided they effectively have access to, and can implement both elements of the system: evaluation of state-of-the art epidemiological data and of phylogenetic sequence data based on local and hospital SARS-CoV-2 genomes as determined within 24-48 hours by the COVID-19 Genomics (COG) UK initiative. Therefore this study would rather be of interest to specialists working in hospital infection prevention, with more limited further interest.

Concerning the prospective analysis of prospective use cases described in the discussion and given the high burden of SAR-CoV-2 in the UK in late 2020, there should already be ample data available for their inclusion.

As the Discussion section briefly touches on the motivation for using a 2 SNP threshold for identifying "similar" viruses instead of phylogeny-informed approaches, it dismisses the latter saying that they are too intensive to be used outside of research-intensive settings. Although phylogenetic approaches would be more resource intensive, it would seem that the amount of sequencing and shortness of turnaround time required to make this algorithm useful similarly limit its utility to research-intensive settings. With this in mind, the substantially increased information associated with phylogenetic approaches would seem to be worth the increased resource costs particularly since the capabilities and quality of phylogenetic tools has increased substantially since the publication of the 2013 paper cited on line 326. Although extending this particular algorithm to include phylogenetic data is outside the scope of the study, the manuscript is unfairly dismissive of such approaches, especially when a location specific build of nextstrain can be created with a programming burden similar to that of the algorithm described in this study.

Given these considerations, the paper would be acceptable for publication if the following points would be addressed:

1) Analysis of the prospective use cases described in the discussion. This would seem essential for validating the utility of the algorithm.

2) Address specifically the limitations of the study that are related to the overall applicability of the methodology in settings with more limited access to state-of-the art epidemiological and phylogenetic sequence data, the latter based on timely availability of local and hospital SARS-CoV-2 genomes.

3) Although extending this particular algorithm to include phylogenetic data is outside the scope of this study, it should address the value of such approaches, especially when a location specific build of nextstrain can be created with a programming burden similar to that of the algorithm described in this study.

Reviewer #1 (Recommendations for the authors):

The science is based on a practical comparison of the existing PHE system and the newly established SRT, which is based on epidemiological and digested viral sequenced data. The criteria chosen are largely arbitrary but justified by practical considerations. The presentation of data and conclusions is clear. More attention could be paid to the overall applicability of the methodology (with or without modifications) in other geographical or demographic settings with different public health infrastructures and their limitations.

Tables, figures and supplementary material are well designed and informative

Reviewer #3 (Recommendations for the authors):

Foremost, this study could be substantially strengthened through the analysis of the prospective use cases described in the discussion. Given the high burden of SAR-CoV-2 in the UK in late 2020, there should already be ample data available for inclusion. This would seem essential for validating the utility of the algorithm.

It would also be very helpful to include some estimates on the circumstances under which the algorithm is likely to provide useful information. The UK is highly unusual in terms of its virus sequencing efforts. What proportion of test-positive cases need to be sequenced in order for the algorithm to provide reliable estimates of ongoing transmission under different outbreak scenarios? On a related subject, the impact of turnaround time from sample collection to receipt of sequence data should be evaluated. How fast does sequence data need to be able available for the algorithm to provide actionable information?

The Discussion section briefly touches on the motivation for using a 2 SNP threshold for identifying "similar" viruses instead of phylogeny-informed approaches but dismisses the latter saying that they are too intensive to be used outside of research-intensive settings. I agree that phylogenetic approaches would be more resource intensive, but it would seem that the amount of sequencing and shortness of turnaround time required to make this algorithm useful similarly limit its utility to research-intensive settings. With this in mind, the substantially increased information associated with phylogenetic approaches would seem to be worth the increased resource costs particularly since the capabilities and quality of phylogenetic tools has increased substantially since the publication of the 2013 paper cited on line 326. I recognise that extending this particular algorithm to include phylogenetic data is outside the scope of work for this project, but the manuscript is unfairly dismissive of such approaches, especially when a location specific build of nextstrain can created with a programming burden similar to that of the algorithm described in this study.

https://doi.org/10.7554/eLife.65828.sa1

Author response

Essential revisions:

The method was validated retrospectively with data from two UK city hospital settings and is eventually meant to be used for future use for the identification, prevention and control of healthcare-associated SARS-CoV-2 infections in hospitals.

Indeed, there is an urgent need for tools that can synthesise multiple data streams to provide real time information to healthcare professionals. It is questionable to what extent the tool presented is generalisable to medical facilities outside of the specific data rich settings considered here, or if the tool is useful for prospective analyses.

Concerning the question to what extent the tool presented is generalisable to medical facilities. the papers' claims are largely supported by the data and may be useful for SARS-CoV-2 infection prevention and control teams, provided they effectively have access to, and can implement both elements of the system: evaluation of state-of-the art epidemiological data and of phylogenetic sequence data based on local and hospital SARS-CoV-2 genomes as determined within 24-48 hours by the COVID-19 Genomics (COG) UK initiative. Therefore this study would rather be of interest to specialists working in hospital infection prevention, with more limited further interest.

Concerning the prospective analysis of prospective use cases described in the discussion and given the high burden of SAR-CoV-2 in the UK in late 2020, there should already be ample data available for their inclusion.

We agree that there remain open questions regarding the feasibility and utility of routine near real-time pathogen sequencing for Infection Prevention and Control (IPC), and regarding the factors that might affect this in different healthcare settings. As noted in our cover letter for this revised manuscript, we are in the process of undertaking a large-scale prospective evaluation of viral sequencing and use of the tool for SARS-CoV-2 IPC as part of a broader research program that will include measurement of its impact on the knowledge and actions of IPC teams. The prospective trial includes 14 National Health Service Trusts from across the UK, representing a broad range of settings – from university hospitals to a district general hospital – and the research outputs will include quantitative, qualitative and health economic evaluations of impact and feasibility.

We believe that automated tools to provide feedback on genetic and epidemiological data are needed in order to maximise the feasibility and utility of routine pathogen sequencing across a range of healthcare settings.

As the Discussion section briefly touches on the motivation for using a 2 SNP threshold for identifying "similar" viruses instead of phylogeny-informed approaches, it dismisses the latter saying that they are too intensive to be used outside of research-intensive settings. Although phylogenetic approaches would be more resource intensive, it would seem that the amount of sequencing and shortness of turnaround time required to make this algorithm useful similarly limit its utility to research-intensive settings. With this in mind, the substantially increased information associated with phylogenetic approaches would seem to be worth the increased resource costs particularly since the capabilities and quality of phylogenetic tools has increased substantially since the publication of the 2013 paper cited on line 326. Although extending this particular algorithm to include phylogenetic data is outside the scope of the study, the manuscript is unfairly dismissive of such approaches, especially when a location specific build of nextstrain can be created with a programming burden similar to that of the algorithm described in this study.

Thank you for these comments, we have adjusted the text of the Discussion to better reflect the nuances raised regarding computational and staff resource demands for phylogenetic investigation in comparison to our approach.

Despite substantial advancements in phylogenetic methodology and software since 2013, we feel that the statement “While a phylogenetic approach is useful in excluding direct transmission between cases, it can be more problematic to confirm transmission source” remains true. For example HIV has a much higher mutation rate than SARS-CoV-2 which allows for use of within-host viral diversity as an additional source of information for inference regarding ‘who infected whom’, but the level of evidence obtained is still often inconclusive at the individual level (e.g. as reported by Ratmann et al., doi.org/10.1038/s41467-019-09139-4).

Given these considerations, the paper would be acceptable for publication if the following points would be addressed:

1) Analysis of the prospective use cases described in the discussion. This would seem essential for validating the utility of the algorithm.

As discussed with the Editors and noted in the cover letter, we have not included additional prospectively collected data in the resubmitted manuscript. We agree that there is a need for robust prospective evaluation of viral sequencing for IPC and of this tool specifically. However, the present work constitutes a distinct element within a broader project, which we believe will form an important body of work regarding the routine use of viral sequencing for IPC. This has been clarified in a new paragraph in the Discussion:

“Prospective evaluation of the SRT is currently underway within a multicentre study in the UK[32][Protocol REF]. This study and its accompanying research program will evaluate the impact of routine viral sequencing and use of the SRT on IPC knowledge, actions and outcomes, and will include quantitative, qualitative[33] and health economic analyses to help guide the future development of pathogen genomics for IPC.”

2) Address specifically the limitations of the study that are related to the overall applicability of the methodology in settings with more limited access to state-of-the art epidemiological and phylogenetic sequence data, the latter based on timely availability of local and hospital SARS-CoV-2 genomes.

We have expanded on these issues as suggested in the Discussion:

“The automated feedback provided by the SRT is nonetheless dependent on timely sequencing of a high proportion of viral samples from cases within the hospital concerned, ideally in combination with sequences also available from community-sampled cases. […] This judgement also would be dependent on the available health infrastructure and resources at both the local and national levels.”

3) Although extending this particular algorithm to include phylogenetic data is outside the scope of this study, it should address the value of such approaches, especially when a location specific build of nextstrain can be created with a programming burden similar to that of the algorithm described in this study.

We have adjusted the text of the Discussion to better reflect the value of phylogenetic investigation as suggested:

“There have been advances in recent years in the computational efficiency and workflow standardisation possible for phylogenetic analyses that have made it easier to use these methods for real-time investigation of outbreaks, for example through the development of the Nextstrain project[28, 29]. […] There will be cases in which phylogenetic analysis would provide information beyond that returned by the SRT, and the two approaches may be complementary to one another for outbreak investigation.”

The following text has been deleted from the Discussion:

“However, fully integrated epidemiological and phylogenetic analysis of hospital outbreaks is resource-intensive, presenting challenges in delivering the rapid turnaround and scale-up required to provide clear feedback to hospital IPC teams outside of research-intensive settings.”

Reviewer #1 (Recommendations for the authors):

The science is based on a practical comparison of the existing PHE system and the newly established SRT, which is based on epidemiological and digested viral sequenced data. The criteria chosen are largely arbitrary but justified by practical considerations. The presentation of data and conclusions is clear. More attention could be paid to the overall applicability of the methodology (with or without modifications) in other geographical or demographic settings with different public health infrastructures and their limitations.

Tables, figures and supplementary material are well designed and informative.

We thank the reviewer for their comments on our work. As suggested, we have expanded the Discussion in relation to the broader applicability and limitations of the methodology developed:

“…further evidence is required to determine whether rapid sequencing is worth the necessary investment for routine use within IPC practice. This judgement also would be dependent on the available health infrastructure and resources at both the local and national levels.”

Reviewer #3 (Recommendations for the authors):

Foremost, this study could be substantially strengthened through the analysis of the prospective use cases described in the discussion. Given the high burden of SAR-CoV-2 in the UK in late 2020, there should already be ample data available for inclusion. This would seem essential for validating the utility of the algorithm.

It would also be very helpful to include some estimates on the circumstances under which the algorithm is likely to provide useful information. The UK is highly unusual in terms of its virus sequencing efforts. What proportion of test-positive cases need to be sequenced in order for the algorithm to provide reliable estimates of ongoing transmission under different outbreak scenarios?

The algorithm was initially designed with the aim of close to 100% sequencing coverage of positive cases within the hospital. This reflects a potential future in which pathogen sequencing is a routine part of lab diagnostics, but as suggested this can present a substantial challenge at present – and one which is highly dependent on the local incidence of new cases. The retrospective evaluation that we have conducted indicates that performance may be acceptable with overall sequencing coverage at substantially lower levels (34% of known cases for Glasgow and 73% for Sheffield). However, the performance will also be influenced by which patients are being prioritised for sequencing and the incidence rates of admission of COVID-19 patients and of the occurrence of HOCI cases.

We have added a new section to the Appendices regarding this issue ‘Sequencing prioritisation for prospective use of the SRT’, and have included the sequencing prioritisation guidelines and associated rationale that have been developed in the course of running our prospective study.

On a related subject, the impact of turnaround time from sample collection to receipt of sequence data should be evaluated. How fast does sequence data need to be able available for the algorithm to provide actionable information?

We feel that this question requires prospective evaluation. As noted in our response to the Public Review, our ongoing prospective study includes intervention phases both with a ‘rapid’ target turnaround of 48 hours from sampling and with a ‘slow’ target turnaround of 5-10 days, and this will generate data on the relative utility for IPC of viral sequencing within these timeframes.

The Discussion section briefly touches on the motivation for using a 2 SNP threshold for identifying "similar" viruses instead of phylogeny-informed approaches but dismisses the latter saying that they are too intensive to be used outside of research-intensive settings. I agree that phylogenetic approaches would be more resource intensive, but it would seem that the amount of sequencing and shortness of turnaround time required to make this algorithm useful similarly limit its utility to research-intensive settings. With this in mind, the substantially increased information associated with phylogenetic approaches would seem to be worth the increased resource costs particularly since the capabilities and quality of phylogenetic tools has increased substantially since the publication of the 2013 paper cited on line 326. I recognise that extending this particular algorithm to include phylogenetic data is outside the scope of work for this project, but the manuscript is unfairly dismissive of such approaches, especially when a location specific build of nextstrain can created with a programming burden similar to that of the algorithm described in this study.

We have adjusted the Discussion to better reflect the comparative advantages and limitations with respect to phylogenetic investigation:

“There have been advances in recent years in the computational efficiency and workflow standardisation possible for phylogenetic analyses that have made it easier to use these methods for real-time investigation of outbreaks, for example through the development of the Nextstrain project[28, 29]. […] There will be cases in which phylogenetic analysis would provide information beyond that returned by the SRT, and the two approaches may be complementary to one another for outbreak investigation.”

https://doi.org/10.7554/eLife.65828.sa2

Article and author information

Author details

  1. Oliver Stirrup

    Institute for Global Health, University College London, London, United Kingdom
    Contribution
    Software, Formal analysis, Investigation, Methodology, Writing - original draft, Writing - review and editing
    For correspondence
    oliver.stirrup@ucl.ac.uk
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-8705-3281
  2. Joseph Hughes

    MRC-University of Glasgow Centre for Virus Research, Glasgow, United Kingdom
    Contribution
    Data curation, Formal analysis, Investigation, Visualization, Writing - review and editing
    Competing interests
    No competing interests declared
  3. Matthew Parker

    1. Sheffield Bioinformatics Core, The University of Sheffield, Sheffield, United Kingdom
    2. Sheffield Institute for Translational Neuroscience, The University of Sheffield, Sheffield, United Kingdom
    3. Sheffield Biomedical Research Centre, The University of Sheffield, Sheffield, United Kingdom
    Contribution
    Data curation, Investigation, Writing - review and editing
    Competing interests
    No competing interests declared
  4. David G Partridge

    1. Sheffield Teaching Hospitals NHS Foundation Trust, Sheffield, United Kingdom
    2. The Florey Institute for Host-Pathogen Interactions & Department of Infection, Immunity and Cardiovascular Disease, Medical School, University of Sheffield, Sheffield, United Kingdom
    Contribution
    Data curation, Investigation, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0417-2016
  5. James G Shepherd

    MRC-University of Glasgow Centre for Virus Research, Glasgow, United Kingdom
    Contribution
    Data curation, Investigation, Writing - review and editing
    Competing interests
    No competing interests declared
  6. James Blackstone

    The Comprehensive Clinical Trials Unit at UCL , University College London, London, United Kingdom
    Contribution
    Project administration, Writing - review and editing
    Competing interests
    No competing interests declared
  7. Francesc Coll

    Department of Infection Biology, Faculty of Infectious and Tropical Diseases, London School of Hygiene & Tropical Medicine, London, United Kingdom
    Contribution
    Project administration, Writing - review and editing
    Competing interests
    No competing interests declared
  8. Alexander Keeley

    1. Sheffield Teaching Hospitals NHS Foundation Trust, Sheffield, United Kingdom
    2. The Florey Institute for Host-Pathogen Interactions & Department of Infection, Immunity and Cardiovascular Disease, Medical School, University of Sheffield, Sheffield, United Kingdom
    Contribution
    Data curation, Investigation, Writing - review and editing
    Competing interests
    No competing interests declared
  9. Benjamin B Lindsey

    1. Sheffield Teaching Hospitals NHS Foundation Trust, Sheffield, United Kingdom
    2. The Florey Institute for Host-Pathogen Interactions & Department of Infection, Immunity and Cardiovascular Disease, Medical School, University of Sheffield, Sheffield, United Kingdom
    Contribution
    Data curation, Investigation, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4227-2592
  10. Aleksandra Marek

    Clinical Microbiology, NHS Greater Glasgow and Clyde, Glasgow, United Kingdom
    Contribution
    Data curation, Investigation, Writing - review and editing
    Competing interests
    No competing interests declared
  11. Christine Peters

    Clinical Microbiology, NHS Greater Glasgow and Clyde, Glasgow, United Kingdom
    Contribution
    Data curation, Investigation, Writing - review and editing
    Competing interests
    No competing interests declared
  12. Joshua B Singer

    MRC-University of Glasgow Centre for Virus Research, Glasgow, United Kingdom
    Contribution
    Software, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
  13. The COVID-19 Genomics UK (COG-UK) consortium

  14. Asif Tamuri

    Research Computing, University College London, London, United Kingdom
    Contribution
    Software, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
  15. Thushan I de Silva

    1. Sheffield Teaching Hospitals NHS Foundation Trust, Sheffield, United Kingdom
    2. The Florey Institute for Host-Pathogen Interactions & Department of Infection, Immunity and Cardiovascular Disease, Medical School, University of Sheffield, Sheffield, United Kingdom
    Contribution
    Conceptualization, Data curation, Supervision, Investigation, Writing - review and editing
    Competing interests
    No competing interests declared
  16. Emma C Thomson

    1. MRC-University of Glasgow Centre for Virus Research, Glasgow, United Kingdom
    2. Institute of Infection, Immunity and Inflammation, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, United Kingdom
    3. Department of Infectious Diseases, Queen Elizabeth University Hospital, Glasgow, United Kingdom
    Contribution
    Conceptualization, Data curation, Supervision, Investigation, Writing - review and editing
    Competing interests
    No competing interests declared
  17. Judith Breuer

    Division of Infection and Immunity, University College London, London, United Kingdom
    Contribution
    Conceptualization, Supervision, Writing - original draft, Writing - review and editing
    For correspondence
    j.breuer@ucl.ac.uk
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-8246-0534

Funding

Wellcome (201344/Z/16/Z)

  • Francesc Coll

NIHR

  • Judith Breuer

NIHR (IS-BRC-1215–20017)

  • Matthew Parker

Medical Research Council

  • The COVID-19 Genomics UK (COG-UK) consortium

NIHR

  • The COVID-19 Genomics UK (COG-UK) consortium

Wellcome

  • The COVID-19 Genomics UK (COG-UK) consortium

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

COG-UK HOCI is funded by the COG-UK consortium, which is supported by funding from the Medical Research Council (MRC) part of UK Research and Innovation (UKRI), the National Institute of Health Research (NIHR) and Genome Research Limited, operating as the Wellcome Sanger Institute. JBr receives funding from the NIHR ULC/UCLH Biomedical Research Centre. FC is funded by Wellcome (grant number: 201344/Z/16/Z). MDP is funded by the NIHR Sheffield Biomedical Research Centre (BRC – IS-BRC-1215–20017). We acknowledge the help of the UCL Comprehensive Clinical Trials Unit. We thank the NHS Greater Glasgow and Clyde and Sheffield Teaching Hospitals NHS Foundation Trust infection prevention and control teams for provision of data. We thank Michael Chapman for his assistance in the development of this project.

Ethics

Human subjects: Research Ethics for COG-UK Consortium and research undertaken under its auspices was granted by the PHE Research Ethics and Governance group as part of the emergency response to COVID-19 (24 April 2020, REF: R&D NR0195) and by the relevant Scottish biorepository authorities (16/WS/0207NHS and 10/S1402/33). This was a retrospective analysis on fully anonymized data, the collection of which did not involve any active research intervention. Consent therefore was neither required nor requested from individual patients.

Senior Editor

  1. Jos WM van der Meer, Radboud University Medical Centre, Netherlands

Reviewing Editor

  1. Albert Osterhaus, University of Veterinary Medicine Hannover, Germany

Reviewer

  1. Albert Osterhaus, University of Veterinary Medicine Hannover, Germany

Publication history

  1. Received: December 16, 2020
  2. Accepted: June 25, 2021
  3. Accepted Manuscript published: June 29, 2021 (version 1)
  4. Version of Record published: July 16, 2021 (version 2)

Copyright

© 2021, Stirrup et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 663
    Page views
  • 85
    Downloads
  • 5
    Citations

Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

  1. Further reading

Further reading

    1. Computational and Systems Biology
    2. Epidemiology and Global Health
    Olivier Thomine et al.
    Research Article

    Simulating nationwide realistic individual movements with a detailed geographical structure can help optimize public health policies. However, existing tools have limited resolution or can only account for a limited number of agents. We introduce Epidemap, a new framework that can capture the daily movement of more than 60 million people in a country at a building-level resolution in a realistic and computationally efficient way. By applying it to the case of an infectious disease spreading in France, we uncover hitherto neglected effects, such as the emergence of two distinct peaks in the daily number of cases or the importance of local density in the timing of arrival of the epidemic. Finally, we show that the importance of super-spreading events strongly varies over time.

    1. Epidemiology and Global Health
    2. Microbiology and Infectious Disease
    Paul Z Chen et al.
    Research Advance Updated

    Background:

    Previously, we conducted a systematic review and analyzed the respiratory kinetics of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) (Chen et al., 2021). How age, sex, and coronavirus disease 2019 (COVID-19) severity interplay to influence the shedding dynamics of SARS-CoV-2, however, remains poorly understood.

    Methods:

    We updated our systematic dataset, collected individual case characteristics, and conducted stratified analyses of SARS-CoV-2 shedding dynamics in the upper (URT) and lower respiratory tract (LRT) across COVID-19 severity, sex, and age groups (aged 0–17 years, 18–59 years, and 60 years or older).

    Results:

    The systematic dataset included 1266 adults and 136 children with COVID-19. Our analyses indicated that high, persistent LRT shedding of SARS-CoV-2 characterized severe COVID-19 in adults. Severe cases tended to show slightly higher URT shedding post-symptom onset, but similar rates of viral clearance, when compared to nonsevere infections. After stratifying for disease severity, sex and age (including child vs. adult) were not predictive of respiratory shedding. The estimated accuracy for using LRT shedding as a prognostic indicator for COVID-19 severity was up to 81%, whereas it was up to 65% for URT shedding.

    Conclusions:

    Virological factors, especially in the LRT, facilitate the pathogenesis of severe COVID-19. Disease severity, rather than sex or age, predicts SARS-CoV-2 kinetics. LRT viral load may prognosticate COVID-19 severity in patients before the timing of deterioration and should do so more accurately than URT viral load.

    Funding:

    Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant, NSERC Senior Industrial Research Chair, and the Toronto COVID-19 Action Fund.