Taxonium, a web-based tool for exploring large phylogenetic trees

  1. Theo Sanderson  Is a corresponding author
  1. The Francis Crick Institute, United Kingdom

Abstract

The COVID-19 pandemic has resulted in a step change in the scale of sequencing data, with more genomes of SARS-CoV-2 having been sequenced than any other organism on earth. These sequences reveal key insights when represented as a phylogenetic tree, which captures the evolutionary history of the virus, and allows the identification of transmission events and the emergence of new variants. However, existing web-based tools for exploring phylogenies do not scale to the size of datasets now available for SARS-CoV-2. We have developed Taxonium, a new tool that uses WebGL to allow the exploration of trees with tens of millions of nodes in the browser for the first time. Taxonium links each node to associated metadata and supports mutation-annotated trees, which are able to capture all known genetic variation in a dataset. It can either be run entirely locally in the browser, from a serverbased backend, or as a desktop application. We describe insights that analysing a tree of five million sequences can provide into SARS-CoV-2 evolution, and provide a tool at cov2tree.org for exploring a public tree of more than five million SARS-CoV-2 sequences. Taxonium can be applied to any tree, and is available at taxonium.org, with source code at github.com/theosanderson/taxonium.

Data availability

All code is available on GitHub. Data was not generated as part of this study. Data sources are indicated in the manuscript and raw data is available in all cases, without the need for requests.

The following previously published data sets were used

Article and author information

Author details

  1. Theo Sanderson

    The Francis Crick Institute, London, United Kingdom
    For correspondence
    theo.sanderson@crick.ac.uk
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4177-2851

Funding

Wellcome Trust (210918/Z/18/Z)

  • Theo Sanderson

Wellcome Trust (FC001043)

  • Theo Sanderson

Cancer Research UK (FC001043)

  • Theo Sanderson

Medical Research Council (FC001043)

  • Theo Sanderson

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Copyright

© 2022, Sanderson

This article is distributed under the terms of the Creative Commons Attribution License permitting unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 2,613
    views
  • 261
    downloads
  • 34
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Theo Sanderson
(2022)
Taxonium, a web-based tool for exploring large phylogenetic trees
eLife 11:e82392.
https://doi.org/10.7554/eLife.82392

Share this article

https://doi.org/10.7554/eLife.82392

Further reading

    1. Epidemiology and Global Health
    Jie Liang, Yang Pan ... Fanfan Zheng
    Research Article

    Background:

    The associations of age at diagnosis of breast cancer with incident myocardial infarction (MI) and heart failure (HF) remain unexamined. Addressing this problem could promote understanding of the cardiovascular impact of breast cancer.

    Methods:

    Data were obtained from the UK Biobank. Information on the diagnosis of breast cancer, MI, and HF was collected at baseline and follow-ups (median = 12.8 years). The propensity score matching method and Cox proportional hazards models were employed.

    Results:

    A total of 251,277 female participants (mean age: 56.8 ± 8.0 years), of whom 16,241 had breast cancer, were included. Among breast cancer participants, younger age at diagnosis (per 10-year decrease) was significantly associated with elevated risks of MI (hazard ratio [HR] = 1.36, 95% confidence interval [CI] 1.19–1.56, p<0.001) and HF (HR = 1.31, 95% CI 1.18–1.46, p<0.001). After propensity score matching, breast cancer patients with younger diagnosis age had significantly higher risks of MI and HF than controls without breast cancer.

    Conclusions:

    Younger age at diagnosis of breast cancer was associated with higher risks of incident MI and HF, underscoring the necessity to pay additional attention to the cardiovascular health of breast cancer patients diagnosed at younger age to conduct timely interventions to attenuate the subsequent risks of incident cardiovascular diseases.

    Funding:

    This study was supported by grants from the National Natural Science Foundation of China (82373665 and 81974490), the Nonprofit Central Research Institute Fund of Chinese Academy of Medical Sciences (2021-RC330-001), and the 2022 China Medical Board-open competition research grant (22-466).

    1. Epidemiology and Global Health
    2. Genetics and Genomics
    Wei Q Deng, Nathan Cawte ... Sonia S Anand
    Research Article

    Background:

    Maternal smoking has been linked to adverse health outcomes in newborns but the extent to which it impacts newborn health has not been quantified through an aggregated cord blood DNA methylation (DNAm) score. Here, we examine the feasibility of using cord blood DNAm scores leveraging large external studies as discovery samples to capture the epigenetic signature of maternal smoking and its influence on newborns in White European and South Asian populations.

    Methods:

    We first examined the association between individual CpGs and cigarette smoking during pregnancy, and smoking exposure in two White European birth cohorts (n=744). Leveraging established CpGs for maternal smoking, we constructed a cord blood epigenetic score of maternal smoking that was validated in one of the European-origin cohorts (n=347). This score was then tested for association with smoking status, secondary smoking exposure during pregnancy, and health outcomes in offspring measured after birth in an independent White European (n=397) and a South Asian birth cohort (n=504).

    Results:

    Several previously reported genes for maternal smoking were supported, with the strongest and most consistent association signal from the GFI1 gene (6 CpGs with p<5 × 10-5). The epigenetic maternal smoking score was strongly associated with smoking status during pregnancy (OR = 1.09 [1.07, 1.10], p=5.5 × 10-33) and more hours of self-reported smoking exposure per week (1.93 [1.27, 2.58], p=7.8 × 10-9) in White Europeans. However, it was not associated with self-reported exposure (p>0.05) among South Asians, likely due to a lack of smoking in this group. The same score was consistently associated with a smaller birth size (–0.37±0.12 cm, p=0.0023) in the South Asian cohort and a lower birth weight (–0.043±0.013 kg, p=0.0011) in the combined cohorts.

    Conclusions:

    This cord blood epigenetic score can help identify babies exposed to maternal smoking and assess its long-term impact on growth. Notably, these results indicate a consistent association between the DNAm signature of maternal smoking and a small body size and low birth weight in newborns, in both White European mothers who exhibited some amount of smoking and in South Asian mothers who themselves were not active smokers.

    Funding:

    This study was funded by the Canadian Institutes of Health Research Metabolomics Team Grant: MWG-146332.