Expanding the stdpopsim species catalog, and lessons learned for realistic genome simulations

  1. Department of Ecology and Evolutionary Biology, University of Arizona, Tucson AZ 85719, USA
  2. Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles CA, USA
  3. Embark Veterinary, Inc., Boston MA 02111, USA
  4. Section for Molecular Ecology and Evolution, Globe Institute, University of Copenhagen, Denmark
  5. Institute of Ecology and Evolution, University of Oregon, Eugene OR 97402, USA
  6. School of Mathematics and Statistics, University of Melbourne, Australia
  7. AncestryDNA, San Francisco CA 94107, USA
  8. 54Gene, Inc., Washington DC 20005, USA
  9. Université Paris-Saclay, CNRS, INRIA, Laboratoire Interdisciplinaire des Sciences du Numérique, UMR 9015 Orsay, France
  10. School of Life Sciences, University of Glasgow, Glasgow, UK
  11. Department of Computational Biology, Cornell University, Ithaca NY, USA
  12. Department of Cell and Systems Biology, University of Toronto, Toronto ON, Canada
  13. Department of Biology, University of Toronto Mississauga, Mississauga ON, Canada
  14. Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria
  15. Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria
  16. Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
  17. Computer Technologies Laboratory, ITMO University, St Petersburg, Russia
  18. Agricultural Institute of Slovenia, Department of Animal Science, Ljubljana, Slovenia
  19. Entomology Department, The Ohio State University, Wooster OH, USA
  20. Department of Genetics, University of Cambridge, Cambridge, UK
  21. Department of Zoology, University of Cambridge, Cambridge, UK
  22. Department of Ecology, Evolution, and Organismal Biology, Brown University, Providence RI, USA
  23. Center for Computational Molecular Biology, Brown University, Providence RI, USA
  24. Department of Genetics and Evolution, Federal University of Sao Carlos, Sao Carlos 13565905, Brazil
  25. Department of Genetics, Stanford University School of Medicine, Stanford CA 94305, USA
  26. Department of Cell and Molecular Biology, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Husargatan 3, SE-752 37 Uppsala, Sweden
  27. Department of Integrative Biology, University of California, Berkeley, Berkeley CA, USA
  28. Department of Biostatistics, University of Washington, Seattle WA, USA
  29. Broad Institute of MIT and Harvard, Cambridge MA 02142, USA
  30. Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK
  31. Cluster of Excellence - Controlling Microbes to Fight Infections, Eberhard Karls Universität Tübingen, Tübingen, Baden-Württemberg, Germany
  32. School of Life Sciences and The Biodesign Institute, Arizona State University, Tempe AZ, USA
  33. The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Edinburgh EH25 9RG, UK
  34. Department of Molecular and Cellular Biology, University of Arizona, Tucson AZ 85721, USA
  35. Department of Integrative Biology, University of Wisconsin-Madison, Madison WI, USA
  36. Department of Mathematics, University of Oregon, Eugene OR 97402, USA
  37. Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill NC 27599, USA
  38. Efi Arazi School of Computer Science, Reichman University, Herzliya, Israel

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Ziyue Gao
    University of Pennsylvania, Philadelphia, United States of America
  • Senior Editor
    Molly Przeworski
    Columbia University, New York, United States of America

Reviewer #1 (Public Review):

stdpopsim is an existing, community-driven resource to support population genetics simulations across multiple species. This paper describes improvements and extensions to this resource and discusses various considerations of relevance to chromosome-scale evolutionary simulations. As such, the paper does not analyse data or present new results but rather serves as a general and useful guide for anyone interested in using the stdpopsim resource or in population genetics simulations in general.

Two new features in stdpopsim are described, which expand the types of evolutionary processes that can be simulated. First, the authors describe the addition of the ability to simulate non-crossover recombination events, i.e. gene conversion, in addition to standard crossover recombination. This will allow for simulations that come closer to the actual recombination processes occurring in many species. Second, the authors mention how genome annotations can now be incorporated into the simulations, to allow different processes to apply to different parts of the genome - however, the authors note that this addition will be further detailed in a separate, future publication. These additions to stdpopsim will certainly be useful to many users and represent a step forward in the degree of ambition for realistic population genetics simulations.

The paper also describes the expansion of the community-curated catalog of pre-defined, ready-to-use simulation set-ups for various species, from the previous 6 to 21 species (though not all new species have demographic models implemented, some have just population genetic parameters such as mutation rates and generation times). For each species, an attempt was made to implement parameters and simulations that are as realistic as possible with respect to what's known about the evolutionary history of that species, using only information that can be traced to the published literature. This process by which this was done appears quite rigorous and includes a quality-control process involving two people. Two examples are given, for Anopheles gambiae and Bos taurus. The detailed discussion of how various population genetic and demographic parameters were extracted from the literature for these two species usefully highlights the numerous non-trivial steps involved and showcases the great deal of care that underlies the stdpopsim resource.

The paper is clearly written and well-referenced, and I have no technical or conceptual concerns. The paper will be useful to anyone interested in population genetics simulations, and will hopefully serve as an inspiration for the broader effort of making simulations increasingly more realistic and flexible, while at the same time trying to make them accessible not just to a small number of experts.

Reviewer #2 (Public Review):

Lauterbur et al. present a description of recent additions to the stdpopsim simulation software for generating whole-genome sequences under population genetic models, as well as detailed general guidelines and best practices for implementing realistic simulations within stdpopsim and other simulation software. Such realistic simulations are critical for understanding patterns in genetic variation expected under diverse processes for study organisms, training simulation-intensive models (e.g., machine learning and approximate Bayesian computation) to make predictions about factors shaping observed genetic variation, and for generating null distributions for testing hypotheses about evolutionary phenomena. However, realistic population genomic simulations can be challenging for those who have never implemented such models, particularly when different evolutionary parameters are taken from a variety of literature sources. Importantly, the goal of the authors is to expand the inclusivity of the field of population genomic simulation, by empowering investigators, regardless of model or non-model study system, to ultimately be able to effectively test hypotheses, make predictions, and learn about processes from simulated genomic variation. Continued expansion of the stdpopsim software is likely to have a significant impact on the evolutionary genomics community.

Strengths:

This work details an expansion from 6 to 21 species to gain a greater breadth of simulation capacity across the tree of life. Due to the nature of some of the species added, the authors implemented finite-site substitution models allowing for more than two allelic states at loci, permitting proper simulations of organisms with fast mutation rates, small genomes, or large effect sizes. Moreover, related to some of the newly added species, the authors incorporated a mechanism for simulating non-crossover recombination, such as gene conversion and horizontal gene transfer between individuals. The authors also added the ability to annotate and model coding genomic regions.

In addition to these added software features, the authors detail guidelines and best practices for implementing realistic population genetic simulations at the genome-scale, including encouraging and discussing the importance of code review, as well as highlighting the sufficient parameters for simulation: chromosome level assembly, mean mutation rate, mean recombination rate or recombination map if available, effective size or more realistic demographic model if available, and mean generation time. Much of these best practices are commonly followed by population genetic modelers, but new researchers in the field seeking to simulate data under population genetic models may be unfamiliar with these practices, making their clear enumeration (as done in this work) highly valuable for a broad audience. Moreover, the mechanisms for dealing with issues of missing parameters discussed in this work are particularly useful, as more often than not, estimates of certain model parameters may not be readily available from the literature for a given study system.

Weaknesses:

An important update to the stdpopsim software is the capacity for researchers to annotate coding regions of the genome, permitting distributions of fitness effects and linked selection to be modeled. However, though this novel feature expands the breadth of processes that can be evaluated as well as is applicable to all species within the stdpopsim framework, the authors do not provide significant detail regarding this feature, stating that they will provide more details about it in a forthcoming publication. Compared to this feature, the additions of extra species, finite-site substitution models, and non-crossover recombination are more specialized updates to the software.

When it comes to simulating realistic genomic data, the authors clearly lay out that parameters obtained from the literature must be compatible, such as the same recombination and mutation rates used to infer a demographic history should also be used within stdpopsim if employing that demographic history for simulation. This is a highly important point, which is often overlooked. However, it is also important that readers understand that depending on the method used to estimate the demographic history, different demographic models within stdpopsim may not reproduce certain patterns of genetic variation well. The authors do touch on this a bit, providing the example that a constant size demographic history will be unable to capture variation expected from recent size changes (e.g., excess of low-frequency alleles). However, depending on the data used to estimate a demographic history, certain types of variation may be unreliably modeled (Biechman et al. 2017; G3, 7:3605-3620). For example, if a site frequency spectrum method was used to estimate a demographic history, then the simulations under this model from stdpopsim may not recapitulate the haplotype structure well in the observed species. Similarly, if a method such as PSMC applied to a single diploid genome was used to estimate a demographic history, then the simulations under this model from stdpopsim may not recapitulate the site frequency spectrum well in the observed species. Though the authors indicate that citations are given to each demographic model and model parameter for each species, this may not be sufficient for a novice researcher in this field to understand what forms of genomic variation the models may be capable of reliably producing. A potential worry is that the inclusion of a species within stdpopsim may serve as an endorsement to users regarding the available simulation models (though I understand this is not the case by the authors), and it would be helpful if users and readers were guided on the type of variation the models should be able to reliably reproduce for each species and demographic history available for each species.

Reviewer #3 (Public Review):

Lauterbur et al. present an expansion of the whole-genome evolution simulation software "stdpopsim", which includes new features of the simulator itself, and 15 new species in their catalog of demographic models and genetic parameters (which previously had 6 species). The list of new species includes mostly animals (12), but also one species of plant, one of algae, and one of bacteria. While only five of the new animal species (and none of the other organisms) have a demographic model described in the catalog, those species showcase a variety of demographic models (e.g. extreme inbreeding of cattle). The authors describe in detail how to go about gathering genetic and demographic parameters from the literature, which is helpful for others aiming to add new species and demographic models to the stdpopsim catalog. This part of the paper is the most widely relevant not only for stdpopsim users but for any researcher performing population genomics simulations. This work is a concrete contribution towards increasing the number of users of population genomic simulations and improving reproducibility in research that uses this type of simulations.

Author Response:

We are very glad that the reviewers found our paper of broad interest to the community of population, evolutionary, and ecological genetics. We thank them for their positive feedback and insightful comments and suggestions. We are preparing a revision of the preprint that will address these points.

One issue raised by the reviewers was that it is important to acknowledge possible limitations of the demographic model used in simulation in capturing different aspects of genomic variation. In particular, different demographic models inferred for the same species using different methods or sets of samples may have different strengths and weaknesses, and this should be considered when selecting a demographic model for simulation. This is an important point that we intend to discuss in the revised version of our manuscript. We also plan to expand the documentation of the stdpopsim catalog to include more information about the type of data used to fit every demographic model. Below we provide an outline of our thoughts on the topic.

First of all, it is important to acknowledge that demographic models inferred from genomic data cannot fully capture all aspects of the true demographic changes in the history of a species. As a result, these models do a good job in capturing some aspects of genetic variation, but not all of them. This is primarily determined by two factors: the method used for demographic inference, and the samples whose genomes were used in inference. Regardless of the method applied, the inferred demographic model can only reflect the genealogical ancestry of the sampled individuals, and this will typically make up a small portion of the complete genealogical ancestry of the species (albeit the genealogy of any set of sampled individuals includes many ancestors). Thus, demographic models inferred from larger sets of samples from diverse ancestry backgrounds may provide a more comprehensive depiction of genetic variation within a species, as long as a sufficiently realistic demographic model can be fit. That said, the choice of samples used for inference will mostly influence recent changes in genetic variation. This is because the genealogy of even a single individual consists of numerous ancestors in each generation in the deep past (which is the premise behind PSMC-style inference methods).

The computational method used for inference also affects the way genetic variation is reflected by the demographic model, because different methods derive their inference from different features of genomic variation. Some methods make use of the site frequency spectrum at unlinked single sites (e.g., dadi, Stairway plot), while other methods use haplotype structure (e.g., PSMC, MSMC, IBDNe). This, in turn, may influence the accuracy of different features in the inferred demography. For example, very recent demographic changes, such as recent admixture or bottlenecks, are difficult to infer from the site frequency spectrum, but are more easily inferred by examining shared long haplotypes (as demonstrated by the demographic model inferred for Bos Taurus by MacLeod et al. (2013)). There have been several studies that compare different approaches to demography inference (e.g., Biechman et al. (2017); Harris and Nielsen (2013)), but unfortunately, there is currently no succinct handbook that describes the relative strengths and weaknesses of different methods. Indeed, we hope that the standardized simulations provided by stdpopsim will facilitate systematic comparisons between methods, which will, in turn, provide valuable insights for researchers when selecting demographic models for simulation.

It is important to note that inclusion of a demographic model in the stdpopsim catalog does not involve any judgment as to which aspects of genetic variation it captures. Any model that is a faithful implementation of a published model inferred from genomic data can be added to the stdpopsim catalog. Thus, potential users of stdpopsim should use the implemented models with the appropriate caution, keeping in mind the limitations discussed above. Scientists contributing a new model to the catalog are required to write a brief summary, which is added to the documentation page of the catalog: https://popsim-consortium.github.io/stdpopsim-docs/ latest/catalog.html. This summary includes a graphical description of the model (such as the one shown for Anopheles gambiae in Fig. 2B of the paper), as well as a description of the data and method used for inference. We will mention this in the revised manuscript to help users of stdpopsim navigate through this resource.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation