Poseidon – A framework for archaeogenetic human genotype data management

  1. Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany
  2. International Max Planck Research School for the Science of Human History, Max Planck Institute for Geoanthropology, Jena, Germany
  3. Saarland University, Saarbrücken, Germany

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a response from the authors (if available).

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Shai Carmi
    The Hebrew University of Jerusalem, Jerusalem, Israel
  • Senior Editor
    George Perry
    Pennsylvania State University, University Park, United States of America

Reviewer #1 (Public Review):

The authors describe a framework for working with genotype data and associated metadata, specifically geared towards ancient DNA. The Poseidon framework aims to address long-standing data coordination issues in ancient population genomics research. These issues can usefully be thought of as two primary, separate problems:

(1) The genotype merging problem. Often, genotype calls made by a new study are not made publicly available, or they are only made available in an ad-hoc fashion without consistency in formatting between studies. Other users will typically want to combine genotypes from many previously published studies with their own newly produced genotypes, but a lack of coordination and standards means that this is challenging and time-consuming.

(2) The metadata problem. All genomes need informative metadata to be usable in analyses, and this is even more true for ancient genomes which have temporal and often cultural dimensions to them. In the ancient DNA literature, metadata is often only made available in inconsistently formatted supplementary tables, such that reuse requires painstakingly digging through these to compile, curate and harmonise metadata across many studies.

Poseidon aims to solve both of these problems at the same time, and additionally provide a bit of population genetics analysis functionality. The framework is a quite impressive effort, that clearly has taken a lot of work and thought. It displays a great deal of attention to important aspects of software engineering and reproducibility. How much usage it will receive beyond the authors themselves remains to be seen, as there is always a barrier to entry for any new sophisticated framework. But in any case, it clearly represents a useful contribution to the human ancient genomics community.

The paper is quite straightforward in that it mainly describes the various features of the framework, both the way in which data and metadata are organised, and the various little software tools provided to interact with the data. This is all well-described and should serve as a useful introduction for any users of the framework, and I have no concerns with the presentation of the paper. Perhaps it gets a bit too detailed for my taste at times, but it's up to the authors how they want to write the paper.

I thus have no serious concerns with the paper. I do have some thoughts and comments on the various choices made in the design of the framework, and how these fit into the broader ecosystem of genomics data. I wouldn't necessarily describe much of what follows as criticism of what the authors have done - the authors are of course free to design the framework and software that they want and think will be useful. And the authors clearly have done more than basically anyone else in the field to tackle these issues. But I still put forth the points below to provide some kind of wider discussion within the context of ancient genomics data management and its future.

* * *

The authors state that there is no existing archive for genotype data. This is not quite true. There is the European Variation Archive (EVA, https://www.ebi.ac.uk/eva/), which allows archiving of VCFs and is interlinked to raw data in the ENA/SRA/DDBJ. If appropriately used, the EVA and associated mainstream infrastructure could in principle be put to good use by the ancient genomics community. In practice, it's basically not used at all by the ancient genomics community, and partly this is because EVA doesn't quite provide exactly what's needed (in particular with regards to metadata fields). Poseidon aims to provide a much more custom-tailored solution for the most common use cases within the human ancient DNA field, but it could be argued that such a solution is only needed because the ancient genomics community has largely neglected the mainstream infrastructure. In some sense, by providing such a custom-tailored solution that is largely independent of the mainstream infrastructure, I feel like efforts such as Poseidon (and AADR) - while certainly very useful - might risk contributing to further misaligning the ancient genomics community from the rest of the genomics community, rather than bringing it closer. But the authors cannot really be blamed for that - they are simply providing a resource that will be useful to people given the current state of things.

The BioSamples database (https://www.ebi.ac.uk/biosamples/) is an attempt to provide universal sample IDs across the life sciences and is used by the archives for sequence reads (ENA/SRA/DDBJ). Essentially every published ancient sample already has a BioSample accession, because this is required for the submission of sequence reads to ENA/SRA/DDBJ. It would thus have seemed natural to make BioSamples IDs a central component of Poseidon metadata, so as to anchor Poseidon to the mainstream infrastructure, but this is not really done. There are some links being made to ENA in the .ssf "sequence source" files used by the Poseidon package, including sample accessions, but this seems more ad-hoc.

The package uses PLINK and EIGENSTRAT file formats to represent genotypes, which in my view are not particularly good formats for long-term and rigorous data management in genomics. These file formats cannot appropriately represent multiallelic loci, haplotype phase, or store information on genotype qualities, coverage, etc. The standard in the rest of genomics is VCF, a much more robust and flexible format with better software built around it. Insisting on keeping using these arguably outdated formats is one way in which the ancient genomics community risks disaligning itself from the mainstream.

I could not find any discussion of reference genomes: knowing the reference genome coordinate system is essential to using any genotype file. For comparison, in the EVA archive, every VCF dataset has a "Genome Assembly" metadata field specifying the accession number of the reference genome used. It would seem to me like a reference genome field should be part of a Poseidon package too. In practice, the authors likely use some variant of the hg19 / GRCh37 human reference, which is still widely used in ancient genomics despite being over a decade out of date. Insisting on using an outdated reference genome is one way in which the ancient genomics community is disaligning itself from the mainstream, and it complicates comparisons to data from other sub-fields of genomics.

A fundamental issue contributing to the genome merging problem, not unique to ancient DNA, is that genotype files are typically filtered to remove sites that are not polymorphic within the given study - this means that files from two different studies will often contain different and not fully overlapping sets of sites, greatly complicating systematic merging. I don't see any discussion of how Poseidon deals with this. In practice, it seems the authors are primarily concerned with data on the commonly used 1240k array set, such that the set of SNPs is always well-defined. But does Poseidon deal with the more general problem of non-overlapping sites between studies, or is this issue simply left to the user to worry about? This would be of relevance to whole-genome sequencing data, and there are certainly plenty of whole-genome datasets of great interest to the research community (including archaic human genomes, etc).

In principle, it seems the framework could be species-agnostic and thus be useful more generally beyond humans (perhaps it would be enough to add just one more "species" metadata field?). It is of course up to the authors to decide how broadly they want to cater.

Reviewer #2 (Public Review):

Summary:

Schmid et al. provide details of their new data management tool Poseidon which is intended to standardise archaeogenetic genotype data and combine it with the associated standardised metadata, including bibliographic references, in a way that conforms to FAIR principles. Poseidon also includes tools to perform standard analyses of genotype files, and the authors pitch it as the potential first port of call for researchers who are planning on using archaeogenetic data in their research. In fact, Poseidon is already up and running and being used by researchers working in ancient human population genetics. To some extent, it is already on its way to becoming a fundamental resource.

Strengths:

A similar ancient genomics resource (The Ancient Allen Database) exists, but Poseidon is several steps ahead in terms of integration and standardisation of metadata, its intrinsic analytical tools, its flexibility, and its ambitions towards being independent and entirely community-driven. It is clear that a lot of thought has gone into each aspect of what is a large and dynamic package of tools and overall it is systematic and well thought through.

Weaknesses:

The main weakness of the plans for Poseidon, which admirably the authors openly acknowledge, is in how to guarantee it is maintained and updated over the long term while also shifting to a fully independent model. The software is currently hosted by the MPI, although the authors do set out plans to move it to a more independent venue. However, the core team comprising the authors is funded by the MPI, and so the MPI is also the main funder of Poseidon. The authors do state their ambition to move towards a community-driven independent model, but the details of how this would happen are a bit vague. The authors imagine that authors of archaeogenetic papers would upload data themselves, thereby making all authors of archaeogenetics papers the voluntary community who would take on the responsibility of maintaining Poseidon. Archaeogeneticists generally are committed enough to their field that there is a good chance such a model would work but it feels haphazard to rely on goodwill alone. Given there needs to be a core team involved in maintaining Poseidon beyond just updating the database, from the paper as it stands it is difficult to see how Poseidon might be weaned off MPI funding/primary involvement and what the alternative is. However, the same anxieties always surround these sorts of resources when they are first introduced. The main aim of the paper is to introduce and explain the resource rather than make explicit plans for its future and so this is a minor weakness of the paper overall.

Author response:

We thank the editors and reviewers for their thorough engagement with the manuscript and their well-informed comments on the Poseidon framework. We are pleased to note that they consider Poseidon a promising and timely attempt to resolve important issues in the archaeogenetics community. We also agree with the main challenges they raise, specifically the lack of long-term, independent infrastructure funding at the time of writing, and various aspects of Poseidon that bear the potential to further consolidate a de-facto alienation of the aDNA community from the wider field of genomics.

Poseidon is indeed dependent on the Department of Archaeogenetics at MPI-EVA. For the short to middle-term future (3-5 years) we consider this dependency beneficial, providing a reliable anchor point and direct integration with one of the most proficient data-producing institutions in archaeogenetics. For the long term, as stated in the discussion section of the manuscript, we hope for a snowball effect in the dissemination and adoption of Poseidon to establish it as a valuable community resource that automatically attracts working time and infrastructure donations. To kickstart this process we have already intensified our active community outreach and teach Poseidon explicitly to (early career) practitioners in the field. We are aware of options to apply for independent infrastructure funding, for example through the German National Research Data Infrastructure (NFDI) initiative, and we plan to explore them further.

As the reviewers have noted, key decisions in Poseidon’s data storage mechanism have been influenced by the special path archaeogenetics has taken compared to other areas of genomics. The founding goal of the framework was to integrate immediately with established workflows in the field. Nevertheless we appreciate the concrete suggestions on how to connect Poseidon better with the good practices that emerged elsewhere. We will explicitly address the European Variation Archive in a revised version of the manuscript, deliberate embedding the BioSamples ID of the INSDC databases more prominently in the .janno file, prioritise support for VCF next to EIGENSTRAT and PLINK and add an option to clearly document the relevant human reference genome on a per-sample level. In the revised version of the text we will also explain the treatment of non-overlapping SNPs between studies by trident’s forge algorithm and how we imagine the interplay of different call sets in the Poseidon framework in general.

Beyond these bigger concerns we will also consider and answer the various more detailed recommendations thankfully shared by the reviewers, not least the question how we imagine Poseidon to be used by archaeologists and for archaeological data.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation