Microbial plankton play a central role in marine biogeochemical cycles, but the timing in which abundant lineages diversified into ocean environments remains unclear. Here, we reconstructed the timeline in which major clades of bacteria and archaea colonized the ocean using a high-resolution benchmarked phylogenetic tree that allows for simultaneous and direct comparison of the ages of multiple divergent lineages. Our findings show that the diversification of the most prevalent marine clades spans throughout a period of 2.2 Ga, with most clades colonizing the ocean in the last 800 million years. The earliest clades - SAR202, SAR324, Marinimicrobia, and Marine Group II - diversified around the time of the Great Oxidation Event (GOE), during which oxygen concentration increased but remained at microaerophilic levels throughout the Mid-Proterozoic, and these groups remain prevalent in oxygen minimum zones today. We found the diversification of the prevalent heterotrophic marine clades, SAR11, SAR116, SAR92, SAR86, and Roseobacter as well as the Marine Group I, occurring near to the Neoproterozoic Oxygenation Event (0.8-0.4 Ga). The diversification of these clades is concomitant with an overall increase of oxygen and nutrients in the ocean at this time, as well as the diversification of eukaryotic algae consistent with the previous hypothesis that the diversification of heterotrophic bacteria is linked to the emergence of large eukaryotic phytoplankton. The youngest clades correspond to the widespread phototrophic clades Prochlorococcus, Synechococcus, and Crocosphaera, whose diversification happened after the Phanerozoic Oxidation Event (0.45-0.4 Ga) and the increase of modern oxygen concentrations in the atmosphere and the ocean. Our work clarifies the timing at which abundant lineages of bacteria and archaea colonized the ocean, thereby providing key insights into the evolutionary history of lineages that comprise the majority of prokaryotic biomass in the ocean.
This important paper addresses the challenging problem of dating the origin of several groups of marine microorganisms. However, while much of the analyses are solid, the lack of robustness analysis in molecular dating component such as using alternative time calibrations, clock models, and input gene sets makes the study incomplete. Despite some concerns, this work is a commendable attempt at an extremely difficult problem and will be of broad interest to microbiologists, geologists, and evolutionary biologists.
The ocean plays a central role in the flux and stability of Earth’s biogeochemistry (Dontsova et al., 2020; Falkowski et al., 1998; Field et al., 1998). Due to their abundance, diversity, and physiological versatility, microbes mediate the vast majority of organic matter transformations that underpin higher trophic levels (Brown et al., 2014; Mason et al., 2009). For example, marine microorganisms regulate a large fraction of the organic carbon pool (Ducklow and Doney, 2013), drive elemental cycling of nutrients like nitrogen (Zehr and Kudela, 2011), and participate in the ocean-atmosphere exchange of climatically important gasses (Vila-Costa et al., 2006). Starting in the 1980s, analysis of small-subunit ribosomal RNA genes began to reveal the identity of dominant clades of bacteria and archaea that were notable for their ubiquity and high abundance, and subsequent analyses highlighted their diverse physiological activities in the ocean (Giovannoni and Stingl, 2005). Phylogenetic studies showed that these clades are broadly distributed across the Tree of Life (ToL) and encompass a wide range of phylogenetic breadths (Giovannoni and Stingl, 2005). Cultivation-based studies and the large-scale generation of genomes from metagenomes have continued to make major progress in examining the genomic diversity and metabolism of these major marine clades, but we still lack a comprehensive understanding of the evolutionary events leading to their origin and diversification in the ocean.
Several independent studies have used molecular phylogenetic methods to date the diversification of some marine microbial lineages, such as the Ammonia Oxidizing Archaea of the order Nitrososphaerales (Marine Group I, MGI) (Ren et al., 2019; Yang et al., 2021; Zhang et al., 2021), picocyanobacteria of the genera Synechococcus and Prochlorococcus (Sánchez-Baracaldo, 2015; Sánchez-Baracaldo et al., 2019; Zhang et al., 2021), and marine alphaproteobacterial groups that included the SAR11 and Roseobacter clades (Luo et al., 2013). Differences in the methodological frameworks used in these studies often hinder comparisons between lineages, however, and results for individual clades often conflict (Ren et al., 2019; Sánchez-Baracaldo, 2015; Yang et al., 2021; Zhang et al., 2021). Moreover, it has been difficult to directly compare bacterial and archaeal clades due to the vast evolutionary distances between these domains. For these reasons it has remained challenging to evaluate the ages of different marine lineages and develop a comprehensive understanding of the timing of microbial diversification events in the ocean and their relationship with major geological events throughout Earth’s history.
To clarify the timing at which major lineages of bacteria and archaea diversified into the ocean, we developed an approach that leverages a multi-domain phylogenetic tree that allows for simultaneous dating of all major lineages. This method allows us to directly compare the ages of divergent lineages across the tree of life and subsequently reconstruct a timeline in which these groups evolved into the ocean. Moreover, we can also map the acquisition of different protein families onto this phylogeny and thereby infer the genes that were gained by these marine lineages at the time of their emergence. Altogether, this information provides a comprehensive framework that sheds light on watershed events in the history of life on Earth that have given rise to contemporary biodiversity and biogeochemical dynamics in the ocean.
Results and discussion
To begin analyzing the diversification of marine lineages of bacteria and archaea, we constructed a multi-domain phylogenetic tree that allowed us to directly compare the origin of 13 planktonic marine bacterial and archaeal clades that are notable for their abundance and major roles in marine biogeochemical cycles (Fig. 1). We based tree reconstruction on a benchmarked set of marker genes that we have previously shown to be congruent for inter-domain phylogenetic reconstruction (Coleman et al., 2021; Martinez-Gutierrez and Aylward, 2021) (details in Methods, Supplemental File 2). Our phylogenetic framework included non-marine clades for phylogenetic context, and overall it recapitulates known relationships across the ToL, such as the clear demarcation of the Gracilicutes and Terrabacteria bacterial superphyla and the basal placement of the Thermatogales (Coleman et al., 2021; Martinez-Gutierrez and Aylward, 2021) (Fig. 1). To gain insight into the geological landscape in which these major marine clades first diversified, we performed a Bayesian relaxed molecular dating analysis on our benchmarked ToL using several calibrated nodes (Fig. 1 and Table 1).
Due to the limited representation of microorganisms in the fossil record and the difficulties to associate fossils to taxonomic extant groups, we employed geochemical evidence as temporal calibrations (Fig. 1 and Table 1). Moreover, because of the uncertainty in the length of the branch linking bacteria and archaea, the crown node for each domain was calibrated independently. We used the age of the presence of liquid water as approximated through the dating of zircons (Valley et al., 2014), as well as the most ancient record of biogenic methane (Ueno et al., 2006) as maximum and minimum prior ages for bacteria and archaea (4400 and 3460 My, respectively, Fig. 1 and Table 1). For internal calibration, we used the recent identification of non-oxygenic Cyanobacteria to constrain the diversification node of oxygenic Cyanobacteria with a minimum age of 2320 My, the estimated age for the Great Oxidation Event (GOE) (Bekker et al., 2004; Holland, 2006, 2002). Similarly, we calibrated the crown node of aerobic Ammonia Oxidizing Archaea, aerobic Ca. Marinimicrobia, and the Nitrite Oxidizing Bacteria with a maximum age of 2320 My (GOE estimated age) due to their strict aerobic metabolism. Despite geological evidence pointing to the presence of oxygen before the GOE, our Bayesian estimates are consistent with the ancient origin of major bacterial and archaeal supergroups, such as Asgardarchaeota, Euryarchaeota, Firmicutes, Actinobacteria, and Aquificota (Fig. 2). Moreover, the date we found for oxygenic Cyanobacteria (2611 My, CI 95% = 2589-2632; Fig. 2) is in agreement with their diversification happening before the GOE (Ward et al., 2016). Please see the Methods for a detailed explanation of all calibration dates used, together with our rationale for including each one.
Our Bayesian estimates suggest that the lineages that emerged earliest include SAR202, aerobic Ca. Marinimicrobia, SAR324, and the Marine Group II of the phylum Euryarchaeota (MGII). The most ancient clade was the SAR202 (2479 My, 95% CI = 2465-2492 My), whose diversification took place near before the GOE (Fig. 2). Given the aerobic capabilities of SAR202, the diversification of this clade before the broad oxygenation of the atmosphere during the GOE suggests that SAR202 emerged during an oxygen oasis proposed to have existed in pre-GOE Earth (Anbar et al., 2007; Ossa Ossa et al., 2019; Reinhard and Planavsky, 2022). The ancient pre-GOE origin of SAR202 is consistent with a recent study that proposed that this clade played a role in the shift of the redox state of the atmosphere during the GOE by partially metabolizing organic matter through a flavin dependent Baeyer–Villiger monooxygenase, thereby enhancing the burial of organic matter and contributing to the net accumulation of oxygen in the atmosphere (Landry et al., 2017; Shang et al., 2022). After the GOE, we detected the diversification of aerobic Ca. Marinimicrobia (2196 My, 95% CI = 2173-2219 My), the SAR324 clade (1686 My, 95% CI = 1658-1715 My), and the MGII clade (1184 My, 95% CI = 1166-1202 My) (Fig. 2). Although these ancient clades may have first diversified in an oxic environment, the increase of oxygen during the GOE was followed by a relatively rapid drop in ocean and atmosphere oxygen levels (Alcott et al., 2019; Hodgskiss et al., 2019; Reinhard and Planavsky, 2022). It is therefore likely that these clades diversified in the microaerophilic and variable oxygen conditions that prevailed during this period (Bekker et al., 2004; Holland, 2006, 2002). Indeed, the oxygen landscape in which these marine clades first diversified is consistent with their current physiology. For example, these groups are capable of using oxygen as terminal electron acceptor (e.g., nitrate and sulfate), but several representatives are prevalent in modern marine oxygen minimum (OMZs) (Pajares et al., 2020; Sheik et al., 2014; Thrash et al., 2017; Ulloa et al., 2012). The facultative aerobic or microaerophilic metabolism in these clades is therefore potentially a vestige of the low oxygen environment of most of the Proterozoic Eon, and in this way OMZs can be considered to be modern-day refugia of these ancient ocean conditions. Of the clades that diversified as part of this early period, MGII and SAR324 show the youngest colonization dates, but we suspect that this may be due to the notably long branches that lead to the crown nodes of these lineages. These long branches are likely caused by the absence of basal-branching members of these clades ─ either due to extinction events or under-sampling of rare lineages in the available genome collection ─ that would have increased the age of these lineages if present in the tree.
According to our analysis the next clades to diversify in the ocean are SAR116 (959 My, 95% CI = 945-973 My), SAR11 (725 My, 95% CI = 715-734 My), SAR86 (503 My, 95% CI = 497-509 My), SAR92 (430 My, 95% CI = 423-437 My), and Roseobacter (332 My, 95% CI = 323-340 My) (Fig. 2). The relatively late appearance of these heterotrophic lineages that are abundant in the open ocean today was potentially due to the low productivity and oxygen concentrations in both shallow and deep waters that prevailed in the Mid-Proterozoic (1800-800 My), a period known as the “boring billion” (Anbar and Knoll, 2002; Crockford et al., 2018; Hodgskiss et al., 2019; Planavsky et al., 2014; Tang et al., 2016). The diversification of these clades may be indirectly associated with the Snowball event registered before the Neoproterozoic Oxidation Event (NOE, 800-540 My) (Anbar and Knoll, 2002; Hoffman et al., 1998; Shields-Zhou and Och, 2011), which increased the availability of oxygen and inorganic nutrients in the ocean (Anbar and Knoll, 2002; Butterfield, 2001; Porter, 2004; Shields-Zhou and Och, 2011; Vidal and Moczydłowska-Vidal, 1997), and is also coincident with the widespread diversification of large eukaryotic algae during the Neoproterozoic (Anbar and Knoll, 2002; Butterfield, 2001; Porter, 2004; Shields-Zhou and Och, 2011; Vidal and Moczydłowska-Vidal, 1997). It is therefore plausible that an increase in nutrients as well as the broad diversification of eukaryotic plankton enhanced the mobility of organic and inorganic nutrients beyond the coastal areas, and increased the burial of organic matter that consequently led to an increment in atmospheric oxygen concentrations (Knoll et al., 2006; Shields-Zhou and Och, 2011). The scenario in which heterotrophic marine clades diversified in part as a consequence of the new niches built by marine eukaryotes has been previously proposed to have driven the diversification of the Roseobacter clade (Luo et al., 2013; Luo and Moran, 2014). The diversification timing of Roseobacter and other heterotrophic clades support this phenomenon and suggest that the interaction with marine eukaryotes may have broadly influenced the diversification of prevalent lineages in the modern ocean. Similar to what we observed in MGII and SAR324, the Roseobacter clade shows a long branch leading to the crown node (Fig. 1), suggesting that the diversification of this clade occurred earlier.
We also report the diversification of the chemolithoautotroph archaeal lineage MGI into the ocean after the NOE (678 My, 95% CI = 668-688 My) (Fig 2), which is comparable with the age reported by another independent study (Yang et al., 2021). This is consistent with an increase in the oxygen concentrations of the ocean during this period (Reinhard and Planavsky, 2022), a necessary requisite for ammonia oxidation. Moreover, the widespread sulfidic conditions that likely prevailed in the Mid-Proterozoic ocean may have limited the availability of redox-sensitive metals such as copper, which is necessary for ammonia monooxygenases (Anbar and Knoll, 2002; Hatzenpichler, 2012). It is therefore possible that a low concentration of oxygen and limited inorganic nutrient availability before the NOE were limiting factors that delayed the colonization of AOA into the ocean.
The most recent lineages to emerge include the genera Synechococcus (243 My, 95% CI = 238-247 My), Prochlorococcus (230 My, 95% CI = 225-234 My), and the diazotroph Crocosphaera (228 My, 95% CI = 218-237 My). Our results agree with an independent study that points to a relatively late evolution of the marine cyanobacterial clades Prochlorococcus and Synechococcus (Sánchez-Baracaldo, 2015). Picocyanobacteria and Crocosphaera are critical components of phytoplanktonic communities in the modern open ocean due to their large contribution to carbon and nitrogen fixation, respectively (Flombaum et al., 2013; Montoya et al., 2004; Scanlan et al., 2009). For example, the nitrogen fixation activities of C. watsonii in the open ocean today support the demands of nitrogen-starved microbial food webs found in oligotrophic waters (Hewson et al., 2009). The relatively late diversification of these lineages suggests that the oligotrophic open ocean is a relatively modern ecosystem. Moreover, the oligotrophic ocean today is characterized by the rapid turnover of nutrients that depends on the efficient mobilization of essential elements through the ocean (Karl, 2002). Due to its distance from terrestrial nutrient inputs, productivity in the open ocean is therefore dependent on local nitrogen fixation, which was likely enhanced after the widespread oxygenation of the ocean that made molybdenum widely available due to its high solubility in oxic seawater (Canfield et al., 2007; Scott et al., 2008; Wei et al., 2021). Such widespread oxygenation was registered 430-390 My in an event referred to here as the Paleozoic Oxidation Event (Berner and Raiswell, 1983; Lenton et al., 2016; Sperling et al., 2015; Tostevin and Mills, 2020) (POE, Fig. 2). The increase of oxygen to present-day levels in the atmosphere and the ocean was the result of an increment of the burial of organic carbon in sedimentary rocks due to the diversification of the earliest land plants (Lenton et al., 2016; Planavsky et al., 2021; Reinhard and Planavsky, 2022), which has been also associated with increased phosphorus weathering rates (Bergman et al., 2004; Lenton et al., 2016), global impacts on the global element cycles (Dahl and Arens, 2020), and an increase in overall ocean productivity (Planavsky et al., 2021). The late diversification of oligotrophic-specialized clades after the POE therefore suggests that the establishment of the oligotrophic open ocean as we know it today would only have been possible once modern oxygen concentrations and biogeochemical dynamics were reached (Karl, 2002; Reinhard and Planavsky, 2022).
To shed further light on the drivers of the colonization of the ocean, we investigated whether the diversification of marine microbial clades was linked to the acquisition of novel metabolic capabilities. We broadly classified the different clades as Early-branching Clades, Late-branching Clades, and Late-branching Phototrophs based on the general timing of their diversification (Fig. 2). To identify the enrichment of gene functions at the crown node of each marine clade (Fig. 1), we performed a stochastic mapping analysis on each of the 112,248 protein families encoded in our genome dataset. We compared our results with a null hypothesis distribution in which a constant rate model was implemented unconditionally of observed data (see Methods). Statistical comparisons of the stochastic and the null distribution show that each diversification phase was associated with the enrichment of specific functional categories that were consistent with the geochemical context of their diversification (Fig. 3 and 4). For example, Early-branching Clades (EBC) gained a disproportionate number of genes involved in DNA repair, recombination, and glutathione metabolism, consistent with the hypothesis that the GOE led to a rise in reactive oxygen species that cause DNA damage (Burrows, 2010; Khademian and Imlay, 2021; Masip et al., 2006). Moreover, the EBC were enriched in proteins involved in ancient aerobic pathways, like oxidative phosphorylation and TCA cycle (Fig. 3), as well as genes implicated in the degradation of fatty acids under aerobic conditions, for example the enzyme alkane 1-monooxygenase in MGII (Supplemental File 6). We also detected genes for the adaptation to marine environments, for instance genes for the anabolism of taurine (e.g., cysteine dioxygenase in MGII, Supplemental File 6), an osmoprotectant commonly found in marine bacteria (McParland et al., 2021). Our findings suggest that the diversification of EBC in the ocean was linked to the evolution of aerobic metabolism, the acquisition of metabolic capabilities to exploit the newly created niches that followed the increase of oxygen, and the expansion of genes involved in the tolerance to oxidative and salinity stress.
The emergence of Late-branching Clades (LBC; Fig. 3 and 4), whose diversification occurred around the time of the NOE and the initial diversification of eukaryotic algae (Parfrey et al., 2011), was characterized by the enrichment of substantially different gene repertories compared to EBC (Fig. 3). For instance, the heterotrophic lineages Roseobacter, SAR116, and SAR92 show an enrichment of flagellar assembly and motility genes (Fig. 3), including genes for flagellar biosynthesis, flagellin, and the flagellar basal-body assembly (Supplemental File 6). Motile marine heterotrophs like Roseobacter species have been associated with the marine phycosphere, a region surrounding individual phytoplankton cells releasing carbon-rich nutrients (Mühlenbruch et al., 2018; Seymour et al., 2017). Although the phycosphere can also be found in prokaryotic phytoplankton (Croft et al., 2005; Seymour et al., 2017), given the late diversification of abundant marine prokaryotic phytoplankton (Fig. 2 and 4), it is plausible that the emergence of these clades was closely related to the establishment of ecological proximity with large eukaryotic algae, as suggested by our Bayesian estimates (Fig. 2 and 4). The potential diversification of heterotrophic LBC due to their ecological proximity with eukaryotic algae is further supported by the enrichment of genes involved in vitamin B6 metabolism and folate biosynthesis, which are key nutrients involved in phytoplankton-bacteria interactions (Seymour et al., 2017). LBC were also enriched in genes for the catabolism of taurine (e.g., taurine transport system permease in SAR11 and a taurine dioxygenase in SAR86 and SAR92), suggesting that LBC gained metabolic capabilities to utilize the taurine produced by other organisms as substrate (Clifford et al., 2019), instead of producing it as osmoprotectant. Furthermore, we identified the enrichment of genes involved in carotenoid biosynthesis, including spheroidene monooxygenase, carotenoid 1,2-hydratase, beta-carotene hydroxylase, and lycopene beta-cyclase (Supplemental File 6). The production of carotenoids is consistent with their use in proteorhodopsin, a light driven proton pump that is a hallmark feature of most marine heterotrophic bacteria, in particular those that inhabit energy-depleted areas of the ocean (de la Torre et al., 2003). The enrichment of genes potentially involved in bacteria-phytoplankton interactions suggest that the diversification of LBC was potentially linked to the establishment of ecological relationships with eukaryotes to exchange nutrients.
Late-branching phototrophs that diversified around the time of the POE (LBP; Fig. 2), showed the enrichment of transporters in Crocosphaera and Synechococcus (Fig. 3). In particular, the diversification of Crocosphaera was characterized by the acquisition of transporters for inorganic nutrients like cobalt, nickel, iron, phosphonate, phosphate, ammonium, and magnesium, along with organic nutrients including amino acids and polysaccharides (Supplemental File 6). The acquisition of a wide diversity of transporters by the Crocosphaera is consistent with their boom- and-bust lifestyle seen in the oligotrophic open ocean today (Hewson et al., 2009; Wilson et al., 2017), which requires a rapid and efficient use of the scarce nutrients available. We also identified genes involved in osmotic pressure tolerance, for example a Ca-activated chloride channel homolog, a magnesium exporter, and a fluoride exporter (Supplemental File 6). In contrast, our results show that Synechococcus only acquired transporters for inorganic nutrients (e.g., iron and sulfate, Supplemental File 6), whereas Prochlorococcus did not show an enrichment of transporters (Supplemental File 6). The absence of salt-tolerance related genes suggests that the ancestor of these picocyanobacterial clades inhabited a low-nutrient marine habitat. Similar to LBC, we identified the enrichment of taurine metabolism genes in Crocosphaera and Synechococcus, suggesting that its use as osmoprotectant and potential substrate is widespread among planktonic microorganisms (Clifford et al., 2019). Prochlorococcus exhibits enrichment in fewer categories than the rest of phototrophic clades diversifying during the same period, consistent with the streamlined nature of genomes from this lineage (Partensky and Garczarek, 2010). The genes acquired by this lineage are involved in photosynthesis, which supports previous findings that the diversification of this clade was accompanied by changes in the photosynthetic apparatus compared with Synechococcus, its sister group (Biller et al., 2015). Overall, the diversification of LBP was marked by the capacity to thrive in the oligotrophic ocean by exploiting organic and inorganic nutrients and modifying the photosynthetic apparatus as observed in Crocosphaera and Synechococcus, and Prochlorococcus, respectively.
The contemporary ocean is dominated by abundant clades of bacteria and archaea that drive global biogeochemical cycles and play a central role in shaping the redox state of the planet. Despite their importance, the timing, and the geological landscape at which these clades colonized the ocean have remained unclear due to a combination of the inherent difficulties of studying biological events that occurred in deep time and the lack of a fossil record for microbial life. Yet establishing a timeline of these events is critical because the colonization of major marine lineages led to the establishment of the biogeochemical cycles that govern the environmental health of the planet today. In this study we develop a novel phylogenomic method that allows us to infer a comprehensive timeline of the colonization of the ocean by abundant marine clades. Our study presents key foundational knowledge for understanding ongoing anthropogenic changes in the ocean. Importantly, climate change is predicted to lead to an expansion of both oxygen minimum zones, which our findings suggest are refugia that date back to the mid-Proterozoic ocean, and oligotrophic surface waters, which represent ecosystems that emerged relatively recently in the Phanerozoic. Thus, the impacts of current global change can manifest similarly in ecosystems that have emerged at dramatically different periods of Earth’s history. Knowledge of how and under what geochemical conditions dominant microbial constituents first diversified provides critical context for understanding the impact of drastic climatic changes on the marine biome more broadly and will help to clarify how continuing ecological shifts will impact marine biogeochemical cycles.
Material and methods
Genome sampling and phylogenetic reconstruction
To obtain a comprehensive understanding of the diversification of the main marine planktonic clades, we built a multi-domain phylogenetic tree that included a broad diversity of bacterial and archaeal genomes. We compiled a balanced genome dataset from the Genome Taxonomy Database (Chaumeil et al., 2019) (GTDB, v95), including marine representatives by using a genome sampling strategy reported previously (Martinez-Gutierrez and Aylward, 2021). In addition, we improved the representation of marine clades by subsampling genomes from the GORG database (Pachiadaki et al., 2019), which includes a wide range of genomes derived from single-cell sequencing, and adding several Thermoarchaeota genomes available on the JGI (Nordberg et al., 2014). We discarded genomes belonging to the DPANN superphylum due to the uncertainty of their placement within the archaea (Martinez-Gutierrez and Aylward, 2021). The list of genomes used is reported in Supplemental File 1.
We reconstructed a phylogenetic tree through the benchmarked MarkerFinder pipeline developed previously (Martinez-Gutierrez and Aylward, 2021), which resulted in an alignment of 27 ribosomal genes and three RNA polymerase genes (RNAP) (Martinez-Gutierrez and Aylward, 2021). The MarkerFinder pipeline consists of 1) the identification of ribosomal and RNAP genes using HMMER v3.2.1 with the reported model-specific cutoffs (Eddy, 2011; Sievers and Higgins, 2018), 2) alignment with ClustalOmega (Sievers and Higgins, 2018), and 3) concatenation of individual alignments. The resulting concatenated alignment was trimmed using trimAl (Capella-Gutiérrez et al., 2009) with the option -gt 0.1. Phylogenetic tree inference was carried out with IQ-TREE v1.6.12 (Nguyen et al., 2015) with the options -wbt, -bb 1000 (Minh et al., 2013), -m LG+R10 (substitution model previously selected with the option -m MFP according to the Bayesian Information Criterion (Kalyaanamoorthy et al., 2017)), and --runs 5 to select the tree with the highest likelihood. The tree with the highest likelihood was manually inspected to discard the presence of topological inconsistencies and artifacts on iTOL (Letunic and Bork, 2019) (Fig. 1). The raw phylogenetic tree is presented in Supplemental File 2. In a previous study, we assessed the effect of substitution model selection on the topology of a multi-domain phylogenetic tree (Martinez-Gutierrez and Aylward, 2021) however, we did not observe topological changes between the substitution models LG+C60 and LG+R10.
Assessment of Quality Tree
Due to the key importance of tree quality for the tree-dependent analysis performed in our study, we assessed the congruence of our prokaryotic ToL through the Tree Certainty metric (TC) (Martinez-Gutierrez and Aylward, 2021; Salichos et al., 2014), which has recently been shown to be a more accurate estimate for phylogenetic congruence that the traditional bootstrap. Our estimate based on 1,000 replicate trees (TC = 0.91) indicates high congruence in our phylogeny, indicating that the phylogenetic signal across our concatenated alignment of marker genes is consistent. We also evaluated whether the topology of our ToL agrees with a high-quality prokaryotic ToL reported previously (Martinez-Gutierrez and Aylward, 2021). In general, we observed consistency in the placement of all the phyla, as well as the bacterial superphyla (Terrabacteria and Gracilicutes) between both trees, except for the sisterhood of Actinobacteriota and Armatimonadota, which differs from the sisterhood of Actinobacteriota and Firmicutes in the reference tree (Martinez-Gutierrez and Aylward, 2021). Despite this discrepancy, we do not expect that it will substantially impact the results of the current study because none of the marine clades are within this region of the tree.
Estimating the age of the crown node of bacterial and archaeal marine clades
To investigate the timing of the diversification of the marine planktonic clades, the phylogenetic tree obtained above was used to perform a molecular dating analysis of the crown nodes leading to the diversification of the main marine microbial clades. We focused our analysis on clades of bacteria and archaea that are overwhelmingly marine, such that the evolutionary history of that clade could be clearly traced back to an ancestral colonization of the ocean. Some clades, such as marine Nitrospinae and Actinobacteria, were not considered because they included several non-marine members, and it was unclear whether these lineages colonized the ocean multiple times independently. Our analysis was performed through Phylobayes v4.1c (Lartillot et al., 2009) with the program pb on four independent chains. For each chain, the input consisted in the phylogenetic tree, the amino acids alignment, the calibrations, and an autocorrelated relaxed log normal model (-ln) (Thorne et al., 1998) with the molecular evolution model CAT-Poisson+G4. Convergence was tested every 5000 cycles using the program tracecomp with a burn-in of 250 cycles and sampling every 2 cycles. After 100,000 cycles, our chains reached convergence in 8 out of 12 parameters (Supplemental File 3). To assess the uncertainty derived from the parameters that did not reach convergence, we estimated the divergence ages for each of our four chains using the last 1000 cycles and a range of 10 cycles to have a sample of 100 age estimates using the program readdiv (Supplemental File 4). We report the confidence intervals and the median age of 100 replicates of chain three throughout the manuscript.
Selection of priors and assessment of priors’ impact on posterior distribution
To determine the impact of our priors (Fig. 1 and Table 1) on the age estimates of the calibrated nodes in our tree and assess the suitability of the ages used as priors for our analyses, we ran an independent MCMC chain without the amino acid alignment using the option -root on Phylobayes. Our prior-only analysis yielded a posterior age falling within the maximum and minimum priors used for the crown group of archaea and bacteria. For the internal calibrated nodes, we observed posterior estimates consistent with the priors used for each case except for aerobic ammonia oxidizing archaea (Supplemental File 5 and 6). Overall, this result suggests that the calibrations used as priors were adequate for our analyses.
Comparison among PhyloBayes chains and between TreePL age estimates
To evaluate the reproducibility of our Bayesian molecular dating analysis, we applied a second independent approach based on Penalized Likelihood (PL) using the TreePL program (Smith and O’Meara, 2012) on 1000 replicate bootstrap trees that had fixed topology but varying branch lengths. Replicate trees were generated with the program bsBranchLenghts available on RAxML v8.2.12 (Stamatakis, 2014). For each replicate run, we initially used the option “prime” to identify the optimization parameters and applied the parameters “through” to continue iterations until convergence in the parameters of each of the 1000 runs. Moreover, we estimated the optional smoothing value for each replicate tree and ran cross-validation with the options “cv” and “randomcv” (Smith and O’Meara, 2012). The divergence times resulting from the 1000 bootstrap trees were used to assess the age variation for each node of interest (Supplemental File 4).
Although some Bayesian parameters did not reach convergence after 100,000 cycles (Supplemental File 3), the estimated ages resulting from our four independent chains were similar when compared to each other (Supplemental File 4). Moreover, our Bayesian and Penalized likelihood approaches showed similar divergence times, strengthening the conclusions of our study. We only observed notable discrepancies between these approaches in Photosynthetic Cyanobacteria (PL showing more recent divergence during the GOE), and the marine Picocyanobacteria Synechococcus and Prochlorococcus (PL showing more ancient divergence during the POE). Despite these discrepancies, the differences observed between both approaches do not alter our main conclusions regarding the emergence of marine microbial clades.
Comparing Bayesian diversification estimates with previous studies
Two estimated divergence times shown in our study disagree with previously published analyses. Firstly, a recent molecular dating estimate suggested that the transition of AOA-Archaea from terrestrial environments into marine reals occurred before the NOE (Ren et al., 2019; Smith and O’Meara, 2012) during a period known as the “boring million” characterized by low productivity and minimum oxygen concentrations in the atmosphere (0.1% the present levels) (Anbar and Knoll, 2002; Hodgskiss et al., 2019; Holland, 2006; Reinhard and Planavsky, 2022). Our estimates point to a later diversification of this lineage during or after the NOE (678 Mya, 95% CI = 668-688 Mya) (Fig. 2), which is comparable with the age reported by another independent study (Yang et al., 2021). Secondly, another study reported the origin of the Picocyanobacterial clade Prochlorococcus to be 800 My, before the Snowball Earth period registered during the Cryogen12. However, our results agree with another independent study that points to a relatively late evolution of Prochlorococcus (Sánchez-Baracaldo, 2015).
Orthologous groups detection, stochastic mapping, and functional annotation
To investigate the genomic novelties associated with the diversification of the marine microbial lineages considered in our study, we identified enriched KEGG categories in the crown node of each clade. First, we predicted protein orthologous groups with ProteinOrtho v6 (Lechner et al., 2011) using the option “lastp” and protein files downloaded from the GTDB, GORG, and JGI databases. Furthermore, we performed a functional annotation using the KEGG database (Kanehisa, 2019; Kanehisa et al., 2021; Kanehisa and Goto, 2000) through HMMER3 with an e-value of 10-5 on all proteins. Proteins with multiple annotations were filtered to keep the best-scored annotation, and we predicted the function of each protein orthologous group by using the Majority Rule Principle. The presence/absence matrix resulting from the identification of orthologous groups was used together with the phylogenetic tree utilized for molecular dating to perform 100 replicate stochastic mapping analyses on each orthologous group with the make.simmap function implemented on Phytools (Bollback, 2006; Revell, 2012) with the model “all-rates-different” (ARD). To evaluate evidence of enrichment of KEGG categories, we created a null distribution for each protein cluster by simulating under the transition matrix estimated from our stochastic mapping analysis using the function sim.history, but without conditioning on the presence/absence data at the tips (i.e. simulating a constant rate null distribution of transitions across the tree). Since some of the protein clusters show a low exchange rate (identified because one of the rows in the Q-matrix was equal to zero), we manually changed the exchange rate from zero to 0.00001. For each distribution, we estimated the number of genes gained for each KEGG category at the crown node of the marine clades. Clusters without a known annotation on the KEGG database were discarded. The resulting KEGG categories distributions for our stochastic mapping and null analyses were statistically compared using a one-tailed Wilcox test (α=0.01, N=100 for each distribution). KEGG categories showing statistically more gains in our stochastic mapping distribution were considered enriched (Supplemental File 7).
We acknowledge the use of the Virginia Tech Advanced Research Computing Center for bioinformatic analyses performed in this study. This investigation was supported by grants from the Institute for Critical Technology and Applied Science and the National Science Foundation (IIBR-2141862), and a Simons Early Career Award in Marine Microbial Ecology and Evolution to F.O.A. We kindly thank members of the Aylward Lab for their insightful comments on an earlier version of this manuscript.
Supplemental File 1. Genomes dataset used for the molecular dating of the main marine microbial clades.
Supplemental File 2. Raw maximum likelihood phylogenetic tree used for molecular dating and stochastic mapping analyses.
Supplemental File 3. Assessment of parameters convergence of four independent chains used for Bayesian molecular dating analyses. Relative difference <0.3 is shown in bold letters and denotes parameters that reached convergence after 100,000 cycles using a burn-in of 250 and sampling every two cycles.
Supplemental File 4. Comparison of the age distribution of marine microbial clades and its relationship with the main Earth oxygenation events using a Bayesian and a Penalized Likelihood approach for molecular dating. Ridges represent the age of 100 and 1000 replicate age estimates for each Bayesian independent chains and Penalized Likelihood analyses, respectively (see Methods).
Supplemental File 5. Estimated ages for calibrated nodes showing their suitability as priors for Bayesian molecular dating. Values resulted from running an independent chain on the temporal calibrations without sequence data (-root option on Phylobayes). Bars represent the standard error of the cycles tested.
Supplemental File 6. KOs gained at the crown node of each marine microbial clade. A KO was considered as gained when found in 51 out of 100 stochastic mapping replicates.
Supplemental File 7. Enriched KEGG categories at the crown node of each marine microbial clade. Clades were classified based on the diversification timing shown in Fig. 2. Enriched categories were identified by statistically comparing a stochastic mapping distribution with an all-rates-different vs a null distribution with a constant rate model without conditioning on the presence/absence data at the tips. Each dot represents one replicate (See Methods). X-axis represents the number of KOs gained at each crown node for each KEGG category. Stochastic mapping and null distributions were sorted for visualization purposes.
- Stepwise Earth oxygenation is an inherent property of global biogeochemical cyclingScience 366:1333–1337
- A whiff of oxygen before the great oxidation event?Science 317:1903–1906
- Proterozoic ocean chemistry and evolution: a bioinorganic bridge?Science 297:1137–1142
- Dating the rise of atmospheric oxygenNature 427:117–120
- BiogeochemistryCOPSE: A new model of biogeochemical cycling over Phanerozoic time
- Burial of organic carbon and pyrite sulfur in sediments over phanerozoic time: a new theoryGeochimica et Cosmochimica Acta https://doi.org/10.1016/0016-7037(83)90151-5
- Prochlorococcus: the structure and function of collective diversityNat Rev Microbiol 13:13–27
- SIMMAP: stochastic character mapping of discrete traits on phylogeniesBMC Bioinformatics 7
- A billion years of environmental stability and the emergence of eukaryotes: new data from northern AustraliaGeology 26:555–558
- A trait based perspective on the biogeography of common and abundant marine bacterioplankton cladesMar Genomics 15:17–28
- Surviving an Oxygen Atmosphere: DNA Damage and RepairACS Symposium Series https://doi.org/10.1021/bk-2009-1025.ch008
- Paleobiology of the late Mesoproterozoic (ca. 1200 Ma) Hunting Formation, Somerset Island, arctic CanadaPrecambrian Research https://doi.org/10.1016/s0301-9268(01)00162-0
- Bangiomorpha pubescensn. gen., n. sp.: implications for the evolution of sex, multicellularity, and the Mesoproterozoic/Neoproterozoic radiation of eukaryotesPaleobiology https://doi.org/10.1666/0094-8373(2000)026<0386:bpngns>2.0.co;2
- Paleobiology of the Neoproterozoic Svanbergfjellet FormationSpitsbergen. Wiley-Blackwell
- Late-Neoproterozoic deep-ocean oxygenation and the rise of animal lifeScience 315:92–95
- trimAl: a tool for automated alignment trimming in large-scale phylogenetic analysesBioinformatics 25:1972–1973
- GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy DatabaseBioinformatics 36:1925–1927
- Taurine Is a Major Carbon and Energy Source for Marine Prokaryotes in the North Atlantic Ocean off the Iberian PeninsulaMicrobial Ecology https://doi.org/10.1007/s00248-019-01320-y
- A rooted phylogeny resolves early bacterial evolutionScience 372https://doi.org/10.1126/science.abe0511
- Triple oxygen isotope evidence for limited mid-Proterozoic primary productivityNature 559:613–616
- Algae acquire vitamin B12 through a symbiotic relationship with bacteriaNature 438:90–93
- The impacts of land plant evolution on Earth’s climate and oxygenation state – An interdisciplinary reviewChemical Geology https://doi.org/10.1016/j.chemgeo.2020.119665
- Proteorhodopsin genes are distributed among divergent marine bacterial taxaProc Natl Acad Sci U S A 100:12830–12835
- Biogeochemical Cycles: Ecological Drivers and Environmental Impact
- What Is the Metabolic State of the Oligotrophic Ocean? A DebateAnnual Review of Marine Science https://doi.org/10.1146/annurev-marine-121211-172331
- Accelerated Profile HMM SearchesPLoS Comput Biol 7
- Biogeochemical Controls and Feedbacks on Ocean Primary ProductionScience 281:200–207
- Primary production of the biosphere: integrating terrestrial and oceanic componentsScience 281:237–240
- Present and future global distributions of the marine Cyanobacteria Prochlorococcus and SynechococcusProc Natl Acad Sci U S A 110:9824–9829
- Molecular diversity and ecology of microbial planktonNature https://doi.org/10.1038/nature04158
- Diversity, physiology, and niche differentiation of ammonia-oxidizing archaeaAppl Environ Microbiol 78:7501–7510
- In situ transcriptomic analysis of the globally important keystone N2-fixing taxon Crocosphaera watsoniiISME J 3:618–631
- A productivity collapse to end Earth’s Great OxidationProceedings of the National Academy of Sciences https://doi.org/10.1073/pnas.1900325116
- A Neoproterozoic Snowball EarthScience https://doi.org/10.1126/science.281.5381.1342
- The oxygenation of the atmosphere and oceansPhilosophical Transactions of the Royal Society B: Biological Sciences https://doi.org/10.1098/rstb.2006.1838
- Volcanic gases, black smokers, and the great oxidation eventGeochimica et Cosmochimica Acta https://doi.org/10.1016/s0016-7037(02)00950-x
- ModelFinder: fast model selection for accurate phylogenetic estimatesNat Methods 14:587–589
- Toward understanding the origin and evolution of cellular organismsProtein Sci 28:1947–1951
- KEGG: integrating viruses and cellular organismsNucleic Acids Res 49:D545–D551
- KEGG: kyoto encyclopedia of genes and genomesNucleic Acids Res 28:27–30
- Nutrient dynamics in the deep blue seaTrends Microbiol 10:410–418
- How Microbes Evolved to Tolerate OxygenTrends Microbiol 29:428–440
- Eukaryotic organisms in Proterozoic oceansPhilos Trans R Soc Lond B Biol Sci 361:1023–1038
- SAR202 Genomes from the Dark Ocean Predict Pathways for the Oxidation of Recalcitrant Dissolved Organic MatterMBio 8https://doi.org/10.1128/mBio.00413-17
- PhyloBayes 3: a Bayesian software package for phylogenetic reconstruction and molecular datingBioinformatics 25:2286–2288
- Proteinortho: detection of (co-)orthologs in large-scale analysisBMC Bioinformatics 12
- Earliest land plants created modern levels of atmospheric oxygenProc Natl Acad Sci U S A 113:9704–9709
- Interactive Tree Of Life (iTOL) v4: recent updates and new developmentsNucleic Acids Research https://doi.org/10.1093/nar/gkz239
- Evolution of divergent life history strategies in marine alphaproteobacteriaMBio 4https://doi.org/10.1128/mBio.00373-13
- Evolutionary ecology of the marine Roseobacter cladeMicrobiol Mol Biol Rev 78:573–587
- Phylogenetic Signal, Congruence, and Uncertainty across Bacteria and ArchaeaMol Biol Evol 38:5514–5527
- The many faces of glutathione in bacteriaAntioxid Redox Signal 8:753–762
- Prokaryotic diversity, distribution, and insights into their role in biogeochemical cycling in marine basaltsISME J 3:231–242
- The Osmolyte Ties That Bind: Genomic Insights Into Synthesis and Breakdown of Organic Osmolytes in Marine MicrobesFrontiers in Marine Science https://doi.org/10.3389/fmars.2021.689306
- Ultrafast approximation for phylogenetic bootstrapMol Biol Evol 30:1188–1195
- High rates of N2 fixation by unicellular diazotrophs in the oligotrophic Pacific OceanNature 430:1027–1032
- Mini-review: Phytoplankton-derived polysaccharides in the marine environment and their interactions with heterotrophic bacteriaEnviron Microbiol 20:2671–2685
- IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogeniesMol Biol Evol 32:268–274
- The genome portal of the Department of Energy Joint Genome Institute: 2014 updatesNucleic Acids Res 42:D26–31
- The Neoproterozoic oxygenation event: Environmental perturbations and biogeochemical cyclingEarth-Science Reviews https://doi.org/10.1016/j.earscirev.2011.09.004
- Limited oxygen production in the Mesoarchean oceanProc Natl Acad Sci U S A 116:6647–6652
- Charting the Complexity of the Marine Microbiome through Single-Cell GenomicsCell 179:1623–1635
- Spatial Distribution Patterns of Bacterioplankton in the Oxygen Minimum Zone of the Tropical Mexican PacificMicrob Ecol 80:519–536
- Estimating the timing of early eukaryotic diversification with multigene molecular clocksProc Natl Acad Sci U S A 108:13624–13629
- Prochlorococcus: advantages and limits of minimalismAnn Rev Mar Sci 2:305–331
- Evolution of the structure and impact of Earth’s biosphereNature Reviews Earth & Environment https://doi.org/10.1038/s43017-020-00116-w
- Earth history. Low mid-Proterozoic atmospheric oxygen levels and the delayed rise of animalsScience 346:635–638
- The fossil record of early eukaryotic diversificationThe Paleontological Society Papers https://doi.org/10.1017/s1089332600002321
- The History of Ocean OxygenationAnn Rev Mar Sci 14:331–353
- Phylogenomics suggests oxygen availability as a driving force in Thaumarchaeota evolutionISME J 13:2150–2161
- . phytools: an R package for phylogenetic comparative biology (and other things)Methods in Ecology and Evolution https://doi.org/10.1111/j.2041-210x.2011.00169.x
- Novel information theory-based measures for quantifying incongruence among phylogenetic treesMol Biol Evol 31:1261–1271
- Origin of marine planktonic cyanobacteriaSci Rep 5
- Insights Into the Evolution of Picocyanobacteria and Phycoerythrin Genes (mpeBA and cpeBA)Frontiers in Microbiology https://doi.org/10.3389/fmicb.2019.00045
- Ecological Genomics of Marine PicocyanobacteriaMicrobiology and Molecular Biology Reviews https://doi.org/10.1128/mmbr.00035-08
- Tracing the stepwise oxygenation of the Proterozoic oceanNature 452:456–459
- Zooming in on the phycosphere: the ecological interface for phytoplankton–bacteria relationshipsNature Microbiology https://doi.org/10.1038/nmicrobiol.2017.65
- Oxidative metabolisms catalyzed Earth’s oxygenationNat Commun 13
- Metabolic flexibility of enigmatic SAR324 revealed through metagenomics and metatranscriptomicsEnviron Microbiol 16:304–317
- The case for a Neoproterozoic Oxygenation Event: Geochemical evidence and biological consequencesGSA Today https://doi.org/10.1130/gsatg102a.1
- Clustal Omega for making accurate alignments of many protein sequencesProtein Sci 27:135–145
- treePL: divergence time estimation using penalized likelihood for large phylogeniesBioinformatics 28:2689–2690
- Statistical analysis of iron geochemical data suggests limited late Proterozoic oxygenationNature https://doi.org/10.1038/nature14589
- RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogeniesBioinformatics 30:1312–1313
- Extremely low oxygen concentration in mid-Proterozoic shallow seawatersPrecambrian Research https://doi.org/10.1016/j.precamres.2016.02.005
- Estimating the rate of evolution of the rate of molecular evolutionMolecular Biology and Evolution https://doi.org/10.1093/oxfordjournals.molbev.a025892
- Metabolic Roles of Uncultivated Bacterioplankton Lineages in the Northern Gulf of Mexico “Dead Zone.”MBio 8https://doi.org/10.1128/mBio.01017-17
- Reconciling proxy records and models of Earth’s oxygenation during the Neoproterozoic and PalaeozoicInterface Focus 10
- Evidence from fluid inclusions for microbial methanogenesis in the early Archaean eraNature 440:516–519
- Microbial oceanography of anoxic oxygen minimum zonesProc Natl Acad Sci U S A 109:15996–16003
- Hadean age for a post-magma-ocean zircon confirmed by atom-probe tomographyNature Geoscience https://doi.org/10.1038/ngeo2075
- Biodiversity, speciation, and extinction trends of Proterozoic and Cambrian phytoplanktonPaleobiology https://doi.org/10.1017/s0094837300016808
- Dimethylsulfoniopropionate uptake by marine phytoplanktonScience 314:652–654
- Timescales of Oxygenation Following the Evolution of Oxygenic PhotosynthesisOrig Life Evol Biosph 46:51–65
- Global marine redox evolution from the late Neoproterozoic to the early Paleozoic constrained by the integration of Mo and U isotope recordsEarth-Science Reviews https://doi.org/10.1016/j.earscirev.2021.103506
- Coordinated regulation of growth, activity and transcription in natural populations of the unicellular nitrogen-fixing cyanobacterium CrocosphaeraNat Microbiol 2
- The Evolution Pathway of Ammonia-Oxidizing Archaea Shaped by Major Geological EventsMol Biol Evol 38:3637–3648
- Nitrogen cycle of the open ocean: from genes to ecosystemsAnn Rev Mar Sci 3:197–225
- Snowball Earth, population bottleneck and evolutionProc Biol Sci 288