Yerba mate (Ilex paraguariensis) genome provides new insights into convergent evolution of caffeine biosynthesis
Abstract
Yerba mate (YM, Ilex paraguariensis) is an economically important crop marketed for the elaboration of mate, the third-most widely consumed caffeine-containing infusion worldwide. Here, we report the first genome assembly of this species, which has a total length of 1.06 Gb and contains 53,390 protein-coding genes. Comparative analyses revealed that the large YM genome size is partly due to a whole-genome duplication (Ip-α) during the early evolutionary history of Ilex, in addition to the hexaploidization event (γ) shared by core eudicots. Characterization of the genome allowed us to clone the genes encoding methyltransferase enzymes that catalyse multiple reactions required for caffeine production. To our surprise, this species has converged upon a different biochemical pathway compared to that of coffee and tea. In order to gain insight into the structural basis for the convergent enzyme activities, we obtained a crystal structure for the terminal enzyme in the pathway that forms caffeine. The structure reveals that convergent solutions have evolved for substrate positioning because different amino acid residues facilitate a different substrate orientation such that efficient methylation occurs in the independently evolved enzymes in YM and coffee. While our results show phylogenomic constraint limits the genes coopted for convergence of caffeine biosynthesis, the X-ray diffraction data suggest structural constraints are minimal for the convergent evolution of individual reactions.
Editor's evaluation
This very important study combines genomics, biochemistry, structural biology and ancestral sequence reconstruction to address the basis of caffeine biosynthesis in Yerba mate, a species that is phylogenetically unrelated to other plants, namely coffee and tea, in which this pathway has been studied before. The manuscript reports the first draft genome sequence for yerba mate and provides convincing evidence for the identity and characteristics of enzymes for caffeine biosynthesis. The authors are able to propose structural constraints for the convergent evolution of individual reactions. The work will be of interest to plant and evolutionary biologists and anyone studying natural product biosynthesis.
https://doi.org/10.7554/eLife.104759.sa0Introduction
In the genomic era, hundreds of plant species have had their nucleotide sequences determined to provide unprecedented insight into the genetic basis of many traits. One of the few species of economic importance for which no genomic data exist is Ilex paraguariensis var. paraguariensis A. St. Hilaire (Aquifoliaceae), colloquially known as yerba mate (YM), which is a caffeinated diploid tree-species (2n = 2x = 40) endemic to the subtropical rainforests of South America (Niklas, 1987). The dried leaves and twigs of this dioecious evergreen are used to prepare a traditional infusion named mate, or chimarrão, widely consumed around the world. Approximately 300,000 ha are cultivated with this tree crop, with Argentina responsible for 80% of worldwide production (Heck and de Mejia, 2007). The mate infusion has been shown to have numerous beneficial effects in humans including as an antioxidant (Sánchez Boado et al., 2015; Gugliucci, 1996; Vieira et al., 2010), antidiabetic (Kang et al., 2012; Ríos et al., 2015), as well as central nervous system stimulant (Santos et al., 2015), among others. Several bioactive compounds have been identified in YM that might be responsible for its effects, including terpenes, flavonoids, phenolics, and methylxanthines (Heck and de Mejia, 2007). Although its stimulant properties are best known and mostly related to caffeine content, little is known about the genetic and biochemical mechanisms of how YM synthesizes this, or any, of its important metabolites. Despite the recent release of three other Ilex genome sequences (Kong et al., 2022; Xu et al., 2022; Yao et al., 2022), none of the species produce caffeine, making the genetic basis for convergent evolution of this trait in YM unclear.
Convergent evolution has occurred throughout the tree of life and is particularly rampant in plants (Sackton and Clark, 2019) where examples of repeated origins of morphological (Thorogood et al., 2018), anatomical (Wan et al., 2018), physiological (Yang et al., 2017), and biochemical (Pichersky and Lewinsohn, 2011) traits have been documented. Caffeine (CF) is a xanthine alkaloid that has independently evolved no less than six times across angiosperms and has implications for pollination, insect defence, and allelopathy (Anaya et al., 2006; Stevenson et al., 2017). There are multiple biosynthetic routes to caffeine possible within the xanthine alkaloid network (Figure 1). Within the Rosid genera Theobroma, Paullinia, and Citrus, sequential methylation of xanthine (X), 1- and/or 3-methylxanthine (1X, 3X) and, finally, either theophylline (TP) or theobromine (TB) leads to caffeine (Huang et al., 2016). In contrast, the Asterids, Coffea and Camellia, appear to sequentially methylate xanthosine (XR), 7-methylxanthine (7X), and theobromine (TB) to yield caffeine (Ashihara et al., 1996; Suzuki and Takahashi, 1976; Figure 1). Regardless of which pathway is utilized, species differ in terms of which SABATH enzyme family members were convergently recruited to synthesize caffeine: xanthine methyltransferase (XMT) is used by Citrus and Coffea while the paralogous caffeine synthase (CS) is used by Camellia, Theobroma, and Paullinia (Huang et al., 2016; Kato et al., 1996; Uefuji et al., 2003; Figure 1). Convergence appears to also extend to the mutational level, since different amino acid replacements to homologous regions of CS and XMT enzymes appear to govern the evolution of substrate preference switches (O’Donnell et al., 2021). However, it remains unclear whether mutations lead to convergent three-dimensional protein structures to confer convergent substrate interactions and catalysis by the enzymes. Because XMT- and CS-type enzymes have been convergently recruited in both Rosids and Asterids to catalyse the same or different pathways, it suggests considerable evolutionary lability underlying caffeine production in plants (Huang et al., 2016). As a result, it is difficult to predict what sets of genes, biochemical reactions, and structural properties might lead to the evolution of caffeine biosynthesis in YM, or any plant, a priori.
Although some transcriptomic resources have been generated for YM (Acevedo et al., 2019; Debat et al., 2014; Fay et al., 2018), a complete genome sequence has the potential to advance our understanding of the metabolic potential of this important crop and facilitate improvement. Here, we describe the first draft genome of YM and report on its composition, organization, and evolution. The genomic sequence enabled us to uncover the genetic, biochemical, mutational, and structural bases for convergent evolution of caffeine in YM. Our comparative analyses of caffeine-producing enzymes across angiosperms reveal how convergence may be the result of constrained evolutionary genomic potential but relatively unconstrained structural potential.
Results and discussion
YM genome sequencing, assembly, and annotation
The YM genome was sequenced combining Illumina and PacBio sequencing technologies. With Illumina sequencing, we generated ~263.2 Gb of short reads from various DNA fragment sizes (350 bp, 550 bp, 3 kbp, 8 kbp, and 12 kbp), while with PacBio sequencing, we generated ~77.5 Gb of long reads. These reads represent ~158.5- and ~49.3-fold base-pair coverage of the genome, respectively (Table 1). The total assembly length was ~1.06 Gb and consisted of 10,611 scaffolds (≥1 kb) with an N50 length of ~510.8 kb (Table 2). To assess the completeness of the genome, we aligned the available YM transcriptome reads (Acevedo et al., 2019; Debat et al., 2014; Fay et al., 2018) and the YM genomic short reads generated in this study with the assembly: 99.3% of the former and 99.5% of the latter were mapped. The GC content of the genome assembly was 36.33% (Table 2), similar to that of other eudicots (33.70–38.20 GC%) (Singh et al., 2016) and almost identical to that of Ilex polyneura (36.08 GC%) (Yao et al., 2022), Ilex asprella (36.25 GC%) (Kong et al., 2022), and Ilex latifolia (36.44 GC%) (Xu et al., 2022), the only three Ilex species with sequenced genomes. About 64.63% of the genome assembly was composed of repetitive sequences, of which ~36.22% were retrotransposons, ~1.80% were DNA transposons, ~0.74% were simple repeats, and ~0.15% were low complexity regions. Long terminal-repeat retrotransposons of the Gypsy and Copia families were the most abundant transposable elements, as observed in many sequenced plant genomes (Galindo-González et al., 2017), followed by long interspersed nuclear elements (LINEs) and hobo-Activator transposons, among others (Table 3).
A total of 53,390 protein-coding genes were predicted in the genome, with a mean coding sequence length of 3062 bp and 4.23 exons per gene. Of these, 41,483 (~77.63%) could be annotated with GO terms, EC numbers or Pfam domains. In addition, we identified 4530 non-coding RNA genes, including 2670 small nucleolar RNAs, 815 transfer RNAs, 471 ribosomal RNAs, 348 small nuclear RNAs, and 226 micro RNAs (Appendix 1, Appendix 1—tables 1–3). To further assess the completeness of the assembly, we aligned the scaffolds with the KOG (Tatusov et al., 2003) and DEG (Luo et al., 2014) databases, determining that 98% of the core gene families from the KOG database and 97.5% of the Arabidopsis thaliana DEG subset were present. Then, we performed a Benchmarking Universal Single-Copy Orthologs (BUSCO) (Manni et al., 2021) assessment using the eudicot ODB10 database. Among 2326 conserved single-copy genes, ~96.20% were retrieved, of which ~78.80% were complete and single copies, ~17.40% were complete and in duplicates, ~3.10% were fragmented, and only ~0.70% were missing. These results suggest that the coding region of the assembly is nearly complete. The number of estimated genes for YM is higher than the ca. 39,000 reported from the genome sequences of other Ilex species (Kong et al., 2022; Xu et al., 2022; Yao et al., 2022). This could be at least partly due to the larger genome size of YM as estimated from flow cytometry relative to the other species (Gottlieb and Poggio, 2015).
Evolutionary analysis of YM genome provides evidence of whole-genome duplication in an early Ilex ancestor
Most plant lineages have experienced ancient polyploidization events followed by massive duplicate gene losses and genome rearrangements, which may have contributed to the evolution of developmental and metabolic complexity (Landis et al., 2018; Sankoff and Zheng, 2018). Recent transcriptome-based analyses (One Thousand Plant Transcriptomes Initiative, 2019; Zhang et al., 2020b) reported an ancient polyploidization event in the Ilex lineage around 60 Ma (Cretaceous–Paleogene boundary), based on phylogenomic and synonymous substitution rate (Ks) evidence. Evolutionary analyses of I. polyneura (Yao et al., 2022) and I. latifolia (Xu et al., 2022) genomes also provided evidence of a shared Ilex-specific whole-genome duplication (WGD). As YM is the first American holly to have its genome sequenced, we performed synteny-based analyses of its genome to deepen our understanding of Aquifoliales evolution (Figure 2, Figure 2—figure supplement 1). The Ks distribution of YM paralogues (Figure 2B) revealed a significant peak with a median Ks value of ~0.37, not shared with the rest of the eudicot genomes analysed (Figure 2B, Figure 2—figure supplement 1). This confirms the lineage-specific polyploidization event (Ip-α) previously reported in Ilex (One Thousand Plant Transcriptomes Initiative, 2019; Xu et al., 2022; Yao et al., 2022; Zhang et al., 2020b), in addition to the shared ancestral WGT-γ which is indicated by a median Ks value of ~1.4 (Figure 2B). A WGD in the common ancestor of Ilex species is further supported by 2:1 syntenic depth ratios between the YM genome and the coffee and grape genomes, which did not experience additional duplication events after the ancestral WGT-γ (Figure 2C). In order to determine the age of Ip-α, we used two different phylogenies (Figure 2A). The plastid genome phylogeny supports the monophyly of Aquifoliales as the first diverging clade of campanulids (Magallón et al., 2015); the alternative nuclear genome phylogeny supports Ilex in Aquifoliales I as an early branching lineage of lamiids (Zhang et al., 2020b). With the former phylogeny, we estimated the age of the WGD event between 48.75 and 69.63 Ma while, with the latter, divergence was estimated at 49.43 and 70.62 Ma (Figure 2A). Both estimates are consistent with that of Zhang et al., 2020a and validate the age of Ip-α near the origin of Ilex, which is estimated between 43 and 89 Ma (Yao et al., 2021).
Convergent evolution of caffeine biosynthesis in YM
In order to determine the genes and biochemical pathway responsible for caffeine biosynthesis in YM, we used bioinformatic analyses to identify SABATH enzyme family members in the genome (Huang et al., 2016; Kato et al., 1996; Uefuji et al., 2003). There appear to be 28 full-length SABATH genes in YM that encode members of the functionally diverse clades of the family, including SAMT (Ross et al., 1999) and JMT (Seo et al., 2001), among others (Figure 3A). Our phylogenetic analysis showed that although the YM genome does not appear to encode XMT-type caffeine-producing enzymes like Coffea and Citrus, it does contain three recently and tandemly duplicated genes that encode CS-type enzymes, IpCS1, IpCS2, and IpCS3 (Figure 3A, C, Appendix 2). The duplicated IpCS1–3 are 86–91% identical at the amino acid level and are expressed at highest levels in caffeine-accumulating tissue (Figure 3B). IpCS1–3 also appear to be of recent origin, since non-caffeine accumulating Ilex species only have a single gene or gene fragment in the syntenic region (Figure 3C). In Camellia, Theobroma, and Paullinia, recent duplications of the CS-type enzymes responsible for the successive steps of xanthine alkaloid methylation have also independently occurred (Figure 3A; Huang et al., 2016; O’Donnell et al., 2021). Two other YM genes encode IpCS4 and 5, but these are not syntenic with IpCS1–3 and are not highly expressed in any tissues studied (Figure 3B); therefore, we did not characterize them further.
To investigate the biochemical activities of the enzymes encoded by the three CS-type genes, we cloned them into bacterial expression vectors and determined heterologous protein functions. One enzyme, IpCS1, appears to primarily methylate X to catalyse the formation of 3X (Figure 4). A second enzyme, IpCS2, shows activity only with 3X to produce TB, while a third enzyme, IpCS3, exhibits a preference to methylate TB to form CF (Figure 4). Thus, collectively, these three enzymes appear capable of catalysing a complete pathway from xanthine to caffeine. The apparent KM for the preferred substrates of all three enzymes ranges from 85 to 197 μM, and the kcat/KM estimates are comparable to those determined for other caffeine biosynthetic enzymes (O’Donnell et al., 2021; Table 4, Figure 4—figure supplement 1). Further evidence for this biosynthetic pathway has been reported by 14C xanthine tracer studies in young leaf segments of I. paraguariensis that showed radioactivity in 3X and TB in addition to CF (Yin et al., 2015). A pathway from X→3X→TB→CF has also been reported for Theobroma and Paullinia using CS-type SABATH enzymes (Huang et al., 2016). Like Huang et al., 2016, this represents another departure from the long-assumed pathway to caffeine biosynthesis (XR→7X→TB→CF) as reported in coffee and tea (Figure 1). This instance in Ilex is particularly notable since YM is an Asterid, like coffee and tea. The fact that Ilex, Theobroma, and Paullinia convergently recruited CS genes that independently duplicated and evolved to encode enzymes with similar substrate preferences to catalyse a common pathway to caffeine, in spite of their divergence more than 100 Ma (Yang et al., 2020), is remarkable and suggests a high degree of genetic constraint governing the repeated origin of this trait.
While the substrate preferences shown in Figure 4 suggest pathway flux from X→3X→TB→CF, IpCS1 also shows secondary activity with 7X to produce TB and IpCS3 can catalyse the formation of CF from paraxanthine (PX) (Figure 4A). Thus, flux through other branches of the xanthine alkaloid biosynthetic network (Figure 1) cannot be excluded. However, it is not clear how 7X or PX would be produced in planta since none of the three enzymes studied here is capable of their formation; therefore, these secondary activities may not be physiologically relevant. In addition, it has been proposed that TP may also be a precursor to caffeine biosynthesis in I. paraguariensis based on radioisotopic feeding studies (Yin et al., 2015), although its levels in plant tissues are 30–160 times lower than TB (Negrin et al., 2019). Our in vitro enzyme assays provide no experimental evidence for that biosynthetic route; however, it is possible that additional MT enzymes from the SABATH (or other) gene family not characterized in this study may perform such reactions. Alternatively, if the exogenously supplied TP was first catabolized to 3X in YM tissues, then the caffeine detected previously (Yin et al., 2015) could have been synthesized via the route described above for IpCS2 + IpCS3 (Figure 4).
The caffeine biosynthetic pathway in YM evolved from ancestral networks with different inferred flux
Caffeine is produced within only one small lineage of Ilex that diverged and experienced CS gene duplication (Figure 3) within the last 11 million years (Negrin et al., 2019; Yao et al., 2021) which indicates that the pathway has only recently evolved. The nature by which novel multistep biochemical pathways evolve is a central question in biology (Noda-Garcia et al., 2018). To investigate the caffeine pathway origin in YM, we used Ancestral Sequence Reconstruction (Dean and Thornton, 2007; Thornton, 2004) to study AncIpCS1 and AncIpCS2, the ancestors of the three modern-day enzymes implicated in caffeine biosynthesis in YM (Figure 5, Figure 5—figure supplements 1–4). The ancestral enzyme, AncIpCS1, which gave rise to all three modern-day YM enzymes, exhibits highest relative activity with X, 3X, and 7X (Figure 5A). Methylation of 7X by AncIpCS1 occurred at the N3 position resulting in TB synthesis, whereas xanthine methylation occurred at either the N1 or N3 position to form 1X and 3X, respectively (Figure 5B, Figure 5—figure supplement 5A). AncIpCS1 was capable of methylation of 3X at N1 to produce TP, while methylation at the N7 position led to TB formation (Figure 5B, Figure 5—figure supplement 5A). These data demonstrate that, although AncIpCS1 could not produce caffeine, it could methylate xanthine alkaloids at 3 different positions of the planar heterocyclic ring structures and this combination of activities would have allowed for the ancestor of YM to produce a cocktail of 1X, 3X, TP, and TB by flux through multiple branches of the xanthine alkaloid biosynthetic network with a single enzyme (Figure 5B).
After gene duplication of AncIpCS1, one daughter enzyme ultimately gave rise to IpCS1, which exhibits preference to methylate xanthine to produce 3X (Figure 5A). The other daughter enzyme, AncIpCS2, appears to have maintained highest activity with X, 3X, and 7X like AncIpCS1 (Figure 5A). However, unlike its ancestor, AncIpCS1, AncIpCS2 evolved high relative activity with 7X to produce not just TB, but also PX by methylation at the N1 position (Figure 5A, Figure 5—figure supplement 5B). AncIpCS2 retained the ancestral activity of AncIpCS1 with xanthine to produce 1X, but also evolved the ability to methylate X at the N7 position (Figure 5B, Figure 5—figure supplement 5B). This enzyme also retained ancestral activity with 3X to produce only TB by N7 methylation but lost the ability to methylate 3X at the N1 position to form TP. These activities of AncIpCS2 would have allowed for ancestral flux to produce 1X, 7X, TB, and PX but not caffeine. Because a YM ancestor could have possessed both AncIpCS2 and a descendant of AncIpCS1, AncIpCS1’ (Figure 5B), additional pathway flux is possible. If AncIpCS1’ retained activities of its ancestor, AncIpCS1, then the ancestral Ilex species could have also produced 3X and TP making for an even more diverse array of xanthine alkaloids in its tissues (Figure 5B). It has been shown that the xanthine alkaloids, 1X, 3X, and TP, can bind to modern-day rat adenosine receptors (Daly et al., 1983). Therefore, if these molecules were to accumulate in ancestral Ilex tissues, they could have conferred a selective advantage which would likely result in retention of the ancient genes. Ultimately, once gene duplication led to the generation of the three modern-day CS-type enzymes in YM, pathway flux could be channelled away from intermediates like 1X and TP such that the modern-day pathway to caffeine evolved (Figure 5B). Not only did the modern-day CS enzymes of YM evolve to catalyse a pathway from X>3X>TB>CF from ancestral biosynthetic networks of different products, Theobroma and Paullinia also independently evolved enzymes with similar properties (Huang et al., 2016). And, they did so from ancestral pathways that, like YM, had alternative ancestral fluxes (O’Donnell et al., 2021). While it could be due to chance alone that all three lineages converged to catalyse a similar pathway from differing ancestral networks, it is also possible that it was advantageous to specialize for flux to TB via X and 3X because either it is more enzymatically favourable or these intermediates have greater adaptive value than other structural isomers.
Protein crystal structure of IpCS3 reveals convergent structural basis for methylation of theobromine to form caffeine
We successfully crystallized and determined the 2.7 Å resolution structure of IpCS3 (PDB ID: 8UZD), that converts TB into CF. This enzyme crystallizes as a holo-homodimer, bound to both of its reaction products: S-adenosyl-homocysteine (SAH) and caffeine (Figure 6A, Table 5). As is typical for the SABATH family of methyltransferases, IpCS3 exhibits a Rossman-like fold composed of seven β-strands surrounded by five α-helices which bind the methyl-donor S-adenosyl-L-methionine (SAM), as well as an α-helical cap which binds the methyl-acceptor theobromine (McCarthy and McCarthy, 2007; Petronikolou et al., 2018; Zhao et al., 2008; Zubieta et al., 2003). This structural information of the enzyme bound to both of its products, SAH and caffeine, facilitates an in-depth comparison of the active site structures of the caffeine-producing CS-type enzyme found in Ilex to the XMT-type enzyme in Coffea canephora (McCarthy and McCarthy, 2007) (CcDXMT) to determine the extent to which convergence of physicochemical properties of the active site has allowed for independent specialization for theobromine methylation by the paralogous SABATH enzymes. Although the IpCS3 structure was obtained in complex with its product, caffeine (Figure 6—figure supplement 1), it can be assumed that the binding mode is conserved for its precursor, theobromine. Indeed, our computational modelling of theobromine in the active site of IpCS3 predicts it to be oriented as we have discerned from the diffraction data (Figure 6—figure supplement 2). Thus, in the following comparisons, the atomic numbering for the theobromine precursor will be used to facilitate comparison to the CcDXMT structure.
In both CcDXMT and IpCS3, there are several conserved residues, shared by nearly all SABATH enzymes (Figure 6—figure supplements 3 and 4), that form the active site pocket and appear to play important roles in binding many different substrates (Petronikolou et al., 2018; Zubieta et al., 2003). His160 and Trp161 in CcDXMT are in the same relative positions as His155 and Trp156 in IpCS3 (Figure 6B, C). These residues are ca. 3 Å from TB and participate in H-bonding but to different atoms of the substrate. In CcDXMT, the NE2 of His160 and NE1 of Trp161 form hydrogen bonds to carbonyl O2 of TB when positioned for N1 methylation; yet, in the structure of IpCS3 these corresponding side chain groups form hydrogen bonds to O6 of TB. Despite these two residues being conserved for H-bonding, the substrates are rotated 180° along an axis going through N1 and C4. Thus, the conserved His and Trp residues interact with opposing carbonyls in TB/CF but still position the substrate for N1 methylation (Figure 6B, C).
On the other hand, there are residues that differ between the two enzymes but appear to provide for important substrate interactions. Specifically, in the structure of CcDXMT, the hydroxyl group of Ser237 allows specific hydrogen bonding with N9 to position TB for N1 methylation (Figure 6C). In IpCS3, His236 is found at the homologous position in the structure. Nevertheless, its involvement in H-bonding with N9 is uncertain as the distance between nitrogen atoms is ca. 4 Å. Tyr368 of CcDXMT is found to participate in π–π interactions with the ring structure of TB Lanzarotti et al., 2020; yet in IpCS3, Asn353 is found in the homologous position and the amine forms a hydrogen bond with N9 due to its proximity within 3.2 Å, which is additionally stabilized by Asn221 and His236 (Figure 6B). The caffeine-producing CS-type enzymes found in Camellia sinensis (CsTCS1) and Paullinia cupana (PcCS), may share the same interaction pattern observed with Asn353 in IpCS3 because the homologous Thr in CsTCS1 or Gln in PcCS could potentially form a hydrogen bond with N9 (Figure 6B, Figure 6—figure supplements 3 and 4). Because the residues in these positions of IpCS3 and CcDXMT differ yet contribute to TB binding, these independent replacements represent convergent structural solutions for N1 methylation of the substrate.
In order to experimentally test for the functional importance of the active site residues identified in the crystal structure of IpCS3 for the evolution of TB methylation preference, we performed site-directed mutagenesis. We chose to mutate five amino acid positions that appear to be important for governing xanthine alkaloid methylation in IpCS3 and other CS-type enzymes (Jin et al., 2016; O’Donnell et al., 2021; Wang et al., 2023; Yoneyama et al., 2006); these included A22G, R23C, T25S, H221N, and Y265C (Figure 6—figure supplements 3 and 4). When we mutated all five amino acid residues simultaneously in AncIpCS2, we found that activity with TB increased dramatically relative to 3X and all other xanthine alkaloid substrates (Figure 5A). Thus, these five sites appear to be crucial for the evolution of TB methylation in the history of the YM lineage and further indicate that convergence of caffeine biosynthesis in different species is a result of amino acid replacements at these sites. The homologous sites to H221N and Y265C in Theacrine synthase from Camellia assamica were also shown by mutagenesis to be important for the evolution of trimethyluric acid methylation (Zhang et al., 2020a) thereby providing further support for the functional significance of these positions for xanthine alkaloid binding.
Computational modelling of IpCS1 and IpCS2 active sites predict convergent substrate-binding residues for xanthine and 3-methylxanthine methylation
Previous studies used site-directed mutagenesis of two sequence regions in CS-type caffeine biosynthetic enzymes from Theobroma (TcCS1/2) and Paullinia (PcCS1/2) to uncover the mutational basis for the convergent evolution of substrate preference switches towards their preferred substrates, X and 3X (O’Donnell et al., 2021). In order to determine whether the same regions were convergently mutated in IpCS1 and IpCS2, and provide binding interactions with X and 3X, respectively, AlphaFold2 (Mirdita et al., 2022) models and subsequent docking studies were performed (Figure 7, Figure 7—figure supplement 1). Modelling of substrate binding in the predicted active sites of IpCS1 and IpCS2 (Figure 7A and B) shows that the preferred substrates have optimal binding orientations that would result in methylation to form the products that were experimentally detected in our assays shown in Figure 4. From our docking simulations, IpCS1 residues W156, N221, and Y265 are positioned for hydrogen bonding with the carbonyl moieties of xanthine to position N3 for methyl transfer from SAM (Figure 7A). Although Theobroma and Paullinia CS1 enzymes, as well as Citrus XMT1, specialized for xanthine methylation also possess W156 and Y265 in the homologous positions (Figure 6—figure supplements 3 and 4), these residues are highly conserved among nearly all angiosperm SABATH enzymes. On the other hand, the homologous position to N221 which is important for IpCS1 did not change concomitantly with the evolution of X preference in either Theobroma or Paullinia (Figure 6—figure supplements 3 and 4); instead, when Theobroma and Paullinia ‘region III’ was mutated, activity with X improved (O’Donnell et al., 2021). Because IpCS1 was not mutated in the homologous region III, there appear to be convergent solutions allowing for efficient positioning of X for 3X biosynthesis among these enzymes. In the case of IpCS2, two hydrogen bond donors, S24 and T25, appear to contribute to the positioning of 3X in the active site (Figure 7B). This homologous region was experimentally mutated in Theobroma and Paullinia CS2 enzymes and improved specialization for 3X methylation in both, although the actual substitutions are different in each case (O’Donnell et al., 2021). Thus, this may represent an additional instance where convergent mutations of the same region lead to specialization for 3X methylation. If crystal structures could be generated for all of these caffeine-producing enzymes in the future, even more detailed insights about active site architecture could be gleaned and would further enhance our understanding of these convergent activities.
Comparative phylogenomic analyses of caffeine biosynthetic genes reveal historical constraints to convergent gene co-option
Many nearly ubiquitous specialized metabolites involved in defence, development and floral scent are produced by SABATH enzyme family members that appear to be conserved across diverse angiosperm lineages, such as SAMT that methylates salicylic acid (Dubs et al., 2022) and IAMT that methylates indole-3-acetic acid (Zhao et al., 2008). However, caffeine is sporadically distributed among disparate angiosperm lineages and seems to have only recently evolved by convergence in a few distantly related orders (Huang et al., 2016). Our comparative evolutionary genomic analysis of the CS and XMT syntenic regions across angiosperm (Figure 8) indicates that predicting which SABATH locus a given lineage might co-opt for caffeine biosynthesis is more dependent upon the idiosyncratic history of gene loss than phylogenetic relatedness. For example, in the case of the CS syntenic region used for caffeine biosynthesis in YM and Theobroma, Coffea lacks a CS orthologue and none can be detected from its genome (Figure 8—figure supplement 1A). Thus, only XMT was historically available for recruitment in Coffea. Conversely, YM appears to have lost any vestiges of XMT orthologues known to be responsible for caffeine biosynthesis in Coffea and Citrus (Figure 8—figure supplement 1B–D). This lack of genomic potential may be seen as an evolutionary constraint to gene recruitment for caffeine biosynthesis in Coffea and YM.
A broader phylogenetic perspective on the XMT and CS syntenic regions provides further insight into genomic canalization and allows for predictions about the underlying genetic basis for caffeine biosynthesis in as-of-yet characterized lineages. As shown in Figure 8, several angiosperm lineages have neither XMT nor CS and this may explain why caffeine has apparently never evolved in the large and diverse orders Brassicales, Asterales, Solanales and Lamiales even though it has been shown to be advantageous in transgenic plants (Kim et al., 2011; Kim et al., 2006). In the case of Cola, a caffeine-producing genus from Africa (Niemenak et al., 2008), it is predicted to have co-opted CS genes for xanthine alkaloid methylation because the order Malvales to which it belongs appears to have lost XMT orthologues prior to its origin (Figure 8). Tests of this hypothesis await genomic sequences and functional studies of Cola enzymes. However, even with a functional XMT or CS enzyme, gene duplication and protein functional diversification appears to be required to assemble a complete pathway to caffeine as shown here for YM. Nonetheless, because molecular clock analyses indicate that the caffeine-producing Coffea, Camellia, Citrus, Paullinia, and Ilex lineages each originated within only the last 10–20 million years (Buerki et al., 2011; Hamon et al., 2017; Pfeil and Crisp, 2008; Yao et al., 2021; Zan et al., 2023), it suggests that the evolution of novel specialized metabolic pathways like that of caffeine can be relatively rapid.
Materials and methods
Plant materials
Request a detailed protocolI. paraguariensis A. St.-Hil. var. paraguariensis, cv CA 8/74 (INTA-EEA Cerro Azul, Misiones, Argentina) and cv SI-49 (Establecimiento Las Marías S.A.C.I.F.A., Corrientes, Argentina) were used in this study. High productivity, increased tolerance to drought, and ease of propagation with stem cuttings characterize these genotypes (Acevedo et al., 2019; Avico et al., 2023; Tarragó et al., 2012).
DNA extraction and sequencing
Request a detailed protocolTwo DNA extraction and sequencing approaches were combined to improve the accuracy of genome assembly. First, young leaves of cv CA 8/74, preserved in silica-gel, were used to isolate total genomic DNA with the DNeasy Plant Mini Kit (QIAGEN), following the manufacturer’s instructions. Paired-end libraries (with insert sizes of 350 and 550 bp) and mate-pair libraries (with insert sizes of 3, 8, and 12 kbp) were constructed using the Illumina TruSeq DNA Sample Preparation Kit and Illumina Nextera Mate Pair Library Preparation Kit following the kit’s instructions, respectively. The obtained libraries were sequenced on an Illumina HiSeq 1500 platform, generating ~263.2 Gb of raw data. Second, young leaves of cv SI-49, preserved in silica-gel, were used to purify high molecular weight DNA with the Quick-DNA HMW MagBead Kit (Zymo Research), according to the manufacturer’s instructions. Long reads libraries were prepared using Sequel Binding Kit 1.0 (Pacific Biosciences), following the manufacturer’s instructions. The obtained libraries were subsequently sequenced on PacBio Sequel I (Pacific Biosciences) using Sequel Sequencing Kit 1.0 (Pacific Biosciences) and SMRT Cell 1M (Pacific Biosciences), generating ~77.5 Gb of additional raw data.
Genome assembly and quality assessment
Request a detailed protocolWe opted for a pipeline that could take advantage of both short and long sequencing technologies. For the short reads, we applied Trimmomatic v.0.39 (Bolger et al., 2014) to remove adaptor contaminations and filter low-quality reads (reads with mean quality scores ≤25, reads where the quality of the bases at the head or tail was ≤10 and reads with a length ≤30 bp). The resulting clean reads were then corrected using Quake v.0.3 (Kelley et al., 2010). Contig assembly and scaffolding was done using the assembler SOAPdenovo v.2 (Luo et al., 2012) (55-mer size), with the mate-pair reads being used to link contigs into scaffolds. After the assembly, DeconSeq v.0.4.3 (Schmieder and Edwards, 2011) was used to detect and remove sequence contaminants. Contigs and scaffolds clearly belonging to the chloroplast and mitochondria genomes were also discarded. YM transcriptome sequences (Acevedo et al., 2019; Debat et al., 2014; Fay et al., 2018) and public databases KOG (Tatusov et al., 2003) and DEG (Luo et al., 2014) were used to validate the genome assembly. Canu v.2.2 (Koren et al., 2017) was used to perform the self-correction and assembly of the long reads, using the default parameters and stopOnLowCoverage = 20. For both short and long assemblies, we separated the assembly haplotypes (haplotigs) using PurgeHaplotigs (Roach et al., 2018) with the recommended parameter values. Then, we merged both SOAPdenovo v.2 and Canu v.2.2 curated assemblies using Quickmerge v.03 (Chakraborty et al., 2016), where only contigs with minimum overlap of 5000 bp (-ml 5000) were merged and only the contigs greater than 1000 bp (-l 1000) were retained. The resulting scaffolds and contigs were refined further with the gap-filling module in SOAPdenovo v.2 (GapCloser) and SSPACE v.2.1.1 (Boetzer et al., 2011).
Gene prediction and annotation
Request a detailed protocolFirst, we masked the genome assembly with RepeatMasker (http://repeatmasker.org/). Then, we predicted the protein- and non-coding genes using Funannotate v.1.8.13 (Palmer and Stajich, 2019) previously training it with the available I. paraguariensis RNA-Seq experiments (NCBI projects PRJNA315513, PRJNA375923, and PRJNA251985). Then, Infernal v.1.1.4 (Nawrocki and Eddy, 2013) was employed to improve the prediction of small RNAs and microRNAs, while tRNAScan-SE v.2.0 (Chan and Lowe, 2019) was used to improve the prediction of transfer RNAs. Ribosomal RNAs were predicted using RNAmmer v.1.2 (Lagesen et al., 2007). The TAPIR web server (Bonnet et al., 2010) (http://bioinformatics.psb.ugent.be/webtools/tapir) and the TargetFinder software v.1.7 (Fahlgren and Carrington, 2010) were used to identify miRNA targets. InterProScan v.5.55-88.0 (Jones et al., 2014) and eggNOG-mapper v.2.1.7 (Huerta-Cepas et al., 2019) were employed for the functional assignment of the predicted genes.
Repeat content estimation
Request a detailed protocolThe repeat content was estimated employing Dfam TE Tools v.1.5 (https://github.com/Dfam-consortium/TETools copy archived at Rosen and Gray, 2024). First, we used RepeatModeler v.2.0.3 (Flynn et al., 2020) to build a database with Ilex repeat families. Then, we merged that database with Dfam v3.6 (Hubley et al., 2016) and GIRI Repbase ver 20181026 (Jurka et al., 2005). Finally, we ran RepeatMasker on the assembly using the merged database to look for repeat sequences.
Genome duplication analysis
Request a detailed protocolRates of synonymous substitution (Ks) between paralogous genes and orthologous genes in Lactuca sativa, Daucus carota, I. paraguariensis, C. canephora, and Vitis vinifera were determined using CoGe’s tool SynMap (https://genomevolution.org/). Gaussian mixture models were fitted to the resulting Ks distributions with the mclust R package v.5.0 (Scrucca et al., 2016), and significant peaks were identified using the SiZer R package v.0.1-7 (Chaudhuri and Marron, 2000). To estimate the age of the lineage-specific polyploidization event (Ip-α) in Ilex, we considered two different phylogenies (a multiple nuclear genome phylogeny and a plastid genome phylogeny). With the median Ks value of YM-grape orthologues (~0.89) and the divergence date of the two species (125.64 Ma for the multiple nuclear genome phylogeny and 123.7 Ma for the plastid genome phylogeny), we calculated the number of substitution per synonymous site per year (r) for YM (divergence date = Ks/(2 × r)). Conforming to the multiple nuclear genome phylogeny, the YM r value is 3.54E−9; while for the plastid genome phylogeny, the YM r value is 3.59E−9. These r values and the SiZer Ks range of YM paralogues (~0.35–0.5) were then applied to estimate the age of Ip-α. Finally, to determine the syntenic depth ratio between I. paraguariensis and C. canephora and V. vinifera, we employed CoGe’s tool SynFind (https://genomevolution.org/), using a distance cutoff of 10 genes and requiring at least 5 gene pairs per synteny block.
Gene expression quantitation
Request a detailed protocolFirst, YM transcriptome reads (PRJNA315513) were mapped to IpCS1–5 transcripts, obtained from the de novo transcriptome assembly and annotation, using BWA (Li and Durbin, 2009). Then, with the number of mapped reads, the abundance of each transcript was calculated, normalized by transcript length and transcriptome size (quantification in RPKM, reads per kilobase per million mapped reads).
Cloning, mutagenesis, heterologous expression, and purification of enzymes
Request a detailed protocolTwo different approaches were used to clone IpCS genes: RT-PCR from leaf tissue and custom gene synthesis. For RT-PCR of IpCS2, cDNA was obtained from 1 µg of RNA from fresh YM leaves using standard procedures and cycling conditions with the following two primers: IpCS2F 5′-ATGGACGTGAAGGAAGCAC-3′ and IpCS2R 5′-CTATCCCATGGTCCTGCTAAG-3′. Following amplification, cDNA was cloned using the pTrcHis TOPO TA Expression Kit (Invitrogen, Carlsbad, CA). Ligation of cDNA into the pTrcHis vector and subsequent transformation into Top10 E. coli cells was carried out according to the manufacturer’s protocol. The transformation mixture was incubated overnight at 37°C on LB plates containing 50 µg/ml ampicillin. Colonies were screened by PCR to obtain full-length inserts that were subsequently verified for insert orientation by DNA sequencing. For IpCS1 and 3 and AncIpCS1 and 2, gene sequences were synthesized by GenScript with codons optimized for expression in E. coli. Synthesized genes were subcloned from the pUC57 cloning vector into the pET-15b (Novagen) expression vector using 1.5 μg of DNA and NdeI and BamHI in 30 μl reactions. Linear fragments corresponding to the expected sizes were gel purified using the QIAEX II Gel Extraction Kit (QIAGEN Corp) according to the manufacturer’s instructions. Purified DNA fragments were ligated into pET-15b using T4 DNA ligase from New England Biolabs. Ligation products were transformed into Top10 E. coli cells using 2 μl of the ligation reaction. Site-directed mutagenesis of AncIpCS2 was carried out using the Agilent QuikChange Lightning Kit (Agilent Technologies Inc, Santa Clara, CA) following the manufacturer’s protocol. Minipreps of positive transformants were obtained using a QIAprep Spin Miniprep Kit (QIAGEN Corp) and 10 ng of each plasmid was used to transform BL21 E. coli cells using standard plating and incubation methods.
Induction of His6-protein was achieved in 50 ml cell cultures of BL21 (DE3) with IpCS1 and 3 and AncIpCS1 and 2 in pET-15b or Top10 with IpCS2 in pTrcHis with the addition of 1 mM IPTG at 23°C for 6 hr. Purification of the His6-tagged protein was achieved by TALON spin columns (Takara Bio) following the manufacturer’s instructions. Bradford assays were used to determine purified protein concentration, and recombinant protein purity was evaluated on sodium dodecyl sulphate–polyacrylamide gel electrophoresis gels.
Enzyme assays
Request a detailed protocolAll enzymes were tested for activity with the eight xanthine alkaloid substrates shown in Figure 1. Radiochemical assays were performed at 24°C for 60 min in 50 µl reactions that included 50 mM Tris–HCl buffer, 0.01 µCi (0.5 µl) 14C-labelled SAM, 10–20 µl purified protein, and 1 mM methyl acceptor substrate dissolved in 0.5 M NaOH. Negative controls were composed of the same reagents, except that the methyl acceptor substrate was omitted and 1 µl of 0.5 M NaOH was added instead. Methylated products were extracted in 200 µl ethyl acetate and quantified using a liquid scintillation counter. The highest enzyme activity reached with a specific substrate was set to 1.0 and relative activities with remaining substrates were calculated. Each assay was run at least three times so that mean, plus standard deviation, could be calculated.
High-performance liquid chromatography
Request a detailed protocolProduct identity of enzyme assays was determined using high-performance liquid chromatography (HPLC) on 500 μl scaled-up reactions utilizing all the same reagents as described above except that non-radioactive SAM was used as the methyl donor and reactions were allowed to progress for 4 hr. Whole reactions were filtered through Vivaspin columns (Sartorius) to remove all protein prior to injection in the HPLC. Mixtures were separated by HPLC using a two-solvent system with a 250 mm × 4.6 mm Kinetex 5 μM EVO C18 column (Phenomenex). Solvent A was 99.9% water with 0.1% trifluoroacetic acid and Solvent B was 80% acetonitrile, 19.9% water and 0.1% TFA and a 0–20% gradient was generated over 16 min with a flow rate of 1.0 ml/min. Product identity was determined by comparing retention times and absorbance at 254 and 272 nm of authentic standards. Reactions were compared to negative controls in which no methyl acceptor substrates were added.
Phylogenetic analyses
Request a detailed protocolIn order to accurately determine the orthology of YM SABATH sequences encoded in the genome, we compared them to all previously functionally characterized gene family members in other species. We also included CS and XMT orthologues from the orders of caffeine-producing species (Malvales, Ericales, Gentianales, Sapindales) available in public databases (GenBank, OneKP) as shown in Figure 3. Accession numbers for all sequences are provided in . Alignment of amino acid sequences was achieved using MAFFT v.7.0 (Katoh and Standley, 2013) and employing the auto search strategy to maximize accuracy and speed. A phylogenetic estimate was obtained using FastTree v.2 (Price et al., 2010) assuming the Jones–Taylor–Thorton model of amino acid substitution with a CAT approximation using 20 rate categories. Reliability of individual nodes was estimated from local support values using the Shimodaira–Hasegawa test as implemented in FastTree.
Ancestral sequence resurrection
Request a detailed protocolIn order to obtain accurate ancestral CS protein estimates, we assembled two datasets to assess variation in terms of sampling. The first dataset included 154 sequences including all CS-type enzymes we could retrieve from GenBank and China National Gene Bank as well as representatives of all other functionally characterized clades of SABATH enzymes (Figure 5—figure supplement 1). In this dataset, the only Ilex sequences available were IpCS1–3. This dataset resulted in highly confident estimates for AncIpCS1 and AncIpCS2 (average site-specific posterior probabilities >0.99 in both cases). Subsequently, once additional Ilex genomes became available, we estimated a second set of ancestral sequences using 29 CS-type enzymes from asterids to assess uncertainty in our initial estimates (Figure 5—figure supplement 2). In this subsequent analysis, highly confident estimates for AncIpCS1 and AncIpCS2 were obtained with average site-specific posterior probabilities >0.99 in both cases (see Figure 5). MAFFT v.7.0 (Katoh and Standley, 2013) was used to align the protein sequences in both datasets using the auto search strategy to maximize accuracy and speed; subsequently, IQTree (Trifinopoulos et al., 2016) was used to obtain trees describing the relationships amongst the aligned sequences for both datasets. For the first set of ancestral sequence estimates, the Jones, Taylor, and Thorton matrix model for amino acid substitution and the Free rate model for among-site rate heterogeneity (Yang, 1995) was determined to be the best fit. For the second dataset, the Q matrix as estimated for plants (Ran et al., 2018) with a gamma model for rate heterogeneity was the preferred model. IQTree estimates ancestral sequences using the empirical Bayesian approach (Trifinopoulos et al., 2016). In order to determine ancestral protein lengths in regions with alignment gaps, we coded each gap for the number of amino acids possessed and used parsimony to determine ancestral residue numbers as in our previous studies (Huang et al., 2016). The estimated sequences were synthesized by Genscript Corp and had codons chosen for optimal protein expression in Escherichia coli and were cloned into pET15b for expression and purification using the His6 tag. Details of expression were the same as described above for the modern-day enzymes. Although the two separate ancestral sequence estimates are highly similar to one another (>95% identity in both cases), the two AncIpCS1 proteins differ at 10 positions and those for AncIpCS2 differ at seven positions (Figure 5—figure supplements 3 and 4).
Crystallization, data collection, phasing, and refinement of IpCS3
Request a detailed protocolInitial crystallization screening was performed using the IpCS3 methyltransferase at a concentration of 30 mg/ml incubated with 2 mM TB and 2 mM SAM. Sitting-drop for crystallization screening was set up by equal volume of precipitant and protein solution (0.25:0.25 µl) using a Crystal Gryphon robot (Art Robbins Instruments) and a reservoir volume of 45 µl. Trays were incubated at 9°C. Initial hits were further optimized using the hanging-drop method at 9°C, with 150 µl reservoir solution and 1:1 ratio of precipitant to protein and ligand solution in a 2-µl drop. Attempts to crystallize with SAH or uncleavable SAM analogs and TB to attain a pre methylation structure were unsuccessful given the poor diffraction of these crystals. Therefore, the latter was composed of 33 mg/ml IpCS3 protein concentration, 4 mM TB and 2 mM SAM, and the crystallization condition was optimized to 25% PEG 3350, 0.2 M NH4SO4, 0.1 M Bis-Tris methane pH 5.5. Square crystals grew over 10 days, but initial X-ray crystallography data revealed a poor electron density for SAH and an electron density in the active site for CF, the product, rather than for TB. Consequently, crystals were grown in the aforementioned condition and subsequently soaked for 4 hr at 9°C in the precipitant solution supplemented with 10 mM SAH and 10 mM TB. The idea was to supply an excess of the expended methyl source and additional TB to convert any existing SAM as we did not have access to caffeine as a reagent. Crystals were transiently soaked in the precipitant solution supplemented with 20% ethylene glycol immediately prior to vitrification by direct immersion into liquid nitrogen. Diffraction data were collected at the Advanced Photon Source (APS) at Argonne National Laboratory Sector-21 via the Life Sciences-Collaborative Access Team (LS-CAT) at beamline 21-ID-G. Diffraction data were indexed, integrated, and scaled using the autoPROC software package (Vonrhein et al., 2011). The structure was solved by molecular replacement using Phaser-MR included in the Phenix software package (Adams et al., 2010), using PDB ID 6LYH structure as the replacement model. The model was subject to rounds of manual building followed by refinement using REFMAC5 (Murshudov et al., 2011), and was manually built in COOT v.0.9.8.3 (Emsley et al., 2010). Crystallographic statistics are listed in Table 5.
Structure prediction and molecular docking
Request a detailed protocolProtein structures of IpCS enzymes were predicted using the ColabFold implementation of AlphaFold2 (Mirdita et al., 2022) with no template. Diagnostic plots depicting the MSA coverage, alignment error and LDDT are shown in the supplementary information (Figure 7—figure supplement 1). Structures of xanthine alkaloid ligands (X, 3X, and TB) were downloaded from the ChEMBL database Mendez et al., 2019; protonation states were checked by Chemicalize (Swain, 2012) and optimized using the VMD Molefacture plugin (Humphrey et al., 1996). The receptor structures were prepared following the standard AutoDock protocol (Morris et al., 2009) using the prepare_receptor4.py script from AutoDock Tools. All non-polar hydrogens were merged, and Gasteiger charges and atom types were added. The ligand PDBQT was prepared using the prepare_ligand4.py script. The grid size and position were chosen to contain the whole ligand-binding site (including all protein atoms closer than 5 Å from all ligands). For each system, 10 different docking runs were performed. Docking was performed using AutoDock Vina v.1.2.0 (Eberhardt et al., 2021). The docking results were further analysed by visual inspection. Images of the molecules were prepared using the PyMOL molecular graphics system (Schrodinger, 2015).
Synteny comparisons and phylogenetic distribution of CS and XMT
Request a detailed protocolThe presence or absence of CS and XMT genes was determined for orders of plants for which at least one genomic sequence exists, as shown in Figure 8. For those species which do not yet have an available assembly, we used BLAST (Altschul et al., 1990) analyses of GenBank (nr and TSA databases), Phytozome (Goodstein et al., 2012) as well as the OneKP dataset (One Thousand Plant Transcriptomes Initiative, 2019) in China National GeneBank. BLAST combined with subsequent phylogenetic analyses were also used to verify presence/absence of CS- or XMT-type sequences in cases where the syntenic regions did not appear to encode one or the other gene. Comparisons of the CS and XMT syntenic regions were performed using CoGe’s tool GEvo (https://genomevolution.org/).
Appendix 1
Non-coding RNAs in yerba mate
1.1 Transfer RNAs
Analysis of transfer RNA (tRNA) family members revealed the presence of 815 tRNA genes including 726 standard tRNAs, 76 pseudo tRNAs, 11 tRNAs with undetermined isotypes, and 2 possible suppressor tRNAs (Appendix 1—table 1). No selenocysteine tRNA was found in the yerba mate genome, which is consistent with the results of other tRNA analyses carried out in higher plants (Santesmasses et al., 2017). According to evolutionary analyses, selenoproteins and selenocysteine insertion sequence (SECIS) elements present in protozoans and animals evolved early, and were independently lost in higher plants and fungi through evolution (Novoselov et al., 2002). With regard to the nonsense suppressor tRNAs, only ochre and opal nonsense suppressor tRNA genes were found in the yerba mate genome, which suppress the phenotypes of ochre and opal mutations, respectively.
The length of the standard tRNA sequences ranged from 61 to 236 nucleotides, encoding 53 different anti-codons/isoacceptors in total, with tRNASer having the highest abundance of genes and tRNATyr having the lowest (Appendix 1—table 1). In our study, we also found that yerba mate tRNAs contain introns in 13 of the 55 tRNAMet, 9 of the 17 tRNATyr, 1 of the 65 tRNASer, and 1 of the 22 tRNAIle genes, providing additional demonstration that tRNAMet and tRNATyr are not the only intron containing tRNAs in the plant kingdom as it was previously believed (Michaud et al., 2011). The length of these introns varied from 6 to 163 nucleotides and all of them were found at the canonical position, one nucleotide 3′ to the anti-codon loop.
1.2 Ribosomal RNAs
Analysis of ribosomal RNA (rRNA) family members showed the presence of 425 5S rRNA, 23 18S rRNA, and 23 25S rRNA genes. The large variability in the copy number of the rRNA genes has been observed and studied in plants for decades (Rogers and Bendich, 1987). It is believed that a high copy number of these genes is important to ensure increased demand of proteosynthesis during plant development, but also to stabilize the cell nucleus (Garcia et al., 2012). With regard to its genomic organization, the 18S and 25S rRNA genes were clustered together, while the 5S rRNA genes were tandemly located elsewhere in the genome (S-type arrangement). However, some 5S rRNA genes were linked with the 18S and 25S rRNA genes as well (L-type arrangement). Given this observation, we should incorporate the yerba mate genome to the list of eukaryotic genomes with an L-arrangement of ribosomal DNA.
1.3 Small RNAs
Analysis of small RNA family members revealed the presence of 348 small nuclear RNA (snRNA) genes including 65 U1 snRNA, 46 U2 snRNA, 29 U4 snRNA, 33 U5 snRNA, and 146 U6 snRNA genes corresponding to the major spliceosome complex, and 19 U6atac snRNA, 7 U11 snRNA, and 1 U12 snRNA genes corresponding to the minor spliceosome complex. Furthermore, it showed the presence of 2670 small nucleolar RNA (snoRNA) genes, of which 2631 (~98.54%) were box C/D snoRNA genes and 39 (~1.46%) were box H/ACA snoRNA genes. Both groups of snoRNAs are involved in the cleavage of precursor ribosomal RNA (pre-rRNA) and determine site-specific modification in pre-rRNAs and snRNAs, though the box C/D snoRNAs are usually associated with 2′-O-ribose methylation, while the box H/ACA snoRNAs are normally associated with 2′-O-ribose pseudouridylation (Brown et al., 2003; Hari and Parthasarathy, 2019). The greater abundance of the box C/D snoRNAs in the yerba mate genome could be explained, first, by the fact that plants have higher numbers of 2′-O-ribose methylated nucleotides than archaea, yeast, and other higher eukaryotes, and second, by the fact that computer algorithms still find difficult to predict the relatively short conserved sequences of box H/ACA snoRNAs (Brown et al., 2003). It was remarkable the high copy number (2377) of snoRNA R71 in the yerba mate genome, which is a member of the Box C/D family. This snoRNA, which is thought to function as a 2′-O-ribose methylation guide for 18S rRNA, has been identified in multiple copies in most eudicot genomes (Kiss-László et al., 1996).
1.4 Micro RNAs
Analysis of micro RNA (miRNA) family members revealed the presence of 226 miRNA genes belonging to 30 families. The most abundant miRNAs usually involved in growth and development were miR156, miR166, and miR159, while the most abundant miRNAs normally involved in stress responses were miR169_2, miR167_1, miR399, and miR395 (Appendix 1—table 2). Nevertheless, the functions of miRNAs slightly differ among plants. Therefore, to better understand the regulatory effect of miRNAs in yerba mate, we used the TAPIR web server (Bonnet et al., 2010) and the TargetFinder softwareto identify yerba mate miRNA targets (Appendix 1—table 3). The results obtained allowed us to infer the role of 14 of the 30 miRNA families found in the yerba mate genome. Apparently, miR159, miR164, miR169_2, and miR169_5 regulate a variety of processes related to development and stress responses. On the one hand, both miR159 and miR164 regulate the expression of myb-like transcription factors, involved in auxin homeostasis, lateral root and leaf development, leaf senescence, response to abscisic acid, response to the absence of light and response to salt stress. miR164 also regulates the synthesis of vitamin B5 involved in embryo development. On the other hand, both miR169_2 and miR169_5 regulate the expression of a galactinol synthase involved in the response to cold stress, heat stress, oxidative stress, salt stress and water deprivation; a kinesin-like protein involved in pollen development and a mitogen-activated protein kinase involved in directing cellular responses to mitogens, osmotic stress, heat shock, and proinflammatory cytokines. miR167_1 is probably involved only in plant growth and development as it regulates the expression of an auxin response factor (arf). And last, miR171_1 and miR390 are likely involved only in stress responses. miR171_1 regulates the expression of an RNA-binding family protein and an endoglucanase involved in host defence, whereas miR390 regulates the expression of a rotamase FKBP 1 involved in the response to heat stress, osmotic stress, and wounding. It is important to mention that the functions of all the predicted targets were gathered from the Arabidopsis Information Resource (TAIR) (Rhee et al., 2003), and therefore the functional involvement of these miRNAs in yerba mate must be experimentally validated.
Data availability
The Illumina and PacBio raw sequence data, assembly and annotation were deposited in the European Nucleotide Archive (ENA) under BioProject No. PRJEB65927. An assembly obtained only with the Illumina data was also deposited in ENA under BioProject No. PRJEB36685. The plasmids used to produce proteins are freely available upon request. The atomic coordinates and structure factors have been deposited in the Protein Data Bank, Research Collaboratory for Structural Bioinformatics, Rutgers University, New Brunswick, NJ (http://www.rscb.org) with the accession code 8UZD for the IpCS3 structure bound to caffeine and SAH.
-
ENAID PRJEB65927. Ilex paraguariensis var. paraguariensis genome (Illumina and PacBio).
-
ENAID PRJEB36685. Ilex paraguariensis var. paraguariensis genome (Illumina).
-
RCSB Protein Data BankID 8UZD. The structure of IpCS3, a theobromine methyltransferase from Yerba Mate.
-
ENAID PRJNA315513. Ilex paraguariensis multiple library de novo transcriptome assembly.
-
NCBI Sequence Read ArchiveID SRP043293. Yerba mate (Ilex paraguariensis St. Hil.) NGS DN Transcriptome assembly.
-
NCBI Sequence Read ArchiveID SRP110129. RNA-Seq of Ilex paraguariensis: roots and mature leaves.
References
-
PHENIX: a comprehensive python-based system for macromolecular structure solutionActa Crystallographica. Section D, Biological Crystallography 66:213–221.https://doi.org/10.1107/S0907444909052925
-
Basic local alignment search toolJournal of Molecular Biology 215:403–410.https://doi.org/10.1016/S0022-2836(05)80360-2
-
Metabolism and ecology of purine alkaloidsFrontiers in Bioscience 11:2354–2370.https://doi.org/10.2741/1975
-
Biosynthesis of caffeine in leaves of coffeePlant Physiology 111:747–753.https://doi.org/10.1104/pp.111.3.747
-
Scaffolding pre-assembled contigs using SSPACEBioinformatics 27:578–579.https://doi.org/10.1093/bioinformatics/btq683
-
Trimmomatic: a flexible trimmer for Illumina sequence dataBioinformatics 30:2114–2120.https://doi.org/10.1093/bioinformatics/btu170
-
tRNAscan-SE: Searching for tRNA Genes in Genomic SequencesMethods in Molecular Biology 1962:1–14.https://doi.org/10.1007/978-1-4939-9173-0_1
-
Scale space view of curve estimationThe Annals of Statistics 28:408–428.https://doi.org/10.1214/aos/1016218224
-
New insights on coffea mirnas: features and evolutionary conservationApplied Biochemistry and Biotechnology 177:879–908.https://doi.org/10.1007/s12010-015-1785-x
-
miRNA regulation in the early development of barley seedBMC Plant Biology 12:120.https://doi.org/10.1186/1471-2229-12-120
-
Subclasses of adenosine receptors in the central nervous system: interaction with caffeine and related methylxanthinesCellular and Molecular Neurobiology 3:69–80.https://doi.org/10.1007/BF00734999
-
Mechanistic approaches to the study of evolution: the functional synthesisNature Reviews. Genetics 8:675–688.https://doi.org/10.1038/nrg2160
-
Identification and functional analysis of new conserved microRNAs and their targets in potato (Solanum tuberosum L.)TURKISH JOURNAL OF BOTANY 38:1199–1213.https://doi.org/10.3906/bot-1405-105
-
AutoDock vina 1.2.0: new docking methods, expanded force field, and python bindingsJournal of Chemical Information and Modeling 61:3891–3898.https://doi.org/10.1021/acs.jcim.1c00203
-
Features and development of CootActa Crystallographica. Section D, Biological Crystallography 66:486–501.https://doi.org/10.1107/S0907444910007493
-
miRNA target prediction in plantsMethods in Molecular Biology 592:51–57.https://doi.org/10.1007/978-1-60327-005-2_4
-
Phytozome: a comparative platform for green plant genomicsNucleic Acids Research 40:D1178–D86.https://doi.org/10.1093/nar/gkr944
-
Quantitative and qualitative genomic characterization of cultivated Ilex L. speciesPlant Genetic Resources 13:142–152.https://doi.org/10.1017/S1479262114000756
-
Antioxidant effects of Ilex paraguariensis: induction of decreased oxidability of human LDL in vivoBiochemical and Biophysical Research Communications 224:338–344.https://doi.org/10.1006/bbrc.1996.1030
-
Genotyping-by-sequencing provides the first well-resolved phylogeny for coffee (Coffea) and insights into the evolution of caffeine content in its species: GBS coffee phylogeny and the evolution of caffeine contentMolecular Phylogenetics and Evolution 109:351–361.https://doi.org/10.1016/j.ympev.2017.02.009
-
Prediction of coding and non-coding RNAEncyclopedia of Bioinformatics and Computational Biology 01:e0099.https://doi.org/10.1016/B978-0-12-809633-8.20099-X
-
The Dfam database of repetitive DNA familiesNucleic Acids Research 44:D81–D9.https://doi.org/10.1093/nar/gkv1272
-
VMD: visual molecular dynamicsJournal of Molecular Graphics 14:33–38.https://doi.org/10.1016/0263-7855(96)00018-5
-
Natural allelic variations of TCS1 play a crucial role in caffeine biosynthesis of tea plant and its related speciesPlant Physiology and Biochemistry 100:18–26.https://doi.org/10.1016/j.plaphy.2015.12.020
-
InterProScan 5: genome-scale protein function classificationBioinformatics 30:1236–1240.https://doi.org/10.1093/bioinformatics/btu031
-
Repbase Update, a database of eukaryotic repetitive elementsCytogenetic and Genome Research 110:462–467.https://doi.org/10.1159/000084979
-
Identification of apple miRNAs and their potential role in fire blight resistanceTree Genetics & Genomes 11:e8123.https://doi.org/10.1007/s11295-014-0812-3
-
Identification of microRNA targets in tomato fruit development using high-throughput sequencing and degradome analysisJournal of Experimental Botany 64:1863–1878.https://doi.org/10.1093/jxb/ert049
-
Identification of miRNAs in sorghum by using bioinformatics approachPlant Signaling & Behavior 7:246–259.https://doi.org/10.4161/psb.18914
-
MAFFT multiple sequence alignment software version 7: improvements in performance and usabilityMolecular Biology and Evolution 30:772–780.https://doi.org/10.1093/molbev/mst010
-
Assessment of miRNA expression profile and differential expression pattern of target genes in cold-tolerant and cold-sensitive tomato cultivarsBiotechnology & Biotechnological Equipment 29:851–860.https://doi.org/10.1080/13102818.2015.1061447
-
RNAmmer: consistent and rapid annotation of ribosomal RNA genesNucleic Acids Research 35:3100–3108.https://doi.org/10.1093/nar/gkm160
-
Impact of whole-genome duplication events on diversification rates in angiospermsAmerican Journal of Botany 105:348–363.https://doi.org/10.1002/ajb2.1060
-
Aromatic clusters in protein-protein and protein-drug complexesJournal of Cheminformatics 12:30.https://doi.org/10.1186/s13321-020-00437-4
-
Stress-responsive microRNAs in PopulusThe Plant Journal 55:131–151.https://doi.org/10.1111/j.1365-313X.2008.03497.x
-
ChEMBL: towards direct deposition of bioassay dataNucleic Acids Research 47:D930–D940.https://doi.org/10.1093/nar/gky1075
-
A global picture of tRNA genes in plant genomesThe Plant Journal 66:80–93.https://doi.org/10.1111/j.1365-313X.2011.04490.x
-
ColabFold: making protein folding accessible to allNature Methods 19:679–682.https://doi.org/10.1038/s41592-022-01488-1
-
AutoDock4 and AutoDockTools4: Automated docking with selective receptor flexibilityJournal of Computational Chemistry 30:2785–2791.https://doi.org/10.1002/jcc.21256
-
REFMAC5 for the refinement of macromolecular crystal structuresActa Crystallographica. Section D, Biological Crystallography 67:355–367.https://doi.org/10.1107/S0907444911001314
-
Drought response in rice: the miRNA storyInternational Journal of Molecular Sciences 20:3766.https://doi.org/10.3390/ijms20153766
-
Infernal 1.1: 100-fold faster rna homology searchesBioinformatics 29:2933–2935.https://doi.org/10.1093/bioinformatics/btt509
-
LC-ms metabolomics and chemotaxonomy of caffeine-containing holly ( ilex) species and related taxa in the aquifoliaceaeJournal of Agricultural and Food Chemistry 67:5687–5699.https://doi.org/10.1021/acs.jafc.8b07168
-
Purine alkaloids and phenolic compounds in three Cola species and Garcinia kola grown in CameroonSouth African Journal of Botany 74:629–638.https://doi.org/10.1016/j.sajb.2008.03.003
-
Metabolite-enzyme coevolution: from single enzymes to metabolic pathways and networksAnnual Review of Biochemistry 87:187–216.https://doi.org/10.1146/annurev-biochem-062917-012023
-
The age and biogeography of Citrus and the orange subfamily (Rutaceae: Aurantioideae) in Australasia and New CaledoniaAmerican Journal of Botany 95:1621–1631.https://doi.org/10.3732/ajb.0800214
-
Convergent evolution in plant specialized metabolismAnnual Review of Plant Biology 62:549–566.https://doi.org/10.1146/annurev-arplant-042110-103814
-
Ribosomal RNA genes in plants: variability in copy number and in the intergenic spacerPlant Molecular Biology 9:509–520.https://doi.org/10.1007/BF00015882
-
Convergent evolution in the genomics era: new insights and directionsPhilosophical Transactions of the Royal Society B 374:20190102.https://doi.org/10.1098/rstb.2019.0102
-
Bioavailability and antioxidant effect of the Ilex Paraguariensis polyphenolsNutrition & Food Science 45:326–335.https://doi.org/10.1108/NFS-08-2014-0079
-
Whole genome duplication in plants: implications for evolutionary analysisMethods in Molecular Biology 1704:291–315.https://doi.org/10.1007/978-1-4939-7463-4_10
-
Computational identification of the selenocysteine tRNA (tRNASec) in genomesPLOS Computational Biology 13:e1005383.https://doi.org/10.1371/journal.pcbi.1005383
-
SoftwareThe pymol molecular graphics system, version Version 18Science and Education.
-
mclust 5: clustering, classification and density estimation using gaussian finite mixture modelsThe R Journal 8:289–317.
-
Identification and characterization of novel maize mirnas involved in different genetic backgroundInternational Journal of Biological Sciences 11:781–793.https://doi.org/10.7150/ijbs.11619
-
Comparative analysis of GC content variations in plant genomesTropical Plant Biology 9:136–149.https://doi.org/10.1007/s12042-016-9165-4
-
Role of microRNAs during flower and storage root development in sweet potatoPlant Molecular Biology Reporter 33:1731–1739.https://doi.org/10.1007/s11105-015-0869-7
-
Caffeine biosynthesis in Camellia sinensisPhytochemistry 15:1235–1239.https://doi.org/10.1016/0031-9422(76)85084-4
-
chemicalize.orgJournal of Chemical Information and Modeling 52:613–615.https://doi.org/10.1021/ci300046g
-
Influence of the irradiance on phenols content and rooting of Ilex paraguariensis cuttings collected from adult plantsActa Physiologiae Plantarum 34:2419–2424.https://doi.org/10.1007/s11738-012-1009-8
-
Resurrecting ancient genes: experimental analysis of extinct moleculesNature Reviews. Genetics 5:366–375.https://doi.org/10.1038/nrg1324
-
Convergent and divergent evolution in carnivorous pitcher plant trapsThe New Phytologist 217:1035–1041.https://doi.org/10.1111/nph.14879
-
W-IQ-TREE: a fast online phylogenetic tool for maximum likelihood analysisNucleic Acids Research 44:W232–W235.https://doi.org/10.1093/nar/gkw256
-
Data processing and analysis with the autoPROC toolboxActa Crystallographica. Section D, Biological Crystallography 67:293–302.https://doi.org/10.1107/S0907444911007773
-
Expression of arabidopsis MIRNA genesPlant Physiology 138:2145–2154.https://doi.org/10.1104/pp.105.062943
-
Phylogeny and biogeography of the hollies ( Ilex L., Aquifoliaceae)Journal of Systematics and Evolution 59:73–82.https://doi.org/10.1111/jse.12567
-
Metabolism of purine alkaloids and xanthine in leaves of maté (Ilex paraguariensis)Natural Product Communications 10:707–712.
-
Phylogenomic analyses of Camellia support reticulate evolution among major cladesMolecular Phylogenetics and Evolution 182:107744.https://doi.org/10.1016/j.ympev.2023.107744
-
Small RNA and degradome deep sequencing reveals peanut MicroRNA roles in response to pathogen infectionPlant Molecular Biology Reporter 33:1013–1029.https://doi.org/10.1007/s11105-014-0806-1
-
Identification of miRNAs and their targets in tea (Camellia sinensis)Journal of Zhejiang University. Science. B 14:916–923.https://doi.org/10.1631/jzus.B1300006
Article and author information
Author details
Funding
Consejo Nacional de Investigaciones Científicas y Técnicas
- Adrian G Turjanski
Biotechnology and Biological Sciences Research Council (BB/P027849/1)
- Adrian G Turjanski
National Science Foundation (MCB-1120624)
- Todd J Barkman
European Molecular Biology Laboratory
- Federico A Vignale
Ministerio de Educación de la Nación (PRO.MATE.AR)
- Adrian G Turjanski
Lee Honors College, Western Michigan University
- Madeline N Smith
National Science Foundation (MCB-2325341)
- Todd J Barkman
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
This work was supported by Consejo Nacional de Investigaciones Científicas y Técnicas de Argentina (CONICET); PRO.MATE.AR project, funded by Secretaría de Políticas Universitarias del Ministerio de Educación de la Nación Argentina; CABANA project, funded by UKRI-BBSRC on behalf of the Global Challenges Research Fund (BB/P027849/1), the U.S. National Science Foundation (NSF) (grants MCB-1120624 and MCB-2325341 to T.J.B.), the Lee Honors College at WMU (to M.S. and T.J.B.), and the European Molecular Biology Laboratory (EMBL). We would like to express our gratitude to Centro de Cómputos de Alto Rendimiento (CeCAR) and UBA-FCEN-QB-Cluster for providing access to computational resources, which facilitated the majority of computational analyses in this work. Special thanks are extended to Kevin Blair and the WMU Department of Chemistry for facilitating our HPLC analyses.
Copyright
© 2025, Vignale, Hernandez Garcia et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 1
- views
-
- 3
- downloads
-
- 0
- citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
- Biochemistry and Chemical Biology
- Microbiology and Infectious Disease
Paramyxovirus membrane fusion requires an attachment protein for receptor binding and a fusion protein for membrane fusion triggering. Nipah virus (NiV) attachment protein (G) binds to ephrinB2 or -B3 receptors, and fusion protein (F) mediates membrane fusion. NiV-F is a class I fusion protein and is activated by endosomal cleavage. The crystal structure of a soluble GCN4-decorated NiV-F shows a hexamer-of-trimer assembly. Here, we used single-molecule localization microscopy to quantify the NiV-F distribution and organization on cell and virus-like particle membranes at a nanometer precision. We found that NiV-F on biological membranes forms distinctive clusters that are independent of endosomal cleavage or expression levels. The sequestration of NiV-F into dense clusters favors membrane fusion triggering. The nano-distribution and organization of NiV-F are susceptible to mutations at the hexamer-of-trimer interface, and the putative oligomerization motif on the transmembrane domain. We also show that NiV-F nanoclusters are maintained by NiV-F–AP-2 interactions and the clathrin coat assembly. We propose that the organization of NiV-F into nanoclusters facilitates membrane fusion triggering by a mixed population of NiV-F molecules with varied degrees of cleavage and opportunities for interacting with the NiV-G/receptor complex. These observations provide insights into the in situ organization and activation mechanisms of the NiV fusion machinery.
-
- Biochemistry and Chemical Biology
Identifying target proteins for bioactive molecules is essential for understanding their mechanisms, developing improved derivatives, and minimizing off-target effects. Despite advances in target identification (target-ID) technologies, significant challenges remain, impeding drug development. Most target-ID methods use cell lysates, but maintaining an intact cellular context is vital for capturing specific drug–protein interactions, such as those with transient protein complexes and membrane-associated proteins. To address these limitations, we developed POST-IT (Pup-On-target for Small molecule Target Identification Technology), a non-diffusive proximity tagging system for live cells, orthogonal to the eukaryotic system. POST-IT utilizes an engineered fusion of proteasomal accessory factor A and HaloTag to transfer Pup to proximal proteins upon directly binding to the small molecule. After significant optimization to eliminate self-pupylation and polypupylation, minimize depupylation, and optimize chemical linkers, POST-IT successfully identified known targets and discovered a new binder, SEPHS2, for dasatinib, and VPS37C as a new target for hydroxychloroquine, enhancing our understanding these drugs’ mechanisms of action. Furthermore, we demonstrated the application of POST-IT in live zebrafish embryos, highlighting its potential for broad biological research and drug development.