Comprehensive re-analysis of hairpin small RNAs in fungi reveals loci with conserved links

  1. Nathan R Johnson
  2. Luis F Larrondo
  3. José M Álvarez  Is a corresponding author
  4. Elena A Vidal  Is a corresponding author
  1. Millennium Science Initiative - Millennium Institute for Integrative Biology (iBio), Chile
  2. Centro de Genómica y Bioinformática, Facultad de Ciencias, Ingeniería y Tecnología, Universidad Mayor, Chile
  3. Departamento de Genética Molecular y Microbiología, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Chile
  4. Centro de Biotecnología Vegetal, Facultad de Ciencias, Universidad Andrés Bello, Chile
  5. Escuela de Biotecnología, Facultad de Ciencias, Ingeniería y Tecnología, Universidad Mayor, Chile
5 figures and 9 additional files


Figure 1 with 3 supplements
Publications referring to microRNAs (miRNAs) and/or miRNA-like (mi/milRNA) RNAs in fungi were identified, classified, and assessed for reported loci.

(A) Generalized depiction of canonical miRNA synthesis pathway. (B) Breakdown of important metrics in fungal mi/milRNA loci identified in publications (Supplementary file 1). 1. Number of published studies reporting sRNA-seq-based annotations. 2. Number of publications where genomic coordinates are given for each mi/milRNA precursor gene. 3. Number of studies that have two or more library replicates for a condition. 4. Number of studies that make the sRNA-seq data available through a public repository. 5. Number of studies that support the synthesis pathway for the mi/milRNA through knock-out or over-expression lines. 6. Number of studies that give evidence for the function of an mi/milRNA through target-site manipulation or molecular evidence of cleavage (evidence quality designations described in Materials and methods). 7. Number of studies using a given tool/pipeline to discover mi/milRNA loci. (C) Dendrogram constructed from 18S rRNA sequences for the species with reported loci in A1 (MUSCLE alignment, RAxML maximum likelihood tree). Colored tips represent the taxonomic class for each species identified by SILVA. Names are given as a five-letter abbreviation (full names in Supplementary file 2). Number of loci identified is shown by colored bars. (D) Assessment of the quality of loci reporting for published mi/milRNA loci. Loci are classified by whether they have a reported sequence and genomic coordinates for the entire precursor hairpin (coordinates, green), only reported sequence for the precursor hairpin without genomic coordinates (hairpin, olive green), or have insufficient or no information regarding the hairpin locus (fail/lacking, gray). Loci in this final category were subjected to a recovery pipeline to characterize their genomic coordinates, as explained in Materials and methods. (E) An assessment of hairpins relative to their source evidence as explained in C. Valid hairpins are defined as sequences containing the proposed most-abundant sequence, with no secondary structures or more than 20 unpaired bases in the duplex pairing region.

Figure 1—figure supplement 1
Analysis of locus count by citation.

We show the number of loci reported by species in each publication. Abbreviations are provided in Supplementary file 2 and publications are provided in Supplementary file 1. Colors are respective to their taxonomic class shown in Figure 1C.

Figure 1—figure supplement 2
Accuracy for recovery pipeline.

(A) Cumulative density function of the number of map locations for mature miRNAs and miRNA-like (mi/milRNAs) sequences in their corresponding genome assemblies. Sensitivity is displayed for the cutoff of 10 mature mi/milRNA alignments in the genome. Precision is modeled based on a single mature mi/milRNA alignment, randomly chosen from possible alignments. (B) Actual precision and sensitivity values computed for loci with a known source location as positive. Recoveries are done using only mature sequences for loci with known loci.

Figure 1—figure supplement 3
Flow diagram of confirmation pipeline for publications, miRNAs and miRNA-like (mi/milRNAs), sRNA-seq confirmation, and tier classification (produced using

Abbreviations: Most-abundant sequence (MAS), protein-coding gene (PCG), coding sequence (CDS), transposable element (TE).

Figure 2 with 3 supplements
Independent evaluation of miRNAs and miRNA-like (mi/milRNA) annotations using sRNA-seq data.

(A) Number of hairpin loci in which the reported mature mi/milRNA sequence is the most-abundant sequence (MAS) (orange), another sequence in the locus is the MAS (pink), the cited sequence is not present in the locus (light gray), or the reported locus has no aligned reads (dark gray). (B) Abundance (aligned reads per million) and strandedness for mi/milRNA loci. Each dot represents a locus, and the colors represent whether the locus passes the 0.5 reads per million cutoff of expression (pass rpm, yellow), has at least 80% of the aligned reads stranded (pass frac, blue), fulfills both criteria (pass both, green), or does not fulfill any of the criteria (none, gray) (C) Proportion of loci passing the three minimal locus profile rules (fn1, fn2, and fn3, described in table below). Arabidopsis thaliana (Artha) and Drosophila melanogaster (Drmel), low (LC) and high-confidence (HC) miRNAs from miRbase tested in a single library are included as a reference. (D) Number of loci without available source sRNA-seq libraries (light gray), loci failing one or more rules indicated in B and C (red), loci passing all rules (yellow), and loci passing all rules in two or more independent libraries (blue-green). For confirmed loci, we show in magenta those passing all the rules for annotating miRNAs in plants described in the ShortStack (Axtell, 2013b) and Axtell-2018 (Axtell and Meyers, 2018) rule sets. (E) Locus count and proportion of loci as in (D), separated by the parent species and including loci for which a valid hairpin could not be identified (dark gray).

Figure 2—figure supplement 1
Evaluation of reported miRNAs and miRNA-like (mi/milRNA) loci using published rules for plant and animal miRNAs (description of rules is provided in Supplementary file 6).

Libraries from Arabidopsis thaliana (Artha) and Drosophila melanogaster (Drmel) were used to test low- and high-confidence (LC, HC) miRNAs reported in miRbase for each species. Fungi mi/milRNAs are tested using their source libraries (where available) for all valid hairpins.

Figure 2—figure supplement 2
Cumulative density function of raw library depths for source libraries reporting miRNAs and miRNA-like (mi/milRNAs).

We show the proportion of libraries with a given raw read depth. 94.1% of the libraries are 5 million reads or more in depth.

Figure 2—figure supplement 3
Loci count in the context of their confirmation by sRNA-seq.

(A) Proportion and loci count for loci reported from each citation, (B) the annotation tool/approach, and (C) the length of the most-abundant sequence. Colors represent loci which are invalid (dark gray), failed to confirm by sRNA-seq (red), have no available sRNA-seq data (light gray), are confirmed in one library (yellow), or are confirmed in multiple libraries.

Figure 3 with 1 supplement
mi/milRNA loci were evaluated for homology and overlap with other genomic features.

(A) Number of loci matching Rfam structural RNA families, identified by blastn (bitscore ≥ 50). Categories are merged from multiple Rfam entries (Supplementary file 5). (B) Proportions of loci originating from transposable elements (TEs, blue) (Muszewska et al., 2019), near TEs (light blue), near protein-coding genes (PCGs) (yellow), PCG introns (green), PCG untranslated regions (red), and coding sequences of PCGs (orange). Near classifications refer to distances closer than 100 nucleotides while more distant features were called as intergenic (dark gray). Proportion bars are composed of loci which are not found in Rfam, have a homologous sequence in an annotated genome, and excluding those which failed the transcriptional rules laid out in Figure 2. Numbers to the right of bars indicate the total locus count for a division. Diamonds indicate species which have no available TE annotations (yellow) or no TE/gene annotations (blue) available. (C) Genomic distances to the nearest feature (nucleotides) for loci in panel B. Those more distant than 1000 nucleotides are aligned to the right. Range for considering an near intersection (100 nucleotides) is shown with an orange box with those outside considered intergenic and counted to the right. (D) Total proportions of feature intersections for loci, divided by their transcriptional support (excluding values <1%).

Figure 3—figure supplement 1
Repetitiveness of unfiltered loci in related genomes.

(A) Cumulative density function of the number of blast hits in included genomes for hairpins from fungal miRNAs and miRNA-like (mi/milRNAs). Top hits are labeled with the target genome species identifier and the mi/milRNA which has a blast hit. (B) Expanded view of the count of species where bba-milR4 is found, separated by kingdom.

Figure 4 with 1 supplement
Hairpin RNAs were compared between species to identify possible homologous genes.

(A) 18S rRNA trees (Quast et al., 2013) (MUSCLE alignment, RAxML maximum likelihood tree) showing selected species with published miRNAs and miRNA-like (mi/milRNA) loci (circles) and closely related species with available genomes (diamonds). Connecting edges indicate mi/milRNA loci with highly conserved hairpin sequences in other species (nhmmer, >0.4 bits/basepair). Edge colors indicate the degree of transcriptional support for the source loci in the context of Figure 2: green – confirmed and replicated loci, yellow – loci confirmed in a single library, gray – loci with no available sequencing libraries. HC-miRNA loci from Arabidopsis thaliana (Artha) and Oryza sativa (Orsat) are shown in blue. Edge line width is scaled to indicate the number of connections between two species, with separate scales for each clade. Edge line width indicates the number of connecting edges between two species. (B) Example of a conserved locus from Cordyceps militaris (Comil) with other Sordariomycetes. Bases are colored by percent identity. The hairpin structure for Comil and Trichoderma reesei (Trree) are shown in dot-bracket format with colored circles showing scaled depth (RPM) for all positions across available libraries. Most-abundant sequence for these species are highlighted with a red box.

Figure 4—figure supplement 1
Multiple sequence alignment for homologs of miR159 across plant species.

Positions are colored by their column-wise percent identity. Fold and most-abundant sequences (MASs) are shown for Arabidopsis thaliana (Artha).

Supporting evidence was used to categorize hairpin RNAs into tier classifications.

(A) Tier classifications of loci based on evidence quality. Colors indicate the support for a locus in the context of Figures 24, with the bars showing the count of loci. Dark yellow labels in the legend indicate exclusion criteria (for assessment as tier 4). Tiers 1 and 2 were analyzed further, looking at (B) their species (showing loci count) and (C) the length of the most-abundant small regulatory RNA (sRNA) for a locus. Blue bars indicate size ranges common to sRNAs in other organisms. For B and C, bar outlines indicate tier of loci (red – tier 1, black – tier 2). (D) Hairpin lengths of tier 1 and tier 2 miRNAs and miRNA-like (mi/milRNAs) and of HC-miRNAs for plants and animals from miRbase. Length is defined as the duplex-to-duplex distance (including the duplex). (E) Specificity of sRNA dicing for the same groups, measured by the ratio of the most-abundant sRNA to all sRNAs matching the hairpin strand.

Additional files

Supplementary file 1

List of publications reporting miRNAs and miRNA-like (mi/milRNA) from fungi referred in this work.

We show the publication identifier used in this work, the organism it pertains to, the publication title, journal, URL, whether the miRNA/milRNA reported were obtained by sRNA-seq analysis (Y:yes; N:no; other context), details of why a study was excluded, whether sRNA-seq data is available (Y: yes, N:no), whether mi/milRNA loci coordinates are provided (Y:yes, N:no), whether biological replication of results has been performed (Y:yes, N:no), whether genetic evidence is provided (Y:yes; N:no), type of targeting evidence provided (prediction, degradome, 5’ RACE, in vitro support for targeting, and detection of target gene modulation through indirect approaches knock-outs of biosynthetic genes and direct approaches which modify the expression of the functional sRNA), and bioinformatic tool used for mi/milRNA identification. This also gives information on the native assembly used for the primary identification of the mi/milRNA loci cited in a publication, including assembly sources, IDs, names, and URLs where applicable.
Supplementary file 2

List of species considered in this work.

We show species abbreviations (abbv), full names (species), taxon identifiers (taxid), secondary names (synonyms), kingdom, phylum, class, order, family, genus, whether they are a species reported miRNAs and miRNA-like (mi/milRNAs) analyzed in this work (1: yes; 0: no), relevance to humans, hosts for pathogens/parasites, corresponding genome assemblies used for genomic context and homolog identification (ncbi_assembly), SILVA-db accessions, and whether a genus member was used as a SILVA-db substitute in the case of missing species data.
Supplementary file 3

The sRNA-seq libraries used for re-assessment of reported miRNAs and miRNA-like (mi/milRNA) loci.

We show the publication citations used in this work, SRR accession of the libraries, adapter sequence used for trimming, read length of the library, number of reads in the raw library (raw_depth), and the percentage of raw reads in the following categories: trimmed with adapters (have_adap), reads passing cutadapt filters (pass_filter), uniquely mapping reads, multi-mapping reads, percent non-mapping reads, and aligned reads resulting from unique and multi-mapper placement (aligned). Where available, this also gives the alignment rates given by source publications and a manually curated note.
Supplementary file 4

Rules testing data for miRNAs and fungal libraries.

We show identifiers for the miRNAs and miRNA-like (mi/milRNA) locus, citation, and source sRNA-seq library. Locus abundance, hairpin strand abundance, and percent stranding are shown. Other features include the abundance and sequence for the most-abundant sequence determined in this work (MAS), the inferred star for this MAS (MAS*), and the MAS cited in the source publication. Locus and cited MAS designation shown in accordance with Figure 2A. Rule passing is shown as a string, with the rule citation shown with a two-letter prefix (from Supplementary file 6), with the subsequent string pointing to rules passing (1) and failing (0) in numerical order.
Supplementary file 5

Table defining the confirmation status for a miRNAs and miRNA-like (mi/milRNA) for all reported loci.

We show identifiers for the locus, publication, native and NCBI assemblies used in its analysis, hairpin coordinates used for each assembly, and the most abundant sequence, indicating whether it has been corrected by our sequencing re-analysis. For loci with strong support for a targeting relationship, the type of evidence is shown. Locus details like the hairpin sequence, folding, minimum free energy, and boundary coordinates of the duplex within the hairpin are provided. We report identification details, including the level of coordinate and locus detail provided by the source publication and a locus’s validation status. Confirmation details include the rule set confirmation status, interactions with Rfam (including Rfam descriptions), intersections with genomic features, the status as an intergenic locus, the intergenomic conservation status, and the tier classification of all reported loci. Relevant figures are indicated where available for data columns. Dashes (‘-’) indicate fields that are not evaluated due to an ‘invalid’ hairpin.
Supplementary file 6

The miRNA annotation rules used in this work.

Rule identifiers use the prefixes ‘ax’ (Axtell and Meyers, 2018), ‘ku’ (Kuang et al., 2019), ‘mb’ (Kozomara and Griffiths-Jones, 2014), and ‘ss’ (Axtell, 2013b). General rule category and more detailed rule descriptions are presented for each rule.
MDAR checklist
Source data 1

Text file of the sRNA abundance profiles for all miRNAs and miRNA-like (mi/milRNA) with available sRNA-seq data.

Here is given the locus and library identifiers, abundance of the locus and all sRNAs aligned to the hairpin strand (l=length, a=abundance). Most-abundant sequence (MAS) and cited MASs are shown for each locus. Lower case nucleotides indicate mismatches with the reference.
Source data 2

Annotations of miRNAs and miRNA-like (mi/milRNAs) (gffs).

Includes annotations for mi/milRNAs in native genomes (where they were first annotated) and best-blast-hits (bbh) in NCBI-sourced assemblies. Annotations also contain two versions, one with all mi/milRNA loci and a second with only top tier loci (tiers 1 and 2, denoted ‘best’).

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Nathan R Johnson
  2. Luis F Larrondo
  3. José M Álvarez
  4. Elena A Vidal
Comprehensive re-analysis of hairpin small RNAs in fungi reveals loci with conserved links
eLife 11:e83691.