1. Chromosomes and Gene Expression
  2. Computational and Systems Biology
Download icon

Nanopore direct RNA sequencing maps the complexity of Arabidopsis mRNA processing and m6A modification

  1. Matthew T Parker
  2. Katarzyna Knop
  3. Anna V Sherwood
  4. Nicholas J Schurch
  5. Katarzyna Mackinnon
  6. Peter D Gould
  7. Anthony JW Hall
  8. Geoffrey J Barton  Is a corresponding author
  9. Gordon G Simpson  Is a corresponding author
  1. University of Dundee, United Kingdom
  2. University of Liverpool, United Kingdom
  3. Norwich Research Park, United Kingdom
  4. James Hutton Institute, United Kingdom
Research Article
  • Cited 20
  • Views 9,097
  • Annotations
Cite this article as: eLife 2020;9:e49658 doi: 10.7554/eLife.49658

Abstract

Understanding genome organization and gene regulation requires insight into RNA transcription, processing and modification. We adapted nanopore direct RNA sequencing to examine RNA from a wild-type accession of the model plant Arabidopsis thaliana and a mutant defective in mRNA methylation (m6A). Here we show that m6A can be mapped in full-length mRNAs transcriptome-wide and reveal the combinatorial diversity of cap-associated transcription start sites, splicing events, poly(A) site choice and poly(A) tail length. Loss of m6A from 3’ untranslated regions is associated with decreased relative transcript abundance and defective RNA 3′ end formation. A functional consequence of disrupted m6A is a lengthening of the circadian period. We conclude that nanopore direct RNA sequencing can reveal the complexity of mRNA processing and modification in full-length single molecule reads. These findings can refine Arabidopsis genome annotation. Further, applying this approach to less well-studied species could transform our understanding of what their genomes encode.

Introduction

Patterns of pre-mRNA processing and base modifications determine eukaryotic mRNA coding potential and fate. Alternative transcripts produced from the same gene can differ in the position of the start site, the site of cleavage and polyadenylation, and the combination of exons spliced into the mature mRNA. Collectively termed the epitranscriptome, RNA modifications play crucial context-specific roles in gene expression (Meyer and Jaffrey, 2017; Roundtree et al., 2017). The most abundant internal modification of mRNA is methylation of adenosine at the N6 position (m6A). Knowledge of RNA modifications and processing combinations is essential to understand gene expression and what genomes really encode.

RNA sequencing (RNAseq) is used to dissect transcriptome complexity: it involves copying RNA into complementary DNA (cDNA) with reverse transcriptases (RTs) and then sequencing the subsequent DNA copies. RNAseq reveals diverse features of transcriptomes, but limitations can include misidentification of 3′ ends through internal priming (Jan et al., 2011), spurious antisense and splicing events produced by RT template switching (Houseley and Tollervey, 2010; Mourão et al., 2019), and the inability to detect all base modifications in the copying process (Helm and Motorin, 2017). The fragmentation of RNA prior to short-read sequencing makes it difficult to interpret the combination of authentic RNA processing events and remains an unsolved problem (Steijger et al., 2013).

We investigated whether long-read direct RNA sequencing (DRS) with nanopores (Garalde et al., 2018) could reveal the complexity of Arabidopsis mRNA processing and modifications. In nanopore DRS, the protein pore (nanopore) sits in a membrane through which an electrical current is passed, and intact RNA is fed through the nanopore by a motor protein (Garalde et al., 2018). Each RNA sequence within the nanopore (five bases) can be identified by the magnitude of signal it produces. Arabidopsis is a pathfinder model in plant biology, and its genome annotation strongly influences the annotation and our understanding of what other plant genomes encode. We applied nanopore DRS and Illumina RNAseq to wild-type Arabidopsis (Col-0) and mutants defective in m6A (Růžička et al., 2017) and exosome-mediated RNA decay (Lange et al., 2014). We reveal m6A and combinations of RNA processing events (alternative patterns of 5′ capped transcription start sites, splicing, 3′ polyadenylation and poly(A) tail length) in full-length Arabidopsis mRNAs transcriptome-wide.

Results

Nanopore DRS detects long, complex mRNAs and short, structured non-coding RNAs

We purified poly (A)+ RNA from four biological replicates of 14-day-old Arabidopsis Col-0 seedlings. We incorporated synthetic External RNA Controls Consortium (ERCC) RNA spike-in mixes into all replicates (Jiang et al., 2011; Reid and External RNA Controls Consortium, 2005) and carried out nanopore DRS (Supplementary file 1). Illumina RNAseq was performed in parallel on similar material (Supplementary file 2). Using Guppy base-calling (Oxford Nanopore Technologies) and minimap2 alignment software (Li, 2018), we identified around 1 million reads per sample (Supplementary file 1). The longest read alignments were 12.7 kb for mRNA transcribed from AT1G48090, spanning 63 exons, (Figure 1A), and 12.8 kb for mRNA transcribed from AT1G67120, spanning 58 exons (Figure 1—figure supplement 1A). These represent some of the longest contiguous mRNAs sequenced from Arabidopsis. Among the shortest read alignments were those spanning genes encoding highly structured non-coding RNAs such as UsnRNAs and snoRNAs, for example U3B (Figure 1B).

Figure 1 with 1 supplement see all
Diverse Arabidopsis RNAs are detected by nanopore DRS.

(A) Nanopore DRS 12.7 kb read alignment at AT1G48090, comprising 63 exons. Blue, nanopore DRS read alignments; black, Araport11 annotation. (B) Nanopore DRS read alignments at the snoRNA gene U3B (AT5G53902). Blue, sample of nanopore DRS read alignments; black, Araport11 annotation. (C) PIN4 (AT2G01420) antisense RNAs detected using nanopore DRS. Blue, Col-0 PIN4 sense Illumina RNAseq coverage and sample of nanopore PIN4 sense read alignments; orange, Col-0 PIN4 antisense Illumina RNAseq coverage and nanopore PIN4 antisense read alignments; green, hen2-2 mutant PIN4 sense Illumina RNAseq coverage; purple, hen2-2 mutant PIN4 antisense Illumina RNAseq coverage; black, PIN4 sense RNA isoforms found in Araport11; grey, PIN4 antisense differentially expressed regions detected in Illumina RNAseq with DERfinder.

Base-calling errors in nanopore DRS are non-random

We used ERCC RNA spike-ins (Jiang et al., 2011) as internal controls to monitor the properties of the sequencing reads. The spike-ins were detected in a quantitative manner (Figure 1—figure supplement 1B), consistent with the suggestion that nanopore sequencing is quantitative (Garalde et al., 2018). For the portion of reads that align to the reference, sequence identity was 92% when measured against the ERCC RNA spike-ins (Figure 1—figure supplement 1C). The errors showed evidence of base specificity (Figure 1—figure supplement 1D,E). For example, guanine was under-represented and uracil over-represented in indels and substitutions relative to the reference nucleotide (nt) distribution. In some situations, this bias could impact the utility of interpreting nanopore sequence errors. We used the proovread software tool (Hackl et al., 2014) and parallel Illumina RNAseq data to correct base-calling errors in the nanopore reads (Depledge et al., 2019).

Artefactual splitting of raw signal affects transcript interpretation

We detected artefacts caused by the MinKNOW software splitting raw signal from single molecules into two or more reads. As a result, alignments comprising apparently novel 3′ ends were mapped as adjacent to alignments with apparently novel 5′ ends (Figure 1—figure supplement 1F). A related phenomenon called over-splitting was recently reported in nanopore DNA sequencing (Payne et al., 2019). Over-splitting can be detected when two reads sequenced consecutively through the same pore are mapped to adjacent loci in the genome (Payne et al., 2019). Over-splitting in nanopore DRS generally occurs at low frequency (< 2% of reads). However, RNAs originating from specific gene loci, such as RH3 (AT5G26742), appear to be more susceptible, with up to 20% of reads affected across multiple sequencing experiments (Figure 1—figure supplement 1F).

Spurious antisense reads are rare or absent in nanopore DRS

Two out of 9,445 (0.02%) reads mapped antisense to the ERCC RNA spike-in collection (Jiang et al., 2011). None of 19,665 reads mapped antisense to the highly expressed gene RUBISCO ACTIVASE (RCA) (AT2G39730), consistent with the lack of antisense RNAs annotated at this locus. Consequently, we conclude that spurious antisense is rare or absent from nanopore DRS data (Wyman, 2019; Mourão et al., 2019). This simplifies the interpretation of authentic antisense RNAs, which is important in Arabidopsis because the distinction between RT-dependent template switching and authentic antisense RNAs produced by RNA-dependent RNA polymerases that copy mRNA is not straightforward (Matsui et al., 2017). For example, by nanopore DRS, we could identify Arabidopsis long non-coding antisense RNAs, such as those at the auxin efflux carriers PIN4 and PIN7 (Figure 1C, Figure 1—figure supplement 1G). The existence of these previously unannotated antisense RNAs was supported by Illumina RNAseq of wild-type Col-0 and the exosome mutant hen2–2 (Figure 1C, Figure 1—figure supplement 1G) (Supplementary file 2A), the latter of which had a 13-fold increase in abundance of these antisense RNAs. Consequently, the low level of steady-state accumulation of some antisense RNAs may explain why they are currently unannotated.

Nanopore DRS confirms sites of RNA 3′ end formation and estimates poly(A) tail length

Ligation of the motor protein adapter to RNA 3′ ends results in nanopores sequencing mRNA poly(A) tails first. We used the nanopolish-polyA software tool to estimate poly(A) tail lengths for individual reads (Workman et al., 2019). Nanopolish-polyA segments the raw signal into regions corresponding to the mRNA and poly(A) tail and uses the length of these regions to estimate poly(A) tail length. Nanopolish-polyA was developed using in vitro transcribed RNA appended with poly(A) tails of defined lengths ranging from 10 to 100 nt (Workman et al., 2019). We first examined the ability of this tool to estimate the known lengths of poly(A) tails in the ERCC spike-ins within our own datasets (Figure 2—figure supplement 1A). ERCC-00002 has an expected poly(A) tail length of 24 nt, and the median nanopolish polyA estimate was 23.3 nt (95% confidence intervals [CIs] [5.7, 102]; n = 1,485). ERCC-00074 also has an expected poly(A) tail length of 24 nt, and the median nanopolish-polyA estimate was 27.4 nt (95% CIs [8.9, 118]; n = 1,439). We conclude that the median nanopolish-polyA estimates of poly(A) tail length are accurate within a few bases, but individual read measurements have a wide margin of error. Application of this approach to reads mapping to the Arabidopsis genome indicated a median length estimate of 68 nt for Arabidopsis mRNA poly(A) tails, but with a wide range in estimated lengths for individual mRNAs (95% were in the 13–200 nt range; Figure 2A). The generally shorter poly(A) tails of chloroplast- (median length 13 nt, 95% CIs [3.1, 56]) and mitochondria-encoded (median length 20 nt, 95% CIs [4.1, 73] transcripts, which are a feature of RNA decay in these organelles, were also detectable. We find that poly(A) tail length correlates negatively (Spearman's ρ = −0.34, p < 1 × 10−16, 95% CIs [−0.36,–0.32]) with gene expression in Arabidopsis (Figure 2—figure supplement 1B), consistent with other species analysed by short-read TAILseq (Lima et al., 2017).

Figure 2 with 1 supplement see all
Nanopore DRS reveals poly(A) tail length and maps 3′ cleavage and polyadenylation sites.

(A) Normalized histogram showing poly(A) tail length of RNAs encoded by different genomes in Arabidopsis. Blue, nuclear (n = 2,348,869 reads); orange, mitochondrial (n = 2,490 reads); green, chloroplast (n = 1,848 reads). (B) Distance between the RNA 3′ end positions in nanopore DRS read alignments and the nearest polyadenylation sites identified by Helicos data. (C) Nanopore DRS identified 3′ polyadenylation sites in RNAs transcribed from FPA (AT2G43410). Blue track, coverage of nanopore DRS read alignments; blue, collapsed read alignments representing putative transcript annotations detected by nanopore DRS; black, Araport11 annotation; pPAS, proximal polyadenylation site; dPAS, distal polyadenylation sites.

We previously mapped Arabidopsis mRNA 3′ ends transcriptome-wide using Helicos direct RNA sequencing (DRS) (Sherstnev et al., 2012 ) We compared the position of 3′ ends of nanopore DRS read alignments and Helicos DRS data genome-wide. The median genomic distance between nanopore DRS and Helicos DRS 3′ ends was 0 ± 13 nt (one standard deviation) demonstrating close agreement between these orthogonal technologies (Figure 2B). Likewise, the overall distribution of the 3′ ends of aligned nanopore DRS reads resembles the pattern we previously reported with Helicos data (Sherstnev et al., 2012). For example, 97% of nanopore DRS 3′ ends (4,152,800 reads at 639,178 unique sites, 93% of all unique sites) mapped to either annotated 3′ untranslated regions (UTRs) or downstream of the current annotation. Mapping of 3′ ends to coding sequences or 5′UTRs was rare (2.8%, 119,524 reads at 39,610 unique sites, 5.8% of all unique sites), and mapping to introns even rarer (0.29%, 12,554 reads at 7791 unique sites, 1.1% of unique sites). Even so, examples of the latter included sites of alternative polyadenylation with well-established regulatory roles, such as in mRNA encoding the RNA-binding protein FPA, which controls flowering time (Hornyik et al., 2010) (Figure 2C), and in mRNA encoding the histone H3K9 demethylase IBM1, which controls levels of genic DNA methylation (Rigal et al., 2012) (Figure 2—figure supplement 1C). In addition, we identified novel examples of intronic alternative polyadenylation predicted to influence gene function. For example, we found frequent cleavage and polyadenylation at mRNA encoding the PTM homeodomain transcription factor (AT5G35210) (Figure 2—figure supplement 1D). PTM mediates retrograde signalling upon cleavage of a C-terminal transmembrane domain that sequesters it to the chloroplast (Feng et al., 2016). Cleavage and polyadenylation within PTM intron 10 (supported by our Helicos DRS data) terminates transcripts prior to sequence encoding the transmembrane domain, hence bypassing established retrograde control.

Since RT-dependent internal priming can result in the misinterpretation of authentic cleavage and polyadenylation sites (Jan et al., 2011), we next determined whether nanopore DRS was compromised in this way. To address this issue, we examined whether the 3′ ends of nanopore DRS reads mapped to potential internal priming substrates comprised of six consecutive adenosines within a transcribed coding sequence (according to the Arabidopsis Information Portal Col-0 genome annotation, Araport11 Cheng et al., 2017). Of the 10,116 such oligo(A)6 sequences, only four have read alignments terminating within 13 nt in all four datasets. Of these, two were not detectable after error correction with proovread (suggesting that they resulted from alignment errors) and the other two mapped to the terminal exon of coding sequence annotation, indicating that they may be authentic 3′ ends. Hence, internal priming is rare or absent in nanopore DRS data. It is possible that a standard nanopore DRS approach may be compromised in the quantitative detection of RNAs with very short poly(A) tails or tails incorporating uridylation for decay. However, overall, we conclude that nanopore DRS can identify multiple authentic features of RNA 3′ end processing.

Cap-dependent 5′ RNA detection by nanopore DRS

Nanopore DRS reads are frequently truncated prior to annotated transcription start sites, resulting in a 3′ bias of genomic alignments (Figure 3A) (Depledge et al., 2019). Consequently, it is impossible to determine which, if any, aligned reads correspond to full-length transcripts. To address this issue, we used cap-dependent ligation of a biotinylated 5′ adapter RNA to purify capped mRNAs (a related approach has recently been used by others Jiang et al., 2019). We then re-sequenced two biological replicates of Arabidopsis Col-0 incorporating 5′ adapter ligation (Supplementary file 1) and filtered the reads for 5′ adapter RNA sequences using the sequence alignment tool BLASTN and specific criteria (Supplementary file 3). We then used high confidence examples of sequences that passed or failed these criteria to train a convolutional neural network to detect the 5′ adapter RNA in the raw signal (Figure 3—figure supplement 1A–C). Hence, we improved 5′ adapter-ligated RNA detection without requiring base-calling or genome alignment, and demonstrated enrichment of full-length, cap-dependent mRNA sequences (Figure 3A,B). This procedure reduced the median 3′ bias of nanopore read alignments per gene (as measured by quartile coefficient of variation of per base coverage) from 0.45 (95% CIs [0.43,0.47]) to 0.08 (95% CIs [0.07,0.09]; Figure 3B).

Figure 3 with 1 supplement see all
Cap-dependent ligation of an adapter enables detection of authentic RNA 5′ ends.

(A) 5′ adapter RNA ligation reduces 3′ bias in nanopore DRS data at RCA (AT2G39730) from 0.92 to 0.03. Blue line, exonic read coverage at RCA for reads without (left) and with adapter (right); orange line, median coverage; orange shaded area, interquartile range (IQR). Change in 3′ bias can be measured using the IQR/median = quartile coefficient of variation (QCoV). (B) 5′ adapter RNA ligation reduces 3′ bias in nanopore DRS data. Histogram showing the QCoV in per base coverage for each gene in the Araport11 annotation, for reads with a 5′ adapter RNA (orange), and reads without a 5′ adapter RNA (blue). (C) Cap-dependent adapter ligation allows identification of authentic 5′ ends using nanopore DRS. The cumulative distribution function shows the distance to the nearest Transcription Start Site (TSS) identified from full-length transcripts cloned as part of the RIKEN RAFL project (left), or 5’ tag identified from nanoPARE data (right), for reads with a 5′ adapter RNA (orange) compared with reads without a 5′ adapter RNA (blue). (D) Cap-dependent adapter ligation enabled resolution of an additional 11 nt of sequence at the RNA 5′ end. Histogram showing the nucleotide shift in the largest peak of 5′ coverage for each gene in data obtained using protocols with vs without a 5′ adapter RNA ligation. (E) For RCA (AT2G39730), the 5′ end identified using cap-dependent 5′ adapter RNA ligation protocol was consistent with Illumina RNAseq and full-length cDNA start site data but differed from the 5′ ends in the Araport11 and AtRTD2 annotations. Grey, Illumina RNAseq coverage; blue, nanopore DRS 5′ end coverage generated without a cap-dependent ligation protocol; green/pink, nanopore DRS 5′ end coverage for read alignments generated using the cap-dependent ligation protocol with (green) and without (pink) 5′ adapter RNA; orange, 5’ coverage of nanoPARE data; blue, TSSs identified from full-length transcripts cloned as part of the RIKEN RAFL project; black, Araport11 and AtRTD2 annotations (with duplicated 5’ positions removed).

In order to determine whether the 5′ ends we detected reflect full-length mRNAs, we compared them against transcription start sites in datasets derived from two orthogonal approaches. First, we compared the nanopore cap-capture 5’ ends with full-length Arabidopsis cDNA clones (Seki et al., 2002). We found that 41% of adapter-ligated nanopore DRS reads mapped within 5 nt of transcription start sites and 60% mapped within 13 nt (Figure 3C). Next, we compared the nanopore cap-captured 5’ ends to Arabidopsis data obtained using an Illumina-based high throughput 5’ mapping approach called nanoPARE (parallel analysis of RNA ends) (Schon et al., 2018). We found that 75% of nanopore DRS cap-capture 5’ ends mapped within 5 nt of a transcription start site detected using single base resolution peak-calling from nanoPARE reads, and 82% mapped with 13 nt (Figure 3C). This closer agreement is presumably due to the increased depth of nanoPARE data leading to greater coverage of alternative transcription start sites. We also detected recently defined examples of alternative 5′ transcription start sites (Ushijima et al., 2017) at specific Arabidopsis genes, which are also supported by nanoPARE data (Figure 3—figure supplement 1D). We therefore conclude that this approach is effective in detecting authentic mRNA 5′ ends.

Reads with adapters had, on average, 11 nt more at their 5′ ends that could be aligned to the genome compared with the most common 5′ alignment position of reads lacking the 5′ adapter RNA (Figure 3D). This difference may be explained by loss of processive control by the motor protein when the end of an RNA molecule enters the pore. As a result, the 5′ end of RNA is not correctly sequenced. Consistent with these Arabidopsis transcriptome-wide nanopore DRS data, reads mapping to the synthetic ERCC RNA Spike-Ins and in vitro transcribed RNAs also lacked ~11 nt of authentic 5′ sequence (Figure 3—figure supplement 1E,F). However, the precise length of 5′ sequence missing from all of these RNAs varied, suggesting that sequence- or context-specific effects on sequence accuracy are associated with the passage of 5′ RNA through the pore (Figure 3D, Figure 3—figure supplement 1E,F).

Despite the close agreement between nanopore DRS, Illumina RNAseq and full-length cDNA data (Seki et al., 2002) at RCA, the start site annotated in Araport11 and the Arabidopsis thaliana Reference Transcript Dataset 2 (AtRTD2) (Zhang et al., 2017) is quite different (Figure 3E). The apparent overestimation of 5′UTR length is widespread in Araport11 annotation (Figure 3—figure supplement 1G), consistent with the assessment of capped Arabidopsis 5′ ends detected by nanoPARE sequencing (Schon et al., 2018). Consequently, with appropriate modification to the current protocol, such as we describe here, nanopore DRS data can be used to revise Arabidopsis transcription start site annotations.

We next asked whether the truncation of reads prior to predicted TSSs in conventional nanopore DRS data might inform features of biological RNA decay. We examined if there was enrichment of 5’ nanopore DRS reads terminating in proximity to miRNA cleavage sites predicted in protein coding and non-coding RNAs. We identified peaks of downstream cleavage products at mRNA encoding, for example, HAM1, which is targeted by miR170 and miR171 (Figure 3—figure supplement 1H), NAC1 (targeted by miR164), ARF proteins (targeted by miR160) and SPL domain-containing proteins (targeted by miR156/7) (Supplementary file 4). In addition, we could detect cleavage of the non-coding TAS RNAs that are targeted by miR173 (Supplementary file 4). Therefore, nanopore DRS can reveal regions of internal mRNA and ncRNA cleavage. Since nanopore DRS does not sequence the 5’ end of RNA accurately, precision in this approach would be improved by ligation of an adapter to uncapped RNAs (Schon et al., 2018).

Nanopore DRS reveals the complexity of splicing events

In single reads, nanopore sequencing revealed some of the most complex splicing combinations so far identified in the Arabidopsis transcriptome. For example, analysis of error corrected reads revealed the splicing pattern of a 12.7 kb read alignment, comprised of 63 exons, agreeing exactly with the AT1G48090.4 isoform annotated in Araport11 (Figure 1A). Mutually exclusive alternative splicing of FLM (AT1G77080) that mediates the thermosensitive response controlling flowering time (Posé et al., 2013) was also detected (Figure 4A). However, a combination of base-calling and alignment errors contributed to the misidentification of splicing events for uncorrected DRS data: 58% (170,702) of the unique splice junctions detected in the combined set of replicate data were absent from Araport11 and AtRTD2 annotations and were unsupported by Illumina RNAseq (Figure 4B, Supplementary file 5). We applied proovread (Hackl et al., 2014; Depledge et al., 2019) error correction with the parallel Illumina RNAseq data and then re-analysed the corrected and uncorrected nanopore DRS data. After error correction, only 13% (39,061) of unique splice junctions were unsupported by an orthogonal dataset, consistent with an improvement in alignment accuracy. The four nanopore DRS datasets for Col-0 biological replicates captured 75% (102,486) and 69% (104,686) of Araport11 and AtRTD2 splice site annotations, respectively. Most of the canonical GU/AG splicing events (100,450; 81%) detected in the error-corrected nanopore data were found in both annotations and were supported by Illumina RNAseq (Figure 4B, Supplementary file 5). A total of 3,234 unique canonical splicing events in the error-corrected nanopore DRS data were supported by Illumina RNAseq but absent from both Araport11 and AtRTD2 annotations, highlighting potential gaps in our understanding of the complexity of Arabidopsis splicing annotation (Figure 4B, Supplementary file 5). Consistent with this, we validated three of these newly identified splicing events using RT-PCR (reverse transcription polymerase chain reaction) followed by cloning and Sanger sequencing (Figure 4—figure supplement 1A). In order to examine the features of these unannotated splicing events, we applied previously determined splice site position weight matrices of the flanking sequences to categorize U2 or U12 class splice sites (Sheth et al., 2006). Of the 3,234 novel GU/AG events found in error-corrected data and supported by Illumina alignments, 74% were classified as canonical U2 or U12 splice sites, suggesting that they are authentic (Figure 4—figure supplement 1B).

Figure 4 with 1 supplement see all
Nanopore DRS reveals the complexity of alternative splicing.

(A) Nanopore DRS identified the mutually exclusive alternative splicing of the second and third exons of FLOWERING LOCUS M (FLM, AT1G77080) and several novel isoforms. Black arrows indicate mutually exclusive exons. Orange, isoforms present in the AtRTD2 annotation but not identified using nanopore DRS; blue, isoforms common to both AtRTD2 and nanopore DRS; green, novel isoforms identified in nanopore DRS. (B) Comparison of splicing events identified in error-corrected and non-error-corrected nanopore DRS, Illumina RNA sequencing, and Araport11 and AtRTD2 annotations. Bar size represents the number of unique splicing events common to the set intersection highlighted using circles (see Supplementary file 5 for the exact values). GU/AG splicing events are shown on the top and non-GU/AG on the bottom of the plot: yellow, splicing events common to all five datasets; green, events common to both error-corrected and non-error-corrected nanopore DRS with support in orthogonal datasets; blue, events common to both nanopore DRS datasets without orthogonal support; orange, events found in uncorrected nanopore DRS (but not error corrected) with orthogonal support; pink, events found in error-corrected nanopore DRS (but not uncorrected) with orthogonal support. (C) Comparison of RNA isoforms (defined as sets of co-spliced introns) identified in error-corrected full-length nanopore DRS, Araport11 and AtRTD2 annotations. Bar size represents the number of splicing events common to a group highlighted using circles below (see Supplementary file 5 for the exact values): yellow, unique splicing patterns found in nanopore DRS and both reference annotations; blue, novel isoforms found only in nanopore DRS.

In addition to previously unannotated splicing events, we identified unannotated combinations of previously established splice sites. For example, we identified 19 FLM splicing patterns that adhered to known splice junction sites (Figure 4A). However, 11 of these transcript isoforms were not previously annotated. In order to investigate this phenomenon transcriptome-wide, we analysed the 5′ cap-dependent nanopore DRS datasets of full-length mRNAs (Supplementary file 1). Unique sets of co-splicing events were extracted from error-corrected reads (so as to focus on splicing, we did not consider single exon reads or 5′ and 3′ positions). In total, 13,064 unique splicing patterns were detected that matched annotations in Araport11, AtRTD2 or both (Figure 4C). Another 8,659 unique splicing patterns were identified that were not present in either annotation (Figure 4C, Supplementary file 5). Of these, 50% (4,293) used only splice donor and acceptor pairs that were already annotated in either Araport11 or AtRTD2. Hence, this approach defines splicing patterns (including retained introns) produced from alternative combinations of known splice sites.

We next asked whether these full-length nanopore DRS reads could reveal alternative splicing events with a predicted functional impact, such as the introduction of premature termination codons (PTCs) that can target mRNAs for nonsense-mediated decay (NMD). We could identify established cases of alternative splicing that lead to NMD (Kalyna et al., 2012) (Supplementary file 6). In addition, by highlighting reads overlapping annotated ORFs, with splicing events that would shift the annotated stop codon out of frame, we identified novel events that could lead to NMD. For example, exon skipping at mRNA encoding a KH domain protein (AT5G56140) and the use of alternative acceptor sites at mRNA encoding XRCC (AT1G80420) introduce PTCs. In both cases, the splicing events detected with nanopore DRS were supported by reads from our Illumina RNAseq data.

Overall, we conclude that nanopore DRS can reveal a greater complexity of splicing in the context of full-length mRNAs compared with short-read data. However, accurate splice pattern detection benefits from error correction with, for example, high-accuracy orthogonal short-read sequencing data. However, even with error-free sequences, accurate splice detection can be confounded by the existence of equivalent alternative junctions (Dehghannasiri et al., 2019). Therefore, improved computational tools are required, not only for basecalling or error correction but also for splicing-aware long-read alignment.

Differential error site analysis reveals m6A modifications transcriptome-wide

mRNA modifications have emerged recently as a crucial, but relatively neglected, layer of gene regulation (Meyer and Jaffrey, 2017; Roundtree et al., 2017). m6A has been mapped transcriptome-wide using approaches based on antibodies that recognize this mark (Helm and Motorin, 2017; Linder et al., 2015). However, in principle, m6A can be detected by nanopore DRS (Garalde et al., 2018). Since m6A is not included in the training data for nanopore base-calling software, we asked whether its incorrect interpretation could be used to identify m6A in RNA. To test this possibility, we examined whether we could detect m6A in a matched set of synthetic RNAs that differ only in the presence or absence of N6 methylation at a single adenosine. We selected a 60 nt portion of the human MALAT1 lncRNA with a previously reported m6A site (Liu et al., 2013). The m6A modified and unmodified RNAs were chemically synthesised, complete with a 50 nt poly(A) tail and sequenced using nanopore DRS. Using read alignments to the known sequence of the synthetic RNA, we applied a G-test to detect differences in the basecall profile of alignments at each reference position. Using this differential error rate approach, we could correctly map the position of m6A in these synthetic RNAs (Figure 5—figure supplement 1A).

We next asked, if this same approach could be used to map Arabidopsis m6A transcriptome-wide. For this, we applied nanopore DRS to four biological replicates of an Arabidopsis mutant defective in the function of VIRILIZER (vir-1), a conserved m6A writer complex component, and four biological replicates of a line expressing VIR fused to Green Fluorescent Protein (GFP) (VIR complemented; VIRc) that restores VIR activity in the vir-1 mutant background (Růžička et al., 2017) (Figure 5—figure supplement 1B). In parallel, we sequenced six biological replicates of each genotype with Illumina RNAseq (Supplementary file 2B). We used the differential error rate analysis approach to compare the mutant (defective m6A) and VIR-complemented lines. We identified 17,491 sites with a more than two-fold higher error rate (compared with the TAIR10 reference base) in the VIR-complemented line with restored m6A (Figure 5A). No VIR-dependent error sites mapped to either chloroplast or mitochondrial-encoded RNAs. In all, 99.4% of the differential error sites mapped within Araport11 annotated protein-coding genes. Motif analysis of these error sites revealed the DRAYH (D = G or U or A, R = G or A, Y = C or U, H = A or C or U) sequence (E value < 1 × 10−16), which closely resembles the established m6A target consensus (Meyer and Jaffrey, 2017; Roundtree et al., 2017) (Figure 5B). The most frequently detected motifs in this analysis, AAm6ACU and AAm6ACA (Figure 5—figure supplement 1C), match those most frequently detected in Arabidopsis using the orthogonal Me-RIP approach (Luo et al., 2014). Notably, we also detect 187 sites (4.8%) associated with the motif AGAUU (Figure 5—figure supplement 1C), raising the possibility that a C following m6A is not an invariant feature of the Arabidopsis m6A code. In addition, like the established location of m6A sites in mRNAs (Meyer and Jaffrey, 2017; Roundtree et al., 2017), the error sites were preferentially found in 3′UTRs (Figure 5C). The relatively low resolution of Me-RIP analysis is associated with an interpretation of enrichment of m6A over stop codons. However, consistent with the development of other more accurate methods of mapping m6A (Ke et al., 2015), nanopore DRS analysis does not support this (Figure 5D). Since approximately five nt contribute to the observed current at a given time point in nanopore sequencing (Garalde et al., 2018), the presence of a methylated adenosine could affect the accuracy of base-calling for the surrounding nucleotides. Consistent with this, we identified 3,871 unique sequences (False Discovery Rate [FDR] < 0.1) matching the motif discovered at error sites (Figure 5—figure supplement 1C), with a median of one error site per motif (95% CIs [1, 5]). Overall, these results agree with the established and conserved properties of authentic m6A sites (Meyer and Jaffrey, 2017; Roundtree et al., 2017), suggesting that differential error sites can be used to identify thousands of m6A modifications in nanopore DRS datasets.

Figure 5 with 1 supplement see all
Differential error rate analysis identifies sites of VIR-dependent m6A modifications transcriptome-wide.

(A) Loss of VIR function is associated with reduced error rate in nanopore DRS. Histogram showing the log2 fold change in the ratio of errors to reference matches at bases with a significant change in error profile in vir-1 mutant compared with the VIR-complemented line. Orange and green shaded regions indicate sites with increased and reduced errors in vir-1, respectively. (B) The motif at error rate sites matches the consensus m6A target sequence. The sequence logo is for the motif enriched at sites with reduced error rate in the vir-1 mutant. (C) Differential error rate sites are primarily found in 3′ UTRs. Bar plot showing the number of differential error rate sites per kb for different genic feature types of 48,149 protein coding transcript loci in the nuclear genome of the Araport11 reference. Upstream and downstream regions were defined as 200 nt regions upstream of the annotated transcription start sites and downstream of the annotated transcription termination sites, respectively. (D) Differential error rate sites and miCLIP peaks are similarly distributed within the 3′ UTR, without accumulation at the stop codon. Metagene plot centred on stop codons from 48,149 protein coding transcript loci, showing the frequency of nanopore DRS error sites (blue) and miCLIP peaks (orange). (E) The locations of differential error rate sites are in good agreement with the locations of sites identified by miCLIP. Histogram showing the distribution of distances to the nearest miCLIP peak for each site of reduced error. Most error sites (66%) are within 5 nt of a miCLIP peak. (F) Nanopore DRS differential error site analysis and miCLIP identify m6A sites in the 3′ UTR of PRPL34 RNA. Blue, miclip 5′ end coverage; purple, nanopore DRS differential error sites; black, RNA isoform from Araport11 annotation. Expanded region: blue, miCLIP coverage; purple, nanopore DRS differential error sites; black, motif scores using the m6A consensus position weight matrix (Figure 5B). A higher positive score denotes a higher likelihood of a match to the consensus sequence.

In order to examine the validity of m6A sites identified by the differential error site analysis, we used an orthogonal technique to map m6A. Previous maps of Arabidopsis m6A are based on Me-RIP (Shen et al., 2016a; Anderson et al., 2018) and limited by a resolution of around 200 nt (Ke et al., 2015). Therefore, to examine Arabidopsis m6A sites with a more accurate method, we used miCLIP (Linder et al., 2015) analysis of two biological replicates of Arabidopsis Col-0. We found that, like the differential error sites uncovered in the nanopore DRS analysis, the Arabidopsis miCLIP reads were enriched in 3′ UTRs but with no enrichment over stop codons (Figure 5D, Figure 5—figure supplement 1D). In all, 66% of the called nanopore DRS differential error sites fell within 5 nt of an miCLIP peak (Figure 5E,F). We therefore conclude that our analysis of nanopore DRS data can detect authentic VIR-dependent m6A sites transcriptome-wide.

We next examined how the power of the differential error rate analysis approach to detect m6A depends upon sequencing depth. To do this we used a random sub-sampling approach at different read depths. Using the single replicate of MALAT1 synthetic RNA nanopore DRS data we were able to detect m6A at the correct position in more than 95% of random sub-samples with 60 reads per condition (Figure 5—figure supplement 1E). Since the stoichiometry of m6A varies in nature, we replaced some of the sample of methylated reads with unmethylated reads in silico. When 50% of reads in the methylation positive condition are sampled from the unmethylated condition, more than 200 reads per condition are required to reliably detect m6A in > 95% bootstraps, indicating that stoichiometry is an important factor in the power to detect differentially methylated sites (Figure 5—figure supplement 1E). We next performed a sub-sampling analysis for a selection of m6A sites using the Arabidopsis nanopore DRS datasets. We found that the number of reads required to detect m6A varies between Arabidopsis genes and between methylated positions in transcripts from the same gene. Presumably, these distinctions reflect differences in stoichiometry (or possibly sequence context) (Figure 5—figure supplement 1F). For some positions, we can reliably detect m6A with only 10 reads per replicate (in a four replicate per condition experiment).

Defective m6A perturbs gene expression patterns and lengthens the circadian period

The combination of transcript processing and modification data obtained using nanopore DRS enabled us to investigate the impact of m6A on Arabidopsis gene expression. We found a global reduction in protein-coding gene expression in vir-1 (using either nanopore DRS or Illumina RNAseq data), corresponding to transcripts that were methylated in the VIR-complemented line (Figure 6A, Figure 6—figure supplement 1A). These findings are consistent with the recent discovery that m6A predominantly protects Arabidopsis mRNAs from endonucleolytic cleavage (Anderson et al., 2018). Therefore, although the m6A writer complex comprises conserved components that target a conserved consensus sequence and distribution of m6A is enriched in the last exon, it appears that this modification predominantly promotes mRNA decay in human cells (Wang et al., 2014), but mRNA stability in Arabidopsis (Anderson et al., 2018).

Figure 6 with 1 supplement see all
Reduction in m6A RNA modification leads to disruption of the circadian clock.

(A) Genes with differential error rate sites have lower detectable RNA levels. Histogram showing the log2 fold change in protein coding gene expression based on counts from nanopore DRS reads in the vir-1 mutant compared to the VIR-complemented line. Blue, genes with differential error rate sites (n = 5083 genes); orange, genes without differential error rate sites (n = 14,675 genes). (B) The circadian period is lengthened in the vir-1 mutant. Mean delayed fluorescence measurements in arbitrary units: blue, Col-0 (n = 61 technical reps); green, the VIR-complemented line (VIRc; n = 29 technical reps); orange, vir-1 (n = 24 technical reps). Shaded areas show bootstrapped 95% confidence intervals for the mean. (C) Boxplot showing the period lengths calculated from delayed fluorescence measurements shown in (b): blue, Col-0; green, VIRc; orange, vir-1 (orange). (D) Poly(A) tail length is altered in the vir-1 mutant. Histogram showing the poly(A) tail length distribution of CAB1 (AT1G29930): blue, Col-0 (n = 40,841 reads); green, VIRc (n = 65,810 reads); orange, vir-1 (n = 68,068 reads). Arrows indicate phased peaks of poly(A) length at approximately 20, 40 and 60 nt. vir-1 distribution is significantly different from both Col-0 and VIRc, according to the Kolmogorov–Smirnov test (p < 1 × 10−16, p < 1 × 10−16 respectively).

The changes in gene expression in vir-1 were wide ranging (Figure 6—figure supplement 1B,C,D). For example, we found that the abundance of mRNAs encoding components of the interlocking transcriptional feedback loops that comprise the Arabidopsis circadian oscillator (McClung, 2011), such as CCA1, were altered in vir-1 (Figure 6—figure supplement 1B,C). This distinction was associated with a biological consequence in that the vir-1 mutant had a lengthened clock period (Figure 6B,C). Notably, m6A is also required for normal circadian rhythms in mammalian cells (Fustin et al., 2013). We detected m6A at mRNAs encoding the clock components PRR7, GI and LNK1/2 in both the nanopore DRS and miCLIP data (Figure 6—figure supplement 1B). We also detected shifts in the poly(A) tail length distributions of mRNAs transcribed from genes previously shown to be under circadian control (Supplementary dataset 17). At CAB1 mRNAs, for example, we detected poly(A) tail length estimates that peaked at approximately 20, 40 and 60 nt (Figure 6D). vir-1 mutants had reduced abundance of CAB1 mRNAs with 20 and 40 nt poly(A) tails, and an increased abundance of poly(A) tails of 60 nt in length (Figure 6D). Therefore, nanopore DRS may uncover the circadian control of poly(A) tail length, as previously reported for specific Arabidopsis genes (Lidder et al., 2005). An output of the Arabidopsis circadian clock is the control of flowering time, and we found that not only were photoperiod pathway components differentially expressed but so too were other flowering time genes (Supplementary file 7). Notably, FLOWERING LOCUS C (FLC) expression was reduced by more than 40-fold compared with the wild type (Figure 6—figure supplement 1D). Consequently, the proper control of circadian rhythms, flowering time and the regulatory module at FLC ultimately requires the m6A writer complex component, VIR.

Defective m6A is associated with disrupted RNA 3′ end formation

In addition to measuring RNA expression, we examined the impact of m6A loss on pre-mRNA processing. Detectable disruptions to splicing in vir-1 were modest. For example, using the DEX-Seq software tool (Anders et al., 2012) for analysis of annotated splice sites, we found only weak effect-size changes to cassette exons, retained introns or alternative donor/acceptor sites compared with the VIR-complemented line in our Illumina RNAseq data (Figure 7—figure supplement 1A). In contrast, a clear defect in RNA 3′ end formation in vir-1 was apparent. Using a Kolmogorov–Smirnov test, we identified 3,579 genes (FDR < 0.05, absolute change in position of > 13 nt) with an altered nanopore DRS 3′ position profile in the vir-1 mutant compared with the VIR-complemented line (Figure 7A). Of these, 3,008 displayed a shift to usage of more proximal poly(A) sites in vir-1: 59% of these genes also contain m6A sites detectable by nanopore DRS (compared with 32% of all expressed genes, p < 1 × 10−16; 61% were detectable by miCLIP compared with 35% of all expressed genes, p < 1 × 10−16) and correspond to locations of increased cleavage downstream of m6A sites in the vir-1 mutant (Figure 7B). A total of 571 genes showed increased transcriptional readthrough beyond the 3′ end in vir-1 (Figure 7A): 72% of these loci also contained nanopore-mapped m6A sites (p < 1 × 10−16; 71% were detectable by miCLIP, p < 1 × 10−16).

Figure 7 with 1 supplement see all
RNA methylation is closely linked to poly(A) site selection.

(A) Global shift to proximal 3′ end usage observed in the vir-1 mutant compared with the VIRc line. Histogram showing the distance in nucleotides between the most reduced and most increased 3′ end positions for genes in which the 3′ end profile is altered in vir-1 (detected with the Kolmogorov–Smirnov test, FDR < 0.05). A threshold of 13 nt was used to detect changes in the 3′ end position. (B) A shift to the use of more proximal poly(A) sites is observed in m6A containing transcripts in the vir-1 mutant. Histogram showing distance from the error site (n = 17,491 error sites) to upstream and downstream 3′ ends: orange, vir-1; green, VIRc. (C) Poly(A) site choice is phased with m6A at PRPL34 (AT1G29070). PRPL34 has four distinct poly(A) site clusters identified from VIRc reads. Analysis of the error profiles of these clusters at the closest upstream m6A motif indicates that proximally polyadenylated isoforms are less likely to be methylated. Green, VIRc 3’ coverage and poly(A) clusters; purple, nanopore differential error site tested; black, Araport11 annotation. Zoomed view shows stacked barplots of VIRc error profiles for the tested nanopore differential error site. Pairs of clusters with significantly different profiles (G-test, FDR < 0.05) are indicated with asterisks. (D) PRPL34, which is methylated in at least two positions in the 3’UTR (Figure 5F), also displays an increase in alternative polyadenylation at a proximal poly(A) site in the vir-1 mutant. The proximal poly(A) site is approximately 17 nt downstream of the closest methylated position. Purple, nanopore differential error sites; green, VIRc coverage; orange, vir-1 coverage; black, Araport11 annotation. (E) RNA methylation sites overlap with canonical and degenerate poly(A) site motifs. Line plot showing the enrichment of poly(A) site motifs (with an edit distance of one) around m6A motifs detected de novo under nanopore differential error sites: blue, the canonical PAS motif AAUAAA; grey, other similar motifs (excluding AAAAAA); orange, the UAUAAA motif; green, the AACAAA motif; yellow, the AAGAAA motif; black, the m6A motif (includes all sequences shown in Figure 5—figure supplement 1E). (F) Readthough events and chimeric RNAs are detected in vir-1. Green, nanopore DRS and Illumina RNAseq data for the VIRc line; orange, nanopore DRS and Illumina RNAseq data for the vir-1 mutant; grey and white, differentially expressed regions between vir-1 and VIRc detected using Illumina RNAseq data with DERfinder that are upregulated and downregulated, respectively; black, Araport11 annotation. Intergenic readthrough regions are highlighted by the purple dashed rectangle. (G) ALG3-GH3.9 chimeric RNAs transcribed in the vir-1 mutant. Chimeric readthrough RNAs were validated using RT-PCR. RT-PCR products were separated on agarose gel and stained with SYBR Safe. RT+ and RT-, cDNA produced in the presence and absence of reverse transcriptase, respectively; ladder, DNA size ladder; Water, no template control. TUBULIN was used as a control.

In order to examine the impact of m6A on 3’ end formation in an orthogonal manner, we analysed our parallel Illumina RNAseq datasets using the DaPars algorithm which estimates shifts in patterns of alternative polyadenylation (Xia et al., 2014). DaPars-mediated analysis also revealed a clear directional shift to proximal poly(A) site selection at thousands of loci in the vir-1 background, with 3408 genes displaying a decrease in distal poly(A) site usage, compared to only 166 genes displaying an increase in distal poly(A) site usage (Figure 7—figure supplement 1B). Consequently, DaPars analysis is consistent with our nanopore DRS study, indicating that the predominant disruption in 3’ end formation in the vir-1 mutant is a shift to proximal poly(A) site selection.

To address the relationship between m6A and 3’ end formation further, we asked if differential error sites were phased (associated) with specific poly(A) sites. We clustered reads from VIRc samples by their 3’ position using kernel density estimation. We then compared the error profiles of reads from different 3’ clusters using a G-test to identify whether 3’ end positions were associated with m6A levels (only positions which had differential error in the vir-1 vs VIRc comparison were tested). We identified 327 error sites which were phased with 3’ end position, overlapping 207 annotated genes. 149 (72%) of these also display a statistically detectable change in 3’ end position in the vir-1 mutant compared to VIRc. For example, at PRPL34 (AT1G29070), selection of the distal poly(A) sites pA2 and pA3 is associated with increased mismatches to the reference base (presumably resulting from higher levels of m6A) at an upstream error site, compared to the proximal poly(A) site pA1 (Figure 7C). Consistent with this, there is a quantitative shift to selection of the proximal poly(A) site pA1 in the vir-1 mutant (Figure 7D). These findings indicate that changes in 3’ end position in the vir-1 mutant may result directly from the loss of m6A.

In the vir-1 mutant, there is an increase in the relative abundance of 3’ end positions in the region approximately 0-20 nt downstream of differential error sites (Figure 7B). Notably, the distribution of the canonical poly(A) signal AAUAAA peaks 19 nt upstream of the cleavage site in Arabidopsis mRNA (related hexamers, differing at only a single position, show similar distribution patterns) (Sherstnev et al., 2012). We therefore compared the distribution of the poly(A) signal AAUAAA (and related hexamers differing at only a single position, except AAAAAA) with the motifs underlying differential error sites. This analysis reveals a close overlap of the peak of AAUAAA-like poly(A) signals with the m6A motif (Figure 7E). This finding may reflect the relative similarity of the most common Arabidopsis m6A motifs (such as AAACU and AAACA) to the poly(A) signal, and indicates that adenosine residues directly adjacent to, or within, the poly(A) signal itself may be methylated. Since loss of m6A activates the selection of upstream poly(A) signals, these findings are consistent with m6A at (or in the vicinity of) AAUAAA inhibiting recognition of the poly(A) signal for subsequent cleavage and polyadenylation.

We next investigated those events where downstream readthrough was detected in vir-1. The impacts of altered 3′ processing can be complex and have the potential to change the relative abundance of transcripts processed from the same gene but with different coding potential. For example, we detected increased readthrough of an intronic cleavage site in the Symplekin-like gene TANG1 (AT1G27595; Figure 7—figure supplement 1C) and increased readthrough and cryptic splicing at ALG3 (AT2G47760) that also results in chimeric RNA formation with the downstream gene GH3.9 (AT2G4G7750; Figure 7F). The existence of the ALG3-GH3.9 chimeric RNAs was supported by Illumina RNAseq (Figure 7F) and confirmed by RT-PCR, cloning and sequencing (Figure 7G). We detected 523 loci with increased levels of chimeric RNAs in vir-1 resulting from unterminated transcription proceeding into downstream genes on the same strand (Supplementary dataset 24). Chimeric RNAs were recently detected in mutants affecting other components of the Arabidopsis m6A writer complex, MTA and FIP37 (Pontier et al., 2019). However, only 33% of upstream genes forming the chimeric RNAs had detectable m6A sites in the VIR-complemented line with restored VIR activity. Consequently, these findings might be explained either by an m6A-independent role for VIR (or the writer complex) in 3′ end formation or an indirect effect on factors required for 3′ processing. m6A independent roles for the human m6A methyltransferases METTL3 (Lin et al., 2016) and METTL16 (Pendleton et al., 2017) have been found previously, and a role for the writer complex in controlling Arabidopsis RNA processing independent of m6A cannot be overlooked (Pendleton et al., 2017).

Overall, we conclude that loss of VIRILIZER-dependent m6A disrupts Arabidopsis mRNA 3’ formation. Although this includes readthrough at poly(A) sites with the formation of chimeric RNAs, the predominant impact is loss of suppression of proximal poly(A) sites.

Discussion

We used nanopore DRS to reveal the Arabidopsis transcriptome. Using orthogonal datasets as validation in each case, we could demonstrate the utility of nanopore DRS in revealing full-length mRNAs, mapping 5’ cap position, the position and estimated length of the poly(A) tail, patterns of alternative splicing, sites of internal RNA cleavage, and m6A modification of RNA. As a result, we could not only confirm previously established complexity in Arabidopsis RNA processing, but uncover novel events too. We found that spurious antisense signal was rare or absent, as were internal priming artefacts that complicate interpretation of authentic antisense RNAs and poly(A) sites respectively. Improvements in the utility of nanopore DRS likely depend upon more accurate base-calling, refined splitting of signal, increased read-depth and the development of software for splice-aware alignment of long sequencing reads.

The key technological question we addressed in this study, was whether nanopore DRS could be used to map m6A transcriptome-wide. We show that a simple and transparent statistical test of basecalling error rate can identify m6A. Available basecalling software depend upon recurrent neural networks trained with signal produced from sequencing RNA where modified bases are unlabelled and present in relatively low abundance. As a result, basecallers cannot accurately assign m6A. Since 5 RNA bases occupy a nanopore at any one time, m6A can affect the accuracy of basecalling of surrounding nucleotides. Formally then, an error-rate based approach has an approximate resolution of 9 nt. However, the methylated base can be interpreted using a position weight matrix of the consensus motif associated with error sites. Nanopore DRS technology and basecalling software continues to develop. In our experience, so long as the basecallers are not trained to directly identify m6A, and sequencing and analysis is performed on a common iteration of the nanopore DRS platform (the same generation of sequencing kit, pore and basecalling software), then an error rate approach is still able to map m6A. Since this is a statistical test, it depends upon an aggregation of reads and does not map m6A directly in single molecules. Consequently, the power to detect m6A will depend upon read depth.

Until recently, high throughput m6A mapping has depended on antibodies that recognise this modification. However, antibody-based approaches such as MeRIP/m6A-seq and miCLIP are limited in some respects. For example, they cannot readily define stoichiometry, do not necessarily detect m6A only, have unclear sensitivity, and the fragmentation of RNA required for library preparation may result in the underrepresentation of sites close to RNA 3’ ends. As a result, antibody-independent approaches to mapping m6A have been sought (Garcia-Campos et al., 2019; Liu et al., 2019; Meyer, 2019). The enzyme MaZF cleaves RNA at ACA in a methylation-dependent manner and is used in MAZTER-Seq to give single base resolution and insight into m6A stoichiometry (Garcia-Campos et al., 2019). Since MAZTER-Seq is limited to the detection of m6As in an ACA context (and spatial constraints) the majority of m6A sites in human or yeast cells cannot be detected (Garcia-Campos et al., 2019). Another recently developed nanopore DRS-based approach to map m6A, called EpiNano, also has context-specific limitations (Liu et al., 2019); nanopore DRS of in vitro transcribed RNAs was used to train a support vector machine model to detect m6A (Liu et al., 2019). However, EpiNano cannot currently call m6A in a k-mer with more than one adenosine, and hence 92% of m6A sites we identify in Arabidopsis are undetectable with this approach (Figure 5—figure supplement 1E) (Liu et al., 2019). Since nanopore DRS is a relatively recent development in RNA sequencing, we anticipate that alternative approaches will refine our ability to identify RNA modifications in single molecule reads (Lorenz et al., 2019).

The key biological question we set out to address, was whether an unbiased transcriptome-wide sequencing approach (that mapped m6A in the context of full-length mRNAs) could reveal the functional impacts of m6A on Arabidopsis gene expression. We found that disrupting m6A was associated with altered patterns of mRNA cleavage and polyadenylation. Although we detected readthrough at some poly(A) sites, the predominant effect was the activation of proximal poly(A) sites in the absence of m6A, indicating that m6A can suppress proximal poly(A) site selection. We found that the distribution of m6A sites and the canonical poly(A) signal AAUAAA (and related hexamers with one mismatch position) closely overlapped, indicating that adenosines within the AAUAAA signal itself could be methylated. In mammals, recognition of AAUAAA involves direct binding by the CPSF30 and WDR33 proteins of the cleavage and polyadenylation machinery (Chan et al., 2014; Schönemann et al., 2014). The Arabidopsis homologs (CPSF30 and FY respectively) show conservation of amino acid residues involved in recognizing AAUAAA (Clerici et al., 2018; Sun et al., 2018). In Arabidopsis, the conserved, constitutive role for CPSF30 and FY in recognising the poly(A) signal is associated with the evolution of features for specific gene regulation. For example, FY has C- terminal proline-containing motifs that mediate an interaction with the plant-specific RNA binding protein FCA, conferring upon FY a specialised function in controlling the timing of Arabidopsis flower development (Simpson et al., 2003). Meanwhile, Arabidopsis (like other plants and the apicomplexa) expresses a CPSF30 isoform (regulated by alternative polyadenylation in Arabidopsis) that encodes a YT521-B homology (YTH) domain with the potential to bind and read m6A (Stevens et al., 2018). We speculate that m6A at AAUAAA either directly inhibits binding of CPSF30 and FY to the poly(A) signal, or that binding of m6A by YTH domain containing proteins (possibly CPSF30-YTH itself) blocks canonical recognition by CPSF30 and FY, and hence assembly of an active cleavage and polyadenylation complex. Future experiments will dissect the mechanisms involved.

The impact of individual m6A sites on Arabidopsis biology is likely to be context-specific. However, it is clear that m6A is almost exclusively found in mRNA 3’UTRs (in steady-state analyses). Consequently, the effect of m6A is likely to be mediated mostly through mRNA alternative polyadenylation, stability, translatability and the formation of specific protein complexes (Mayr, 2019). We (and others) find that transcripts modified with m6A are predominantly stabilised in Arabidopsis (Anderson et al., 2018). An explanation for this phenomenon is that m6A protects against upstream endonucleolytic mRNA cleavage (Anderson et al., 2018). We report here that m6A predominantly inhibits the selection of downstream proximal cleavage and polyadenylation sites. These distinct phenotypes may reflect temporal differences in how m6A affects mRNA fate. During co-transcriptional processing, m6A may influence the site of cleavage and polyadenylation, but subsequently affect mRNA phase separation and fate in the cytoplasm (Ries et al., 2019). Arabidopsis has 13 YTH domain containing proteins with the potential to bind and read m6A (Reichel et al., 2019). Understanding the complexity of the roles that they play will help explain the context-specific functions of m6A. The combination of m6A mapping and RNA complexity revealed by nanopore DRS, provides an additional approach to address these questions.

Concluding remarks

We have shown that nanopore DRS has the potential to refine multiple features of Arabidopsis genome annotation and to generate the clearest map to date of m6A locations, despite the genome sequence being available since 2000 (Kaul and Arabidopsis Genome Initiative, 2000). Modern agriculture is dominated by a handful of intensely researched crops (United Nations, Department of Economic and Social Affairs, Population Division, 2017), but to diversify global food supply, enhance agricultural productivity and tackle malnutrition there is a need to focus on crops utilized in rural societies that have received little attention for crop improvement (Chang, 2018). Based on our experience with Arabidopsis, we anticipate that the combination of nanopore DRS and other sequencing approaches will improve genome annotation. Consistent with this, we recently applied nanopore DRS to refine the annotation of water yam (Dioscorea alata), an African orphan crop. Indeed, we are moving into an era where thousands of genome sequences are available and programmes such as the Earth BioGenome Project aim to sequence all eukaryotic life on Earth (Lewin et al., 2018). However, genome sequences provide only part of the puzzle: annotating what they encode will be essential for us to fully utilize this information.

Materials and methods

Plants

Plant material and growth conditions

Request a detailed protocol

Wild-type A. thaliana accession Col-0 was obtained from Nottingham Arabidopsis Stock Centre. The vir-1 and VIR-complemented (VIR::GFP-VIR) lines (Růžička et al., 2017) were provided by K. Růžička, Brno. The hen2–2 (Gabi_774HO7) mutant (Lange et al., 2014) was provided by D. Gagliardi, Strasbourg. Seeds were sown on MS10 medium plates, stratified at 4°C for 2 days, germinated in a controlled environment at 22°C under 16 hr light/8 hr dark conditions and harvested 14 days after transfer to 22°C.

Clock phenotype analysis

Request a detailed protocol

Clock phenotype experiments were performed as previously described by measuring delayed fluorescence as a circadian output (Gould et al., 2009). Briefly, plants were grown in 12‐h light/12‐h dark cycles at 22°C and 80 μmol m−2 sec−1 light for 9 days. Next, delayed fluorescence measurements were recorded every hour for 6 days at constant temperature (22°C) and constant light (20 µmol m−2 sec−1 red light and 20 µmol m−2 sec−1 blue light mix). Fast Fourier Transform (FFT) non-linear least-squares fitting to estimate period length was conducted using Biodare (Moore et al., 2014).

RNA

RNA isolation

Request a detailed protocol

Total RNA was isolated using RNeasy Plant Mini kit (Qiagen) and treated with TURBO DNase (Thermo Fisher Scientific). The total RNA concentration was measured using a Qubit 1.0 Fluorometer and Qubit RNA BR Assay Kit (Thermo Fisher Scientific), and RNA quality and integrity were assessed using a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific) and Agilent 2200 TapeStation System (Agilent).

mGFP in vitro transcription

Request a detailed protocol

The mGFP coding sequence was amplified using CloneAmp HiFi PCR Premix (Clontech) and a forward primer containing the T7 promoter sequence (Merck; Supplementary file 8). The PCR product was purified using GeneJET Gel Extraction (Thermo Fisher Scientific) and DNA Cleanup Micro (Thermo Fisher Scientific) kits, according to the manufacturer’s instructions. mGFP was in vitro transcribed using a mMESSAGE mMACHINE T7 ULTRA Transcription Kit (Thermo Fisher Scientific) with and without the anti-reverse cap analogue, according to the manufacturer’s instructions. mGFP transcripts were treated with TURBO DNase, polyA-tailed using Escherichia coli poly(A) Polymerase (E-PAP) and ATP (Thermo Fisher Scientific) and recovered using a MEGAclear kit (Thermo Fisher Scientific) according to the manufacturer’s instructions. The quantity of mGFP mRNAs was measured using a Qubit 1.0 Fluorometer (as described above), and the quality and integrity was checked using the NanoDrop 2000 spectrophotometer and agarose-gel electrophoresis. In vitro capped and non-capped mGFP mRNAs were used to prepare the libraries for DRS using nanopores.

MALAT1 synthetic RNA

Request a detailed protocol

A matched set of synthetic RNAs differing only in a single m6A site were chemically synthesized by Dharmacon (Horizon Discovery). The sequence corresponds to a 60-nt portion of a human lncRNA MALAT1 5’-AAGUAAUUCAAGAUCAAGAGUAAUUACCAACUUAAUGUUUUUGCAUUGGACUUUGAGUUA(A50)−3’ (Supplementary file 8) with a previously mapped m6A (highlighted in bold) (Liu et al., 2013). The 2’-ACE protected synthetic RNAs were deprotected using 100 mM acetic acid adjusted to a pH of 3.4–3.8 using TEMED, according to the manufacturer’s instructions. The quantity of deprotected synthetic RNAs was checked using the NanoDrop 2000 spectrophotometer. 1 μg of the MALAT1 m6A and MALAT1 control synthetic RNAs were used to prepare the libraries for DRS using nanopores.

RT-PCR and RT-qPCR

Request a detailed protocol

Total RNA was reverse transcribed using SuperScript III polymerase or SuperScript IV VILO Master Mix (Thermo Fisher Scientific) according to the manufacturer’s protocol. For RT-PCR, reactions were performed using the Advantage 2 Polymerase Mix (Clontech) using the primers (Merck) listed in Supplementary file 8. PCR products were gel purified using GeneJET Gel Extraction and DNA Cleanup Micro kits (Thermo Fisher Scientific), cloned into the pGEM T-Easy vector (Promega; according to the manufacturer’s instruction) and sequenced. For RT-qPCR, reactions were carried out using the SYBR Green I (Qiagen) mix with the primers (Merck) listed in Supplementary file 8, following manufacturer’s instructions.

Illumina RNA sequencing

Preparation of libraries for Illumina RNA sequencing

Request a detailed protocol

Illumina RNA sequencing libraries from purified mRNA were prepared and sequenced by the Centre for Genomic Research at University of Liverpool using the NEBNext Ultra Directional RNA Library Prep Kit for Illumina (New England Biolabs). Paired-end sequencing with a read length of 150 bp was carried out on an Illumina HiSeq 4000. Illumina RNA libraries from ribosome-depleted RNA were prepared using the TruSeq Stranded Total RNA with Ribo-Zero Plant kit (Illumina). Paired-end sequencing with a read length of 100 bp was carried out on an Illumina HiSeq 2000 at the Genomic Sequencing Unit of the University of Dundee. ERCC RNA Spike-In mixes (Thermo Fisher Scientific) (Jiang et al., 2011; Lee et al., 2016) were included in each of the libraries using concentrations advised by the manufacturer.

Mapping of Illumina RNA sequencing data

Request a detailed protocol

Reads were mapped to the TAIR10 reference genome with Araport11 reference annotation using STAR version 2.6.1 (Dobin et al., 2013), a maximum multimapping rate of 5, a minimum splice junction overhang of 8 nt (three nt for junctions in the Araport11 reference), a maximum of five mismatches per read and intron length boundaries of 60–10,000 nt.

Differential gene expression analysis using Illumina RNA sequencing data

Request a detailed protocol

Transcript-level counts for Illumina RNA sequencing reads were estimated by pseudoalignment with Salmon version 0.11.2 (Patro et al., 2017). Counts were aggregated to gene level using tximport (Soneson et al., 2016) and differential gene expression analyses for vir-1 mutant vs wild type and vir-1 mutant vs the VIR-complemented line were conducted in R version 3.5 using edgeR version 3.24.3 (Robinson et al., 2010).

Differentially expressed region analysis using Illumina RNA sequencing data

Request a detailed protocol

Mapped read pairs originating from the forward and reverse strands were separated and coverage tracks were generated using samtools version 1.9 (Li et al., 2009). Coverage tracks were then used as input for DERfinder version 1.16.1 (Collado-Torres et al., 2017). Expressed regions were identified using a minimum coverage of 10 reads, and differential expression between the vir-1 and VIR-complemented lines was assessed using the analyseChr method with 50 permutations.

Differential exon usage analysis using Illumina RNA sequencing data

Request a detailed protocol

Annotated gene models from Araport11 were divided into transcript chunks (i.e. contiguous regions within which each base is present in the same set of transcript models). Read counts for each chunk were generated using bedtools version 2.27.1 (Quinlan and Hall, 2010) intersect in count mode. Chunk counts were then processed using DEXseq version 1.28.3 (Anders et al., 2012) to identify differentially expressed chunks between vir-1 and VIR-complemented lines, using an absolute log-fold-change threshold of 1 and an FDR threshold of 0.05. Chunks were annotated as 5′ variation if they included a start site of any transcript and as 3′ variation if they contained a termination site. Chunks representing overhangs from alternative donor or acceptor sites were also classified separately. Internal exons were subclassified as a cassette exon if they could be wholly contained within any intron.

Differential polyadenylation analysis using DaPars and illumina RNA sequencing data

Request a detailed protocol

We used the DaPars (Xia et al., 2014) algorithm to estimate shifts in patterns of alternative polyadenylation with Illumina RNAseq data. Annotated gene models from Araport11 were flattened to give one model per gene. Exonic annotation was given priority over intronic or intergenic annotation and CDS annotation was given priority over UTR annotation. Unstranded bigwig coverage files were prepared for each Illumina RNAseq sample using bedtools genomecov 2.27.1 (Quinlan and Hall, 2010), as DaPars does not support stranded RNAseq data. DaPars version 0.9.1 (Xia et al., 2014) was then run using the flattened Araport11 annotation as a reference. Transcripts from genes which were differentially polyadenylated in the vir-1 condition were identified using an FDR threshold of 0.05.

Nanopore DRS

Preparation of libraries for direct RNA sequencing using nanopores

Request a detailed protocol

mRNA was isolated from approximately 75 μg of total RNA using the Dynabeads mRNA purification kit (Thermo Fisher Scientific) following the manufacturer’s instructions. The quality and quantity of mRNA was assessed using the NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific). Nanopore libraries were prepared from 1 μg poly(A)+ RNA combined with 1 μl undiluted ERCC RNA Spike-In mix (Thermo Fisher Scientific) using the nanopore DRS Kit (SQK-RNA001, Oxford Nanopore Technologies) according to manufacturer’s instructions. The poly(T) adapter was ligated to the mRNA using T4 DNA ligase (New England Biolabs) in the Quick Ligase reaction buffer (New England Biolabs) for 15 min at room temperature. First-strand cDNA was synthesized by SuperScript III Reverse Transcriptase (Thermo Fisher Scientific) using the oligo(dT) adapter. The RNA–cDNA hybrid was then purified using Agencourt RNAClean XP magnetic beads (Beckman Coulter). The sequencing adapter was ligated to the mRNA using T4 DNA ligase (New England Biolabs) in the Quick Ligase reaction buffer (New England Biolabs) for 15 min at room temperature followed by a second purification step using Agencourt beads (as described above). Libraries were loaded onto R9.4 SpotON Flow Cells (Oxford Nanopore Technologies) and sequenced using a 48 hr run time.

To incorporate cap-dependent ligation of a biotinylated 5′ adapter RNA, the following modifications were introduced into the library preparation protocol. First, 4 µg mRNA was de-phosphorylated by Calf Intestinal Alkaline Phosphatase (Thermo Fisher Scientific) and the 5′ cap was removed by Cap-Clip Acid Pyrophosphatase (Cambio) according to the manufacturer’s instructions. Next, the 5′ adapter RNA (5’-CGACUGGAGCACGAGGACACUGACAUGGACUGAAGGAGUAGAAA-3’) biotinylated at the 5′ end (Integrated DNA Technologies; Supplementary file 8) was ligated to dephosphorylated, de-capped mRNA using T4 RNA ligase I (New England Biolabs) and mRNA was purified using Dynabeads MyOne Streptavidin C1 beads (Thermo Fisher Scientific) according to the manufacturer’s instructions. mRNA was assessed for quality and quantity using the NanoDrop 2000 spectrophotometer and used for nanopore DRS library preparation (as described above).

Processing of nanopore DRS data

Request a detailed protocol

Reads were basecalled with Guppy version 2.3.1 (Oxford Nanopore Technologies) using default RNA parameters and converted from RNA to DNA fastq using seqkit version 0.10.0 (Shen et al., 2016b). Reads were aligned to the TAIR10 A. thaliana genome (Kaul and Arabidopsis Genome Initiative, 2000) and ERCC RNA Spike-In sequences (Jiang et al., 2011; Lee et al., 2016) using minimap2 version 2.8 (Li, 2018) in spliced mapping mode using a kmer size of 14 and a maximum intron size of 10,000 nt. Sequence Alignment/Map (SAM) and BAM file manipulations were performed using samtools version 1.9 (Li et al., 2009). Variation in mapping rates is likely caused by differences in the amount of RNA loaded onto the flowcell. For example, the cap-capture procedure, resulted in less input RNA for library preparation. This has two consequences: first, it results in lower pore occupancy during sequencing, resulting in many apparently low quality ‘reads’ which are actually noise signal from empty pores (Mojarro et al., 2018; Pontefract et al., 2018); second, an increased number of reads map to Enolase II, the internal standard control RNA (Oxford Nanopore Technologies control strand) added to each library. For MALAT1 synthetic RNAs, which were only 110 nt in length including a 50 nt poly(A) tail, minimap2 parameters were relaxed to allow lower mapping scores: kmer size (-k) was reduced to 10; the minimum number of minimisers per chain (-n), to 2; the minimum chaining score (-m), to 12; and the minimum dynamic programming alignment score (-s), to 8.

Proovread version 2.14.1 (Hackl et al., 2014) was used to correct errors in the nanopore DRS reads (Depledge et al., 2019). Each nanopore DRS replicate was split into 200 chunks for parallel processing. Each chunk was corrected using four samples of Illumina poly(A) RNAseq data selected randomly from the 36 Illumina files (six biological replicates sequenced across six lanes). Illumina reads 1 and 2 were merged into fragments using FLASh version 1.2.11 (Magoč and Salzberg, 2011). Unjoined pairs were discarded. Error correction with proovread was conducted in sampling-free mode using a minimum nanopore read length of 50 nt. Since both the Illumina RNAseq and nanopore DRS datasets were strand specific, proovread was modified to prevent opposite strand mapping between the datasets. Corrected reads were then mapped to the Araport11 reference genome using minimap2 (as described above). All figures showing gene tracks with nanopore DRS reads use error-corrected reads, visualized using the Integrative Genomics Viewer (Robinson et al., 2011).

Error rate analysis using nanopore DRS data

Request a detailed protocol

Error rate analysis of aligned reads was conducted on ERCC RNA Spike-In mix controls using pysam version 0.15.2 (Heger et al., 2014) for BAM file parsing. Matches, mismatches, insertions and deletions relative to the reference were extracted from the cs tag (a more informative version of CIGAR string, output by minimap2) and normalised by the aligned length of the read. Reference bases and mismatch bases per position were also recorded and used to assess the frequency of each substitution and indel type by reference base.

Over-splitting analysis of nanopore DRS data

Request a detailed protocol

To identify read pairs resulting from over-splitting of the signal originating from a single RNA molecule, the sequencing summary files produced by Guppy were parsed for sequencing time and channel identifier and then used to identify pairs of consecutively sequenced reads. Genomic locations of reads were parsed from minimap2 mappings, and consecutively sequenced reads with adjacent alignment within a genomic distance of –10 nt to 1000 nt were identified. Samples sequenced before or during May 2018 had very low levels of over-splitting (between 0.01% and 0.05% of reads) compared with those sequenced in September 2018 onwards (between 0.8% and 1.5% of reads).

Analysis of the potential for internal priming in nanopore DRS data

Request a detailed protocol

To determine whether internal priming caused by the RT step can occur in nanopore data, the location of oligo(A) hexamers within Arabidopsis coding sequence (CDS) regions (Araport11) was determined and reads that terminated within a 20 nt window of each hexamer were counted. Of the 10,116 CDS oligo(A) runs, 160 (1.58%) had at least one supporting read in one Col-0 nanopore dataset. Of these, 137 were supported by only one replicate, and only four were supported by all four biological replicates. In total, 66 (41%) occurred in Araport11-annotated terminal exons, suggesting that they may be genuine sites of 3’ end formation.

Poly (A) length estimation using nanopore DRS data

Request a detailed protocol

Poly (A) tail length estimations were produced using nanopolish version 0.11.0 (Workman et al., 2019) and added as tags to bam files using pysam version 0.15.2 (Heger et al., 2014). Per gene length distributions were then produced by identifying reads overlapping each gene in the Araport11 annotation by > 20% of their aligned length, and genes with significant changes in length distribution in the vir-1 mutant compared with the VIR-complemented line were identified using a Kolmogorov–Smirnov test. p-values were adjusted for multiple testing using Benjamini-Hochberg correction.

3′ end analyses of nanopore and Helicos reads

Request a detailed protocol

Helicos data were prepared as previously described (Duc et al., 2013; Schurch et al., 2014). Positions with three or more supporting reads were considered to be peaks of nanopore or Helicos 3′ ends. The distance between each nanopore peak and the nearest Helicos peak was then determined. In all, 37% of nanopore peaks occurred at the same position as a Helicos peak, and the standard deviation in distance was 12.5 nt. To determine the percentage of nanopore DRS 3′ ends mapping within annotated genic features, transcripts were first flattened into a single record per gene. Exonic annotation was given priority over intronic or intergenic annotation and CDS annotation was given priority over UTR annotation. Reads were assigned to genes if they overlapped them by > 20% of their aligned length, and the annotated feature type of the 3′ end position was determined. Counts were generated both for all reads and for unique positions per gene.

Isoform collapsing of nanopore DRS data

Request a detailed protocol

Error-corrected full-length alignments were collapsed into clusters of reads with identical sets of introns. These clusters were then subdivided by 3′ end location by using a Gaussian kernel with sigma of 100 to find local minima between read ends, which were used as cut points to separate clusters. The read with the longest aligned length in each cluster was used as the representative in the figure.

Splicing analysis of nanopore DRS and Illumina RNAseq data

Request a detailed protocol

Splice junction locations, their flanking sequences and the read counts supporting them were extracted from Illumina RNA sequencing, nanopore DRS and nanopore error-corrected DRS reads using pysam version 0.14 (Heger et al., 2014), as well as from Araport11 (Cheng et al., 2017) and AtRTD2 (Zhang et al., 2017) reference annotations. Splice junctions at the same position but on the opposite strand were counted independently. Junctions were classified by their most likely snRNP machinery using Biopython version 1.71 (Cock et al., 2009), with position weight matrices as previously calculated (Sheth et al., 2006). Position weight matrices were scored against the sequence –three nt to +10 nt of the donor site and –14 nt to +3 nt of the acceptor site. Position weight matrix scores greater than zero indicate a match to the motif, while scores of around zero, or negative scores, indicate background frequencies or deviation from the motif. Positive scores were normalized into the range 50–100 as done previously (Sheth et al., 2006). Junctions with U12 donor scores of > 75 and U12 acceptor scores of > 65 were classified as U12 junctions, while junctions with U2 donor and acceptor scores of > 60 were classified as U2, as done previously (Zhang et al., 2017). Junctions were further categorized as canonical or non-canonical based on the presence or absence, respectively, of GT/AG intron border sequences. For isoform analysis, linked splices from the same read were extracted from full-length nanopore error-corrected reads and counted to create unique sets of splice junctions. Intronless reads were not counted. UpSet plots were generated in Python 3.6 using the package upsetplot (Nothman, 2018).

Validation of novel splice sites

Request a detailed protocol

To validate novel splice junctions detected in nanopore DRS, five splice sites out of the 20 most highly expressed splice sites were selected for further validation; three of the five selected splice sites were successfully amplified in RT-PCR followed by Sanger sequencing (described above).

Identification of potential novel NMD targets

Request a detailed protocol

Error corrected reads from Col-0 samples were filtered to retain only those that overlapped both annotated starts and termination codons from the Araport11 reference annotation (Cheng et al., 2017). These reads were then assessed to identify those where the length of the open reading frame was not divisible by three, indicating a splicing event causing a frameshift compared to the reference. These reads were then further filtered to remove examples without splicing support from Col-0 poly(A) Illumina RNAseq datasets. Novel examples were selected by filtering out splice patterns already present in the Araport11 or AtRTD2 reference annotations.

5′ adapter detection analyses using nanopore DRS data

Request a detailed protocol

To produce positive and negative examples of 5′ adapter-containing sequences, 5′ soft-clipped regions were extracted from aligned reads for the Col-0 replicate one datasets (with and without adapter ligation) using pysam (Heger et al., 2014). These soft-clipped sequences were then searched for the presence of the adapter sequence using BLASTN version 2.7.1 (Camacho et al., 2009). Two rules were initially applied to filter BLASTN results: a match of 10 nt or more to the 44 nt adapter, and an E value of < 100. Reads from the adapter-containing dataset that failed one or both criteria were used as negative training examples. A final rule requiring the match to the adapter sequence to occur directly adjacent to the aligned read was also applied. Reads from the adapter-containing dataset that passed all three rules were used as the positive training set. When comparing the ratio of positive to negative examples between datasets containing the adapter and those generated from the same tissue but without the adapter, we found that these three rules gave a signal-to-noise ratio of > 5000 (Supplementary file 3).

In all, 72,083 positive and 123,739 negative training examples from Col-0 tissue replicate one were collected to train the neural network. We then estimated the amount of raw signal from the 5′ end of the squiggle that was required on average to capture the 5′ adapter. To do this, we used nanopolish eventalign version 0.11.0 (Loman et al., 2015) to identify the interval in the raw read corresponding to the mRNA alignment to the reference in the positive examples of 5′ adapter-containing sequences. Since the adapter can be identified immediately adjacent to the alignment in sequence space for these reads, the signal after the event alignment should correspond to the signal originating from the adapter. The median length of these signals was 1441 points, and 96% of the signals were < 3000 points. Therefore, we used a window size of 3000 to make predictions.

The model was trained in Python 3.6 using Keras version 2.2.4 with the Tensorflow version 1.10.0 backend (Chollet, 2018; Abadi, 2016). A ResNet-style architecture was used (Kaiming He et al., 2015), composed of eight residual blocks containing two convolutional layers of kernel size five and a shortcut convolution with kernel size 1. Down-sampling using maximum pooling layers with a stride of 2 was used between each residual block. A penultimate densely connected layer of size 16 was used, with training dropout of 0.5. Input signals were standardized by median absolute deviation scaling across the whole read before the final 3000 points were taken, and the negative samples were augmented by addition of random internal signals from reads and pure Gaussian, multi-Gaussian and perlin noise signals (Wick et al., 2018). The whole dataset was also augmented on the fly during training by the addition of Gaussian noise with a standard deviation of 0.1. Models were trained for a maximum of 100 epochs (batch size of 128, 100 batches per epoch, positive and negative examples sampled in a 1:1 ratio) using the RMSprop optimizer with an initial learning rate of 0.001, which was reduced by a factor of 10 after three epochs with no reduction in validation loss. Early stopping was used after five epochs with no reduction in validation loss. We found that a number of negative training examples from the ends of reads, but not from internal positions, were likely to be incorrectly labelled by the BLASTN method, because the model predicted them to contain adapters. We therefore filtered these to clean the training data, before repeating the training process. Model performance was evaluated using five-fold cross validation and by testing on independently generated datasets from Col-0 replicate 2, produced with and without the adapter ligation protocol (Figure 3—figure supplement 1B,C) (Chollet, 2018; Abadi, 2016).

To evaluate the reduction in 3′ bias of adapter-ligated datasets, we used Araport11 exon annotations to produce per base coverage for each gene in the Col-0 replicate two dataset. Coverage was generated separately for reads predicted to contain adapters and for those predicted not to contain adapters. Leading zeros at the extreme 5′ and 3′ ends of genes were assumed to be caused by mis-annotation of UTRs and so were trimmed. The quartile coefficient of variation (interquartile range/median) was then used as a robust measure of variation in coverage across each gene.

Two orthogonal approaches were used to validate the 5′ ends of adapter-ligated reads. First, full-length cDNA clone sequences were downloaded from RIKEN RAFL (Arabidopsis full-length cDNA clones) (Seki et al., 2002). These were mapped with minimap2 (Li, 2018) in spliced mode. The distance from each nanopore alignment 5′ end to the nearest RIKEN RAFL alignment (Seki et al., 2002) 5′ end was calculated using bedtools (Quinlan and Hall, 2010). The amount of 5′ end sequence that is rescued when 5′ adapters are used was estimated by identifying the largest peak in 5′ end locations per gene in the absence of adapter, and then measuring how this peak shifted using reads predicted to contain adapters. Second, Illumina RNAseq reads derived from 5’ tags using nanoPARE were downloaded from GSE112869 (Schon et al., 2018). nanoPARE reads and matched control RNAseq reads were mapped to TAIR10 with Araport11 reference annotation (Cheng et al., 2017) using STAR version 2.6.1 (Dobin et al., 2013), a maximum multimapping rate of 5, a minimum splice junction overhang of 8 nt (three nt for junctions in the Araport11 reference), a maximum of five mismatches per read, and intron length boundaries of 60–10,000 nt. Mapped reads were filtered into those representing ‘capped’ and ‘uncapped’ 5’ ends by the presence or absence of soft-clipped untemplated Gs at the 5’ end, which are added by reverse transcriptase in a cap dependent manner (Schon et al., 2018). 5′ coverage and matched RNAseq 5′ coverage tracks were generated using bedtools version 2.27.1 (Quinlan and Hall, 2010). nanoPARE capped read peaks were called at single nucleotide resolution with Piranha version 1.2.1 (Uren et al., 2012) and relaxed p-value thresholds of 0.5. Reproducible peaks across pairwise combinations of the three replicates were identified by irreproducible discovery rate (IDR) analysis using Python package idr version 2.0.3 with an IDR threshold of 0.05 (Li et al., 2011). The final set of peaks was identified by pooling the three replicates, re-analysing using Piranha, ranking the peaks by FDR and selecting the top N peaks, where N is the smallest number of reproducible peaks discovered by pairwise comparisons of the three replicates. This yielded 47,986 unique single nucleotide resolution nanoPARE peaks. The distance from each nanopore alignment 5′ end to the nearest nanoPARE peak was calculated using bedtools (Quinlan and Hall, 2010).

Identification of miRNA cleavage products in nanopore DRS data

Request a detailed protocol

Nanopore miRNA cleavage product identification was adapted from the EndCut method used previously (Schon et al., 2018). Known miRNA and siRNA sequences from Arabidopsis thaliana were downloaded from https://github.com/Gregor-Mendel-Institute/nanoPARE. miRNA target sites were predicted using GSTAr (https://github.com/MikeAxtell/GSTAr). miRNAs were also shuffled 1000 times and targets were predicted for shuffled sequences to model random chance predictions. P values for Allen scores of miRNA-target interactions were produced by permutation test, using the Allen scores of targets predicted from randomly shuffled versions of the miRNA as a null distribution. P values for the enrichment of nanopore DRS 5’ ends at the target cleavage sites were also produced by permutation test, comparing the log ratio of coverage in a 20 nt window upstream and downstream of each predicted cleavage site for an miRNA, to the coverage upstream and downstream of predicted cleavage sites of the shuffled miRNA. A 20 nt window was used instead of measuring 5' ends precisely at the cleavage site, because as we demonstrate in this study, approximately 11 nt are lost from the 5' end of RNA sequencing reads in standard nanopore DRS. Predictions which fell within 5’UTRs or across splice junctions were filtered out to reduce false positives. The P values derived from nanopore coverage and Allen scores were combined using the fisher method and corrected for multiple testing using the Benjamini-Hochberg method.

Differential error site analysis using nanopore DRS data

Request a detailed protocol

To detect sites of VIRILIZER-dependent m6A RNA modifications, we developed scripts to test changes in per base error profiles of aligned reads. Pileup columns for each position with coverage of > 10 reads were generated using pysam (Heger et al., 2014) and reads in each column were categorized as either A, C, G, U or indel. The relative proportions of each category were counted. Counts from replicates of the same experimental condition were aggregated and a 2 × 5 contingency table was produced for each base comparing vir-1 and VIR-complemented lines. A G-test was performed to identify bases with significantly altered error profiles. For bases with a p-value of < 0.05, G-tests for homogeneity between replicates of the same condition were then performed. Bases where the sum of the G statistic for homogeneity tests was greater than the G statistic for the vir-1 and VIR-complemented line comparison were filtered. Multiple testing correction was carried out using the Benjamini-Hochberg method, and an FDR threshold of 0.05 was used. The log2 fold change in mismatch to match ratio (compared with the reference base) between the vir-1 and VIR-complemented lines was calculated using Haldane correction for zero counts. Bases that had a log fold change of > 1 were considered to have a reduced error rate in the vir-1 mutant.

To identify motifs enriched at sites with a reduced error rate, reduced error rate sites were increased in size by five nt in each direction and overlapping sites were merged using bedtools version 2.27.1 (Quinlan and Hall, 2010). Sequences corresponding to these sites were extracted from the TAIR10 reference and over-represented motifs were detected in the sequences using MEME version 5.0.2 (Bailey et al., 2009), run in zero or one occurrence mode with a motif size range of 5–7 and a minimum requirement of 100 sequence matches. The presence of these motifs at error sites was then detected using FIMO version 5.0.2 (Grant et al., 2011). A relaxed FDR threshold of 0.1 was used with FIMO to capture more degenerate motifs matching the m6A consensus.

To determine the effect of sequencing depth on the ability to call nanopore differential error sites, we performed a bootstrapped sub-sampling analysis. We conducted this analysis using MALAT1 synthetic RNA sequencing reads, and four differential error sites derived from three genes in the Arabidopsis data. 500 bootstrapped subsamples, at depths between 10 and 250 reads per condition, were generated for each gene and each base in the gene was then tested using differential error rate analysis as described above. Detection was considered successful if any of the differential error sites at the m6A motif which could be detected with all reads, was significant in the subsampled reads (FDR threshold 0.05). For the MALAT1 dataset, we also performed analyses on the effect of m6A stoichiometry on the ability to detect sites using differential error. For these analyses, reads from the m6A positive sample were replaced with reads from the m6A negative sample at rates of 0%, 25%, 50% and 75%, to simulate methylation levels of 100%, 75%, 50% and 25% respectively.

Analysis of phasing of nanopore DRS error sites with poly(A) site choice

Request a detailed protocol

To identify error sites which were phased with 3’ ends, we used error sites identified in the vir-1 vs VIRc comparison. We identified reads overlapping each error site in the VIRc samples using pysam version 0.15.2 (Heger et al., 2014), and clustered them by 3’ end position using local minima detected after gaussian smoothing with a sigma of 13 nt. Clusters containing fewer than 10 reads in total were discarded. Error sites where there was only one cluster of reads after filtering were not assessed further. Basecall profiles for each 3’ cluster at the overlapping error site were then produced and tabulated into an N × 5 contingency table, where N is the number of 3’ clusters, and five is the basecall type, that is A, C, G, U, or indel. Error sites with differences in basecall profile in different 3’ clusters were detected using a G-test. P values were corrected for multiple testing using the Benjamini-Hochberg method. For PRPL34 (AT1G29070), which is featured as an example in Figure 7D, pairwise comparisons of poly(A) cluster error profiles were performed with G-tests and corrected for multiple testing using the Benjamini-Hochberg method.

Differential gene expression analysis using nanopore DRS data

Request a detailed protocol

Gene level counts were produced for each nanopore DRS replicate using featureCounts version 1.6.3 (Liao et al., 2013) in long-read mode with strand-specific counting. Differential expression analysis between the vir-1 and VIR-complemented lines was then performed in R version 3.5 using edgeR version 3.24.3 (Robinson et al., 2010).

Identification of alternative 3′ end positions and chimeric RNA using nanopore DRS data

Request a detailed protocol

Genes with differential 3′ end usage were identified by producing 3′ profiles of reads which overlapped with each annotated gene locus by > 20%. These profiles were then compared between the vir-1 and VIR-complemented lines using a Kolmogorov–Smirnov test to identify changes. Multiple testing correction was performed using the Benjamini-Hochberg method. To approximately identify the direction and distance of the change, the normalized single base level histograms of the VIR-complemented line profile was subtracted from that of the mutant profile, and the minimum and maximum points in the difference profile were identified. These represent the sites of most reduced and increased relative usage, respectively. Results were filtered for an FDR of < 0.05 and absolute change of site > 13 nt (the measured error range of nanopore DRS 3′ end alignment). The presence of m6A modifications at genes with differential 3′ end usage was assessed using bedtools intersect (Quinlan and Hall, 2010), and significant enrichment of m6A at these genes was tested using a hypergeometric test (using all expressed genes as the background population).

To identify genes with a significant increase in chimeras in the vir-1 mutant, we used Araport11 annotation (Cheng et al., 2017) to identify reads that overlapped with multiple adjacent gene loci by more than 20% of their aligned length (i.e. chimeric reads) and those originating from a single locus (i.e. non-chimeric reads). To reduce false positives caused by reads mapping across tandem duplicated loci, reads mapping to genes annotated in PTGBase (Yu et al., 2015) were filtered out. Chimeric reads were considered to originate from the most upstream gene with which they overlapped. We pooled reads from replicates for each experimental condition and used 50 bootstrapped samples (75% of the total data without replacement) to estimate the ratio of chimeric to non-chimeric reads at each gene in each condition. Haldane correction for zero counts was applied. The distributions of chimeric to non-chimeric ratios in the vir-1 and VIR-complemented lines were tested using a Kolmogorov–Smirnov test to detect loci with altered chimera production. All possible pairwise combinations of VIR-complemented and vir-1 bootstraps were then compared to produce a distribution of estimates of change in the chimeric to non-chimeric ratio in the vir-1 mutant. Loci that had more than one chimeric read in vir-1, demonstrated at least a two-fold increase in the chimeric read ratio in > 50% of bootstrap comparisons and were significantly changed at a FDR of < 0.05 were considered to be sites of increased chimeric RNA formation in the vir-1 mutant.

Poly(A) site motif enrichment at nanopore DRS differential error motifs

Request a detailed protocol

To assess the relationship between m6A and the poly(A) signal motif, we counted the occurrence of the sequence AAUAAA and related sequences (with an edit distance of one, excluding AAAAAA) in a 100 nt window centred around m6A motifs detected de novo underneath nanopore error sites. To assess enrichment at these sites, we also shuffled the sequences derived from these windows 1000 times, maintaining the dinucleotide content using ushuffle version 1.1.0 (Jiang et al., 2008), and counted the occurrences of the same motifs each time. The log enrichment score was then derived from the ratio of the observed count over the median expected count from shuffled sequences.

miCLIP

Preparation of miCLIP libraries

Request a detailed protocol

Total RNA for miCLIP was isolated from 7.5 mg of 14 day old Arabidopsis Col-0 seedlings as previously described (Quesada et al., 2003). mRNA was isolated from ~1 mg total RNA using oligo(dT) and streptavidin paramagnetic beads (PolyATtract mRNA Isolation Systems, Promega) according to the manufacturer’s instructions. miCLIP was carried out as previously described (Grozhik et al., 2017) in three biological replicates, using 15 µg mRNA per each replicate, and an antibody against N6-methyladenosine (#202 003 Synaptic Systems), with minor modifications. No-antibody controls were processed throughout the experiment. RNA-antibody complexes were separated by 4–12% Bis-Tris gel electrophoresis at 180 V for 50 min and transferred to nitrocellulose membranes (Protran BA85 0.45 µm, GE Healthcare) at 30 V for 60 min. Membranes were then exposed to Medical X-Ray Film Blue (Agfa) at −80°C overnight. Reverse transcription was carried out using barcoded RT primers: RT41, RT48, RT49 and RT50 (Integrated DNA Technologies; Supplementary file 8). After reverse transcription, cDNA fraction corresponding to 70–200 nt was gel purified after 6% TBE-urea gel electrophoresis (Thermo Fisher Scientific). After the final PCR step, all libraries were pooled together, purified using Agencourt Ampure XP magnetic beads (Beckman Coulter) and eluted in nuclease-free water. Paired-end sequencing with a read length of 100 bp was carried out on an Illumina MiSeq v2 at Edinburgh Genomics, University of Edinburgh. Input sample libraries were prepared using NEBNext Ultra Directional RNA Library Prep Kit for Illumina (New England Biolabs) and sequenced on an Illumina HiSeq2000 at the Tayside Centre for Genomics Analysis, University of Dundee, with a pair-end read length of 75 bp.

Processing of miCLIP sequencing data

Request a detailed protocol

miCLIP data were assessed for quality using FastQC version 0.11.8 (Andrews, 2010) and MultiQC version 1.7 (Ewels et al., 2016). Only the forward read was used for analysis because the miCLIP site is located at the 5′ position of the forward read. 3′ adapter and poly(A) sequences were trimmed using cutadapt version 1.18 (Martin, 2011) and unique molecular identifiers were extracted from the 5′ end of the reads using UMI-tools version 0.5.5 (Smith et al., 2017). Immunoprecipitation and no-antibody controls were demultiplexed and multiplexing barcodes were trimmed using seqkit version 0.10.0 (Shen et al., 2016b). Reads were mapped to the TAIR10 reference genome with Araport11 reference annotation (Cheng et al., 2017) using STAR version 2.6.1 (Dobin et al., 2013), a maximum multimapping rate of 5, a minimum splice junction overhang of 8 nt (three nt for junctions in the Araport11 reference), a maximum of five mismatches per read, and intron length boundaries of 60–10,000 nt. SAM and BAM file manipulations were performed using samtools version 1.9 (Li et al., 2009). Removal of PCR duplicates was then performed using UMI-tools in directional mode (Smith et al., 2017). miCLIP 5′ coverage and matched input 5′ coverage tracks were generated using bedtools version 2.27.1 (Quinlan and Hall, 2010) and these were used to call miCLIP peaks at single nucleotide resolution with Piranha version 1.2.1 (Uren et al., 2012) and relaxed p-value thresholds of 0.5. Reproducible peaks across pairwise combinations of the three replicates were identified by irreproducible discovery rate (IDR) analysis using Python package idr version 2.0.3 with an IDR threshold of 0.05 (Li et al., 2011). miCLIP replicate three was found to correlate less well, with fewer reproducible sites (Spearman's rho 0.54 for replicates 1 vs. 2, 93,046 reproducible sites, Spearman's rho 0.42 for replicates 1 vs. 3, 12,777 reproducible sites, Spearman's rho 0.47 for replicates 2 vs. 3, 19,948 reproducible sites). We therefore chose to use the reproducible sites detected in the comparison of replicates 1 and two as the final set of miCLIP sites for downstream analysis.

m6A liquid chromatography–mass spectroscopy analysis

Request a detailed protocol

m6A content analysis using liquid chromatography–mass spectroscopy (LC-MS) was performed as previously described (Huang et al., 2018). Chromatography was carried out by the FingerPrints Proteomics facility, University of Dundee.

Code availability

Request a detailed protocol

All scripts, pipelines and notebooks used for this study are available on GitHub at https://github.com/bartongroup/Simpson_Barton_Nanopore_1 (Parker and Schurch, 2019; copy archived at https://github.com/elifesciences-publications/Simpson_Barton_Nanopore_1).

References

  1. 1
  2. 2
  3. 3
  4. 4
    FastQC: a quality control tool for high throughput sequence data
    1. S Andrews
    (2010)
    FastQC: a quality control tool for high throughput sequence data.
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
    Keras
    1. F Chollet
    (2018)
    Keras, https://github.com/fchollet/keras.
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
    pysam: Python interface for the SAM/BAM sequence alignment and mapping format
    1. A Heger
    2. TG Belgrad
    3. M Goodson
    4. K Jacobs
    (2014)
    pysam: Python interface for the SAM/BAM sequence alignment and mapping format.
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
    What are 3' UTRs doing?
    1. C Mayr
    (2019)
    Cold Spring Harbor Perspectives in Biology 11:a034728.
    https://doi.org/10.1101/cshperspect.a034728
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
    Plant Circadian Networks: Methods and Protocols
    1. A Moore
    2. T Zielinski
    3. AJ Millar
    (2014)
    Springer.
  65. 65
  66. 66
    upsetplot
    1. J Nothman
    (2018)
    upsetplot, https://github.com/jnothman/UpSetPlot.
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
  81. 81
  82. 82
  83. 83
  84. 84
  85. 85
  86. 86
  87. 87
  88. 88
  89. 89
  90. 90
  91. 91
  92. 92
  93. 93
  94. 94
  95. 95
  96. 96
  97. 97
  98. 98
  99. 99
  100. 100
  101. 101
  102. 102
  103. 103
  104. 104
  105. 105
  106. 106
  107. 107

Decision letter

  1. Yue Wan
    Reviewing Editor; A*STAR, Singapore
  2. Christian S Hardtke
    Senior Editor; University of Lausanne, Switzerland

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

Studying the full transcriptome complexity and processing is key to understanding the biology of any organism. While the advent of RNA sequencing using short-read sequencing has greatly enriched our understanding of the transcriptome, challenges due to short-read sequencing include difficulties in mapping uniquely to individual isoforms, and assembling full transcripts, especially in repeat regions. In addition, the need to convert RNA into cDNA libraries for sequencing can result in artifacts in transcript annotation due to template switching of reverse transcriptase enzymes, and cDNA sequencing limited in its ability to detect RNA modifications. Using nanopore direct RNA sequencing, the authors updated the annotation of the Arabidopsis transcriptome by identifying new antisense transcripts, new splicing patterns, 3' end usage, polyA tail lengths and full length transcripts using 5'cap capture. In addition, using a mutant of the m6A writer in Arabidopsis (vir-1 mutant), and vir-1 mutant reconstituted with vir-1, the authors used increased error rates in direct RNA sequencing to detect m6A modifications transcriptome-wide. The authors showed that m6A modifications impact RNA stability and circadian cycles are associated with 3' end formation in arabidopsis. The paper is very informative in the utility and limitations of using direct RNA sequencing to interrogate transcriptomes and the strategies described in the manuscript is widely applicable to not only Arabidopsis but also to any biologist interested in their transcriptome of choice.

Decision letter after peer review:

Thank you for submitting your article "Nanopore direct RNA sequencing maps an Arabidopsis N6 methyladenosine epitranscriptome" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Christian Hardtke as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

This is a timely manuscript to demonstrate the utility of direct RNA sequencing in discovering transcriptomic features in the model organism Arabidopsis thaliana. In addition to studying the primary sequences of the transcriptome, including 5' ends, splicing and polyA+ tail length, the authors also utilized direct RNA sequencing to identify m6A modifications in vir-1 mutants and in vir-1 mutants with restored VIR activity. They found that m6A modifications is associated with 3' end processing in Arabidopsis thaliana. Overall the manuscript is well written and interesting.

We have suggestions to improve the manuscript to increase the impact of the novelty and utility of using direct RNA sequencing, as well as the technical strength of the manuscript by adding controls and performing validations to determine the accuracy of the new signals that they find.

Essential revisions:

1) For polyA tail length determination- could the authors show accuracy of their polyA tail determination on a set of standards with known poly(A) tail lengths?

2) 75% of the author's adapter ligated library failed to align to the transcriptome. The authors should look into the failed reads to explain why this is so.

3) For transcripts with new 5' ends identified by 5'end capture, the authors should validate of the new 5'ends using 5'end RACE.

4) A set of RNAs with known m6A sites should be generated to determine the accuracy and sensitivity of the author's method in detecting m6A to other available softwares, such as Tombo from Nanopore, as well as other published methods including (https://www.biorxiv.org/content/10.1101/525741v1). The authors should also comment on the robustness of their method with regards to new software changes.

5) The authors should show data on the correlation between sequencing depth and the ability to detect m6A modifications. They should also do deeper analysis to look at the nucleotide composition of the called m6A modifications that they capture, as well as whether there is potential bias towards capturing modifications on specific transcripts.

6) Direct RNA sequencing benefits from the aspect that it is long read, single molecule sequencing. As such, there is the possibility of "phasing" modifications that cannot be performed using short-read sequencing. The authors should show some examples of RNAs with more or less modifications than expected and attempt to associate that with RNA processing to demonstrate the utility of direct RNA sequencing.

Reviewer #1:

The possibility to perform RNA sequencing in its native form, without the requirement of cDNA synthesis, represents a major significant advance in RNA biology. Within this context, in this manuscript Matthew Parker and colleagues applied the Oxford Nanopore direct RNA sequencing to characterize the profiling of RNA transcripts in the model plant Arabidopsis thaliana. In the first part of the manuscript, the aim of this research is to: 1) prove that the Nanopore direct RNA sequencing has the ability to detect long, complex mRNAs under an acceptable error rate and to resolve a great complexity of splicing events; 2) prove that the over-splitting and spurious antisense reads generally occurs at low frequency in Nanopore direct RNA sequencing; 3) prove that Nanopore direct RNA sequencing can be used in estimation of poly-A tail length; 4) prove that Nanopore direct RNA sequencing using Cap-dependent capturing is effective in detecting authentic mRNA 5′ ends by using a convolutional neural network classifier of raw signals. In the second part, the authors present a direct RNA sequencing-based approach to detection of m6A methylation in data produced from both the vir-1 mutants defective in m6A and the vir-1 mutants restoring VIR activity. Furthermore, taking advantage of long read sequencing, the authors found that VIR activity is associated with the maintenance of 3' end formation of mRNAs. Overall, this manuscript describes a scientifically sound investigation of an important scientific area, that of epitranscriptomics. This manuscript will no doubt be a valuable resource to the growing field of direct RNA sequencing. I recommend this manuscript for publication in eLife.

I have some specific points detailed below.

1) In the supplementary Table 1, only 25% and 27% of sequencings reads in the 5' adapter ligation library could be mapped to the TAIR10 genome. The authors should check the sequencing reads to explain in detail the reason for the failed alignment in a large majority of sequencing reads.

2) Subsection “Differential error site analysis reveals the m6A epitranscriptome”, 17,491 sites with a more than two-fold higher error rate in the VIR-complemented line with restored m6A. Are these error sites exclusively located in adenine positions? A nucleotide composition summary is required to improve clarity.

3) Subsection “Spurious antisense reads are rare or absent in nanopore DRS”, the absence of reads mapping to antisense to RCA suggests that spurious antisense is rare or absent from Nanopore direct RNA sequencing data. Is it a generally accepted that the genomic loci of RCA do not generate antisense transcripts? If so, this information should be included.

4) The webpage (https://github.com/bartongroup/Simpson_Barton_Nanopore_1) which is used to deposit scripts and pipelines on GitHub is not accessible. The code availability will represent a valuable addition to the Nanopore RNA sequencing community and can be used a guide for direct RNA methylation analysis.

5) This is a technical paper to assess the performance of Nanopore direct RNA sequencing in a pioneer manner. Since m6A RNA modification detection and full-length RNA transcript sequencing using Cap-dependent capturing constitutes the major contribution of this paper, the recent advances relating to these topics should be included in the Introduction section.

Liu H, et al. Accurate detection of m6A RNA modifications in native RNA sequences. bioRxiv 2019:525741.

Jiang F, et al. Long-read direct RNA sequencing by 5'-Cap capturing reveals the impact of Piwi on the widespread exonization of transposable elements in locusts. RNA Biol 2019:1-10.

Reviewer #2:

In this manuscript, Parker et al. demonstrated several interesting applications of nanopore direct RNA sequencing (DRS) to advance the current understanding of Arabidopsis transcriptome. They have assessed the primary performance factors of DRS that are related to the basecalling error profile, poly(A) length distribution, full-length splicing profile, and the reliability of antisense reads. Also, they have developed two new techniques for DRS to find the 5′-end positions of capped RNAs and the m6A -modified positions in mRNAs. Finally, they applied an error-based m6A -detection method to find the association between the m6A modifications of 3′ UTR and RNA 3′-end processing in Arabidopsis thaliana.

This manuscript lists many different types of RNA processing and modification that can be detected by DRS. This is a nice demonstration of the potential applications and strengths of DRS. But most parts do not connect to each other, and the conceptual novelty is limited. I have some concerns that will need to be addressed before publication.

1) This technique is conceptually similar to the work by Liu et al. (https://www.biorxiv.org/content/10.1101/525741v1). Please clarify the similarities/differences between Liu et al. and the method described in this manuscript.

2) Base-calling error profiles can vary depending on the basecaller version, the model in the basecaller, the processing/filtering parameters used for running the basecaller, the pore protein version, and the motor protein version. Recently, there was a huge upgrade for the basecalling model (high-accuracy model, a.k.a. flip-flop basecaller) introduced in Guppy 3.2. Moreover, according to the Oxford Nanopore Technologies (ONT), a faster motor protein and the new R10 pore will be introduced to their DRS kits very soon. The authors need to clarify and inform the readers that the error profile presented here may change depending on the basecaller software, a model, a pore, and a motor protein.

The method described in this manuscript can work in the specific combinations that the authors used, but it does not guarantee its performance in the other settings. Considering the change in the flagship basecaller, Guppy, with higher accuracy after the version used in the manuscript, the authors need to check if this method works with comparable accuracy with the newer Guppy.

3) In addition, Tombo from the ONT has a mode called "level_sample_compare" that enables the modified base detection by comparing control and experiment groups. Can you discuss a bit about the benefits and drawbacks of your method in comparison with Tombo?

4) Subsection “Nanopore DRS confirms sites of RNA 3′ end formation and estimates poly(A) tail length”: The standard nanopore DRS library preparation uses the double-stranded RNA-DNA ligation assisted by an oligo(dT) splint. It inevitably introduces a substantial underrepresentation of short poly(A) tails. Even 10 nt oligo(dT) splint often results in a strong bias for > 20 nt poly(A) tails against the shorter tails. Moreover, short poly(A) tails often carry additional U tails which interfere with the ligation to the adapter. In addition, the means of the single-read "poly(A) length measurements" may not be a suitable summarization to estimate the means of "poly(A) lengths." Nanopolish gives the best approximations of poly(A) lengths at a single-read level. However, as the poly(A) dwell time roughly follows a gamma distribution with a long tail, the mean of the best approximations at a single-read level systematically overestimates the mean of poly(A) length.

5) Subsection “Cap-dependent 5′ RNA detection by nanopore DRS”: The authors used RNA ligase 1 to mark the 5′ ends of RNAs. Due to the substrate specificity of this enzyme, circularized mRNAs and mRNA-mRNA concatamers might have been produced. Supplementary Table 1 shows that the libraries using the standard protocol yielded ~90% of mappable reads while the libraries with cap-dependent ligation yielded only ~25%. Please describe what the other 75% are. Also, it would be helpful if the sequence information of the 5′ adapter were presented in the manuscript.

6) Subsection “Differential error site analysis reveals the m6A epitranscriptome”: How many of the miCLIP peaks could be detected with nanopore DRS? The authors only show the analyses using the population-level detection of m6A -modified sites. A real benefit of nanopore DRS lies in the associative analyses at a single-read level. Is it possible to use the DRS method to call m6A -modified sites within single molecules and analyze to see if a modified RNA has different polyadenylation status or 3′ end position from those in an unmodified RNA?

I feel that the current discoveries related to m6A modification in this manuscript can be better done using miCLIP than nanopore DRS.

Reviewer #3:

In this manuscript, "Nanopore direct RNA sequencing maps an Arabidopsis N6 methyladenosine epitranscriptome" the authors use Nanopore direct-read RNA sequencing (DRS) to characterize the Arabidopsis transcriptome, including features that can't easily be determined by short read sequencing, such as cap-associated transcription start sites, splicing events, poly(A) site choice and poly(A) tail length. The authors also use this approach to map sites of m6A modification and use a vir-1 mutant strain to validate m6A sites. The identification of modified transcripts allowed the authors to identify a role of m6A in regulating length of circadian period. Overall the paper is well written and is easy to follow.

The authors use cap-dependent ligation to enrich for capped mRNAs, reducing the 3' end bias observed in DRS. It would be of interest to discuss if the transcripts with early 5' ends result from technical artifacts (for example, degradation during RNA preparation) or represent transcripts present in the cell that lack a cap structure that is compatible with the ligation protocol. For example, the authors demonstrate that nanopore DRS can identify rare anti-sense transcripts. Could transcripts with early 5' ends, which are selected against in the cap dependent libraries, represent rare transcripts or degradation intermediary products?

Additionally, the authors leverage DRS to expand the set of annotated transcripts and splicing isoforms. Can the authors use this data to describe new endogenous targets of NMD, poison exons, or find examples of proteins with new functional domains?

One feature of the transcriptome the authors focus on is the localization and effect of m6A modification on RNA metabolism. It would be informative if in addition to the ERCC controls, a set of RNAs with known m6A sites were also included.

One aspect of m6A biology that can't easily be studied with current methodologies is the stoichiometry of the modification at each position in each transcript. Can the authors comment at all on stoichiometry of m6A from DRS? Furthermore, can the authors comment on how many reads per transcript are necessary to detect a modification? Is there a bias towards more abundant transcripts? In mammalian cells it has been shown that modified RNAs tend to have shorter 3' UTRs (Molinie et al., 2016)(PMID: 27376769). Can this observation be tested in DRS, and if so, does the same phenomena occur in Arabidopsis? Information on stoichiometry can be validated for ACA sites with transcript specific assays using the MazF nuclease.

Lastly, can the authors comment on the ability of detecting modifications other than m6A through DRS?

https://doi.org/10.7554/eLife.49658.sa1

Author response

Essential revisions:

1) For polyA tail length determination- could the authors show accuracy of their polyA tail determination on a set of standards with known poly(A) tail lengths?

We used the software tool nanopolish (polyA), developed by Jared Simpson’s lab, to estimate poly(A) tail length. Details on the development of the tool are incorporated within the Workman et al., pre-print (https://www.biorxiv.org/content/10.1101/459529v1). Importantly, in this context, the tool was tested on a set of RNA substrates with poly(A) tails of known lengths. Workman et al. state:

“These datasets consisted of ionic current traces for synthetic S. cerevisiaeS. cerevisiae enolase transcripts appended with 3′ poly(A) tails of 10, 15, 30, 60, 80 or 100 nucleotides”

The analysis of this set of standards with known poly(A) tail lengths in the Workman et al. study reports:

“Median estimates fell within 4 nucleotides of the expected tail length for the 10-to-80 poly(A) datasets; for the 100 nt dataset, the median estimate was 109 nt.”

To examine this question independently, using our own nanopore DRS data, we asked how well nanopolish (polyA) performed in identifying the poly(A) tail of the ERCC spike-ins included in our experiments. According to the manufacturers, most of the spike-ins have a terminal poly(A) stretch of 24 nt, with all spike-in poly(A) tails in the range 20-26 nt. For example, ERCC- 00002 has an expected poly(A) tail length of 24 nt, and the median nanopolish polyA estimate was 23.3 nt (95% confidence intervals [CIs] [5.7, 102], n = 1,485). ERCC-00074 also has an expected poly(A) tail length of 24 nt, and the median nanopore polyA estimate was 27.4 nt (95% CIs [8.9, 118], n = 1,439). We conclude that the median estimated poly(A) tail length by nanopolish (polyA) is similarly accurate to that reported by Workman et al., but with a broad margin of error.

Text relating to these findings is now included in Results section and these data are now included as Figure 2—figure supplement 1A.

2) 75% of the author's adapter ligated library failed to align to the transcriptome. The authors should look into the failed reads to explain why this is so.

This finding is explained by the relatively low amount of input RNA used in the corresponding sequencing runs. In our standard nanopore DRS experiments we used 1,000ng of poly(A+) RNA for input libraries loaded onto the MinION flowcell. However, the recovery of RNA after cap capture was less than 250ng. This reduction in sample RNA has two consequences:

First, relatively low amounts of input RNA reduces the ratio of the sample RNA to ONT’s internal standard RNA (yeast Enolase II [ENOII]). As a result, an order of magnitude more reads (11-20%) map to Enolase II in the adapter ligated samples compared to our standard nanopore DRS samples (Author response image 1).

Second, low input RNA results in low pore occupancy and “noise reads” consisting of signal from empty pores1,2. Consistent with this, the remaining unmapped reads are very short (75% of unmapped reads are 100 nt or less in length) and of low quality (98% of unmapped reads which are 100 nt or less in length have a median per-base quality score of 7, compared to only 1% of mapped reads) as illustrated in Author response image 2. Consequently, these unmappable “reads” do not equate to signals from RNA molecules but from empty pore “noise”1,2.

Author response image 1
Low amounts of input RNA increase the number of Enolase II reads and short noise reads.

Read length distribution of mapped (blue) and unmapped (orange) reads originating from a standard DRS experiment (left panel) and an adapter ligation experiment (right panel).

Author response image 2
Noise reads have low quality scores.

Boxplot showing distribution of median per-read quality scores for mapped and unmapped reads from an adapter ligation DRS experiment.

Text has been inserted into the computational analysis methods section and the legend to Table 1, of the revised manuscript to explain this phenomenon.

3) For transcripts with new 5' ends identified by 5'end capture, the authors should validate of the new 5'ends using 5'end RACE.

Our 5’ cap capture method is essentially a cap-dependent 5’ RACE experiment: We ligate a biotinylated synthetic RNA oligonucleotide matching the oligo sequence used in the RLM GeneRacer kit designed for cap-dependent 5’ RACE (Thermo Fisher Scientific). Consequently, stronger validation of the new 5’ ends we identify could derive from a more orthogonal approach such as the Illumina-based high throughput 5’ end capture method called nanoPARE published at the end of 2018 by Michael Nodine’s group using Arabidopsis RNA3.

NanoPARE is able to capture 5’ ends using reverse transcription of RNA with a template switching oligonucleotide (TSO), followed by tagmentation and PCR using TSO and transposase specific primers. PCR products are then sequenced using the IIlumina platform. Template switching using the TSO occurs most often at the 5’ end of the mRNA regardless of whether there is an m7G cap or not. Reads originating from the 5’ end of capped mRNAs can be distinguished from reads from uncapped 5’ ends and internal template switching by the addition of untemplated guanosine at the 5’ end of the read.

To address whether novel 5’ ends identified by nanopore can be validated with another approach, we downloaded and reanalysed nanoPARE data, focussing on those reads with untemplated guanosine that originate from capped mRNAs. The 5’ ends identified by nanopore DRS at RCA (AT2G39730), which agree with both our Illumina RNAseq coverage and the RIKEN RAFL clones, also agree well with the coverage from nanoPARE data (Figure 3E in the revised version of manuscript). Furthermore, the alternative transcriptional start sites identified at AT1G17050 and AT5G18650 using nanopore DRS are also clearly visible in the nanoPARE data (Figure 3—figure supplement 1D in the revised version of manuscript).

We reported in our original submission that 60% of nanopore 5’ ends sequenced using the cap capture protocol were within 13 nt of a transcription start site (TSS) identified using RIKEN RAFL cDNA clones (Figure 3C, left panel, in the revised version of manuscript). We can now also report that 75% of nanopore DRS 5' ends sequenced using cap capture are within 5 nt of a TSS detected using single base resolution peak-calling from nanoPARE reads, and 82% of nanopore DRS 5’ ends are within 13 nt of a TSS (Figure 3C, right panel, in the revised version of manuscript). This closer agreement is presumably due to the higher depth of the nanoPARE compared to RIKEN RAFL dataset, leading to the greater coverage of alternate TSSs.

This information is now reported in the Results section of the manuscript and the data analysis has been incorporated into the Materials and mmethods section.

4) A set of RNAs with known m6A sites should be generated to determine the accuracy and sensitivity of the author's method in detecting m6A to other available softwares, such as Tombo from Nanopore, as well as other published methods including (https://www.biorxiv.org/content/10.1101/525741v1). The authors should also comment on the robustness of their method with regards to new software changes.

To address the first of the points raised by the reviewers, we examined whether we could detect m6A in a matched set of synthetic RNAs that differ only in the presence/absence of N6 methylation at a single adenosine. We selected a 60 nt portion of a human lncRNA (MALAT1) that has been reported to be modified by m6A, and for which base-specific validation of the modification had been obtained using the orthogonal technique, SCARLET4. The m6A -modified and unmodified RNAs were chemically synthesized by Dharmacon (Horizon Discovery), complete with a poly(A) tail 50 nt in length. We prepared and sequenced a library of each RNA using the nanopore DRS procedure.

Following the application of our differential error rate analysis we could successfully identify the methylated adenosine.

Text describing this approach has been added to the Results section, with an accompanying Figure 5—figure supplement 1A. In addition, details of the associated methodology involved have been added to the Materials and methods section.

The relative evaluation of our nanopore differential error analysis against other approaches detecting m6A is not straightforward. No comparable assessment of the accuracy or sensitivity of any of the established m6A mapping techniques Me-RIP/ m6A -Seq5, mi-CLIP6, MAZTER-Seq7 or EpiNano8 exists. This is partly because no absolute “ground truth” for m6A sites exists, aside from individual synthesised RNAs. However, we have addressed this question in the following ways:

Since our manuscript was reviewed, the EpiNano bioRxiv preprint has been published in Nature Communications8. A distinction from the preprint is that the subsequent publication includes an assessment of EpiNano on biological data i.e. nanopore DRS of Saccharomyces cerevisiae strain SK1 defective in the S. cerevisiae m6A methyltransferase, IME4. S. cerevisiaeS. cerevisiae mRNAs are methylated during meiosis. Hence, synchronisation of cells is required to have appropriate starting material to detect m6A. Data corresponding to 3 biological replicates of the “wild-type” strain (Say841 comprising a deletion of NDT80 to facilitate meiosis synchronization) and 3 biological replicates of the ime4 mutant strain (Say966 in which both NDT80 and IME4 were deleted) were produced.

Liu et al.8 compared the relative performance of EpiNano against TOMBO using m6A sites identified by Me-RIP/ m6A -seq5 as the “ground truth”. However, while our manuscript was under review, another study was published in Cell that used an antibody-independent method called MAZTER-seq7 to map m6A. MAZTER-seq uses the enzyme MazF, which cleaves RNA in an ACA sequence context, unless the proximal A is methylated7. The MAZTER-seq study reported a minimal estimate of false positives and false negatives (validated in some cases using SCARLET) in S. Cerevisiae Me-RIP/ m6A -seq data5,9. Consequently, Me-RIP/ m6A -seq data does not define an absolute “ground truth”. However, MAZTER-seq does not define an absolute “ground truth” either because only a fraction of S. Cerevisiae m6A sites are in an ACA context

The nanopore DRS dataset for S. cerevisiae8 is smaller than that which we report here for Arabidopsis (3 biological replicates, 2,215,099 read alignments compared to our 4 Arabidopsis biological replicates with 10,364,150 read alignments). Consequently, this smaller dataset likely results in reduced power to detect m6A. Furthermore, only basecalled data appears to be publicly available and it is not certain whether data is basecalled using MinKNOW (on which basecaller versions can vary due to auto-updates) or independently using a fixed version of the basecaller. Possibly for these reasons, we detect a large number of false positives in the differential error site approach when using an FDR threshold of 0.05 and log fold change threshold of 1 during analysis of these data. We therefore filtered the data with a more stringent FDR threshold of 10-6, resulting in the detection of 2,294 differential error sites (Author response image 3A). A de novode novo motif search using MEME at these sites returned the motif RGACWWT (Author response image 3B), which resembles very closely the motif identified by Schwartz et al. using MeRIP/ m6A -seq5. Application of the identified motif to the sequences underlying error sites using FIMO (FDR < 0.1) identified 555 unique m6A motif sites. The most common motifs detected with nanopore DRS are shown in Author response image 3C and included GGACA (34.1%) and AGACA (22.3%). Notably, these two motifs were also the most frequently detected with the MeRIP/ m6A -seq approach (GGACA [29.4%], AGACA [17.3%])5.

Author response image 3
Differential error rate analysis of S. cerevisiae SK1 WT vs ime4.

(a) Histogram showing the log2 fold change in the ratio of errors to reference matches at bases with a significant change in error profile in ime4 mutant compared with wild type S. cerevisiae SK1. Orange and green shading indicates sites with increased and reduced errors in ime4, respectively. (b) Sequence logo showing the motif enriched at sites with reduced error rate in ime4. (c) Frequency of m6A motifs detected at ime4 reduced error sites, as detected by FIMO using the motif detected de novode novo by MEME and an FDR threshold of 0.1.

We conclude that the output of analysis of these S. cerevisiae nanopore DRS datasets with a differential error rate approach is consistent with well-established features of m6A in this organism.

We assessed the performance of our nanopore DRS differential error rate approach by using the nanopore DRS data published in the Liu et al. EpiNano study8 and Me-RIP/ m6A -seq sites5 and MAZTER-seq sites7 as “ground truths”. However, it is important to bear in mind the limitations of such “ground truths” and hence the interpretation of “false positives” or “false negatives”. As a result, we refer here to “ground truth" and “predicted” sites that were “supported” or “unsupported” by either the Me-RIP/ m6A -seq or MAZTER-seq data.

We calculate three metrics for each set of predictions compared to the ground truth: (i) precision, (ii) recall and (iii) F1 score.

Precision describes the fraction or percentage of predicted sites which have support in the ground truth dataset.

Recall describes the fraction or percentage of ground truth sites which are detected in the predicted dataset.

The F1-score an overall measure of accuracy and is defined as the harmonic mean of the precision and recall.

We first used Me-RIP/ m6A -seq data as an m6A “ground truth” dataset (Author response table 1). Schwartz et al 5 identified 1,308 Me-RIP/ m6A -seq peaks, of which 1,100 have coverage in the nanopore DRS data. Me-RIP/ m6A -seq peak intervals were increased in length to 50 nt centred around the peak summits downloaded from GEO accession GSE51583. Using the motifs predicted de-novo underneath differential error sites, 11.3% of Me-RIP/ m6A -seq sites were recovered with a precision of 22.4% for an F1-score of 0.15. EpiNano recovered 21.2% of Me-RIP/ m6A -seq sites with a precision of 2.9%, for an F1-score of 0.05. MAZTER-seq recovered 11.5% of Me-RIP/ m6A -seq peaks with a precision of 10%, with an F1-score of 0.11. Analysed in this way, our nanopore DRS differential error site approach performs similarly to MAZTER-seq. Although EpiNano recovered more sites, it also called many more sites (32,309) for which there is no supporting Me-RIP/ m6A -seq data. Indeed, using EpiNano sites with a threshold of zero (p >= 0.0, Author response table 1) gave an F1-score of 0.03, suggesting that much of the predictive power of EpiNano can be achieved by simply using a DRACH motif search in regions with nanopore DRS read coverage.

Author response table 1
Performance of MAZTER-seq, EpiNano and Nanopore differential error rate methods using S. cerevisiae Me-RIP/m6A-seq peaks as the “ground truth”.
MAZTER
seq
EpiNano
p >= 0.0
EpiNano
p >= 0.5
EpiNano
p >= 0.9
Nanopore error sitesNanopore error site
motifs
Total ground truth
sites
1,1001,1001,1001,1001,1001,100
Total prediction
sites
1,25732,9208,0121692,294555
Ground truth sites
supported
12748923314138124
Unsupported
prediction sites
1,12632,3097,7671552,087430
Unsupported
ground truth sites
9736118671,086962976
Precision0.1010.0150.0290.0830.0620.224
Recall0.1150.4450.2120.0130.1250.113
F1 score0.1080.0290.0510.0220.0830.150
Fisher p value1.7E-1246.9E-2051.5E-1124.9E-133.9E-983.9E-161

We next used MAZTER-seq datasets as a "ground truth” dataset (Author response table 2). Garcia-Campos et al. identified 1,341 MAZTER-seq m6A sites in ACA contexts7, of which 1,257 have coverage in the nanopore data. Because MAZTER-seq only maps m6A in an ACA context7, and EpiNano is unable to map m6A in 5mers with more than one A, there was no overlap between the datasets, meaning EpiNano scored a precision, recall, and F1 of zero. Me-RIP/m6A-seq peaks data recovered 10.4% of sites with a precision of 11.9%, for an F1-score of 0.11. When we performed de novo motif detection under MeRIP/m6A-seq peaks (motif identified was GGACWWT) and filtered out m6A sites which were not in ACA contexts, we recovered only 7.5% of MAZTER-seq sites, but with an improved precision of 26.2%, for an F1 score of 0.12. Motifs detected de novo under nanopore DRS differential error sites recovered 9.2% of MAZTER-seq sites with a precision of 33.0%, for an F1-score of 0.14, when filtered to ACA contexts. Consequently, when analysed in this way, nanopore DRS differential error site analysis performed marginally better than Me-RIP/m6A-seq (with and without motif detection) and outperforms EpiNano.

Author response table 2
Performance of Me-RIP/m6A seq, EpiNano and Nanopore differential error rate methods using S.cerevisiae MAZTER-seq sites as the “ground truth”.
Me-RIP
seq
Me-RIP
seq motifs
Me-RIP
seq
motifs (ACA)
EpiNano
p >= 0.5
Nanopore error sitesNanopore error site motifsNanopore error site motifs
(ACA)
Total ground truth
sites
1,2571,2571,2571,2571,2571,2571,257
Total prediction
sites
1,1006693618,0122,294555348
Ground truth sites
supported
13195950123116116
Unsupported
prediction sites
9735752678,0122,130442235
Unsupported
ground truth sites
1,1261,1621,1621,2571,1341,1411,141
Precision0.1190.1420.26200.0550.2080.330
Recall0.1040.0760.07600.0980.0920.092
F1 score0.1110.0990.11700.0700.1280.144
Fisher p value1.7E-1242.6E-1551.9E-18318.5E-1436.93E-2114.26E-238

Overall, we conclude that nanopore DRS differential error analysis can perform similarly to Me- RIP/m6A-seq and MAZTER-seq, but outperforms EpiNano.

The strengths of MAZTER-seq are that single nucleotide resolution and stoichiometric data are obtained7. However, since MAZTER-seq only detects m6A in an ACA context (and with spatial resolution constraints from other ACA sites), the majority of mRNA m6A sites are undetectable with this approach7.

The limitations of EpiNano may derive from the training data used: RNAs were transcribed in vitro in either the presence or absence of m6A. As a result, not only are RNAs with methylation profiles that have not been reported in nature used in the training set, but also the most frequently occurring sequence contexts found in nature e.g. AAm6ACA are missing from the training data. In order to address these limitations, EpiNano does not incorporate sites with more than one A in the k-mer, and so like MAZTER-seq, the majority of m6A sites found in mRNA are undetectable using EpiNano.

When we were characterising the features of the differential error sites we identified in S. cerevisiae nanopore DRS data, we asked if these sites were enriched toward the 3’ end of mRNAs as has been previously reported5. We found that this was the case (Author response image 4). However, when we plotted the “ground truth” MeRIP/m6A-seq and MAZTER-seq sites in the same way, we found the profiles to be quite different (Author response image 4). Strikingly, MeRIP/m6A-seq data was relatively depleted in sites that map closest to RNA 3’ ends. These distinctions may be explained either by different degrees of synchronisation in meiosis in the different studies, analysis-based definition of peaks in Me- RIP/m6A-seq data or from the well-established limitations of Illumina-based fragmentation procedures providing sequence coverage at the 5’ and 3’ extremities of RNA molecules. Whatever the explanation, it reveals the limitations in the field to defining m6A ground truths in biological data and hence caveats to the analysis that we could perform.

Author response image 4
Smoothed distribution of m6A sites detected by different approaches around stop codons in S.cerevisiae SK1.

In order to address the final point regarding software changes, we have found that comparing samples that have been sequenced with different generations of ONT sequencing kits or pores, or have been basecalled/mapped with different software can result in false positives. Consequently, it is necessary to restrict analysis to samples that were sequenced with comparable kits and used the same pipeline, tools, and parameters for basecalling and mapping of all samples.

We have included a new Discussion section where we comment on the impact of software (and hardware/wetware) changes to nanopore DRS.

5) The authors should show data on the correlation between sequencing depth and the ability to detect m6A modifications. They should also do deeper analysis to look at the nucleotide composition of the called m6A modifications that they capture, as well as whether there is potential bias towards capturing modifications on specific transcripts.

In order to examine the power of our differential error rate analysis approach at detecting m6A sites, we performed a bootstrapped subsampling approach using the nanopore DRS data derived from the synthetic MALAT1 RNAs (described in response to point 4), and example methylated positions in the Arabidopsis transcriptome. This analysis reveals that the ability to detect m6A from errors is dependent upon sequencing depth, as well as the fraction of reads which are methylated at the position in the methylation positive example.

Using the single replicate MALAT1 data, we can reliably detect m6A at the correct position in more than 95% of bootstraps using only 60 reads per condition. However, this represents an idealised scenario, as the synthetic reads are 100% unmethylated in one condition and 100% methylated in the other. We therefore also performed an in silico "dilution" experiment, where we diluted the bootstrapped sample of methylated reads with unmethylated reads and determined whether m6A could still be detected. Here we found that when 50% of reads in the methylation positive condition are sampled from the unmethylated condition, > 200 reads per condition are required to reliably detect m6A in > 95% of bootstraps, suggesting that the percentage of reads which are methylated at a position is an important factor in the power to detect differentially methylated sites.

When performing subsampling experiments using example methylated positions from the Arabidopsis nanopore DRS dataset, we find that the number of reads required to detect m6A varies between genes, and between methylated positions in the same gene. This is likely due to differences in the percentage of transcripts that are methylated at each position. We also cannot rule out that sequence context-specific error rates affect the sequencing depth required to detect m6A. However, we find that for some positions m6A can be reliably detected with as little as 10 reads per replicate (in a 4 replicate per condition experiment) suggesting that it is possible to detect m6A robustly at the threshold of a median coverage of 10 reads per replicate used in our original analysis.

These data have now been incorporated into the Results section of the revised manuscript and included as Figure 5—figure supplement 1E, F.

In order to address the second point of “a deeper analysis at the nucleotide composition of the called m6A modifications”, we examined the reference bases at differential error sites. We find that the error sites are not exclusively located at adenosine positions. This is explained by the fact that 5 RNA bases contribute to the nanopore signal at a particular moment in time. Consequently, the presence of a methylated adenosine in the mRNA can cause basecaller errors anywhere within approximately 5 nt of the methylated adenosine. Hence, detection of m6A using differential error rates has a resolution of around 9 nt, and multiple bases surrounding a single modified adenosine may be detected as having a differential error rate. Adenosines account for most error sites (61.3%). However, since m6A tends to occur in A rich sub-sequence (the most commonly detected methylated motifs in our dataset were AAm6ACA and AAm6ACU) it is also possible that this represents a mixture of methylated positions and adenosines surrounding methylated positions.

We previously characterised the motif that underpinned the differential error sites (Figure 5B in our original submitted manuscript) and detailed the frequency of different sequences matching the motif (Supplementary Figure 5B of our originally submitted manuscript). Notably, the two most frequently occurring motifs that we reported (AAACA and AAACU) match the two most enriched motifs detected in Arabidopsis Me-RIP peaks analysis reported by Chuan He and collaborators10 and also by Wan et al.11. In other words, the most frequently occurring m6A motifs we identify with nanopore DRS match those identified for Arabidopsis using orthogonal Me-RIP/m6A seq data produced and analysed by other labs.

We have included text on this further support to the validity of the approach we used into the revised manuscript.

6) Direct RNA sequencing benefits from the aspect that it is long read, single molecule sequencing. As such, there is the possibility of "phasing" modifications that cannot be performed using short-read sequencing. The authors should show some examples of RNAs with more or less modifications than expected and attempt to associate that with RNA processing to demonstrate the utility of direct RNA sequencing.

Indeed, the ultimate goal would be to distinguish modifications on a single read originating from an individual transcript molecule. We continue to work on this, with the aim of identifying relationships between m6A modification and other RNA processing events. However, at this stage, our method for detecting m6A is aggregated across all reads, meaning that true phasing is not possible.

To address the question posed by the reviewers, we developed an approach to test the relationship between mRNA 3' end position and the error profile of the nanopore DRS reads at detected differential error sites using only the methylated (VIRc) samples. Reads from VIRc samples, which overlapped error sites with identifiable m6A motifs detected in the vir-1 vs VIRc comparison, were clustered by their 3’ position using a kernel density estimate with a σ of 13. Clusters with fewer than 10 reads were discarded. We then compared the error profiles of reads from different clusters at the error site, using a G-test approach, to identify whether different 3’ end positions had different levels of methylation. We identified 327 error sites which were phased with 3’ end position, overlapping 207 annotated genes. 149 (72%) of these genes also display a statistically detectable change in 3’ end position in the vir-1 mutant compared to VIRc, indicating that the changes in 3’ end position in the vir-1 mutant may result directly from the loss of m6A methylation. See PRPL34 (AT1G29070) as an example.

We have included these new analyses in to the Results section and as Figure 7C of the revised manuscript.

Reviewer #1:

[…] I have some specific points detailed below.

1) In the supplementary Table 1, only 25% and 27% of sequencings reads in the 5' adapter ligation library could be mapped to the TAIR10 genome. The authors should check the sequencing reads to explain in detail the reason for the failed alignment in a large majority of sequencing reads.

Our response to this point is covered in the Essential Revisions section, Point 2.

2) Subsection “Differential error site analysis reveals the m6A epitranscriptome”, 17,491 sites with a more than two-fold higher error rate in the VIR-complemented line with restored m6A. Are these error sites exclusively located in adenine positions? A nucleotide composition summary is required to improve clarity.

Our response to this point is covered in the Essential Revisions section, Point 5.

3) Subsection “Spurious antisense reads are rare or absent in nanopore DRS”, the absence of reads mapping to antisense to RCA suggests that spurious antisense is rare or absent from Nanopore direct RNA sequencing data. Is it a generally accepted that the genomic loci of RCA do not generate antisense transcripts? If so, this information should be included.

Antisense transcripts are not annotated at RCA in either the Araport 11 or TAIR10 annotations, so it is not generally accepted that antisense RNAs are found at this locus.

Nanopore DRS indicates antisense RNAs are rare or absent at RCA. To investigate this question with an orthogonal dataset, we consulted our Illumina RNA-seq dataset, in which we sequenced Col-0 and a hen2-2 exosome mutant, prepared using ribosomal RNA depletion. Illumina RNAseq datasets have been shown to suffer from spurious antisense reads produced during library construction which complicate the detection of genuine antisense RNAs. Spurious antisense signals can be identified in count data using ERCC spike-in controls, which should not have antisense RNAs. The ratio of sense to antisense reads mapping to the ERCC spike-in controls can be used as an estimate of the level of spurious antisense signal. In our datasets we find that the ratios of sense to antisense reads at RCA are similar to those found in the ERCC spike-ins, and do not change in the hen2-2 mutant (ind t-test p = 0.88), suggesting that the antisense reads detected at RCA in Illumina RNA-Seq data may be spurious and not from genuine antisense RNAs (Author response image 5).

Author response image 5
Antisense Illumina RNA-seq reads at RCA are likely to be spurious.

(a) Scatter plot showing correlation of sense and antisense read abundances derived from Illumina RNAseq data mapping to ERCC spike-in controls, shown in blue. Each point represents a single spike in transcript detected in a single RNAseq replicate (both Col-0 and hen2-2). Since ERCC spike-ins should not have antisense RNAs, this indicates the level of spurious antisense signal in the dataset. RCA, shown in the orange, does not display antisense RNA levels greater than the ERCC spike-ins, indicating that there are not likely to be genuine antisense RNAs at this locus. FLC, which is known to have antisense RNAs, is shown in green as a positive control. (b) Strip plot showing the sense/antisense read ratio at RCA does not change in the hen2-2 mutant (T test p=0.88). Many genuine antisense RNAs, e.g. those at FLC, are unstable and are therefore upregulated in the hen2-2 mutant relative to sense transcripts.

4) The webpage (https://github.com/bartongroup/Simpson_Barton_Nanopore_1) which is used to deposit scripts and pipelines on GitHub is not accessible. The code availability will represent a valuable addition to the Nanopore RNA sequencing community and can be used a guide for direct RNA methylation analysis.

We have prepared a GitHub site and ENA sites for sequence data (detailed in the submitted manuscript). Access to these resources will be made available upon publication. Indeed, our rationale for submitting this work to eLife, was so that all source data and code would be freely available, open access and directly linked to the figures in which the analysis is presented.

5) This is a technical paper to assess the performance of Nanopore direct RNA sequencing in a pioneer manner. Since m6A RNA modification detection and full-length RNA transcript sequencing using Cap-dependent capturing constitutes the major contribution of this paper, the recent advances relating to these topics should be included in the Introduction section.

Liu H, et al. Accurate detection of m6A RNA modifications in native RNA sequences. bioRxiv 2019:525741.

Jiang F, et al. Long-read direct RNA sequencing by 5'-Cap capturing reveals the impact of Piwi on the widespread exonization of transposable elements in locusts. RNA Biol 2019:1-10.

A manuscript corresponding to this pre-print has now been published in Nature Communications and is now cited in our revised manuscript.

Reviewer #2:

[…] 1) This technique is conceptually similar to the work by Liu et al. (https://www.biorxiv.org/content/10.1101/525741v1). Please clarify the similarities/differences between Liu et al. and the method described in this manuscript.

The general principle of using nanopore Direct RNA Sequencing to identify RNA modifications is shared with that used in the pre-print Liu et al. (https://www.biorxiv.org/content/10.1101/525741v1), which is now published in Nature Communications8. Liu et al. used aggregated measurements of insertion, deletion, and mismatch frequencies at DRACH (where D=A or G or U, R=A or G, H=A or C or U) k-mers, from RNAs transcribed in vitro using either ATP or m6ATP, to train a support vector machine model to classify methylated and unmethylated positions. There are several important similarities and distinctions between this methodology and our own:

  • The models produced by Liu et al.8, once trained on positive and negative examples, should be able to predict m6A in a normal sample without controls. In contrast, our method relies upon comparison between samples with methylation and reduced methylation controls.

  • Like our method, the model produced by Liu et al.8 makes predictions for reference positions and cannot identify the presence or absence of m6A in individual reads.

  • The models produced by Liu et al.8 should detect methylated positions in the analysed transcriptome independently of the writer complex or enzyme which catalyses the methylation. By contrast, our method, when using a mutant with reduced methylation such as vir-1, will only detect methylation sites which are lost in the mutant, e.g. VIR-dependent m6A sites.

  • The models produced by Liu et al.8 rely on the prior knowledge of the DRACH motif to detect m6A, meaning that any modifications in other contexts will not be detected. In contrast, our method takes an unbiased statistical comparison approach to detecting m6A and is therefore sequence agnostic. We find that 4.8% of detected Arabidopsis m6A motifs are in the context AGAUU (Supplementary Figure 5B in our originally submitted manuscript), and 9.55% of detected yeast m6A motifs are in the context DGACG (see Author response image 3) indicating that the prior assumption of the DRACH motif will lead to false negatives.

  • The data used to train the model from Liu et al.8 is produced entirely from in vitro transcribed RNAs. This imposes limitations to this approach:

    • The in vitro transcribed RNAs will not contain other known RNA modifications (see https://mods.rna.albany.edu for a comprehensive list). These modifications are also likely to cause basecalling errors, which the machine learning model has not been exposed to. This means that when the model is applied to biological samples and encounters errors caused by alternative modifications, it may behave unexpectedly, e.g. by calling them as m6A. Our method is unlikely to incorrectly flag other RNA modifications as m6A, unless the mutant used as a negative control alters the pattern of these modifications in RNA.

Because the in vitro transcribed RNAs contain 100% methylated or 100% unmethylated adenosines only, biologically important k-mers with mixtures of A and m6A, such as AAm6ACA or AAm6ACU, are not sampled. Liu et al.8 acknowledge this and therefore do not make predictions in 5mer contexts with more than one A. However, by our estimates this would lead to the failure to detect 92% of m6A sites we detected in Arabidopsis, and 76.6% of m6A sites we detected in S. cerevisiae SK1. The two most frequently occurring motifs that we report in Arabidopsis (AAm6ACA and AAm6ACU) match the two most enriched motifs detected in Arabidopsis Me-RIP peaks analysis reported by Chuan He and collaborators10 and also by Wan et al.11, providing orthogonal support for these estimates.

2) Base-calling error profiles can vary depending on the basecaller version, the model in the basecaller, the processing/filtering parameters used for running the basecaller, the pore protein version, and the motor protein version. Recently, there was a huge upgrade for the basecalling model (high-accuracy model, a.k.a. flip-flop basecaller) introduced in Guppy 3.2. Moreover, according to the Oxford Nanopore Technologies (ONT), a faster motor protein and the new R10 pore will be introduced to their DRS kits very soon. The authors need to clarify and inform the readers that the error profile presented here may change depending on the basecaller software, a model, a pore, and a motor protein.

The method described in this manuscript can work in the specific combinations that the authors used, but it does not guarantee its performance in the other settings. Considering the change in the flagship basecaller, Guppy, with higher accuracy after the version used in the manuscript, the authors need to check if this method works with comparable accuracy with the newer Guppy.

The reviewer raises an important point, namely that there are many variables controlling the error profiles of nanopore DRS reads. Indeed, we have found that comparing samples that have been sequenced with different kits or pores, or have been basecalled/mapped with different software can result in large numbers of false positives. We have therefore restricted analysis to samples that were sequenced with comparable kits, pores and used the same pipeline, tools, and parameters for basecalling and mapping of all samples.

In order to address the reviewers comment that the newest versions of ONT's basecaller, Guppy, may not be able to be used in conjunction with error rate analysis to detect m6A, we re-basecalled the data derived from the sequencing of the synthetic MALAT1 RNA using Guppy version 3.2.4. This version uses the so-called "flip-flop" approach to achieve higher accuracy. Using this basecaller, we found that there were still statistically detectable changes in error profile at the synthetically modified m6A site (Author response image 6).

ONT sequencing and data analysis continues to evolve. Indeed, in the time since the reviewer made this comment ONT have announced another improvement to the basecaller using run length encoding, which may replace the flip-flop models implemented in Guppy in the near future.

However, our experience with machine learning models leads us to believe that so long as ONT do not explicitly train their models to identify m6A, it is likely to remain a source of errors in base- calling. This is because machine learning models typically do not generalise well outside the bounds of the data they are trained on. Even if the training datasets do include unlabelled examples of m6A, due to the extreme imbalance of m6A and A, they are unlikely to be well modelled by the basecaller.

We have included text to reflect this in a new Discussion section of the manuscript.

Author response image 6
Comparison of differential error rate analysis in MALAT1 data using older (3.1.2) and newer (3.2.4) Guppy releases.

Gene track showing the change in the per- reference base test statistic attained when comparing methylated and unmethylated synthetic MALAT1 fragments basecalled using Guppy 3.1.2, which uses recurrent neural network basecalling, and Guppy 3.2.4, which uses recurrent neural network basecalling with a flip-flop architecture.

3) In addition, Tombo from the ONT has a mode called "level_sample_compare" that enables the modified base detection by comparing control and experiment groups. Can you discuss a bit about the benefits and drawbacks of your method in comparison with Tombo?

The performance of Tombo has been evaluated by Liu et al.7; using yeast Me-RIP/m6A-seq data as a ground truth. The authors suggested that EpiNano had an increased precision over Tombo, but recovered fewer ground truth sites7. We understand that ONT is no longer supporting the further development of Tombo and are instead focussing on the development of a new tool called Megalodon for modification detection. However, this tool does not yet have support for RNA.

4) Subsection “Nanopore DRS confirms sites of RNA 3′ end formation and estimates poly(A) tail length”: The standard nanopore DRS library preparation uses the double-stranded RNA-DNA ligation assisted by an oligo(dT) splint. It inevitably introduces a substantial underrepresentation of short poly(A) tails. Even 10 nt oligo(dT) splint often results in a strong bias for > 20 nt poly(A) tails against the shorter tails. Moreover, short poly(A) tails often carry additional U tails which interfere with the ligation to the adapter. In addition, the means of the single-read "poly(A) length measurements" may not be a suitable summarization to estimate the means of "poly(A) lengths." Nanopolish gives the best approximations of poly(A) lengths at a single-read level. However, as the poly(A) dwell time roughly follows a gamma distribution with a long tail, the mean of the best approximations at a single-read level systematically overestimates the mean of poly(A) length.

The reviewer makes very good points and we thank them for their informed insight. We have revised our wording in the manuscript text to refer to the poly(A) length of the sequenced reads rather than Arabidopsis poly(A) tail length. In addition, we have changed our data analysis and edited our text to use the median rather than the mean in these contexts. In addition we have included mention of the impact of short poly(A) tails and uridylated tails might have on quantitative gene expression analysis using the current standard nanopore DRS procedure.

5) Subsection “Cap-dependent 5′ RNA detection by nanopore DRS”: The authors used RNA ligase 1 to mark the 5′ ends of RNAs. Due to the substrate specificity of this enzyme, circularized mRNAs and mRNA-mRNA concatamers might have been produced.

We first investigated whether mRNA-mRNA concatemers were produced in the cap-capture datasets. In our submitted manuscript, we indicated the presence of “over-splitting” of read signal in our DRS datasets, caused by the misinterpretation of signal originating from a single RNA molecule as two molecules. Given that over-splitting can occur, we suggest that “under-splitting” of signal that originates from two unique molecules sequenced consecutively through the same pore may also occur. These will generally be less problematic than over-splitting events, as the chances of under- splitting occurring with two transcripts that have been transcribed from adjacent genomic loci is very low, meaning that they will seldom lead to misidentification of chimeric RNAs. Once base- called, however, these under-split reads would be indistinguishable from physically concatemerized mRNAs catalysed by RNA ligase 1. To identify whether under-splitting occurs in all of our datasets, and whether RNA ligase 1-catalysed concatemers occur in our adapter ligated datasets, we extracted soft-clipped regions from the 5’ and 3’ ends of read alignments to the TAIR10 reference, and remapped them to determine what percentage were alignable to the genome. We found that the percentages of alignable softclipped regions were consistently low in both adapter ligation negative and positive datasets (between 0.04 and 0.27% overall for all experiments), suggesting that the low levels of spurious concatemers are likely to be in silico under-splitting errors (Author response image 7).

Author response image 7
Prevalence of concatemerised reads in nanopore DRS data.

Barplot showing the percentage of read alignments in nanopore DRS data where softclipped regions (i.e. unmapped sequence at 5’ and 3’ ends of alignments) which is mappable elsewhere in the TAIR10 genome, possibly indicating undersplitting of read signal by the MinKNOW software. Adapter ligated datasets do not show increased levels, suggesting that RNA ligase I does not cause a detectable increase in the number of concatemers/chimeric reads.

Supplementary Table 1 shows that the libraries using the standard protocol yielded ~90% of mappable reads while the libraries with cap-dependent ligation yielded only ~25%. Please describe what the other 75% are.

Our response to the reviewer’s comment on the percentage of mappable reads in the adapter ligated datasets is given in the Essential Revisions section Point 2.

Also, it would be helpful if the sequence information of the 5′ adapter is presented in the manuscript.

The sequence of the 5’ adapter RNA has now been added to the Materials and methods section.

6) Subsection “Differential error site analysis reveals the m6A epitranscriptome”: How many of the miCLIP peaks could be detected with nanopore DRS? The authors only show the analyses using the population-level detection of m6A -modified sites. A real benefit of nanopore DRS lies in the associative analyses at a single-read level. Is it possible to use the DRS method to call m6A -modified sites within single molecules and analyze to see if a modified RNA has different polyadenylation status or 3′ end position from those in an unmodified RNA?

Whilst approaching this question we discovered an error in our miCLIP analysis pipeline that meant that the results from the irreproducible rate analysis were not thresholded correctly. Upon correcting this mistake we found that because replicate 3 of our miCLIP experiment was less well correlated with replicates 1 and 2 (spearman's rho 0.54 for replicates 1 vs. 2, 93,046 reproducible sites, spearman's rho 0.42 for replicates 1 vs. 3, 12,777 reproducible sites, spearman's rho 0.47 for replicates 2 vs. 3, 19,948 reproducible sites), it was reducing the quality of the results, and decided to exclude it from the downstream analysis. This has altered our analysis and some results slightly. The Materials and methods and Results sections of the manuscript have been updated accordingly.

We now find that 65.7% of nanopore error sites are within 5 nt of an miCLIP site detected in two replicates at IDR < 0.05 (Figure 5E, 5F in the revised version of manuscript). Conversely, 31.2% of miCLIP peaks are within 5 nt of an error site detected with nanopore DRS. This is likely because miCLIP is a more sensitive but less precise technique (i.e. has antibody-dependent limitations) than nanopore DRS for mapping m6A: we detected 93 thousand significant miCLIP sites compared to only 17 thousand nanopore differential error sites. Furthermore, miCLIP can detect a wider range of methylated adenosines in the transcriptome. Here we report a differential error rate method capable of detecting m6A sites dependent on VIR, in poly(A+) RNA. m6A sites in poly- or oligo- adenylated RNAs, which are detectable by miCLIP, but which are methylated by other enzymes or writer complexes are not detected by our experimental approach.

We address the issues regarding single molecule detection and phasing in the Essential Revisions section Point 6.

I feel that the current discoveries related to m6A modification in this manuscript can be better done using miCLIP than nanopore DRS.

To clarify, we carried out 3 biological replicates of miCLIP in this study using wild type Arabidopsis thaliana Col-0 poly(A)+ RNA. Consequently, in addition to our nanopore DRS datasets, we provide here the first miCLIP-based map of m6A in any plant species. Compared to the read depth of our current nanopore DRS datasets, our miCLIP data does indeed detect more significant sites using the statistical approach described.

However, antibody-based approaches to mapping m6A, such as miCLIP, are limited in some respects because (i) they cannot readily define stoichiometry (ii) they do not necessarily detect m6A only (iii) they have unclear sensitivity. These limitations account, in part, for the recent high-profile publication of the MAZTER-Seq procedure (published while our manuscript was in review) as an antibody-independent approach to mapping m6A that addresses these issues (albeit for a sub-set of sequence contexts)7.

The overlap of miCLIP sites mapped in different studies is low. Examination of 81,519 sites identified across six different published miCLIP experiments in human cell lines revealed a median of only 1,500 sites shared between any two datasets7. This appears to be explained by sites modified at low stoichiometries that are stochastically identified in one miCLIP experiment, but not another, due to the low efficiency of UV cross-linking7.

We used miCLIP as an orthogonal technique here, using only wild-type material. Although we detect enrichment of miCLIP sites in 3’UTRs (Figure 5D and Figure 5—figure supplement 1D in the revised version of manuscript), the consensus sequence associated with miCLIP is oligo(A) – indicating either the identification of sites distinct from those written by the mRNA m6A writer complex, or cross- hybridisation of the antibody to epitopes other than m6A. We used the Synaptic Systems anti-m6A antibody for miCLIP studies because it is widely used in the field and because the errors induced in miCLIP sequencing libraries had already been established6. However, it has also previously been shown that this antibody cross-reacts with non-methylated purine-rich RNA sequences5, highlighting limitations to this approach.

We use nanopore DRS datasets to link m6A to changes in RNA abundance and processing (specifically in relation to 3’ end processing), in the same experiment. For an miCLIP-only approach to provide the same insight, further analysis would be required, such as (i) control procedures for mapping m6A sites in a methylation defective mutant that are not yet routinely established or accepted, (ii) different datasets of RNAseq to investigate changes in RNA levels, (iii) further datasets of specialised RNAseq to map 3’ ends (eg 3P-Seq12), (iv) since we demonstrate that some of the changes in pre-mRNA processing in the absence of VIRILIZER-dependent m6A (chimeric RNAs, for example) are distinct from events annotated in TAIR10, Araport11 and AtRTD2, long read sequencing of the vir-1 transcriptome with PacBio, for example, would be necessary to establish an appropriate Reference Transcriptome to be used to quantify the different de novoRNA processing events identified in this study. Overall, nanopore DRS can give m6A positions in the context of full- length RNA reads which can be used for other analyses, but miCLIP cannot.

Reviewer #3:

[…] The authors use cap-dependent ligation to enrich for capped mRNAs, reducing the 3' end bias observed in DRS. It would be of interest to discuss if the transcripts with early 5' ends result from technical artifacts (for example, degradation during RNA preparation) or represent transcripts present in the cell that lack a cap structure that is compatible with the ligation protocol. For example, the authors demonstrate that nanopore DRS can identify rare anti-sense transcripts. Could transcripts with early 5' ends, which are selected against in the cap dependent libraries, represent rare transcripts or degradation intermediary products?

We thank the reviewer for their insightful comment. We had considered it difficult to clearly distinguish what might be biology from what might be experimental induced RNA decay aside from our RNA quality control measures. Consequently, we focused instead on capturing RNAs that did have a 5’ cap to establish which RNAs were full-length, by at least this criterion. However, in response to the reviewer's question, we looked at this in more detail.

To address what might be biology-dependent decay in the datasets that we had, we asked if there was an enrichment of 5’ nanopore DRS reads terminating in proximity to established miRNA cleavage sites in our datasets prepared without cap capture. We approached this question in a similar way to that used by Michael Nodine’s recent study using Illumina nanoPARE data3. Target sites for known Arabidopsis small (s)RNAs (miRNAs, tasiRNAs) in protein coding transcripts were predicted using GSTAr. To prevent false positives caused by poor alignments and full-length mRNAs, target sites spanning introns and in 5'UTRs were filtered out. For each putative target site, the number of nanopore DRS reads terminating in a 20 nt region upstream and downstream of the cleavage site was calculated and used to produce an enrichment score for reads terminating immediately downstream of the site. This method was used instead of measuring 5' ends precisely at the cleavage site, because as we demonstrated in our study, approximately 11 nt are lost from the 5' end of RNA reads in standard nanopore DRS. Enrichment scores and Allen scores for sRNA-target alignments were tested for significance using the same permutation testing method as in Schon et al.3 sRNA sequences were shuffled 1,000 times and target sites for shuffled sequences were predicted. Null distributions for Allen scores and enrichments scores were then produced and used to generate p values for significance. Enrichment and Allen score p values were combined using Fisher's method and multiple testing correction was applied.

Using this approach, we were able to identify peaks of downstream cleavage products in a number of well-established target mRNA-miRNA interactions. For example, at mRNA encoding HAM1 (Figure 3—figure supplement 1H in the revised version of manuscript) and HAM2 (miR170/miR171); NAC1 (miR164); ARF proteins (miR160); and SPL domain containing proteins (miR156/7). Nanopore DRS reads aligning to the non-protein coding TAS genes also reveal clear evidence of established cleavage products (Author response image 8). The full list of identified miRNA targets identified in nanopore is now included in the manuscript as Supplementary file 4.

Author response image 8
Nanopore DRS captures downstream products of miRNA cleavage at transcripts encoded by TAS genes.

Orange, 5’ coverage from capped nanoPARE reads; purple, 5’ coverage from uncapped nanoPARE reads; blue, nanopore DRS 5’ coverage; black miRNA target site alignment; black, Araport11 annotation.

We conclude that, in principle, nanopore DRS can be used to reveal sites of internal mRNA (and TAS ncRNA) cleavage and we thank the reviewer for making this suggestion. Since we have shown that nanopore DRS does not sequence the extreme 5’ end of an RNA accurately, the ideal way to address this question in the future would be to compare adapter ligated libraries that were cap-dependent with those that were cap-independent in a conceptually similar approach to that used in Michael Nodine’s nanoPARE study3. In addition, the use of mutants defective in the Arabidopsis cytoplasmic 5’-3’ exonuclease XRN4 would be useful in revealing degradation intermediary products that are otherwise too unstable to be readily detected.

Additionally, the authors leverage DRS to expand the set of annotated transcripts and splicing isoforms. Can the authors use this data to describe new endogenous targets of NMD, poison exons, or find examples of proteins with new functional domains?

Indeed, our study suggests that nanopore DRS can provide a useful complementary approach to revealing new details of the Arabidopsis transcriptome. Consequently, this approach can be used in the analysis of the sequence features listed by the reviewer and possibly others too (e.g. identifying uORFs).

We can, for example, identify well-established examples of cassette exons, and alternative donor and acceptor sites which are known to be targets of the NMD pathway as indicated by upregulation in upf1 mutants13 (Author response table 3). This includes, for example, mRNA encoding the splicing factor SR34a, in which inclusion of a cassette exon in the first intron has been suggested to introduce a destabilising uORF13 (Author response table 3). We also identified further examples of unannotated alternative splice isoforms which may introduce frameshifts, such as an exon skipping event at mRNA encoding the KH domain- containing protein AT5G56140 (Figure 4—figure supplement 1C in the revised version of manuscript), and an alternative acceptor site at XRCC mRNA (Figure 4—figure supplement 1D in the revised version of manuscript), establishing the principle that nanopore DRS can be used to identify novel alternative splicing products which may be targets of NMD. We suggest that the best way to study these phenomena in the future would be to use biological material with genotypes that disrupt the processes of interest in order to provide a means to validate the events under study. For example, sequencing an NMD mutant such as upf1, as was previously done with Illumina RNAseq, would validate the discovery of potential new NMD substrates.

Author response table 3
Known upf1 sensitive alternative splicing events which are supported by nanopore DRS read alignments.

Table showing examples of NMD sensitive alternative splicing events taken from Table 1 of Kalyna et al.13 which are supported by nanopore DRS reads. Duplicate genes and unsupported examples have been removed from the table.

Gene IDGene NameAlternative splicing class
AT1G55310At-SCL33Exon 4 alternative acceptor
AT2G37340At-RS2Z33Exon 3 alternative acceptor
AT4G16845VRN2Intron 2 retention
AT4G39260GRP8/CCR1Exon 1 alternative donor
AT1G77080MAF1Exon 4 alternative acceptor
AT2G02960Zinc finger (C3HC4) proteinExon 2 alternative acceptor
AT2G46790APRR9/TL1Exon 2 alternative donor
AT2G38880NF-YB1/HAP3aExon 6 alternative donor
AT3G49430At-SR34aIntron 1 cassette exon
AT3G01150At-PTB2aIntron 2 Cassette exon, exon 8 alternative donor
AT3G13570At-SCL30aIntron 3 cassette exon
AT3G53500At-RS2Z32Exon 3 alternative acceptor
AT3G61860At-RS31Intron 2 cassette exon
AT2G21660GRP7/CCR2Exon 1 alternative donor
AT5G53180At-PTB2bIntron 3 cassette exon
AT4G25500At-RS40Intron 2 cassette exon
AT3G55460At-SCL30Intron 3 cassette exon
AT2G29210Splicing factor PWI proteinExon 6 alternative acceptor
AT1G07830RPL29 familyExon 1 alternative donor
AT1G02090FUS5/CSN7/COP15Exon 8 alternative acceptor
AT1G72560PSD/Exportin-tExon 13 alternative donor
AT4G33060CYP57Intron 5 cassette exon
AT5G65060MAF3Exon 4 alternative acceptor
AT5G35410SOS2/CIPK24Exon 9 alternative acceptor
AT4G36960RRM-containing proteinIntron 1 retention

These findings are now added to the Results section relating to splicing analysis.

In addition to the detection of novel alternative splicing events, nanopore DRS is also able to identify novel alternative cleavage and polyadenylation (APA) sites. For example, PTM is a plant homeodomain transcription factor with C-terminal transmembrane domains that sequester it to the chloroplast. PTM is cleaved from the membrane in high light conditions and the N-terminal fragment translocates to the nucleus, where it acts as a retrograde transcription factor positively regulating flowering through the control of FLC14. We find that alternative polyadenylation in intron 10 of PTM mRNA is common and produces transcripts with a truncated open reading frame which removes the C-terminal transmembrane domains (Figure 2—figure supplement 1D in the revised version of manuscript). Translation of these APA transcripts would be predicted to produce transcription factors which have similar activity to the cleaved N-terminal fragment generated from full length PTM. Consequently, regulation of this APA site may provide an alternative mechanism for the production of active PTM that bypasses the retrograde signalling pathway.

This finding is now included in the Results section related to RNA 3’ end detection.

One feature of the transcriptome the authors focus on is the localization and effect of m6A modification on RNA metabolism. It would be informative if in addition to the ERCC controls, a set of RNAs with known m6A sites were also included.

This point is addressed in response to the Essential Revisions Point 4.

One aspect of m6A biology that can't easily be studied with current methodologies is the stoichiometry of the modification at each position in each transcript. Can the authors comment at all on stoichiometry of m6A from DRS? Furthermore, can the authors comment on how many reads per transcript are necessary to detect a modification? Is there a bias towards more abundant transcripts?

These points are addressed in response to Essential Revisions Point 5.

In mammalian cells it has been shown that modified RNAs tend to have shorter 3' UTRs (Molinie et al., 2016)(PMID: 27376769). Can this observation be tested in DRS, and if so, does the same phenomena occur in Arabidopsis?

When considering the impact of loss of m6A in the vir-1 m6A defective mutant on 3’ end formation, we found a clear directional shift to proximal poly(A) site selection. These findings were reported in Figure 6E of our originally submitted manuscript. In addition, our analysis of phasing, described in Essential Revisions section Point 6, revealed a clear directional shift to more proximal poly(A) site selection in transcript isoforms detected from the same gene in m6A defective mutant backgrounds compared to complemented line VIRc (see Figure 7C in the revised version of manuscript).

We have now also addressed this question in an orthogonal manner. We used the DaPars15 algorithm which is designed to detect shifts in RNA 3’ end formation in Illumina RNAseq data. We used our Illumina RNAseq datasets that we carried out in parallel to nanopore DRS, comparing the vir-1 mutant with the VIR complemented line. We find that DaPars analysis also identifies a global shortening of 3'UTRs in the vir-1 mutant (Figure 7—figure supplement 1B in the revised version of manuscript).

We have incorporated the DaPars analysis (and the phasing analysis) into our revised manuscript.

Our interpretation is that there is not yet a clear consensus on all the possible interrelationships between m6A and 3’ end formation in published mammalian cell studies. For example, Ke et al.16, reported that dense methylation was associated with an inhibition of proximal poly(A) site selection and the use of more distal poly(A) sites. Therefore, our findings are consistent with the suggestion made by Ke et al.16 that m6A could inhibit the use of proximal sites and hence favour longer 3’UTR transcript isoforms of the same gene. Moline et al. state the following in the Discussion of their publication:

“A recent study reported that genes with longer last exons have a higher density of m6A peaks. Consistent with this observation, we found that m6A levels of genes are positively correlated with maximum 3′-UTR lengths as measured in input steady-state RNAseq data (Supplementary Figure 5E). Detailed analyses showed that genes with higher proximal APA usage in the m6A-positive fraction (P/D; m6A+ > m6A−) had significantly longer 3′UTRs (Supplementary Figure 5F) and lower ratios of proximal versus distal signals (Supplementary Figure 5G,H) in the input RNAseq data as compared to genes with higher distal APA usage in the m6A-positive fraction (P/D; m6A+ < m6A−) or no change in APA usage between fractions (P/D; m6A+ = m6A−).”

The further analysis of 3’ end formation prompted by this reviewer’s question and the phasing analysis requested in Essential revisions point 6 led us to look more closely at the shift in 3’ formation. When we plotted the position of the most-frequently occurring motifs at differential error sites and the poly(A) signal AAUAAA (and single nucleotide variants) we detected a close overlap in position (Figure 7E in the revised version of manuscript). We have included this new finding into the Results section of the revised manuscript.

Lastly, can the authors comment on the ability of detecting modifications other than m6A through DRS?

We have not yet examined other RNA modifications. A crucial tool in such analyses is the availability of a genotype that disrupts the modification. However, the ability to use nanopore to map other modifications in DNA and indeed nucleotide analogues17 has already been demonstrated.

Consequently, we would anticipate that other RNA modifications will be detectable by nanopore DRS.

References

1 Pontefract, A., Hachey, J., Zuber, M. T., Ruvkun, G. and Carr, C. E. Sequencing nothing: Exploring failure modes of nanopore sensing and implications for life detection. Life Sci Space Res (Amst) 18, 80-86, doi:10.1016/j.lssr.2018.05.004 (2018).

2 Mojarro, A., Hachey, J., Ruvkun, G., Zuber, M. T. and Carr, C. E. CarrierSeq: a sequence analysis workflow for low-input nanopore sequencing. BMC Bioinformatics 19, 108, doi:10.1186/s12859-018-2124-3 (2018).

3 Schon, M. A., Kellner, M. J., Plotnikova, A., Hofmann, F. and Nodine, M. D. NanoPARE: parallel analysis of RNA 5' ends from low-input RNA. Genome Res 28, 1931-1942, doi:10.1101/gr.239202.118 (2018).

4 Liu, N. et al. Probing N6-methyladenosine RNA modification status at single nucleotide resolution in mRNA and long noncoding RNA. RNA 19, 1848-1856, doi:10.1261/rna.041178.113 (2013).

5 Schwartz, S. et al. High-resolution mapping reveals a conserved, widespread, dynamic mRNA methylation program in yeast meiosis. Cell 155, 1409-1421, doi:10.1016/j.cell.2013.10.047 (2013).

6 Linder, B. et al. Single-nucleotide-resolution mapping of m6A and m6Am throughout the transcriptome. Nat Methods 12, 767-772, doi:10.1038/nmeth.3453 (2015).

7 Garcia-Campos, M. A. et al. Deciphering the "m6A Code" via antibody-independent quantitative profiling. Cell 178, 731-747 e716, doi:10.1016/j.cell.2019.06.013 (2019).

8 Liu, H. et al. Accurate detection of m(6)A RNA modifications in native RNA sequences. Nat Commun 10, 4079, doi:10.1038/s41467-019-11713-9 (2019).

9 Garcia-Campos, M. A. et al. Deciphering the "m(6)A Code" via Antibody-Independent Quantitative Profiling. Cell 178, 731-747 e716, doi:10.1016/j.cell.2019.06.013 (2019).

10 Luo, G. Z. et al. Unique features of the m6A methylome in Arabidopsis thaliana. Nat Commun 5, 5630, doi:10.1038/ncomms6630 (2014).

11 Wan, Y. et al. Transcriptome-wide high-throughput deep m(6)A-seq reveals unique differential m(6)A methylation patterns between three organs in Arabidopsis thaliana. Genome Biol 16, 272, doi:10.1186/s13059-015-0839-2 (2015).

12 Jan, C. H., Friedman, R. C., Ruby, J. G. and Bartel, D. P. Formation, regulation and evolution of Caenorhabditis elegans 3 ' UTRs. Nature 469, 97-101, doi:10.1038/nature09616 (2011).

13 Kalyna, M. et al. Alternative splicing and nonsense-mediated decay modulate expression of important regulatory genes in Arabidopsis. Nucleic Acids Res 40, 2454- 2469, doi:10.1093/nar/gkr932 (2012).

14 Feng, P. Q. et al. Chloroplast retrograde signal regulates flowering. P Natl Acad Sci USA 113, 10708-10713, doi:10.1073/pnas.1521599113 (2016).

15 Xia, Z. et al. Dynamic analyses of alternative polyadenylation from RNA-seq reveal a 3'-UTR landscape across seven tumour types. Nat Commun 5, 5274, doi:10.1038/ncomms6274 (2014).

16 Ke, S. et al. A majority of m6A residues are in the last exons, allowing the potential for 3' UTR regulation. Genes Dev 29, 2037-2053, doi:10.1101/gad.269415.115 (2015).

17 Muller, C. A. et al. Capturing the dynamics of genome replication on individual ultra- long nanopore sequence reads. Nat Methods 16, 429-436, doi:10.1038/s41592-019- 0394-y (2019).

18 Gierlinski, M. et al. Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment. Bioinformatics 31, 3625-3630, doi:10.1093/bioinformatics/btv425 (2015).

19 Schurch, N. J. et al. How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA 22, 839-851, doi:10.1261/rna.053959.115 (2016).

https://doi.org/10.7554/eLife.49658.sa2

Article and author information

Author details

  1. Matthew T Parker

    School of Life Sciences, University of Dundee, Dundee, United Kingdom
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft, Writing—review and editing
    Contributed equally with
    Katarzyna Knop, Anna V Sherwood and Nicholas J Schurch
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0891-8495
  2. Katarzyna Knop

    School of Life Sciences, University of Dundee, Dundee, United Kingdom
    Contribution
    Conceptualization, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    Contributed equally with
    Matthew T Parker, Anna V Sherwood and Nicholas J Schurch
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-2636-9450
  3. Anna V Sherwood

    School of Life Sciences, University of Dundee, Dundee, United Kingdom
    Present address
    Department of Biology, University of Copenhagen, Copenhagen, Denmark
    Contribution
    Conceptualization, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    Contributed equally with
    Matthew T Parker, Katarzyna Knop and Nicholas J Schurch
    Competing interests
    No competing interests declared
  4. Nicholas J Schurch

    School of Life Sciences, University of Dundee, Dundee, United Kingdom
    Present address
    Biomathematics and Statistics Scotland, The James Hutton Institute, Aberdeen, United Kingdom
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Supervision, Validation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    Contributed equally with
    Matthew T Parker, Katarzyna Knop and Anna V Sherwood
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9068-9654
  5. Katarzyna Mackinnon

    School of Life Sciences, University of Dundee, Dundee, United Kingdom
    Contribution
    Validation, Investigation, Visualization, Writing—review and editing
    Competing interests
    No competing interests declared
  6. Peter D Gould

    Institute of Integrative Biology, University of Liverpool, Liverpool, United Kingdom
    Contribution
    Formal analysis, Investigation, Writing—review and editing
    Competing interests
    No competing interests declared
  7. Anthony JW Hall

    Earlham Institute, Norwich Research Park, Norwich, United Kingdom
    Contribution
    Resources, Supervision, Writing—review and editing
    Competing interests
    No competing interests declared
  8. Geoffrey J Barton

    School of Life Sciences, University of Dundee, Dundee, United Kingdom
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Project administration, Writing—review and editing
    For correspondence
    g.j.barton@dundee.ac.uk
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-9014-5355
  9. Gordon G Simpson

    1. School of Life Sciences, University of Dundee, Dundee, United Kingdom
    2. James Hutton Institute, Invergowrie, United Kingdom
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    g.g.simpson@dundee.ac.uk
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6744-5889

Funding

Biotechnology and Biological Sciences Research Council (BB/J00247X/1)

  • Geoffrey J Barton
  • Gordon G Simpson

Biotechnology and Biological Sciences Research Council (BB/M004155/1)

  • Geoffrey J Barton
  • Gordon G Simpson

Biotechnology and Biological Sciences Research Council (BB/M010066/1)

  • Geoffrey J Barton
  • Gordon G Simpson

H2020 Marie Skłodowska-Curie Actions (799300)

  • Katarzyna Knop

University of Dundee (GCRF Challenge Fund)

  • Geoffrey J Barton
  • Gordon G Simpson

Wellcome (097945/B/11/Z)

  • Geoffrey J Barton
  • Gordon G Simpson

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This work was funded by BBSRC grants BB/J00247X/1, BB/M004155/1, BB/M010066/1, the University of Dundee Global Challenges Research Fund and the European Union’s Horizon 2020 research and innovation programme under Marie Skłodowska-Curie grant agreement No. 799300. The m6A LC-MS analysis was carried out by Abdel Atrih of the FingerPrints Proteomics Facility, University of Dundee, which is supported by a Wellcome Trust Technology Platform award (097945/B/11/Z). We thank Kasper Rasmussen and Csaba Hornyik for their comments on the manuscript.

Senior Editor

  1. Christian S Hardtke, University of Lausanne, Switzerland

Reviewing Editor

  1. Yue Wan, A*STAR, Singapore

Publication history

  1. Received: June 26, 2019
  2. Accepted: December 5, 2019
  3. Version of Record published: January 14, 2020 (version 1)

Copyright

© 2020, Parker et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 9,097
    Page views
  • 961
    Downloads
  • 20
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, Scopus, PubMed Central.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Cancer Biology
    2. Chromosomes and Gene Expression
    Xiao Ling Li et al.
    Research Article

    Long noncoding RNAs (lncRNAs) are often associated with polysomes, indicating coding potential. However, only a handful of endogenous proteins encoded by putative lncRNAs have been identified and assigned a function. Here, we report the discovery of a putative gastrointestinal tract-specific lncRNA (LINC00675) that is regulated by the pioneer transcription factor FOXA1 and encodes a conserved small protein of 79 amino acids which we termed FORCP (FOXA1-Regulated Conserved Small Protein). FORCP transcript is undetectable in most cell types but is abundant in well-differentiated colorectal cancer (CRC) cells where it functions to inhibit proliferation, clonogenicity and tumorigenesis. The epitope-tagged and endogenous FORCP protein predominantly localizes to the endoplasmic reticulum (ER). In response to ER stress, FORCP depletion results in decreased apoptosis. Our findings on the initial characterization of FORCP demonstrate that FORCP is a novel, conserved small protein encoded by a mis-annotated lncRNA that regulates apoptosis and tumorigenicity in well-differentiated CRC cells.

    1. Chromosomes and Gene Expression
    Qinyu Hao et al.
    Research Article

    Cell cycle is a cellular process that is subject to stringent control. In contrast to the wealth of knowledge of proteins controlling the cell cycle, very little is known about the molecular role of lncRNAs (long noncoding RNAs) in cell-cycle progression. By performing genome-wide transcriptome analyses in cell-cycle-synchronized cells, we observed cell-cycle phase-specific induction of >2000 lncRNAs. Further, we demonstrate that an S-phase-upregulated lncRNA, SUNO1, facilitates cell-cycle progression by promoting YAP1-mediated gene expression. SUNO1 facilitates the cell-cycle-specific transcription of WTIP, a positive regulator of YAP1, by promoting the co-activator, DDX5-mediated stabilization of RNA polymerase II on chromatin. Finally, elevated SUNO1 levels are associated with poor cancer prognosis and tumorigenicity, implying its pro-survival role. Thus, we demonstrate the role of a S-phase up-regulated lncRNA in cell-cycle progression via modulating the expression of genes controlling cell proliferation.