Computational and Systems Biology

Raw signal segmentation for estimating RNA modification from Nanopore direct RNA sequencing data

Guangzhao Cheng
Aki Vehtari
Lu Cheng author has email address

Department of Computer Science, Aalto University, Espoo, Finland
Institute of Biomedicine, University of Eastern Finland, Kuopio, Finland

https://doi.org/10.7554/eLife.104618.1

Open access
Copyright information

Figures and data

SegPore workflow.
(A) General workflow. The workflow consists of five steps: (1) First, raw current signals are basecalled and mapped using Guppy and Minimap2. The raw current signal fragments are paired with the corresponding reference RNA sequence fragments. (2) Next, the raw current signal of each read is segmented using a hierarchical hidden Markov model (HHMM), which provides an estimated mean (μ_i) for each segment. (3) These segments are then aligned with the 5mer list of the reference sequence fragment using a full/partial alignment algorithm, based on a 5mer parameter table. For example, A_j denotes the base “A” at the j-th position on the reference. (4) All signals aligned to the same 5mer across different genomic locations are pooled together, and a two-component Gaussian Mixture Model (GMM) is used to predict the modification at the site-level or single-molecule level. One component of the GMM represents the unmodified state, while the other represents the modified state. (5) GMM is used to re-estimate the 5mer parameter table.
(B) Hierarchical hidden Markov model (HHMM). The outer HMM segments the current signal into alternating base and transition blocks. The inner HMM approximates the emission probability of a base block by considering neighboring 5mers. A linear model is used to approximate the emission probability of a transition block.
(C) Full/partial alignment algorithms. Rows represent the estimated means of base blocks from the HHMM, and columns represent the 5mers of the reference sequence. Each 5mer can be aligned with multiple estimated means from the current signal.
(D) Gaussian mixture model (GMM) for estimating modification states. The GMM consists of two components: the green component models the unmodified state of a 5mer, and the blue component models the modified state. Each component is described by three parameters: mean (μ), standard deviation (σ), and weight (ω).

RNA translocation hypothesis.
(A) Jiggling RNA translocation hypothesis. The top panel shows the raw current signal of Nanopore direct RNA sequencing, with gray areas representing SegPore-estimated transition blocks. We focus on three neighboring 5mers, considering the central 5mer (CTACG) as the current 5mer. The RNA molecule may briefly move forward or backward during the translocation of the current 5mer. If the RNA molecule is pulled backward, the previous 5mer is placed in the pore, and the current signal (“prev” state, red dots) resembles the previous 5mer’s baseline (mean and standard deviation highlighted by red lines and shades). If the RNA is pushed forward, the current signal (“next” state, blue dots) is similar to the next 5mer’s baseline.
(B) Example raw current signals supporting the jiggling hypothesis. The dashed rectangles highlight base blocks, with red and blue points representing measurements corresponding to the previous and next 5mer, respectively. Red points align closely with the previous 5mer’s baseline, and blue points match the next 5mer’s baseline, reinforcing the hypothesis that the RNA molecule jiggles between neighboring 5mers. The raw current signals were extracted from mESC WT samples of the training data in the m6A benchmark experiment.

Segmentation benchmark

m6A identification at the site level.
(A) Histogram of current signals mapped to an example m6A-modified genomic location (chr10:128548315, GGACT) across all reads in the training data, comparing Nanopolish (left) and SegPore (right).
(B) Histogram of current signals mapped to the GGACT motif at all annotated m6A-modified genomic locations in the training data, again comparing Nanopolish (left) and SegPore (right).
(C) Site-level benchmark results for m6A identification across all DRACH motifs, showing performance comparisons between SegPore+m6Anet and other methods.
(D) Benchmark results for m6A identification on six selected motifs at the site level, comparing SegPore and other baseline methods.

m6A identification at the single-molecule level.
(A) Benchmark results for single-molecule m6A identification on IVT data. SegPore shows better performance compared to CHEUI in both PR-AUC and ROC-AUC.
(B) Comparison of “eventalign” results from SegPore and Nanopolish for five consecutive kmers. Note that DRS is sequenced from 3’ to 5’, so the kmers enters the pore from right to left. A total of 100 reads were randomly sampled from transcript locations A1 (positions 711-719) in both the IVT_normalA and IVT_m6A samples (SRA: SRP166020). Each line represents an individual read, and the y-axis shows the raw signal intensity in picoampere (pA). Pink lines represent the IVT_m6A sample, and gray lines represent the IVT_normalA sample. The kmers “GCGGA,” “CGGAC,” “GGACT,” “GACTT,” and “ACTTT” all contain N6-Methyladenosine (m6A) in the IVT_m6A sample. SegPore’s results show clearer separation between m6A and normal adenosine, especially for “CGGAC” and “GGACT,” compared to Nanopolish.
(C) The upper panel shows the modification rate for selected genomic locations in the example gene ENSMUSG00000003153. The lower panel displays the modification states of all reads mapped to this gene. The black borders in the heatmap highlight the biclustering results, showing distinct modification patterns across different read clusters.

Sign up for email alerts