Raw signal segmentation for estimating RNA modification from Nanopore direct RNA sequencing data

  1. Guangzhao Cheng
  2. Aki Vehtari
  3. Lu Cheng  Is a corresponding author
  1. Department of Computer Science, Aalto University, Finland
  2. Institute of Biomedicine, University of Eastern Finland, Finland
12 figures, 9 tables and 1 additional file

Figures

SegPore workflow.

(A) General workflow. The workflow consists of five steps: (1) First, raw current signals are basecalled and mapped using Guppy and Minimap2. The raw current signal fragments are paired with the corresponding reference RNA sequence fragments using Nanopolish. (2) Next, the raw current signal of each read is segmented using a hierarchical hidden Markov model (HHMM), which provides an estimated mean (μi) for each segment. (3) These segments are then aligned with the 5-mer list of the reference sequence fragment using a full/partial alignment algorithm, based on a 5-mer parameter table. For example, Aj denotes the base ‘A‘ at the j-th position on the reference. In this example, A1 and A2 refer to the first and second occurrences of ‘A’ in the reference sequence, respectively. Accordingly, μ1 and μ2 are aligned to A1, while μ3 is aligned to A2 (4). All signals aligned to the same 5-mer across different genomic locations are pooled together, and a two-component Gaussian Mixture Model (GMM) is used to predict the modification at the site-level or single-molecule level. One component of the GMM represents the unmodified state, while the other represents the modified state. (5) GMM is used to re-estimate the 5-mer parameter table. (B) Hierarchical hidden Markov model (HHMM). The outer HMM segments the current signal into alternating base and transition blocks. The inner HMM approximates the emission probability of a base block by considering neighboring 5-mers. A linear model is used to approximate the emission probability of a transition block. (C) Full/partial alignment algorithms. Rows represent the estimated means of base blocks from the HHMM, and columns represent the 5-mers of the reference sequence. Each 5-mer can be aligned with multiple estimated means from the current signal. (D) Gaussian mixture model (GMM) for estimating modification states. The GMM consists of two components: the green component models the unmodified state of a 5-mer, and the blue component models the modified state. Each component is described by three parameters: mean (μ), standard deviation (σ), and weight (ω).

Figure 2 with 1 supplement
RNA translocation hypothesis.

(A) Jiggling RNA translocation hypothesis. The top panel shows the raw current signal of Nanopore direct RNA sequencing, with gray areas representing SegPore-estimated transition blocks. We focus on three neighboring 5-mers, considering the central 5-mer (CTACG) as the current 5-mer. The RNA molecule may briefly move forward or backward during the translocation of the current 5-mer. If the RNA molecule is pulled backward, the previous 5-mer is placed in the pore, and the current signal (“prev” state, red dots) resembles the previous 5-mer’s baseline (mean and standard deviation highlighted by red lines and shades). If the RNA is pushed forward, the current signal (“next” state, blue dots) is similar to the next 5-mer’s baseline. (B) Example raw current signals supporting the jiggling hypothesis. The dashed rectangles highlight base blocks, with red and blue points representing measurements corresponding to the previous and next 5-mer, respectively. Red points align closely with the previous 5-mer’s baseline, and blue points match the next 5-mer’s baseline, reinforcing the hypothesis that the RNA molecule jiggles between neighboring 5-mers. The raw current signals were extracted from mESC WT samples of the training data in the m6A benchmark experiment.

Figure 2—figure supplement 1
RNA translocation hypothesis illustrated on RNA004 data.

(A) Example from the S. cerevisiae sample (SRA: SRP527927). (B) Example from the curlcake_unmod sample (PRJEB82528). (C) Example from the curlcake_m6A sample (PRJEB82528). In all panels, the x-axis shows time, and the y-axis shows raw current signal intensity (pA, transformed using f5c scale parameters). Vertical lines mark the boundaries between segments (transition blocks). Horizontal shaded areas indicate the estimated standard deviation for each base block, while the red horizontal line shows the mean signal level for that block. Each base block is annotated with four possible states: “prev” (red dot), “next” (blue dot), “curr,” and “noise” (black triangle).

Figure 3 with 3 supplements
m6A identification at the site level.

(A) Histogram of the estimated mean from current signals mapped to an example m6A-modified genomic location (chr10:128548315, GGACT) across all reads in the training data, comparing Nanopolish (left) and SegPore (right). The x-axis represents current in picoamperes (pA). (B) Histogram of the estimated mean from current signals mapped to the GGACT motif at all annotated m6A-modified genomic locations in the training data, again comparing Nanopolish (left) and SegPore (right). The x-axis represents current in picoamperes (pA). (C) Site-level benchmark results for m6A identification across all DRACH motifs, showing performance comparisons between SegPore+m6Anet and other methods. (D) Benchmark results for m6A identification on six selected motifs at the site level, comparing SegPore and other baseline methods.

Figure 3—figure supplement 1
m6A k-mer statistics differences between the human HEK293T (training data) and the mouse mES_WT (test data).

(A) Number of modification sites for each selected 5-mer (in total 10 motif 5-mers) based on the ground truth of the human data (ENA: PRJEB40872). (B) Number of modification sites for all 5-mers (in total 11 motif 5-mers) based on the ground truth of the mouse data (SRA: SRP166020). The ground truth is obtained by taking the union of MeRIP-seq, miCLIP, and miCLIP2. (C) Venn diagram of the two 5-mer motif sets (training data and test data).

Figure 3—figure supplement 2
Comparison of raw signal segmentation results between SegPore and Nanopolish on RNA002 data.

The same raw signal clip is used for both SegPore (top panel) and Nanopolish (bottom panel). Two example raw signal clips (A, B) are taken from the mES_WT sample (SRA: SRP166020) in the m6A identification section. The x-axis represents time, and the y-axis represents the raw current signal intensity (pA). The vertical bars or lines indicate the borders between segments (transition blocks). The horizontal shaded areas represent the estimated standard deviation of each base block, and the red horizontal line indicates the mean of each base block. There are four states in the base blocks: ‘prev’ (red dot), ‘next’ (blue dot), ‘curr’, and ‘noise’ (black triangle).

Figure 3—figure supplement 3
Comparison of eventalign of SegPore and Nanopolish on consecutive 5-mers.

The figures show the distribution of estimated segment means for each k-mer in transcript locations A2 (position 473–477) in IVT_m6A and IVT_normalA samples (SRA: SRP166020). All reads aligned to these 5-mers are used to generate the marginal distribution. The third row is the overlay of the distributions of IVT_m6A and IVT_normalA samples. m6A is located at different positions of the given 5-mer. It can be seen that the peaks are relatively well separated for the third and fourth 5-mers in SegPore’s result, but not so clear in Nanopolish’s result.

Figure 4 with 2 supplements
m6A identification at the single-molecule level.

(A) Benchmark results for single-molecule m6A identification on IVT data. SegPore shows better performance compared to CHEUI in both PR-AUC and ROC-AUC. (B) Comparison of ‘eventalign’ results from SegPore and Nanopolish for five consecutive k-mers. Note that DRS is sequenced from 3’ to 5’, so the k-mers enter the pore from right to left. A total of 100 reads were randomly sampled from transcript locations A1 (positions 711–719) in both the IVT_normalA and IVT_m6A samples (SRA: SRP166020). Each line represents an individual read, and the y-axis shows the raw signal intensity in picoampere (pA). Pink lines represent the IVT_m6A sample, and gray lines represent the IVT_normalA sample. The k-mers ‘GCGGA’, ‘CGGAC’, ‘GGACT’, ‘GACTT’, and ‘ACTTT’ all contain N6-Methyladenosine (m6A) in the IVT_m6A sample. SegPore’s results show clearer separation between m6A and adenosine, especially for ‘CGGAC’ and ‘GGACT’, compared to Nanopolish. (C) The upper panel shows the modification rate for selected genomic locations in the example gene ENSMUSG00000003153. The lower panel displays the modification states of all reads mapped to this gene. The black borders in the heatmap highlight the biclustering results, showing distinct modification patterns across different read clusters labeled C1 through C6.

Figure 4—figure supplement 1
Illustration of Nanopolish raw signal segmentation with eventalign results.

This figure corresponds to Figure 4B (Nanopolish Eventalign) in the manuscript. (A) Visualization without color differentiation: IVT_m6A and IVT_normalA samples are shown in the same color. (B) Visualization with color differentiation: IVT_m6A and IVT_normalA samples are shown in distinct colors for clarity. In both panels, the top section displays five consecutive k-mers, while the bottom section provides a zoomed-in view of the central GGACT k-mer. Without color contrast, baseline differences in the GGACT signals between m6A and normal A samples are not easily detectable, as the baselines appear closely overlapping. Furthermore, identifying other subtle signal features that differentiate the two GGACT states—features leveraged by deep neural networks—remains challenging for field researchers.

Figure 4—figure supplement 2
Running time of SegPore on datasets of varying sizes.

The figure shows the total duration required for the segmentation task using a single NVIDIA DGX-1 V100 GPU and one CPU core with 32 GB system memory allocated.

Appendix 1—figure 1
SegPore workflow with a fixed 5-mer parameter table Tkmer.

The input is the raw current signal yi and reference sequence ri for ith read. The output is the paired current signal segments µi, 5-mer list si and modification states ui.

Appendix 1—figure 2
Example of HHMM where gi{B",T"}, hi{curr",prev",next",noise"} and ci{B",T"}.

In this figure, ‘c’, ‘p’, ‘n’, and ‘o’ represent ‘curr’, ‘prev’, ‘next’, and ‘noise’ respectively in h. |y|=|g|=|h|and|c|=11.

Appendix 1—figure 3
Alignment of signal segments with reference sequence.
Appendix 1—figure 4
Estimation of 5-mer parameters.

Tkmer is initialized using Tref. In training, our algorithm infers the parameters iteratively. After training, Tref is fixed in testing.

Appendix 1—figure 5
Illustration of segmentation and event alignment.
Appendix 2—figure 1
Illustration of enumeration of hidden states of outer HMM.

(a) Raw signal and initialized to atomic segmentations. (b) Examples of the full configuration set G=(g1,g2,,gi,,gM).

Appendix 2—figure 2
Illustration of GPU-accelerated parameter inference.
Author response image 1

Tables

Table 1
Segmentation benchmark on RNA002 data.
TestDatasetAvg. std (σ^) ↓Avg. log p (L^) ↑
NanopolishTomboSegPoreNanopolishTomboSegPore
HEK293T(WT)3.0734.1872.736–2.871–3.749–2.778
HEK293T(KO)2.9484.2042.670–2.856–3.749–2.745
HCT1163.1674.0762.872–2.872–3.704–2.746
Table 2
Segmentation benchmark on the RNA002 data excluding boundaries.
TestDatasetAvg. std (σ^) ↓Avg. log p (L^) ↑
NanopolishTomboSegPoreNanopolishTomboSegPore
HEK293T(WT)2.8624.0902.736–2.838–3.772–2.778
HEK293T(KO)2.8564.1202.670–2.878–3.729–2.745
HCT1163.0584.0532.872–2.869–3.708–2.746
Table 3
Segmentation benchmark on RNA004 data.
TestDatasetAvg. std (σ^) ↓Avg. log p (L^) ↑
f5cUncalled4SegPoref5cUncalled4SegPore
S. cerevisiae3.1293.9703.125–2.541–3.835–2.489
curlcake(IVT)3.4244.7293.359–2.716–2.904–2.515
curlcake(m6A)3.5204.9713.392–3.118–3.524–2.599
Appendix 1—table 1
Example of Tkmer (1024×4).

‘-’ represents ‘NA’.

modle_kmerun_muun_sigmamod_mumod_sigma
AAAAA108.92.68--
AAAAC107.752.68--
AAAAG101.722.68--
AAAAT112.772.68--
AAACA99.382.4195.282.44
AAACC1001.3296.822.42
AAACG101.013.45--
AAACT106.912.18101.982.29
AAAGA110.544.06--
AAAGC107.694.06--
AAAGG108.294.06--
AAAGT108.734.06--
AAATA114.113.11--
...............
TTTTT80.781.97--
Appendix 1—table 2
Hidden states of inner HMM for the bi th block.
StateEventEmission distributionParameter estimation
‘curr’current 5-mer resides in poreN(ϕ(bi),(σ(bi))2)Yes
‘prev’previous 5-mer resides in poreN(ϕ(bi1),(σ(bi1))2)Yes
‘next’next 5-mer resides in poreN(ϕ(bi+1),(σ(bi+1))2)Yes
‘noise’unknown noiseUnif(lb,ub)No
Appendix 1—table 3
SegPore eventalign output example.
read_idxread_namecontigposstrandref_kmermodel_kmermeanstdvstart_idxend_idxevent_len
80741003655f4-ff77-409d-a29e-20dc778d1c47A157+GGTGTGGTGT79.3392.414472964732118
80741003655f4-ff77-409d-a29e-20dc778d1c47A158+GTGTCGTGTC89.8093.579472764729214
80741003655f4-ff77-409d-a29e-20dc778d1c47A159+TGTCTTGTCT117.0176.287472514727218
80741003655f4-ff77-409d-a29e-20dc778d1c47A160+GTCTTGTCTT76.4481.038471484724787
80741003655f4-ff77-409d-a29e-20dc778d1c47A161+TCTTATCTTA76.0361.044471174714425
80741003655f4-ff77-409d-a29e-20dc778d1c47A162+CTTAGCTTAG78.8861.127470924711321
80741003655f4-ff77-409d-a29e-20dc778d1c47A163+TTAGTTTAGT92.9712.307470074708873
80741003655f4-ff77-409d-a29e-20dc778d1c47A164+TAGTGTAGTG121.1294.2468824698390
80741003655f4-ff77-409d-a29e-20dc778d1c47A165+AGTGTAGTGT90.224.554468224687833
80741003655f4-ff77-409d-a29e-20dc778d1c47A167+TGTGCTGTGC104.5662.56346798468188
80741003655f4-ff77-409d-a29e-20dc778d1c47A168+GTGCTGTGCT91.5132.51467784679415
80741003655f4-ff77-409d-a29e-20dc778d1c47A169+TGCTTTGCTT111.4854.001467544677410
Appendix 2—table 1
Transition parameters of the outer HMM (Touter).
‘B’‘T’
‘B’0.990.01
‘T’0.100.90
Appendix 2—table 2
Transition parameters of the inner HMM (Tinner).
‘curr’‘prev’‘next’‘noise’
‘curr’0.9250.0250.0250.025
‘prev’0.3000.5000.1000.100
‘next’0.3000.1000.5000.100
‘noise’0.3000.1000.1000.500
Appendix 2—table 3
Parameters of the inner HMM for the bi th block.
ParameterFixed parameterDescription
TinnerYesTransition probabilities between hidden states
(‘curr’, ‘prev’, ‘next’, ‘noise’), specified in Appendix 2—table 2
πkinnerYesprobability of the first hidden state, πkinner=0.25
, wherek{1,2,3,4}
lbYeslower bound of uniform distribution,
lb=50
ubYesupper bound of uniform distribution,ub=130
ϕ(bi)NoGaussian mean ofith base block
σ(bi)NoGaussian std ofi th base block
ϕ(bi1)NoGaussian mean of i1 th base block
σ(bi1)NoGaussian std of i1 th base block
ϕ(bi+1)NoGaussian mean of i+1 th base block
σ(bi+1)NoGaussian std of i+1 th base block

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Guangzhao Cheng
  2. Aki Vehtari
  3. Lu Cheng
(2026)
Raw signal segmentation for estimating RNA modification from Nanopore direct RNA sequencing data
eLife 14:RP104618.
https://doi.org/10.7554/eLife.104618.4