Raw signal segmentation for estimating RNA modification from Nanopore direct RNA sequencing data
Figures
SegPore workflow.
(A) General workflow. The workflow consists of five steps: (1) First, raw current signals are basecalled and mapped using Guppy and Minimap2. The raw current signal fragments are paired with the corresponding reference RNA sequence fragments using Nanopolish. (2) Next, the raw current signal of each read is segmented using a hierarchical hidden Markov model (HHMM), which provides an estimated mean () for each segment. (3) These segments are then aligned with the 5-mer list of the reference sequence fragment using a full/partial alignment algorithm, based on a 5-mer parameter table. For example, denotes the base ‘‘ at the -th position on the reference. In this example, and refer to the first and second occurrences of ‘’ in the reference sequence, respectively. Accordingly, and are aligned to , while is aligned to (4). All signals aligned to the same 5-mer across different genomic locations are pooled together, and a two-component Gaussian Mixture Model (GMM) is used to predict the modification at the site-level or single-molecule level. One component of the GMM represents the unmodified state, while the other represents the modified state. (5) GMM is used to re-estimate the 5-mer parameter table. (B) Hierarchical hidden Markov model (HHMM). The outer HMM segments the current signal into alternating base and transition blocks. The inner HMM approximates the emission probability of a base block by considering neighboring 5-mers. A linear model is used to approximate the emission probability of a transition block. (C) Full/partial alignment algorithms. Rows represent the estimated means of base blocks from the HHMM, and columns represent the 5-mers of the reference sequence. Each 5-mer can be aligned with multiple estimated means from the current signal. (D) Gaussian mixture model (GMM) for estimating modification states. The GMM consists of two components: the green component models the unmodified state of a 5-mer, and the blue component models the modified state. Each component is described by three parameters: mean (μ), standard deviation (), and weight ().
RNA translocation hypothesis.
(A) Jiggling RNA translocation hypothesis. The top panel shows the raw current signal of Nanopore direct RNA sequencing, with gray areas representing SegPore-estimated transition blocks. We focus on three neighboring 5-mers, considering the central 5-mer (CTACG) as the current 5-mer. The RNA molecule may briefly move forward or backward during the translocation of the current 5-mer. If the RNA molecule is pulled backward, the previous 5-mer is placed in the pore, and the current signal (“prev” state, red dots) resembles the previous 5-mer’s baseline (mean and standard deviation highlighted by red lines and shades). If the RNA is pushed forward, the current signal (“next” state, blue dots) is similar to the next 5-mer’s baseline. (B) Example raw current signals supporting the jiggling hypothesis. The dashed rectangles highlight base blocks, with red and blue points representing measurements corresponding to the previous and next 5-mer, respectively. Red points align closely with the previous 5-mer’s baseline, and blue points match the next 5-mer’s baseline, reinforcing the hypothesis that the RNA molecule jiggles between neighboring 5-mers. The raw current signals were extracted from mESC WT samples of the training data in the m6A benchmark experiment.
RNA translocation hypothesis illustrated on RNA004 data.
(A) Example from the S. cerevisiae sample (SRA: SRP527927). (B) Example from the curlcake_unmod sample (PRJEB82528). (C) Example from the curlcake_m6A sample (PRJEB82528). In all panels, the x-axis shows time, and the y-axis shows raw current signal intensity (pA, transformed using f5c scale parameters). Vertical lines mark the boundaries between segments (transition blocks). Horizontal shaded areas indicate the estimated standard deviation for each base block, while the red horizontal line shows the mean signal level for that block. Each base block is annotated with four possible states: “prev” (red dot), “next” (blue dot), “curr,” and “noise” (black triangle).
m6A identification at the site level.
(A) Histogram of the estimated mean from current signals mapped to an example m6A-modified genomic location (chr10:128548315, GGACT) across all reads in the training data, comparing Nanopolish (left) and SegPore (right). The x-axis represents current in picoamperes (pA). (B) Histogram of the estimated mean from current signals mapped to the GGACT motif at all annotated m6A-modified genomic locations in the training data, again comparing Nanopolish (left) and SegPore (right). The x-axis represents current in picoamperes (pA). (C) Site-level benchmark results for m6A identification across all DRACH motifs, showing performance comparisons between SegPore+m6Anet and other methods. (D) Benchmark results for m6A identification on six selected motifs at the site level, comparing SegPore and other baseline methods.
-
Figure 3—source data 1
site_mod_rate_mES_WT_all_motifs.
- https://cdn.elifesciences.org/articles/104618/elife-104618-fig3-data1-v1.csv
-
Figure 3—source data 2
site_mod_rate_mES_WT_selected_motifs.
- https://cdn.elifesciences.org/articles/104618/elife-104618-fig3-data2-v1.csv
m6A k-mer statistics differences between the human HEK293T (training data) and the mouse mES_WT (test data).
(A) Number of modification sites for each selected 5-mer (in total 10 motif 5-mers) based on the ground truth of the human data (ENA: PRJEB40872). (B) Number of modification sites for all 5-mers (in total 11 motif 5-mers) based on the ground truth of the mouse data (SRA: SRP166020). The ground truth is obtained by taking the union of MeRIP-seq, miCLIP, and miCLIP2. (C) Venn diagram of the two 5-mer motif sets (training data and test data).
-
Figure 3—figure supplement 1—source data 1
xpore kmer statistics.
- https://cdn.elifesciences.org/articles/104618/elife-104618-fig3-figsupp1-data1-v1.csv
-
Figure 3—figure supplement 1—source data 2
mes kmer statistics.
- https://cdn.elifesciences.org/articles/104618/elife-104618-fig3-figsupp1-data2-v1.csv
Comparison of raw signal segmentation results between SegPore and Nanopolish on RNA002 data.
The same raw signal clip is used for both SegPore (top panel) and Nanopolish (bottom panel). Two example raw signal clips (A, B) are taken from the mES_WT sample (SRA: SRP166020) in the m6A identification section. The x-axis represents time, and the y-axis represents the raw current signal intensity (pA). The vertical bars or lines indicate the borders between segments (transition blocks). The horizontal shaded areas represent the estimated standard deviation of each base block, and the red horizontal line indicates the mean of each base block. There are four states in the base blocks: ‘prev’ (red dot), ‘next’ (blue dot), ‘curr’, and ‘noise’ (black triangle).
Comparison of eventalign of SegPore and Nanopolish on consecutive 5-mers.
The figures show the distribution of estimated segment means for each k-mer in transcript locations A2 (position 473–477) in IVT_m6A and IVT_normalA samples (SRA: SRP166020). All reads aligned to these 5-mers are used to generate the marginal distribution. The third row is the overlay of the distributions of IVT_m6A and IVT_normalA samples. m6A is located at different positions of the given 5-mer. It can be seen that the peaks are relatively well separated for the third and fourth 5-mers in SegPore’s result, but not so clear in Nanopolish’s result.
m6A identification at the single-molecule level.
(A) Benchmark results for single-molecule m6A identification on IVT data. SegPore shows better performance compared to CHEUI in both PR-AUC and ROC-AUC. (B) Comparison of ‘eventalign’ results from SegPore and Nanopolish for five consecutive k-mers. Note that DRS is sequenced from 3’ to 5’, so the k-mers enter the pore from right to left. A total of 100 reads were randomly sampled from transcript locations A1 (positions 711–719) in both the IVT_normalA and IVT_m6A samples (SRA: SRP166020). Each line represents an individual read, and the y-axis shows the raw signal intensity in picoampere (pA). Pink lines represent the IVT_m6A sample, and gray lines represent the IVT_normalA sample. The k-mers ‘GCGGA’, ‘CGGAC’, ‘GGACT’, ‘GACTT’, and ‘ACTTT’ all contain N6-Methyladenosine (m6A) in the IVT_m6A sample. SegPore’s results show clearer separation between m6A and adenosine, especially for ‘CGGAC’ and ‘GGACT’, compared to Nanopolish. (C) The upper panel shows the modification rate for selected genomic locations in the example gene ENSMUSG00000003153. The lower panel displays the modification states of all reads mapped to this gene. The black borders in the heatmap highlight the biclustering results, showing distinct modification patterns across different read clusters labeled C1 through C6.
-
Figure 4—source data 1
ENSMUSG00000003153.10.info.
- https://cdn.elifesciences.org/articles/104618/elife-104618-fig4-data1-v1.csv
-
Figure 4—source data 2
ENSMUSG00000003153.10.reads.
- https://cdn.elifesciences.org/articles/104618/elife-104618-fig4-data2-v1.csv
Illustration of Nanopolish raw signal segmentation with eventalign results.
This figure corresponds to Figure 4B (Nanopolish Eventalign) in the manuscript. (A) Visualization without color differentiation: IVT_m6A and IVT_normalA samples are shown in the same color. (B) Visualization with color differentiation: IVT_m6A and IVT_normalA samples are shown in distinct colors for clarity. In both panels, the top section displays five consecutive k-mers, while the bottom section provides a zoomed-in view of the central GGACT k-mer. Without color contrast, baseline differences in the GGACT signals between m6A and normal A samples are not easily detectable, as the baselines appear closely overlapping. Furthermore, identifying other subtle signal features that differentiate the two GGACT states—features leveraged by deep neural networks—remains challenging for field researchers.
Running time of SegPore on datasets of varying sizes.
The figure shows the total duration required for the segmentation task using a single NVIDIA DGX-1 V100 GPU and one CPU core with 32 GB system memory allocated.
SegPore workflow with a fixed 5-mer parameter table .
The input is the raw current signal and reference sequence for ith read. The output is the paired current signal segments µi, 5-mer list and modification states .
Example of HHMM where , and .
In this figure, ‘c’, ‘p’, ‘n’, and ‘o’ represent ‘curr’, ‘prev’, ‘next’, and ‘noise’ respectively in . .
Estimation of 5-mer parameters.
is initialized using . In training, our algorithm infers the parameters iteratively. After training, is fixed in testing.
Illustration of enumeration of hidden states of outer HMM.
(a) Raw signal and initialized to atomic segmentations. (b) Examples of the full configuration set .
Tables
Segmentation benchmark on RNA002 data.
| TestDataset | Avg. std () ↓ | Avg. log p () ↑ | ||||
|---|---|---|---|---|---|---|
| Nanopolish | Tombo | SegPore | Nanopolish | Tombo | SegPore | |
| HEK293T(WT) | 3.073 | 4.187 | 2.736 | –2.871 | –3.749 | –2.778 |
| HEK293T(KO) | 2.948 | 4.204 | 2.670 | –2.856 | –3.749 | –2.745 |
| HCT116 | 3.167 | 4.076 | 2.872 | –2.872 | –3.704 | –2.746 |
Segmentation benchmark on the RNA002 data excluding boundaries.
| TestDataset | Avg. std () ↓ | Avg. log p () ↑ | ||||
|---|---|---|---|---|---|---|
| Nanopolish | Tombo | SegPore | Nanopolish | Tombo | SegPore | |
| HEK293T(WT) | 2.862 | 4.090 | 2.736 | –2.838 | –3.772 | –2.778 |
| HEK293T(KO) | 2.856 | 4.120 | 2.670 | –2.878 | –3.729 | –2.745 |
| HCT116 | 3.058 | 4.053 | 2.872 | –2.869 | –3.708 | –2.746 |
Segmentation benchmark on RNA004 data.
| TestDataset | Avg. std () ↓ | Avg. log p () ↑ | ||||
|---|---|---|---|---|---|---|
| f5c | Uncalled4 | SegPore | f5c | Uncalled4 | SegPore | |
| S. cerevisiae | 3.129 | 3.970 | 3.125 | –2.541 | –3.835 | –2.489 |
| curlcake(IVT) | 3.424 | 4.729 | 3.359 | –2.716 | –2.904 | –2.515 |
| curlcake(m6A) | 3.520 | 4.971 | 3.392 | –3.118 | –3.524 | –2.599 |
Example of Tkmer (1024×4).
‘-’ represents ‘NA’.
| modle_kmer | un_mu | un_sigma | mod_mu | mod_sigma |
|---|---|---|---|---|
| AAAAA | 108.9 | 2.68 | - | - |
| AAAAC | 107.75 | 2.68 | - | - |
| AAAAG | 101.72 | 2.68 | - | - |
| AAAAT | 112.77 | 2.68 | - | - |
| AAACA | 99.38 | 2.41 | 95.28 | 2.44 |
| AAACC | 100 | 1.32 | 96.82 | 2.42 |
| AAACG | 101.01 | 3.45 | - | - |
| AAACT | 106.91 | 2.18 | 101.98 | 2.29 |
| AAAGA | 110.54 | 4.06 | - | - |
| AAAGC | 107.69 | 4.06 | - | - |
| AAAGG | 108.29 | 4.06 | - | - |
| AAAGT | 108.73 | 4.06 | - | - |
| AAATA | 114.11 | 3.11 | - | - |
| ... | ... | ... | ... | ... |
| TTTTT | 80.78 | 1.97 | - | - |
Hidden states of inner HMM for the th block.
| State | Event | Emission distribution | Parameter estimation |
|---|---|---|---|
| ‘curr’ | current 5-mer resides in pore | Yes | |
| ‘prev’ | previous 5-mer resides in pore | Yes | |
| ‘next’ | next 5-mer resides in pore | Yes | |
| ‘noise’ | unknown noise | No |
SegPore eventalign output example.
| read_idx | read_name | contig | pos | strand | ref_kmer | model_kmer | mean | stdv | start_idx | end_idx | event_len |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 57 | + | GGTGT | GGTGT | 79.339 | 2.414 | 47296 | 47321 | 18 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 58 | + | GTGTC | GTGTC | 89.809 | 3.579 | 47276 | 47292 | 14 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 59 | + | TGTCT | TGTCT | 117.017 | 6.287 | 47251 | 47272 | 18 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 60 | + | GTCTT | GTCTT | 76.448 | 1.038 | 47148 | 47247 | 87 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 61 | + | TCTTA | TCTTA | 76.036 | 1.044 | 47117 | 47144 | 25 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 62 | + | CTTAG | CTTAG | 78.886 | 1.127 | 47092 | 47113 | 21 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 63 | + | TTAGT | TTAGT | 92.971 | 2.307 | 47007 | 47088 | 73 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 64 | + | TAGTG | TAGTG | 121.129 | 4.2 | 46882 | 46983 | 90 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 65 | + | AGTGT | AGTGT | 90.22 | 4.554 | 46822 | 46878 | 33 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 67 | + | TGTGC | TGTGC | 104.566 | 2.563 | 46798 | 46818 | 8 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 68 | + | GTGCT | GTGCT | 91.513 | 2.51 | 46778 | 46794 | 15 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 69 | + | TGCTT | TGCTT | 111.485 | 4.001 | 46754 | 46774 | 10 |
Transition parameters of the outer HMM ().
| ‘B’ | ‘T’ | |
|---|---|---|
| ‘B’ | 0.99 | 0.01 |
| ‘T’ | 0.10 | 0.90 |
Transition parameters of the inner HMM ().
| ‘curr’ | ‘prev’ | ‘next’ | ‘noise’ | |
|---|---|---|---|---|
| ‘curr’ | 0.925 | 0.025 | 0.025 | 0.025 |
| ‘prev’ | 0.300 | 0.500 | 0.100 | 0.100 |
| ‘next’ | 0.300 | 0.100 | 0.500 | 0.100 |
| ‘noise’ | 0.300 | 0.100 | 0.100 | 0.500 |
Parameters of the inner HMM for the th block.
| Parameter | Fixed parameter | Description |
|---|---|---|
| Yes | Transition probabilities between hidden states (‘curr’, ‘prev’, ‘next’, ‘noise’), specified in Appendix 2—table 2 | |
| Yes | probability of the first hidden state, , where | |
| Yes | lower bound of uniform distribution, | |
| Yes | upper bound of uniform distribution, | |
| No | Gaussian mean ofith base block | |
| No | Gaussian std ofi th base block | |
| No | Gaussian mean of th base block | |
| No | Gaussian std of th base block | |
| No | Gaussian mean of th base block | |
| No | Gaussian std of th base block |