Figures and data in Raw signal segmentation for estimating RNA modification from Nanopore direct RNA sequencing data

Figures
Tables
Additional files

12 figures, 9 tables and 1 additional file

Figures

Figure 1

Download asset Open asset

SegPore workflow.

(A) General workflow. The workflow consists of five steps: (1) First, raw current signals are basecalled and mapped using Guppy and Minimap2. The raw current signal fragments are paired with the corresponding reference RNA sequence fragments using Nanopolish. (2) Next, the raw current signal of each read is segmented using a hierarchical hidden Markov model (HHMM), which provides an estimated mean ( $μ_{i}$ ) for each segment. (3) These segments are then aligned with the 5-mer list of the reference sequence fragment using a full/partial alignment algorithm, based on a 5-mer parameter table. For example, $A_{j}$ denotes the base ‘ $A$ ‘ at the $j$ -th position on the reference. In this example, $A_{1}$ and $A_{2}$ refer to the first and second occurrences of ‘ $A$ ’ in the reference sequence, respectively. Accordingly, $μ_{1}$ and $μ_{2}$ are aligned to $A_{1}$ , while $μ_{3}$ is aligned to $A_{2}$ (4). All signals aligned to the same 5-mer across different genomic locations are pooled together, and a two-component Gaussian Mixture Model (GMM) is used to predict the modification at the site-level or single-molecule level. One component of the GMM represents the unmodified state, while the other represents the modified state. (5) GMM is used to re-estimate the 5-mer parameter table. (B) Hierarchical hidden Markov model (HHMM). The outer HMM segments the current signal into alternating base and transition blocks. The inner HMM approximates the emission probability of a base block by considering neighboring 5-mers. A linear model is used to approximate the emission probability of a transition block. (C) Full/partial alignment algorithms. Rows represent the estimated means of base blocks from the HHMM, and columns represent the 5-mers of the reference sequence. Each 5-mer can be aligned with multiple estimated means from the current signal. (D) Gaussian mixture model (GMM) for estimating modification states. The GMM consists of two components: the green component models the unmodified state of a 5-mer, and the blue component models the modified state. Each component is described by three parameters: mean (μ), standard deviation ( $σ$ ), and weight ( $ω$ ).

Figure 2 with 1 supplement

Download asset Open asset

RNA translocation hypothesis.

(A) Jiggling RNA translocation hypothesis. The top panel shows the raw current signal of Nanopore direct RNA sequencing, with gray areas representing SegPore-estimated transition blocks. We focus on three neighboring 5-mers, considering the central 5-mer (CTACG) as the current 5-mer. The RNA molecule may briefly move forward or backward during the translocation of the current 5-mer. If the RNA molecule is pulled backward, the previous 5-mer is placed in the pore, and the current signal (“prev” state, red dots) resembles the previous 5-mer’s baseline (mean and standard deviation highlighted by red lines and shades). If the RNA is pushed forward, the current signal (“next” state, blue dots) is similar to the next 5-mer’s baseline. (B) Example raw current signals supporting the jiggling hypothesis. The dashed rectangles highlight base blocks, with red and blue points representing measurements corresponding to the previous and next 5-mer, respectively. Red points align closely with the previous 5-mer’s baseline, and blue points match the next 5-mer’s baseline, reinforcing the hypothesis that the RNA molecule jiggles between neighboring 5-mers. The raw current signals were extracted from mESC WT samples of the training data in the m6A benchmark experiment.

Figure 2—figure supplement 1

Download asset Open asset

RNA translocation hypothesis illustrated on RNA004 data.

(A) Example from the *S. cerevisiae* sample (SRA: SRP527927). (B) Example from the curlcake_unmod sample (PRJEB82528). (C) Example from the curlcake_m6A sample (PRJEB82528). In all panels, the x-axis shows time, and the y-axis shows raw current signal intensity (pA, transformed using f5c scale parameters). Vertical lines mark the boundaries between segments (transition blocks). Horizontal shaded areas indicate the estimated standard deviation for each base block, while the red horizontal line shows the mean signal level for that block. Each base block is annotated with four possible states: “prev” (red dot), “next” (blue dot), “curr,” and “noise” (black triangle).

Figure 3 with 3 supplements

Download asset Open asset

m6A identification at the site level.

(A) Histogram of the estimated mean from current signals mapped to an example m6A-modified genomic location (chr10:128548315, GGACT) across all reads in the training data, comparing Nanopolish (left) and SegPore (right). The x-axis represents current in picoamperes (pA). (B) Histogram of the estimated mean from current signals mapped to the GGACT motif at all annotated m6A-modified genomic locations in the training data, again comparing Nanopolish (left) and SegPore (right). The x-axis represents current in picoamperes (pA). (C) Site-level benchmark results for m6A identification across all DRACH motifs, showing performance comparisons between SegPore+m6Anet and other methods. (D) Benchmark results for m6A identification on six selected motifs at the site level, comparing SegPore and other baseline methods.

Figure 3—source data 1 site_mod_rate_mES_WT_all_motifs.: https://cdn.elifesciences.org/articles/104618/elife-104618-fig3-data1-v1.csv
Download elife-104618-fig3-data1-v1.csv
Figure 3—source data 2 site_mod_rate_mES_WT_selected_motifs.: https://cdn.elifesciences.org/articles/104618/elife-104618-fig3-data2-v1.csv
Download elife-104618-fig3-data2-v1.csv

Figure 3—figure supplement 1

Download asset Open asset

m6A k-mer statistics differences between the human HEK293T (training data) and the mouse mES_WT (test data).

(A) Number of modification sites for each selected 5-mer (in total 10 motif 5-mers) based on the ground truth of the human data (ENA: PRJEB40872). (B) Number of modification sites for all 5-mers (in total 11 motif 5-mers) based on the ground truth of the mouse data (SRA: SRP166020). The ground truth is obtained by taking the union of MeRIP-seq, miCLIP, and miCLIP2. (C) Venn diagram of the two 5-mer motif sets (training data and test data).

Figure 3—figure supplement 1—source data 1 xpore kmer statistics.: https://cdn.elifesciences.org/articles/104618/elife-104618-fig3-figsupp1-data1-v1.csv
Download elife-104618-fig3-figsupp1-data1-v1.csv
Figure 3—figure supplement 1—source data 2 mes kmer statistics.: https://cdn.elifesciences.org/articles/104618/elife-104618-fig3-figsupp1-data2-v1.csv
Download elife-104618-fig3-figsupp1-data2-v1.csv

Figure 3—figure supplement 2

Download asset Open asset

Comparison of raw signal segmentation results between SegPore and Nanopolish on RNA002 data.

The same raw signal clip is used for both SegPore (top panel) and Nanopolish (bottom panel). Two example raw signal clips (**A, B**) are taken from the mES_WT sample (SRA: SRP166020) in the m6A identification section. The x-axis represents time, and the y-axis represents the raw current signal intensity (pA). The vertical bars or lines indicate the borders between segments (transition blocks). The horizontal shaded areas represent the estimated standard deviation of each base block, and the red horizontal line indicates the mean of each base block. There are four states in the base blocks: ‘prev’ (red dot), ‘next’ (blue dot), ‘curr’, and ‘noise’ (black triangle).

Figure 3—figure supplement 3

Download asset Open asset

Comparison of eventalign of SegPore and Nanopolish on consecutive 5-mers.

The figures show the distribution of estimated segment means for each k-mer in transcript locations A2 (position 473–477) in IVT_m6A and IVT_normalA samples (SRA: SRP166020). All reads aligned to these 5-mers are used to generate the marginal distribution. The third row is the overlay of the distributions of IVT_m6A and IVT_normalA samples. m6A is located at different positions of the given 5-mer. It can be seen that the peaks are relatively well separated for the third and fourth 5-mers in SegPore’s result, but not so clear in Nanopolish’s result.

Figure 4 with 2 supplements

Download asset Open asset

m6A identification at the single-molecule level.

(A) Benchmark results for single-molecule m6A identification on IVT data. SegPore shows better performance compared to CHEUI in both PR-AUC and ROC-AUC. (B) Comparison of ‘eventalign’ results from SegPore and Nanopolish for five consecutive k-mers. Note that DRS is sequenced from 3’ to 5’, so the k-mers enter the pore from right to left. A total of 100 reads were randomly sampled from transcript locations A1 (positions 711–719) in both the IVT_normalA and IVT_m6A samples (SRA: SRP166020). Each line represents an individual read, and the y-axis shows the raw signal intensity in picoampere (pA). Pink lines represent the IVT_m6A sample, and gray lines represent the IVT_normalA sample. The k-mers ‘GCGGA’, ‘CGGAC’, ‘GGACT’, ‘GACTT’, and ‘ACTTT’ all contain N6-Methyladenosine (m6A) in the IVT_m6A sample. SegPore’s results show clearer separation between m6A and adenosine, especially for ‘CGGAC’ and ‘GGACT’, compared to Nanopolish. (C) The upper panel shows the modification rate for selected genomic locations in the example gene ENSMUSG00000003153. The lower panel displays the modification states of all reads mapped to this gene. The black borders in the heatmap highlight the biclustering results, showing distinct modification patterns across different read clusters labeled C1 through C6.

Figure 4—source data 1 ENSMUSG00000003153.10.info.: https://cdn.elifesciences.org/articles/104618/elife-104618-fig4-data1-v1.csv
Download elife-104618-fig4-data1-v1.csv
Figure 4—source data 2 ENSMUSG00000003153.10.reads.: https://cdn.elifesciences.org/articles/104618/elife-104618-fig4-data2-v1.csv
Download elife-104618-fig4-data2-v1.csv

Figure 4—figure supplement 1

Download asset Open asset

Illustration of Nanopolish raw signal segmentation with eventalign results.

This figure corresponds to Figure 4B (Nanopolish Eventalign) in the manuscript. (A) Visualization without color differentiation: IVT_m6A and IVT_normalA samples are shown in the same color. (B) Visualization with color differentiation: IVT_m6A and IVT_normalA samples are shown in distinct colors for clarity. In both panels, the top section displays five consecutive k-mers, while the bottom section provides a zoomed-in view of the central GGACT k-mer. Without color contrast, baseline differences in the GGACT signals between m6A and normal A samples are not easily detectable, as the baselines appear closely overlapping. Furthermore, identifying other subtle signal features that differentiate the two GGACT states—features leveraged by deep neural networks—remains challenging for field researchers.

Figure 4—figure supplement 2

Download asset Open asset

Running time of SegPore on datasets of varying sizes.

The figure shows the total duration required for the segmentation task using a single NVIDIA DGX-1 V100 GPU and one CPU core with 32 GB system memory allocated.

Appendix 1—figure 1

Download asset Open asset

SegPore workflow with a fixed 5-mer parameter table $T^{k m e r}$ .

The input is the raw current signal $y_{i}$ and reference sequence $r_{i}$ for ith read. The output is the paired current signal segments µ_i, 5-mer list $s_{i}$ and modification states $u_{i}$ .

Appendix 1—figure 2

Download asset Open asset

Example of HHMM where $g_{i} \in {‘ ‘ B ", ‘ ‘ T "}$ , $h_{i} \in {‘ ‘ c u r r ", ‘ ‘ p r e v ", ‘ ‘ n e x t ", ‘ ‘ n o i s e "}$ and $c_{i} \in {‘ ‘ B ", ‘ ‘ T "}$ .

In this figure, ‘c’, ‘p’, ‘n’, and ‘o’ represent ‘curr’, ‘prev’, ‘next’, and ‘noise’ respectively in $h$ . $| y | = | g | = | h | a n d | c | = 11$ .

Appendix 1—figure 3

Download asset Open asset

Alignment of signal segments with reference sequence.

Appendix 1—figure 4

Download asset Open asset

Estimation of 5-mer parameters.

$T^{k m e r}$ is initialized using $T^{r e f}$ . In training, our algorithm infers the parameters iteratively. After training, $T^{r e f}$ is fixed in testing.

Appendix 1—figure 5

Download asset Open asset

Illustration of segmentation and event alignment.

Appendix 2—figure 1

Download asset Open asset

Illustration of enumeration of hidden states of outer HMM.

(a) Raw signal and initialized to atomic segmentations. (b) Examples of the full configuration set $G = (g_{1}, g_{2}, \dots, g_{i}, \dots, g_{M})$ .

Appendix 2—figure 2

Download asset Open asset

Illustration of GPU-accelerated parameter inference.

Author response image 1

Download asset Open asset

Tables

Table 1

Segmentation benchmark on RNA002 data.

TestDataset	Avg. std ( $\hat{σ}$ ) ↓			Avg. log p ( $\hat{L}$ ) ↑
TestDataset	Nanopolish	Tombo	SegPore	Nanopolish	Tombo	SegPore
HEK293T(WT)	3.073	4.187	2.736	–2.871	–3.749	–2.778
HEK293T(KO)	2.948	4.204	2.670	–2.856	–3.749	–2.745
HCT116	3.167	4.076	2.872	–2.872	–3.704	–2.746

Table 2

Segmentation benchmark on the RNA002 data excluding boundaries.

TestDataset	Avg. std ( $\hat{σ}$ ) ↓			Avg. log p ( $\hat{L}$ ) ↑
TestDataset	Nanopolish	Tombo	SegPore	Nanopolish	Tombo	SegPore
HEK293T(WT)	2.862	4.090	2.736	–2.838	–3.772	–2.778
HEK293T(KO)	2.856	4.120	2.670	–2.878	–3.729	–2.745
HCT116	3.058	4.053	2.872	–2.869	–3.708	–2.746

Table 3

Segmentation benchmark on RNA004 data.

TestDataset	Avg. std ( $\hat{σ}$ ) ↓			Avg. log p ( $\hat{L}$ ) ↑
TestDataset	f5c	Uncalled4	SegPore	f5c	Uncalled4	SegPore
S. cerevisiae	3.129	3.970	3.125	–2.541	–3.835	–2.489
curlcake(IVT)	3.424	4.729	3.359	–2.716	–2.904	–2.515
curlcake(m6A)	3.520	4.971	3.392	–3.118	–3.524	–2.599

Appendix 1—table 1

Example of T^kmer (1024×4).

‘-’ represents ‘NA’.

modle_kmer	un_mu	un_sigma	mod_mu	mod_sigma
AAAAA	108.9	2.68	-	-
AAAAC	107.75	2.68	-	-
AAAAG	101.72	2.68	-	-
AAAAT	112.77	2.68	-	-
AAACA	99.38	2.41	95.28	2.44
AAACC	100	1.32	96.82	2.42
AAACG	101.01	3.45	-	-
AAACT	106.91	2.18	101.98	2.29
AAAGA	110.54	4.06	-	-
AAAGC	107.69	4.06	-	-
AAAGG	108.29	4.06	-	-
AAAGT	108.73	4.06	-	-
AAATA	114.11	3.11	-	-
...	...	...	...	...
TTTTT	80.78	1.97	-	-

Appendix 1—table 2

Hidden states of inner HMM for the $b_{i}$ th block.

State	Event	Emission distribution	Parameter estimation
‘curr’	current 5-mer resides in pore	$N (ϕ^{(b_{i})}, (σ^{(b_{i})})^{2})$	Yes
‘prev’	previous 5-mer resides in pore	$N (ϕ^{(b_{i - 1})}, (σ^{(b_{i - 1})})^{2})$	Yes
‘next’	next 5-mer resides in pore	$N (ϕ^{(b_{i + 1})}, (σ^{(b_{i + 1})})^{2})$	Yes
‘noise’	unknown noise	$U n i f (l b, u b)$	No

Appendix 1—table 3

SegPore eventalign output example.

read_idx	read_name	contig	pos	strand	ref_kmer	model_kmer	mean	stdv	start_idx	end_idx	event_len
80741	003655f4-ff77-409d-a29e-20dc778d1c47	A1	57	+	GGTGT	GGTGT	79.339	2.414	47296	47321	18
80741	003655f4-ff77-409d-a29e-20dc778d1c47	A1	58	+	GTGTC	GTGTC	89.809	3.579	47276	47292	14
80741	003655f4-ff77-409d-a29e-20dc778d1c47	A1	59	+	TGTCT	TGTCT	117.017	6.287	47251	47272	18
80741	003655f4-ff77-409d-a29e-20dc778d1c47	A1	60	+	GTCTT	GTCTT	76.448	1.038	47148	47247	87
80741	003655f4-ff77-409d-a29e-20dc778d1c47	A1	61	+	TCTTA	TCTTA	76.036	1.044	47117	47144	25
80741	003655f4-ff77-409d-a29e-20dc778d1c47	A1	62	+	CTTAG	CTTAG	78.886	1.127	47092	47113	21
80741	003655f4-ff77-409d-a29e-20dc778d1c47	A1	63	+	TTAGT	TTAGT	92.971	2.307	47007	47088	73
80741	003655f4-ff77-409d-a29e-20dc778d1c47	A1	64	+	TAGTG	TAGTG	121.129	4.2	46882	46983	90
80741	003655f4-ff77-409d-a29e-20dc778d1c47	A1	65	+	AGTGT	AGTGT	90.22	4.554	46822	46878	33
80741	003655f4-ff77-409d-a29e-20dc778d1c47	A1	67	+	TGTGC	TGTGC	104.566	2.563	46798	46818	8
80741	003655f4-ff77-409d-a29e-20dc778d1c47	A1	68	+	GTGCT	GTGCT	91.513	2.51	46778	46794	15
80741	003655f4-ff77-409d-a29e-20dc778d1c47	A1	69	+	TGCTT	TGCTT	111.485	4.001	46754	46774	10

Appendix 2—table 1

Transition parameters of the outer HMM ( $T^{o u t e r}$ ).

	‘B’	‘T’
‘B’	0.99	0.01
‘T’	0.10	0.90

Appendix 2—table 2

Transition parameters of the inner HMM ( $T^{i n n e r}$ ).

	‘curr’	‘prev’	‘next’	‘noise’
‘curr’	0.925	0.025	0.025	0.025
‘prev’	0.300	0.500	0.100	0.100
‘next’	0.300	0.100	0.500	0.100
‘noise’	0.300	0.100	0.100	0.500

Appendix 2—table 3

Parameters of the inner HMM for the $b_{i}$ th block.

Parameter	Fixed parameter	Description
$T^{i n n e r}$	Yes	Transition probabilities between hidden states (‘curr’, ‘prev’, ‘next’, ‘noise’), specified in Appendix 2—table 2
$π_{k}^{i n n e r}$	Yes	probability of the first hidden state, $π_{k}^{i n n e r} = 0.25$ , where $k \in {1, 2, 3, 4}$
$l b$	Yes	lower bound of uniform distribution, $l b = 50$
$u b$	Yes	upper bound of uniform distribution, $u b = 130$
$ϕ^{(b_{i})}$	No	Gaussian mean ofith base block
$σ^{(b_{i})}$	No	Gaussian std ofi th base block
$ϕ^{(b_{i - 1})}$	No	Gaussian mean of $i - 1$ th base block
$σ^{(b_{i - 1})}$	No	Gaussian std of $i - 1$ th base block
$ϕ^{(b_{i + 1})}$	No	Gaussian mean of $i + 1$ th base block
$σ^{(b_{i + 1})}$	No	Gaussian std of $i + 1$ th base block