Raw signal segmentation for estimating RNA modification from Nanopore direct RNA sequencing data

  1. Guangzhao Cheng
  2. Aki Vehtari
  3. Lu Cheng  Is a corresponding author
  1. Department of Computer Science, Aalto University, Finland
  2. Institute of Biomedicine, University of Eastern Finland, Finland

eLife Assessment

This study presents SegPore, a valuable new method for processing direct RNA nanopore sequencing data, which improves the segmentation of raw signals into individual bases and boosts the accuracy of modified base detection. The evidence presented to benchmark SegPore is solid, and the authors provide a fully documented implementation of the method. SegPore will be of particular interest to researchers studying RNA modifications.

https://doi.org/10.7554/eLife.104618.4.sa0

Abstract

Estimating RNA modifications from Nanopore direct RNA sequencing data is a critical task for the RNA research community. However, current computational methods often fail to deliver satisfactory results due to inaccurate segmentation of the raw signal. We have developed a new method, SegPore, which leverages a molecular jiggling translocation hypothesis to improve raw signal segmentation. SegPore is a pure white-box model with enhanced interpretability, significantly reducing structured noise in the raw signal. We demonstrate that SegPore outperforms state-of-the-art methods, such as Nanopolish and Tombo, in raw signal segmentation across three large benchmark datasets. Moreover, the improved signal segmentation achieved by SegPore enables SegPore+m6Anet to deliver state-of-the-art performance in site-level m6A identification. Additionally, SegPore surpasses baseline methods like CHEUI in single-molecule level m6A identification.

Introduction

RNA modifications play important roles in different diseases, such as Acute Myeloid Leukemia (Yankova et al., 2021) and Fragile X Syndrome (Prieto et al., 2020), as well as fundamental biological processes like cell differentiation (Bellodi et al., 2013; Lee et al., 2019) and immune response (Quin et al., 2021). To date, researchers have identified over 150 different types of RNA modifications (Boccaletto et al., 2022; Zimna et al., 2023; Ohira and Suzuki, 2024; Chen et al., 2023), highlighting the complexity and diversity of RNA regulation. These modifications are essential for proper RNA function, including maintaining secondary structures and facilitating accurate protein synthesis, such as the role of tRNA modifications in decoding mRNA codons (Agris et al., 2017).

The practical applications of RNA modifications are far-reaching. For instance, N1-methylpseudouridine (m1Ψ) has been used to enhance the efficacy of COVID-19 mRNA vaccines, highlighting their therapeutic potential (Nance and Meier, 2021). However, identifying and characterizing RNA modifications presents significant challenges due to the limitations of existing experimental techniques.

Traditional methods for RNA modification detection, such as MeRIP-Seq (Meyer et al., 2012), miCLIP (Linder et al., 2015), and m6ACE-Seq (Koh et al., 2019), rely on immunoprecipitation techniques that use modification-specific antibodies. While effective, these methods have several drawbacks. First, each method requires a separate antibody for each RNA modification, making it difficult to study multiple modifications simultaneously. Additionally, these techniques can only infer modification locations from next-generation sequencing (NGS) data, rather than measuring the modifications directly. As a result, they struggle to provide single-molecule resolution, limiting their accuracy and scope.

Recent advancements in direct RNA sequencing (DRS) by Oxford Nanopore Technologies (ONT) offer a promising alternative. DRS allows for the direct measurement of electrical currents as RNA molecules translocate through a nanopore, providing a potential avenue for detecting RNA modifications at the single-molecule level. Two versions of the direct RNA sequencing (DRS) kits are available: RNA002 and RNA004. Unless otherwise specified, this study focuses on RNA002 data. In this technique, five nucleotides (5-mers) reside in the nanopore at a time, and each 5-mer generates a characteristic current signal based on its unique sequence and chemical properties (Loman et al., 2015). By analyzing these signals, it is possible to infer the original RNA sequence and detect modifications like N6-methyladenosine (m6A).

The general workflow of Nanopore direct RNA sequencing (DRS) data analysis is as follows. First, the raw electrical signal from a read is basecalled using tools such as Guppy or Dorado (blawrence-ont and malton-ont, 2026), which produce the nucleotide sequence of the RNA molecule. However, these base-called sequences do not include the precise start and end positions of each ribonucleotide (or k-mer) in the signal. Because basecalling errors are common, the sequences are typically mapped to a reference genome or transcriptome using minimap2 (Li, 2018) to recover the correct reference sequence. Next, tools such as Nanopolish (Loman et al., 2015; Simpson et al., 2017) and Tombo (Stoiber et al., 2017) align the raw signal to the reference sequence to determine which portion of the signal corresponds to each k-mer. We define this process as the segmentation and alignment task (abbreviated as the segmentation task), which is referred to as ‘eventalign’ in Nanopolish. Based on this alignment, Nanopolish extracts various features—such as the start and end positions, mean, and standard deviation of the signal segment corresponding to a k-mer. This signal segment or its derived features is referred to as an ‘event’ in Nanopolish. The resulting events serve as input for downstream RNA modification detection tools such as m6Anet (Hendra et al., 2022) and CHEUI (Acera Mateos et al., 2024).

However, significant computational challenges remain. Segmenting the raw current signal into distinct 5-mers and distinguishing between normal nucleotides and their modified counterparts is a complex task. Current methods, such as Nanopolish, employ change-point detection methods to segment the signal and use dynamic programming methods and Hidden Markov Models (HMM) to align the derived segments to the reference sequence, but they are prone to noise and inaccuracies, which degrade performance in downstream tasks like RNA modification prediction. The root of this issue lies in the fact that these methods do not accurately model the physical process of Nanopore sequencing, particularly the dynamics of the motor protein that drives RNA through the pore. As a result, the segmentation process is not well-aligned with the actual translocation mechanics, leading to signal noise that hinders precise modification detection.

SegPore is a novel tool for direct RNA sequencing (DRS) signal segmentation and alignment, designed to overcome key limitations of existing approaches. By explicitly modeling motor protein dynamics during RNA translocation with a Hierarchical Hidden Markov Model (HHMM), SegPore segments the raw signal into small, biologically meaningful fragments, each corresponding to a k-mer sub-state, which substantially reduces noise and improves segmentation accuracy. After segmentation, these fragments are aligned to the reference sequence and concatenated into larger events, analogous to Nanopolish’s ‘eventalign’ output, which serve as the foundation for downstream analyses. Moreover, the ‘eventalign’ results produced by SegPore enhance interpretability in RNA modification estimation. While deep learning–based tools such as m6Anet classify RNA modifications using complex, non-transparent features (see Figure 4—figure supplement 1), SegPore employs a simple Gaussian Mixture Model (GMM) to distinguish modified from unmodified nucleotides based on baseline current levels. This transparent modeling approach improves confidence in the predictions and makes SegPore particularly well-suited for biological applications where interpretability is essential.

By introducing SegPore, we bridge the gaps left by traditional tools. SegPore utilizes a hierarchical hidden Markov model (HHMM) for more precise segmentation and combines it with signal alignment and a GMM for RNA modification prediction, offering both greater accuracy and interpretability. This integrated approach enables robust m6A modification detection at the single-molecule level, facilitating reliable, transparent predictions for both site-specific and molecule-wide m6A modifications.

Materials and methods

SegPore workflow

Workflow overview

Request a detailed protocol

An overview of SegPore is illustrated in Figure 1A, which outlines its five-step process. The output of Step 3 is the ‘events’, which is analogous to the output generated by the Nanopolish (v0.14.0) ‘eventalign’ command and can be used as input for downstream models such as m6Anet. An ‘event’ refers to a segment of the raw signal that is aligned to a specific k-mer on a read, along with its associated features such as start and end positions, mean current, standard deviation, and other relevant statistics. Step 4 allows for direct modification prediction at both the site and single-molecule levels. Notably, a key feature of SegPore is the k-mer parameter table, which defines the mean and standard deviation for each k-mer in either an unmodified or modified state. During the training phase, Steps 3~5 are iterated multiple times to stabilize the parameters in the k-mer table, which are subsequently fixed and applied for modification prediction on the test data. A detailed description of the model is provided in Appendix 1. Unless otherwise noted, the following analysis focuses on RNA002 data, using 5-mers rather than k-mers.

SegPore workflow.

(A) General workflow. The workflow consists of five steps: (1) First, raw current signals are basecalled and mapped using Guppy and Minimap2. The raw current signal fragments are paired with the corresponding reference RNA sequence fragments using Nanopolish. (2) Next, the raw current signal of each read is segmented using a hierarchical hidden Markov model (HHMM), which provides an estimated mean (μi) for each segment. (3) These segments are then aligned with the 5-mer list of the reference sequence fragment using a full/partial alignment algorithm, based on a 5-mer parameter table. For example, Aj denotes the base ‘A‘ at the j-th position on the reference. In this example, A1 and A2 refer to the first and second occurrences of ‘A’ in the reference sequence, respectively. Accordingly, μ1 and μ2 are aligned to A1, while μ3 is aligned to A2 (4). All signals aligned to the same 5-mer across different genomic locations are pooled together, and a two-component Gaussian Mixture Model (GMM) is used to predict the modification at the site-level or single-molecule level. One component of the GMM represents the unmodified state, while the other represents the modified state. (5) GMM is used to re-estimate the 5-mer parameter table. (B) Hierarchical hidden Markov model (HHMM). The outer HMM segments the current signal into alternating base and transition blocks. The inner HMM approximates the emission probability of a base block by considering neighboring 5-mers. A linear model is used to approximate the emission probability of a transition block. (C) Full/partial alignment algorithms. Rows represent the estimated means of base blocks from the HHMM, and columns represent the 5-mers of the reference sequence. Each 5-mer can be aligned with multiple estimated means from the current signal. (D) Gaussian mixture model (GMM) for estimating modification states. The GMM consists of two components: the green component models the unmodified state of a 5-mer, and the blue component models the modified state. Each component is described by three parameters: mean (μ), standard deviation (σ), and weight (ω).

Preprocessing

Request a detailed protocol

We begin by performing basecalling on the input fast5 file using Guppy (v6.0.1), which converts the raw signal data into ribonucleotide sequences. Next, we align the basecalled sequences to the reference genome using Minimap2 (v2.28-r1209) (Li, 2018), generating a mapping between the reads and the reference sequences. Nanopolish provides two independent commands: ‘polya’ and ‘eventalign’. The ‘polya’ command identifies the adapter, poly(A) tail, and transcript region in the raw signal, which we refer to as the poly(A) detection results. The raw signal segment corresponding to the poly(A) tail is used to standardize the raw signal for each read. The ‘eventalign’ command aligns the raw signal to a reference sequence, assigning a signal segment to individual k-mers in the reference. It also computes summary statistics (e.g. mean, standard deviation) from the signal segment for each k-mer. Each k-mer together with its corresponding signal features is termed an event. These event features are then passed into downstream tools such as m6Anet and CHEUI for RNA modification detection. For full transcriptome analysis (Figure 3), we extract the aligned raw signal segment and reference sequence segment from Nanopolish’s events for each read by using the first and last events as start and end points. For in vitro transcription (IVT) data with a known reference sequence (Figure 4), we extract the raw signal segment corresponding to the transcript region for each input read based on Nanopolish’s poly(A) detection results.

Due to inherent variability between nanopores in the sequencing device, the baseline levels and standard deviations of k-mer signals can differ across reads, even for the same transcript. To standardize the signal for downstream analyses, we extract the raw current signal segments corresponding to the poly(A) tail of each read. Since the poly(A) tail provides a stable reference, we standardize the raw current signals for each read, ensuring that the mean and standard deviation are consistent across the poly(A) tail region. This step is crucial for reducing variability between different reads and ensuring more accurate signal segmentation and modification prediction in subsequent steps. See Section 3 of Appendix 1 for more details.

Signal segmentation via hierarchical Hidden Markov model

Request a detailed protocol

The RNA translocation hypothesis (see details in the first section of Results) naturally leads to the use of a hierarchical Hidden Markov Model (HHMM) to segment the raw current signal. As shown in Figure 1B, the HHMM consists of two layers. The outer HMM divides the raw signal into alternating base and transition blocks, represented by hidden states ‘B’ (base) and ‘T’ (transition). Within each base block, the inner HMM models the current signal at a more granular level.

The inner HMM includes four hidden states: ‘prev’, ‘next’, ‘curr’, and ‘noise’. These correspond to the previous, next, and current 5-mer in the pore, while ‘noise’ refers to random fluctuations. Each raw current measurement is emitted from one of these hidden states, providing a detailed model of the signal within the base blocks. A linear model with a large absolute slope is used to represent sharp changes in the transition blocks.

To segment the signal, we first model the likelihood of the HHMM. Given the raw current signal y of a read, the hidden states of the outer hidden HMM are denoted by g. y and g are divided into 2K+1 blocks c, where y(k), g(k) correspond to kth block and c=(c1,c2,,ck,,c2K+1), ck{B,T}. Blocks with odd indices (k=1,3,5,,2K+1) are base blocks, while those with even indices (k=2,4,,2K) are transition blocks. The likelihood of the HHMM is given by

(1) p(y,g)=p(y|g)p(g)
(2) =p(y|g){πg1outeri=2NTgi1giouter}
(3) ={i=1Np(yi|gi)}{πg1outeri=2NTgi1giouter}
(4) ={k=0Kp(y(2k+1)|c2k+1=B)}{k=1Kp(y(2k)|c2k=T)}{πg1outeri=2NTgi1giouter}

where Tgi1giouter is the transition matrix of the outer HMM and πg1outer is the probability for the first hidden state. The first part of Equation 2 is emission probabilities, and the second part is the transition probabilities. It is not possible to directly compute the emission probabilities of the outer HMM (Equation 3) since there exist dependencies for the current signal measurements within a base or transition block. Therefore, we use the inner HMM and linear model (Equation 4) to handle the dependencies and approximate emission probabilities.

The inner HMM models transition between the ‘prev’, ‘next’, ‘curr’, and ‘noise’ states. For the ‘prev’, ‘next’, and ‘curr’ states, a Gaussian distribution is used to model the emission probabilities, while a uniform distribution models the ‘noise’ state. The forward-backward algorithm is used to compute the marginal likelihood of the inner HMM, which approximates the emission probabilities for base blocks. For transition blocks, a standard linear model computes the emission probabilities.

For any given g, we can calculate the joint likelihood (Equation 1). We enumerate different configurations and select the one with the highest likelihood. This process segments the raw current signal into alternating base and transition blocks, where one or more base blocks may correspond to a single 5-mer. Each base block is characterized by its mean and standard deviation, which are used as input for downstream alignment tasks.

Due to the large size of fast5 files (which can reach terabytes), parameter inference for this model is computationally intensive. To address this, we developed a GPU-based inference algorithm that significantly accelerates the process. More details on this algorithm can be found in Appendix 2.

In summary, the HHMM allows us to accurately segment the raw current signal, providing estimates of the mean and variance for each base block, which are crucial for downstream analyses such as RNA modification prediction.

5-mer and 9-mer parameter table

Request a detailed protocol

We downloaded the k-mer models ‘r9.4_180 mv_70bps_5-mer_RNA’ from the ONT GitHub repository (here; Brennen, 2023) for RNA002 data. The columns labeled ‘level_mean’ and ‘level_stdv’ in these models were used as the mean and standard deviation values for the unmodified 5-mers. These values serve as the initial parameters in the 5-mer parameter table for SegPore, which we refer to as the ONT 5-mer parameter table. In the RNA004 data analysis, we obtained the 9-mer parameter table from the source code of f5c (version 1.5). Specifically, we used the array named ‘rna004_130bps_u_to_t_rna_9mer_template_model_builtin_data’ from the file here (accessed on 17 October 2025).

The initialization of the k-mer parameter table is a critical step in SegPore’s workflow. By leveraging ONT’s established k-mer models, we ensure that the initial estimates for unmodified k-mers are grounded in empirical data. These initial estimates are refined through SegPore’s iterative parameter estimation process, enabling the model to accurately differentiate between modified and unmodified k-mers during segmentation and modification prediction tasks. The refined k-mer parameters also provide a foundation for downstream analysis, such as alignment and m6A identification.

Alignment of raw signal segment with reference sequence

Request a detailed protocol

After segmenting the raw current signal of a read into base and transition blocks using HHMM, we align the means of base blocks with the 5-mer list of the reference sequence.

The alignment process involves three main cases:

  1. Base block matching with a 5-mer: In this case, a base block aligns with a 5-mer, which is modeled by a Gaussian distribution. The 5-mer may exist in either an unmodified or modified state, and the corresponding Gaussian parameters are retrieved from the 5-mer parameter table.

  2. Base block matching with an insertion: Here, the base block aligns with an indel (‘-’), indicating an inserted nucleotide in the read.

  3. Deletion in the read: In this case, an indel (0.0) matches with a 5-mer, representing a deleted nucleotide in the read.

The alignment score function models these matching cases as follows. For the first case (a base block matching a 5-mer), we calculate the probability that the base block mean is sampled from either the unmodified or modified 5-mer’s Gaussian distribution. The larger of the two probabilities is used as the match score. For the second and third cases, the alignment is treated as noise, and a fixed uniform distribution is used to calculate the match score.

An important distinction from classical global alignment algorithms (Needleman–Wunsch algorithm) is that one or multiple base blocks may align with a single 5-mer. Given the base block means μ=(μ1,μ2,,μi,,μm) and 5-mer list s=(s1,s2,,sj,,sn), we define (m+1)×(n+1) the score matrix as M (Figure 1C). The first row and column of the matrix are reserved for indels ("0.0" and "-"), representing insertions or deletions in the base blocks or 5-mers, respectively.

We denote the score function by f. The recursion formula of the dynamic programming alignment algorithm is given by

(5) M(i,j)=max{M(i1,j1)+f(μi,sj)M(i,j1)+f(0.0,sj)M(i1,j)+f(μi,-)M(i1,j)+f(μi,sj),

where f is the score function. It can be seen that we can still align μi with sj after we have aligned μi1 with sj, which fulfills the special consideration that one or multiple base blocks might be aligned with one 5-mer.

We implement two types of alignment algorithms (Figure 1C) based on the score matrix M:

  1. Full Alignment Algorithm: This algorithm aligns the full list of base block means with the full list of 5-mers, similar to classical global alignment. It traces back from the (m+1,n+1) position in the score matrix.

  2. Partial Alignment Algorithm: This aligns the full list of base blocks with the initial part of the 5-mer list, with no indels allowed in the base block means or the 5-mer list. The trace back starts from the maximum value in the last row of the score matrix.

A detailed description of both alignment algorithms is provided in Appendix 1. The output of the alignment algorithm is an alignment that pairs the base blocks with the 5-mers from the reference sequence for each read (Figure 1C). Base blocks aligned to the same 5-mer are concatenated into a single raw signal segment (referred to as an ‘event’), from which various features—such as start and end positions, mean current, and standard deviation—are extracted. A detailed derivation of the mean and standard deviation is provided in Section 5.3 in Appendix 1. In the remainder of this paper, we refer to these resulting events as the output of eventalign analysis, which also represents the final output of the segmentation and alignment task.

Modification prediction

Request a detailed protocol

After obtaining the eventalign results, we estimate the modification state of each motif using the 5-mer parameter table. Specifically, for each 5-mer, we compare the probability that its mean is sampled from either the modified or unmodified 5-mer’s Gaussian distribution. If the probability under the modified 5-mer distribution is higher, the 5-mer is predicted to be in the modification state, and vice versa for the unmodified state.

To estimate the overall modification rate at a specific genomic location, we pool all reads that map to that location on the reference sequence. The modification rate is calculated as the proportion of reads that are predicted to be in the modification state at that location. A detailed description of the modification state prediction process can be found in Appendix 1.

GMM for 5-mer parameter table re-estimation

Request a detailed protocol

To improve alignment accuracy and enhance modification predictions, we use a Gaussian Mixture Model (GMM) to iteratively fine-tune the 5-mer parameter table. As illustrated in Figure 1A, the rows of the 5-mer parameter table represent the 5-mers, while the columns provide the mean and standard deviation for both the unmodified and modified states. We denote a 5-mer by s, with its relevant parameters as μs,un, δs,un, μs,mod, δs,mod.

Using the alignment results from all reads, we collect the base block means aligned to the same 5-mer (denoted by s) across different reads and genomic locations, particularly those with high modification rates. A two-component GMM is then fit to these base blocks. In this process, the mean of the first component is fixed to μs,un . From the GMM, we update the parameters δs,un, ωs,un, μs,mod, δs,mod, and ωs,mod, where ωs,un, ωs,mod represent the weights for the unmodified and modified components, respectively. Afterward, we manually adjust the 5-mer parameter table using heuristics to ensure that the modified 5-mer distribution is significantly distinct from the unmodified distribution (see details in Section 7 of Appendix 1).

This re-estimation process is only performed on the training data. The initial 5-mer parameter table is based on the table provided by ONT. After each iteration of updating the table, we rerun the SegPore workflow from the alignment step onward. Typically, the process is repeated three to five times until the 5-mer parameter table stabilizes (when the average change of mean values of all 5-mers is less than 5e-3). Once a stabilized 5-mer parameter table is estimated from the training data, it is used for RNA modification estimation in the test data without further updates. A more detailed description of the GMM re-estimation process is provided in Section 6 of Appendix 1.

m6A site level benchmark

Request a detailed protocol

The HEK293T wild-type (WT) samples were downloaded from the ENA database (accession number PRJEB40872), while the HCT116 samples were obtained from ENA under accession PRJEB44348. The reference sequence (Homo_sapiens.GRCh38.cdna.ncrna_wtChrIs_modified.fa) was downloaded from https://doi.org/10.5281/zenodo.4587661. Ground truth data were sourced from Supplementary Data 1 of Pratanwanich et al., 2021. Fast5 files for the test dataset (mESC WT samples, mESCs_Mettl3_WT_fast5.tar.gz; Jenjaroenpun et al., 2021) were retrieved from the NCBI Sequence Read Archive (SRA) under accession SRP166020.

During training, we initialized the 5-mer parameter table using ONT’s data. The standard SegPore workflow was executed on the training data (HEK293T WT samples), using the full alignment algorithm. The 5-mer parameter table estimation was iterated five times. For mapping, reads were first aligned to the cDNA and subsequently converted to genomic locations using Ensembl GTF file (GRCh38, v9). The same 5-mer across different genomic locations was pooled together. A 5-mer was considered significantly modified if its read coverage exceeded 1500 and the distance between the means of the two Gaussian components in the GMM was greater than 5 picoamperes (pA). As a result, modification parameters were specified for ten significant 5-mers, as illustrated in Figure 3—figure supplement 1A.

With the estimated 5-mer parameter table from the training data, we then ran the SegPore workflow on the test data. The transcript sequences from GENCODE release version M18 were used as the reference sequence for mapping, with the corresponding GTF file (gencode.vM18.chr_patch_hapl_scaff.annotation.gtf) downloaded from GENCODE to convert transcript locations into genomic coordinates. It is important to note that the 5-mer parameter table was not re-estimated for the test data. Instead, modification states for each read were directly estimated using the fixed 5-mer parameter table. Due to the differences between human (Figure 3—figure supplement 1A) and mouse (Figure 3—figure supplement 1B), only six 5-mers were found to have m6A annotations in the test data’s ground truth (Figure 3—figure supplement 1C). For a genomic location to be identified as a true m6A modification site, it had to correspond to one of these six common 5-mers and have a read coverage of greater than 20. SegPore derived the ROC and PR curves for benchmarking based on the modification rate at each genomic location.

In the SegPore+m6Anet analysis, we fine-tuned the m6Anet model using SegPore’s eventalign results to demonstrate improved m6A identification. We started with the pre-trained m6Anet model (available at here; Hendra, 2025, model version: HCT116_RNA002) and fine-tuned it using the eventalign results from HCT116 samples. SegPore’s eventalign output provided the pairing between each genomic location and its corresponding raw signal segment, which allowed us to extract normalized features such as the normalized mean μi, standard deviation σi, dwell time li (number of data points in the event). For genomic location i, m6Anet extracts a feature vector xi={μi1,σi1,li1,μi,σi,li,μi+1,σi+1,li+1}, which was used as input for m6Anet. Feature vectors from a randomly selected 80% of the genomic locations were used for training, while the remaining 20% were set aside for validation. We ran 100 epochs during fine-tuning, selecting the model that performed best on the validation set.

Ground truth data and the performance of other methods (Tombo v1.5.1, Nanom6A v2.0, m6Anet v1.0, and Epinano v1.2.0) on the mESC dataset were provided through personal communications with Prof. Luo Guanzheng, the corresponding author of the benchmark study referenced (Zhong et al., 2023).

m6A single molecule level benchmark

Request a detailed protocol

The benchmark IVT data for single molecule m6A identification was downloaded from NCBI-SRA with accession number SRP166020. CHEUI was used for the benchmark. For the ground truth, every A in every read of the IVT_m6A sample is treated as a m6A modification and every A in every read of the IVT_normalA is treated as a normal A. Detailed methods for calculating modification probability at single molecule level were provided in Appendix 1, ‘Modification state estimation on single molecule level’ and ‘Modification probability estimation on single molecule level’.

Results

RNA translocation hypothesis

Accurate segmentation of raw current signals in direct RNA sequencing remains a major challenge, largely because the precise dynamics of RNA translocation through the pore are not fully understood. To address this, we propose a hypothesis that better reflects the physical movement of the RNA molecule. In traditional basecalling algorithms such as Guppy and Albacore, we implicitly assume that the RNA molecule is translocated through the pore by the motor protein in a monotonic fashion, that is the RNA is pulled through the pore unidirectionally. In the DNN training process of Guppy and Albacore, we try to align the current signal with the reference RNA sequence. The alignment is unidirectional, which is the source of the implicit monotonic translocating assumption.

Contrary to the conventional assumption of monotonic translocation, the raw current data suggests that the motor protein drives RNA both forward and backward during sequencing. Figure 2B illustrates this with several example fragments of DRS raw current signal (Pratanwanich et al., 2021), where each fragment roughly corresponds to three neighboring 5-mers. The raw current signals, as shown in Figure 2B, strongly support this hypothesis, with several instances of measured current intensities matching both the previous and next 5-mer’s baseline. These repeated patterns, observed across multiple DRS datasets, provide empirical evidence of the jiggling translocation. Similar patterns are widely observed across the whole data. This suggests that the RNA molecule may move forward and backward while passing through the pore. This observation is also supported by previous reports (Craig et al., 2017; Caldwell and Spies, 2017), in which the helicase (the motor protein) translocates the DNA strand through the nanopore in a back-and-forth manner. Depending on ATP or ADP binding, the motor protein may translocate the DNA/RNA forward or backward by 0.5~1 nucleotides.

Figure 2 with 1 supplement see all
RNA translocation hypothesis.

(A) Jiggling RNA translocation hypothesis. The top panel shows the raw current signal of Nanopore direct RNA sequencing, with gray areas representing SegPore-estimated transition blocks. We focus on three neighboring 5-mers, considering the central 5-mer (CTACG) as the current 5-mer. The RNA molecule may briefly move forward or backward during the translocation of the current 5-mer. If the RNA molecule is pulled backward, the previous 5-mer is placed in the pore, and the current signal (“prev” state, red dots) resembles the previous 5-mer’s baseline (mean and standard deviation highlighted by red lines and shades). If the RNA is pushed forward, the current signal (“next” state, blue dots) is similar to the next 5-mer’s baseline. (B) Example raw current signals supporting the jiggling hypothesis. The dashed rectangles highlight base blocks, with red and blue points representing measurements corresponding to the previous and next 5-mer, respectively. Red points align closely with the previous 5-mer’s baseline, and blue points match the next 5-mer’s baseline, reinforcing the hypothesis that the RNA molecule jiggles between neighboring 5-mers. The raw current signals were extracted from mESC WT samples of the training data in the m6A benchmark experiment.

Based on the reported kinetic model (Craig et al., 2017), we hypothesize that the RNA is translocated through the pore in a jiggling manner. This “jiggling” hypothesis presents a significant departure from the traditionally accepted view of unidirectional RNA translocation and aligns better with recent evidence from studies on DNA translocation (Craig et al., 2017; Caldwell and Spies, 2017). Incorporating this hypothesis into segmentation algorithms could lead to more accurate predictions of RNA modifications by accounting for the dynamic nature of translocation. On average, the motor protein sequentially translocates 5-mers on the RNA strand forward, and each 5-mer resides in the pore for a short time. During the short period of a single 5-mer, the motor protein may swiftly drive the RNA molecule forward and backward by 0.5~1 nucleotide in the translocation process of the current 5-mer (Figure 2A), which makes the measured current intensity occasionally similar to the previous or the next 5-mer. When the motor protein does not move the RNA molecule, we hypothesize that the RNA molecule undergoes slight thermal fluctuations, causing it to oscillate slightly within the pore and produce a stable current close to its baseline. In contrast, sharp changes in current intensity between consecutive 5-mers define transition blocks, where one 5-mer is replaced by the next.

We also assume that the raw current signal of a read can be segmented into a series of alternating base and transition blocks. In the ideal case, a base block corresponds to the base state where the 5-mer resides in the pore and jiggles between neighboring 5-mers, that is the current 5-mer can transiently jump to the previous or the next 5-mer. A transition block corresponds to the transition state between two consecutive base states where one 5-mer translocates to the next 5-mer in the pore. The current signal should be relatively flat in the base blocks, while a sharp change is expected in the transition blocks. One challenge we encountered was the overestimation of transition blocks. This may be due to a base block actually corresponding to a sub-state of a single 5-mer, rather than each base block corresponding to a full 5-mer, leading to inflated transition counts. To address this issue, SegPore’s alignment algorithm was refined to merge multiple base blocks (which may represent sub-states of the same 5-mer) into a single 5-mer, thereby facilitating further analysis.

SegPore Workflow

The SegPore workflow (Figure 1) consists of five key steps: (1) Preprocess fast5 files to pair raw current signal segments with corresponding RNA sequence fragments; (2) Segment each raw current signal using a hierarchical hidden Markov model (HHMM) into base and transition blocks; (3) Align the derived base blocks with the paired RNA sequence; (4) Fit a two-component GMM to estimate the modification state at the single-molecule level or the modification rate at the site level; (5) Use the results from Step (4) to update relevant parameters. Steps (3) to (5) are iterative and continue until the estimated parameters stabilize.

The final outputs of SegPore are the events and modification state predictions. SegPore’s events are similar to the outputs of Nanopolish’s ‘eventalign’ command, in that they pair raw current signal segments with the corresponding RNA reference 5-mers. Each 5-mer is associated with various features, such as start and end positions, mean current, and standard deviation, derived from the paired signal segment. For 5-mers that exhibit one clearly unmodified component and one clearly modified component, SegPore reports the modification rate at each site, as well as the modification state of that site on individual reads.

A key component of SegPore is the 5-mer parameter table, which specifies the mean and standard deviation for each 5-mer in both modified and unmodified states (Figure 1A). Since the peaks (representing modified and unmodified states) are separable for only a subset of 5-mers, SegPore can provide modification parameters for these specific 5-mers. For other 5-mers, modification state predictions are unavailable.

Segmentation benchmark

To evaluate SegPore’s performance in raw signal segmentation, we compared SegPore with Nanopolish (v0.14.0) and Tombo (v1.5.1) using three Nanopore direct RNA sequencing (DRS) datasets: two HEK293T datasets (wild type and Mettl3 knockout) (Pratanwanich et al., 2021) and the HCT116 dataset (Chen et al., 2025). Nanopolish and SegPore employed the ‘eventalign’ method to align 5-mers on each read with their corresponding raw signals, producing the mean and standard deviation (std) of the aligned signal segments. Tombo used the ‘resquiggle’ method to segment the raw signals, but the resulting signals are not reported on the absolute pA scale. To ensure a fair comparison with SegPore, we standardized the segments using the poly(A) tail in the same way as SegPore (See preprocessing section in Materials and methods).

To benchmark segmentation performance, we used two key metrics (details provided in Appendix 1, Section 8): (1) the log-likelihood of the segment mean, which measures how closely the segment matches ONT’s 5-mer parameter table (used as ground truth), and (2) the standard deviation (std) of the segment, where a lower std indicates reduced noise and better segmentation quality. If the raw signal segment aligns correctly with the corresponding 5-mer, its mean should closely match ONT’s reference, yielding a high log-likelihood. A lower std of the segment reflects less noise and better performance overall.

As shown in Table 1, SegPore consistently achieved the best performance averaged on all 5-mers across all datasets, with the highest log-likelihood and the lowest std values. These results suggest that SegPore provides a more accurate segmentation of the raw signal with significantly reduced noise compared to Nanopolish and Tombo. It is worth noting that the data points corresponding to the transition state between two consecutive 5-mers are not included in the calculation of the standard deviation in SegPore’s results in Table 1. However, their exclusion does not affect the overall conclusion. Because SegPore contains on average ~6 points per event within the transition state, we similarly removed the first and last three data points of each event for Nanopolish and Tombo prior to recalculating the metrics. The updated results are presented in Table 2.

Table 1
Segmentation benchmark on RNA002 data.
TestDatasetAvg. std (σ^) ↓Avg. log p (L^) ↑
NanopolishTomboSegPoreNanopolishTomboSegPore
HEK293T(WT)3.0734.1872.736–2.871–3.749–2.778
HEK293T(KO)2.9484.2042.670–2.856–3.749–2.745
HCT1163.1674.0762.872–2.872–3.704–2.746
Table 2
Segmentation benchmark on the RNA002 data excluding boundaries.
TestDatasetAvg. std (σ^) ↓Avg. log p (L^) ↑
NanopolishTomboSegPoreNanopolishTomboSegPore
HEK293T(WT)2.8624.0902.736–2.838–3.772–2.778
HEK293T(KO)2.8564.1202.670–2.878–3.729–2.745
HCT1163.0584.0532.872–2.869–3.708–2.746

To provide a more intuitive comparison, the segmentation results for two example raw signal clips are illustrated in Figure 3—figure supplement 2. These examples demonstrate the clearer, more precise segmentation achieved by SegPore compared to Nanopolish.

To evaluate SegPore’s performance on RNA004 data, we compared it with f5c (v1.5) (Gamaarachchi et al., 2020) and Uncalled4 (Kovaka et al., 2025) (v4.1.0) using three public DRS datasets: the S. cerevisiae dataset (Watson et al., 2025), the curlcake IVT and m6A datasets (Cruciani et al., 2025). The RNA002 data provides reference current levels for 5-mers, whereas RNA004 provides reference values for 9-mers, with Uncalled4 normalizing them to approximately zero mean and unit variance. As there are currently no established poly(A) detection methods available for RNA004, we used f5c to standardize the raw signals of each read prior to segmentation. SegPore was then applied to perform segmentation on the standardized signals. We computed the same benchmarking metrics—average log-likelihood and standard deviation—using the standardized raw signals and the segmentation results from f5c, Uncalled4, and SegPore. The 9-mer parameter table in pA scale for RNA004 data provided by f5c (see Materials and methods) was used in the analysis. As shown in Table 3, SegPore achieved the best overall performance across all datasets, indicating its robustness and suitability for RNA004 data. Moreover, we find that the jiggling hypothesis remains valid for RNA004, as illustrated in Figure 2—figure supplement 1.

Table 3
Segmentation benchmark on RNA004 data.
TestDatasetAvg. std (σ^) ↓Avg. log p (L^) ↑
f5cUncalled4SegPoref5cUncalled4SegPore
S. cerevisiae3.1293.9703.125–2.541–3.835–2.489
curlcake(IVT)3.4244.7293.359–2.716–2.904–2.515
curlcake(m6A)3.5204.9713.392–3.118–3.524–2.599

m6A identification at the site level

We evaluated SegPore’s performance in raw signal segmentation and m6A identification using independent public datasets as both training and test data. Since m6A modifications typically occur at DRACH motifs (where D denotes A, G, or U, and H denotes A, C, or U) (Linder et al., 2015), this study focuses on estimating m6A modifications on these motifs.

To begin, we estimated the 5-mer parameter table for m6A modifications using Nanopore direct RNA sequencing (DRS) data from three wild-type human HEK293T cell samples (Pratanwanich et al., 2021). The fast5 files from all samples were concatenated, and the full SegPore workflow was run to obtain the 5-mer parameter table. The parameter estimation process was iterated five times to ensure stabilization, refining the modification parameters for ten 5-mers where the modification state distribution significantly differs from the unmodified state.

Next, we applied SegPore’s segmentation and m6A identification to test data from wild-type mouse embryonic stem cells (mESCs; Zhong et al., 2023). Given the comparable methods and input data requirements, we benchmarked SegPore against several baseline tools, including Tombo, MINES (Lorenz et al., 2020), Nanom6A (Gao et al., 2021), m6Anet, Epinano (Liu et al., 2019), and CHEUI (Acera Mateos et al., 2024). By default, MINES and Nanom6A use eventalign results generated by Tombo, while m6Anet, Epinano, and CHEUI rely on eventalign results produced by Nanopolish. In Figure 3C, ‘Nanopolish +m6Anet” refers to the default m6Anet pipeline, whereas ‘SegPore +m6Anet’ denotes a configuration in which Nanopolish’s eventalign results are replaced with those from SegPore. Based on the output of SegPore eventalign, we fine-tune m6Anet using the HCT116 data at all DRACH motifs, aiming to demonstrate that the performance of SegPore benefits downstream models. Additionally, due to the differences in the availability of ground truth labels for specific k-mer motifs between human and mouse (Figure 3—figure supplement 1), six shared 5-mers were selected to demonstrate SegPore’s performance in modification prediction directly. By utilizing the 5-mer parameter table derived from the training data, SegPore employs a two-component GMM to calculate the modification rates at the selected m6A sites.

Figure 3 with 3 supplements see all
m6A identification at the site level.

(A) Histogram of the estimated mean from current signals mapped to an example m6A-modified genomic location (chr10:128548315, GGACT) across all reads in the training data, comparing Nanopolish (left) and SegPore (right). The x-axis represents current in picoamperes (pA). (B) Histogram of the estimated mean from current signals mapped to the GGACT motif at all annotated m6A-modified genomic locations in the training data, again comparing Nanopolish (left) and SegPore (right). The x-axis represents current in picoamperes (pA). (C) Site-level benchmark results for m6A identification across all DRACH motifs, showing performance comparisons between SegPore+m6Anet and other methods. (D) Benchmark results for m6A identification on six selected motifs at the site level, comparing SegPore and other baseline methods.

SegPore demonstrated improved segmentation compared to Nanopolish. Figure 3A shows the eventalign results at an example genomic location with m6A modifications. SegPore’s results show a more pronounced bimodal distribution in the raw signal segment mean, indicating clearer separation of modified and unmodified signals. Furthermore, when pooling all reads mapped to m6A-modified locations at the GGACT motif, SegPore exhibited more distinct peaks (Figure 3B), indicating reduced noise and potentially enabling more reliable modification detection.

We evaluated m6A predictions using two approaches: (1) SegPore’s segmentation results were fed into m6Anet, referred to as SegPore+m6Anet, which works for all DRACH motifs and (2) direct m6A predictions from SegPore’s Gaussian Mixture Model (GMM), which is limited to the six selected 5-mers shown in Figure 3—figure supplement 1C that exhibit clearly separable modified and unmodified components in the GMM (see Materials and methods for details).

In terms of m6A identification, SegPore performed strongly on the test data. Using miCLIP2 (Körtel et al., 2021) data as the ground truth, we calculated the area under the curve (AUC) for both the receiver operating characteristic (ROC) and precision-recall (PR) curves. SegPore+m6Anet achieved the best performance with an ROC AUC of 89.0% and a PR AUC of 43.8% (Figure 3C). For six selected m6A motifs, SegPore achieved an ROC AUC of 82.7% and a PR AUC of 38.7%, earning the third best performance compared with deep learning methods m6Anet and CHEUI (Figure 3D). It is noteworthy that SegPore’s GMM for m6A estimation is a very simple model, utilizing far fewer parameters than DNN-based methods. Achieving a decent performance with such a simple model is a significant accomplishment. These results highlight SegPore’s robust performance in m6A identification. For practical applications, we recommend taking the intersection of m6A sites predicted by SegPore and m6Anet to obtain high-confidence modification sites, while still benefiting from the interpretability provided by SegPore’s predictions.

m6A identification at the single molecule level

SegPore naturally identifies m6A modifications at the single-molecule level, which is crucial for understanding the heterogeneity of RNA modifications across individual transcripts. We benchmarked SegPore against CHEUI using an in vitro transcription (IVT) dataset containing two samples—one transcribed with m6A and the other with adenine (Jenjaroenpun et al., 2021). This dataset provides clear ground truth for m6A modifications at the single-molecule level, with all adenosine positions replaced by m6A in the ivt_m6A sample, and all adenosines unmodified in the ivt_normalA sample. We used 60% of the data for training and 40% for testing, with both SegPore and CHEUI estimating the m6A modification probability at each adenosine site on each read. Based on these probabilities, we calculated the ROC-AUC and PR-AUC by varying the modification probability threshold.

As shown in Figure 4A, SegPore outperformed CHEUI on this benchmark dataset, achieving better performance in both PR-AUC (94.7% vs 90.3%) and ROC-AUC (93% vs 89.9%). These results clearly demonstrate SegPore’s accuracy and robustness in detecting single-molecule m6A modifications.

Figure 4 with 2 supplements see all
m6A identification at the single-molecule level.

(A) Benchmark results for single-molecule m6A identification on IVT data. SegPore shows better performance compared to CHEUI in both PR-AUC and ROC-AUC. (B) Comparison of ‘eventalign’ results from SegPore and Nanopolish for five consecutive k-mers. Note that DRS is sequenced from 3’ to 5’, so the k-mers enter the pore from right to left. A total of 100 reads were randomly sampled from transcript locations A1 (positions 711–719) in both the IVT_normalA and IVT_m6A samples (SRA: SRP166020). Each line represents an individual read, and the y-axis shows the raw signal intensity in picoampere (pA). Pink lines represent the IVT_m6A sample, and gray lines represent the IVT_normalA sample. The k-mers ‘GCGGA’, ‘CGGAC’, ‘GGACT’, ‘GACTT’, and ‘ACTTT’ all contain N6-Methyladenosine (m6A) in the IVT_m6A sample. SegPore’s results show clearer separation between m6A and adenosine, especially for ‘CGGAC’ and ‘GGACT’, compared to Nanopolish. (C) The upper panel shows the modification rate for selected genomic locations in the example gene ENSMUSG00000003153. The lower panel displays the modification states of all reads mapped to this gene. The black borders in the heatmap highlight the biclustering results, showing distinct modification patterns across different read clusters labeled C1 through C6.

Next, we demonstrate SegPore’s interpretability through an example comparison of raw signal clips from both ivt_m6A and ivt_normalA samples (Figure 4B). The raw signal segments (means) are aligned to a randomly selected m6A site as well as its neighboring sites in the reference sequence. Each line represents an individual read, with pink lines from the ivt_m6A sample and gray lines from the ivt_normalA sample. SegPore’s segmentation clearly distinguishes between m6A and adenosine at the single-molecule level, while Nanopolish’s segmentation shows less distinction.

Interestingly, we observed that the position of m6A within a 5-mer can affect the signal intensity. For instance, a clear difference between m6A and adenosine is evident when m6A occupies the fourth position in the 5-mer (e.g. ‘CGGAC’), but this difference is less pronounced when m6A is in the second position (e.g. ‘GACTT’).

To illustrate the benefits of single-molecule m6A estimation, we present an example gene (ENSMUSG00000003153) from the mESC dataset used in site-level m6A identification. Figure 4C shows a heatmap of highly modified genomic locations (modification rate >0.1), where rows represent reads and columns represent genomic locations. Biclustering reveals six clusters of reads and three clusters of genomic locations.

The heatmap in Figure 4C suggests heterogeneity in m6A modification patterns across different reads of the same gene. Biclustering reveals that modifications at g6 are specific to cluster C4, g7 to cluster C5, and g8 to cluster C6, while the first five genomic locations (g1 to g5) show similar modification patterns across all reads. Additionally, high modification rates are observed at the 3rd and 5th positions across the majority of reads. These results suggest that m6A modification patterns can vary significantly even within a single gene. This observation highlights the complexity of RNA modification regulation and underscores the importance of single-molecule resolution for understanding RNA function at a finer scale.

Discussion

One of the main computational challenges in direct RNA sequencing (DRS) lies in accurately segmenting the raw current signal. We developed a segmentation algorithm that models the jiggling property in the physical process of DRS, resulting in cleaner current signals for m6A identification at both the site and single-molecule levels. Our results demonstrate that SegPore’s segmentation enables clear differentiation between m6A-modified and unmodified adenosines. We believe that the de-noised current signals will be beneficial for other downstream tasks, such as the estimation of m5C, pseudouridine, and other RNA modifications. Nevertheless, several open questions remain for future research. In SegPore, we assume a drastic change between two consecutive 5-mers, which may hold for 5-mers with large difference in their current baselines but may not hold for those with small difference. As with other RNA modification estimation methods, SegPore can be affected by misalignment errors, particularly when the baseline signals of adjacent k-mers are similar. These cases may lead to spurious bimodal signal distributions and require careful interpretation. Another key question concerns the physical interpretation of the derived base blocks. Ideally, one base block would correspond to a single 5-mer, but in practice, multiple base blocks often align with one 5-mer. We hypothesize that the HHMM may segment a 5-mer’s current signal into multiple base blocks, where the 5-mer oscillates between different sub-states, each characterized by distinct baselines.

Currently, SegPore models only the modification state of the central nucleotide within the 5-mer. However, modifications at other positions may also affect the signal, as shown in Figure 4B. Therefore, introducing multiple states for a 5-mer to account for modifications at multiple positions within the same 5-mer could help to improve the performance of the model. This approach, however, would significantly increase model complexity—introducing two states per location results in 25=32 modification states per 5-mer, a challenge for simple GMMs. It remains to be investigated how to model multiple modification states properly to improve the performance in RNA modification estimation. While the current two-state assumption simplifies the model, it has proven effective, as demonstrated by improved performance on m6A benchmarks at both site and single-molecule levels.

A key advantage of SegPore is its interpretability, which sets it apart from DNN-based methods. SegPore provides a clearer understanding of RNA modification predictions. A limitation of DNN-based approaches is that they often struggle to differentiate between m6A and adenosine signals, as their intensity levels are quite similar (Figure 4B, Figure 3—figure supplement 3). This poses a challenge when applying DNN-based methods to new datasets without short read sequencing-based ground truth. In such cases, it is difficult for researchers to confidently determine whether a predicted m6A modification is genuine. SegPore encodes current intensity levels for different 5-mers in a parameter table, where unmodified and modified 5-mers are modeled using two Gaussian distributions. One can generally observe a clear difference in the intensity levels between 5-mers with an m6A and those with adenosine, which makes it easier for a researcher to interpret whether a predicted m6A site is genuine (see Figure 4—figure supplement 1). A challenge is the relatively small number of 5-mers that show significant changes in their modification states. To improve accuracy, larger training datasets and expanding the scope to 7-mers or 9-mers could help capture more context, potentially revealing more significant baseline changes.

The limited improvement of SegPore combined with m6Anet over Nanopolish+m6Anet in bulk in vivo analysis (Figure 3) may be explained by several factors: potential alignment inaccuracies due to pseudogenes or transcript isoforms, the complexity of in vivo datasets containing additional RNA modifications (e.g. m5C, m7G) affecting signal baselines, and the fact that m6Anet is specifically trained on events produced by Nanopolish rather than SegPore. Additionally, the lack of a modification-free control (in vitro transcribed) sample in the benchmark dataset makes it difficult to establish true baselines for each k-mer. Despite these limitations, SegPore demonstrates clear improvement in single-molecule m6A identification in IVT data (Figure 4), suggesting it is particularly well suited for in vitro transcription data analysis.

Although SegPore provides clear interpretability, there is potential to explore DNN-based models that can directly leverage SegPore’s segmentation results. Currently, m6Anet computes features (e.g., mean and standard deviation) from raw signal segments, which are then fed into a neural network for m6A prediction. However, a more direct approach—where raw signal segments are used as input to a DNN—could allow for the extraction of more intricate features that may exist within the signal but are currently underexplored. These high-order features might capture subtle aspects of the raw signal, leading to improved m6A estimation. Since SegPore currently models only the mean and standard deviation, further work involving advanced DNNs could extend beyond this, uncovering finer patterns in the signal that traditional statistical models might miss.

Computation speed is also a concern when processing fast5 files. We addressed this by implementing a GPU-accelerated inference algorithm in SegPore, resulting in a 10- to 20-fold speedup compared to the CPU-based version. We believe that the GPU implementation will unlock the full potential of SegPore for a wider range of downstream tasks and larger datasets. More details are provided in Appendix 2. SegPore’s running times on datasets of varying sizes, using a single NVIDIA DGX-1 V100 GPU and one CPU core, are provided in Figure 4—figure supplement 2.

In summary, we developed a novel software SegPore that considered the conformation changes of motor protein to segment raw current signal of Nanopore direct RNA sequencing. SegPore effectively masks out noise in the raw signal, leading to improved m6A identification at both site and single-molecule levels.

Appendix 1

Notation

||Take the length of the inside vector.
YAll reads in the input multi-fast5 file. Y=(y1,y2,,yi,). yi are the raw current measurements of ith read. Assuming there are N reads, i.e. |Y|=N.
yiThe current signal of ith read. yi=(yi1,yi2,yij,yili) where li is the length of yi i.e. |yi|=li.
yi(k)The kth mapped current signal fragment of ith read yi, given by eventalign using Nanopolish.
RReference sequences for each read. R=(r1,r2,ri,).
riThe reference sequence corresponding to ith read yi, given by basecalling and minimap2. ri=(ri1,ri2,rij,,rimi), where rij{A,C,G,T} and mi is the length of ri i.e. |ri|=mi.
ri(k)The kth reference sequence fragment of ri corresponding to yi(k). Note that |yi(k)| is the larger than |ri(k)| in general.
siThe 5mer list corresponds to ri. We have si=(si1,si2,,sij,,simi), where sij=ri,j2ri,j1rijri,j+1ri,j+2 and sij{AAAAA,AAAAC,AAAAG,AAAAT,,TTTTT}. Note that |si|=|ri|=mi.
si(k)The k-mer list fragment corresponds to the kth reference sequence fragment ri(k). |si(k)|=|ri(k)|.
ZThe estimated states of all reads given by the Hierarchical Hidden Markov Model (HHMM). Z=(z1,z2,,zi,).
ziThe hidden states for each current measure of ith read. zi=(zi1,zi2,,zij,,zili), where zij{0,1,2,3,4} corresponds to yij and |yi|=|zi|=li, where zi(k)=0 means the transition state and zij=1,2,3,4 means base states.
zi(k)The kth fragment in zi, which corresponds to yi(k). Note that |zi(k)|=|yi(k)|.
μiThe mean values of base segments given by HHMM, which are estimated from yi and zi. μi=(μi1,μi2,,μij,,μini), where j=1ni indexes the base segments of ith read. In general, we have li>ni>mi.
μi(k)The kth fragment of μi which corresponds to yi(k) and zi(k). μi(k)=(μi1(k),μi2(k),,μij(k),). In general we have |yi(k)|>|μi(k)|>|si(k)|.
ui(k)The modification state of each 5mer in si(k). ui(k)=(ui1(k),ui2(k),,uij(k),), where uij(k){un,mod} indicates if sij(k) is in unmodified or modified state. |ui(k)|=|si(k)|.
TrefThe reference 5mer parameter table provided by Oxford Nanopore Technology (ONT). Tref={(μs,u,σs,u),s{AAAAA,AAAAC,,TTTTT},u=un"}.
TkmerThe parameter table for all 5mers. Tkmer={(μs,u,σs,u),s{AAAAA,AAAAC,,TTTTT}, u{un",mod"}}.

Introduction

Given the parameter table Tkmer for all k-mers, the overall workflow of SegPore consists of the following major steps: (1) Base-calling and mapping, (2) preprocessing of nanopore raw signal, (3) segmenting nanopore raw signal using hierarchical hidden Markov model (HHMM), and (4) aligning raw signal segments with corresponding k-mer list. The input of Segpore is the raw current signal of Nanopore direct RNA sequencing. The output of the Segpore workflow is the pairing of raw signal segments with corresponding k-mers with inferred state (unmodified state or modified state). Unless otherwise noted, the following analysis focuses on RNA002 data, using 5-mers rather than k-mers.

As illustrated in Appendix 1—figure 1, we denote the input FAST5/POD5 file as a set of raw current signal reads Y=(y1,y2,,yi,). We use |Y| to denote the number of reads and |yi| to denote the number of measurements / data points of yi. We then perform basecalling using Guppy and map basecalled sequence to the reference sequences (‘Basecalling, mapping and preprocessing’), yielding a pairing between the ith read yi and its corresponding reference sequence ri. We assume there are k matched fragments [(yi(1),ri(1)),(yi(2),ri(2)),,(yi(k),ri(k))]. For the jth fragment yi(j), we use HHMM (‘Hierarchical hidden Markov model’) to estimate the hidden states for each measurement zi(j)=(zi1(j),zi2(j),), where zim(j){0,1,2,3,4} and |yi(j)|=|zi(j)|. Based on yi(j) and zi(j), we can convert it into a list of low variation segments, with their means denoted by μi(j)=(μi1(j),μi2(j),). As a result, the number of measurements in yi(j) is larger than the number of segments in μi(j), that is |yi(j)|=|zi(j)|>|μi(j)|. After that, we perform full or partial alignment between the raw signal segments μi(j) and reference sequence ri(j), which is converted into a list of 5-mers si(j)=(si1(j),si2(j),) and . Based on μi(j) and si(j), we perform full or partial alignment (‘Alignment of signal segments with reference sequence’) to match each 5-mer with one or multiple current signal segments, as well as estimating its modification state ui(j)=(ui1(j),ui2(j),), where uim(j){un",mod"}.

Appendix 1—figure 1
SegPore workflow with a fixed 5-mer parameter table Tkmer.

The input is the raw current signal yi and reference sequence ri for ith read. The output is the paired current signal segments µi, 5-mer list si and modification states ui.

However, Tkmer is not fully known beforehand (Appendix 1—table 1), especially the parameters for the modification states of 5-mers. We have to estimate the parameters from the training data through an iterative process (‘Estimation of k-mer parameters table’). After obtaining the estimated Tkmer, we derive the modification states from new test data using SegPore workflow with fixed Tkmer.

Appendix 1—table 1
Example of Tkmer (1024×4).

‘-’ represents ‘NA’.

modle_kmerun_muun_sigmamod_mumod_sigma
AAAAA108.92.68--
AAAAC107.752.68--
AAAAG101.722.68--
AAAAT112.772.68--
AAACA99.382.4195.282.44
AAACC1001.3296.822.42
AAACG101.013.45--
AAACT106.912.18101.982.29
AAAGA110.544.06--
AAAGC107.694.06--
AAAGG108.294.06--
AAAGT108.734.06--
AAATA114.113.11--
...............
TTTTT80.781.97--

Basecalling, mapping, and preprocessing

This section describes the details about the basecalling and mapping process, as well as raw data preprocessing for standardizing the raw signal of all reads.

First, we obtain the matching between raw current signals and reference sequences. The raw multi-fast5 reads (input file) were basecalled using Guppy (v6.0.1), which results in a FASTQ file. Then we align the yielded FASTQ file to the reference sequences (we use cDNA reference) using minimap2 (v2.24-r1122)(Li, 2021), which provides a bam file that matches the basecalled reads with the reference sequence. After that, we feed the raw fast5 file, basecalled FASTQ file, bam file, and reference sequence to Nanopolish (v0.13.2) to get (1) matching fragments between the raw current signal yi(k) and the reference sequence ri(k) (as well as si(k)) and (2) the current signal fragment for poly(A) part of the mRNA molecule.

Then, we standardize the raw current signal based on the poly(A) tail. Due to technical reasons, there exist variations in the current signal of the same sequence fragment across different reads. It is necessary to perform a standardization. Since all reads have a poly(A) tail in Nanopore direct RNA sequencing, we could standardize the reads such that the mean and std of poly(A) part is the same for all reads. The standardization is given by

(1) standardized_signal=raw_signalμpolyAσpolyAσstandPolyA+μstandPolyA

where μstandPolyA=108.9 and σstandPolyA=1.67 (RNA002). Note that yi refers to the standardized signal throughout this document.

Hierarchical hidden Markov model

This section describes the hierarchical hidden Markov model (HHMM) to segment the raw current signal of a read. The input of HHMM is yi and the output is zi. For notation simplicity, we use y and z to represent yi and zi in this section, respectively.

Model assumptions

Here we have several assumptions regarding the raw signal. We first assume there is a sharp change of the baseline levels of the current signal between two consecutive 5-mers. This means there is a large change of current signal within a short period of time. If we fit a line to the corresponding signals, we expect to see a large slope. The second assumption is that the 5-mer resides in the Nanopore, oscillates forward and backward during a 5-mer event, while staying in the pore most of the time. Based on these two assumptions, we can divide the raw current signal into two states (hidden states of the outer HMM): the base states for 5-mer event and the transition state for the transition between two consecutive 5-mer events. We assume that the current signal always starts with a base state, then switches to a transition state, after that switches back to a base state or transition state alternately, and in the end, the signal should end in a base state. For the base state, we use an inner HMM to model the oscillations, which has four hidden states: ‘curr’, ‘prev’, ‘next’, and ‘noise’. For the transition state, we use a linear regression model to model the abrupt change of the current signal.

Outer HMM

We define the following notations for the HHMM. As illustrated in Appendix 1—figure 2, we assume there are N time points of the current signal, that is y=(y1,y2,,yi,,yN). We use g=(g1,g2,,gi,,gN), gi{B",T"} to indicate the hidden states of the outer HMM, where B and T stand for ‘base’ and ‘transition’, respectively. Similarly, we use h=(h1,h2,,hi,,hN), hi{curr",prev",next",noise"}, respectively. We then integrate g and h into a single vector z=(z1,z2,,zi,,zN), where zi=g~ih~i. Here g~i (B"→1,T"→0) and h~i (‘curr’→1, ‘prev’→2, ‘next’→3, ‘noise’→4) are the numerical mappings of gi and hi. Based on g, we can derive that y is divided into 2K+1 blocks, which is defined by c=(c1,c2,,ci,,c2K+1), ci{B",T"}. Note that ci=B"i=1,3,,2K+1 and ci=T"i=2,4,,2K. As a result, we could also represent y=(y(1),y(2),,y(i),,y(2K+1)), where y(i) represent the ith block. The base blocks are given by (y(b1),y(b2),,y(bi),,y(bK+1)), where bi=2i1 and i{1,2,,K+1}. The transition blocks are given by (y(t1),y(t2),,y(ti),,y(tK)), where ti=2i and i{1,2,,K}. Similarly, we have g=(g(1),g(2),,g(i),,g(2K+1)) and h=(h(1),h(2),,h(i),,h(2K+1)).

Appendix 1—figure 2
Example of HHMM where gi{B",T"}, hi{curr",prev",next",noise"} and ci{B",T"}.

In this figure, ‘c’, ‘p’, ‘n’, and ‘o’ represent ‘curr’, ‘prev’, ‘next’, and ‘noise’ respectively in h. |y|=|g|=|h|and|c|=11.

We use Touter={TBBouter,TBTouter,TTBouter,TTTouter} to denote the transition probabilities between the base state and the transition state in the outer HMM. We use Tinner= {Thihjinnerhi,hj{curr",prev",next",noise"}} to denote the transition probabilities between four hidden states of the inner HMM. For base blocks, we have inner HMM parameters ϕ=(ϕ(b1),ϕ(b2),,ϕ(bK+1)) and σ=(σ(b1),σ(b2),,σ(bK+1)). For transition blocks, we have linear coefficients β0=(β0(t1),β0(t2),,β0(tK)), β1=(β1(t1),β1(t2),,β1(tK)) and noise std σϵ=(σϵ(t1),σϵ(t2),,σϵ(tK)).

The joint likelihood of the HHMM is given by

(2) p(y,g)=p(yg)p(g)=p(yg)πg1outeri=2NTgi1giouter=i=1Np(yigi)πg1outeri=2NTgi1giouter,

where πg1outer=1 is the initial probability of hidden state of the y1 (always ‘B’ state). However, it is not possible to directly derive p(yi,hi|gi), since yi are not independent given gi. Here we use different models for each block ck to calculate the joint probability p(y(k)|g(k)) to approximate i=1Np(yi|gi). Therefore, we have

(3) i=1Np(yigi)=j=02K+1p(y(j)cj)=j=0Kp(y(2j+1)c2j+1=B")j=1Kp(y(2j)c2j=T")

The first part is the marginal likelihood for base blocks, while the second part is the marginal likelihood for transition blocks. The specific likelihood formulas are given in the following sections.

For the outer HMM, the transition probabilities Touter are pre-specified and the emission probabilities are obtained from the inner HMM (‘Inner HMM for base state’) and linear regression model (‘Linear model for transition state’). The initial probability for the first hidden state πg1outer is set to 1, since the first state should always be “B”.

Inner HMM for base state

This section provides the derivations for calculating the likelihood of the ith base block bi, where i{1,2,,K+1}. We assume the data y(bi)=(y1(bi),y2(bi),,yNbi(bi)) is generated by an HMM with four states: ‘curr’, ‘prev’, ‘next’, and ‘noise’. The descriptions of the states are given by Appendix 1—table 2, where parameters for the noise state are given and parameters for other states need to be estimated from the data.

Appendix 1—table 2
Hidden states of inner HMM for the bi th block.
StateEventEmission distributionParameter estimation
‘curr’current 5-mer resides in poreN(ϕ(bi),(σ(bi))2)Yes
‘prev’previous 5-mer resides in poreN(ϕ(bi1),(σ(bi1))2)Yes
‘next’next 5-mer resides in poreN(ϕ(bi+1),(σ(bi+1))2)Yes
‘noise’unknown noiseUnif(lb,ub)No

The likelihood for the kth block is given by

(4) p(y(bi)cbi=``B``)=h(bi)p(y(bi),h(bi)cbi=``B``)=h(bi)p(y(bi)ϕ,σ,h(bi),cbi=``B``)p(h(bi)cbi=``B``)=h(bi)[j=1Nbip(yj(bi)hj(bi),ϕ,σ)πh1(bi)innerj=2NbiThj1(bi)hj(bi)inner]=h(bi)[j=1Nbi{N(yj(bi)ϕ(bi),(σ(bi))2)I(hj(bi)=``curr``)N(yj(bi)ϕ(bi1),(σ(bi1))2)I(hj(bi)=``prev``)N(yj(bi)ϕ(bi+1),(σ(bi+1))2)I(hj(bi)=``next``)Unif(yj(bi)lb,ub)I(hj(bi)=``noise``)}πh1(bi)innerj=2NbiThj1(bi)hj(bi)inner]

where πh1(bi)inner=0.25 is the probability of the first hidden state and I() is the indicator function

For the inner HMM, the transition probabilities Tinner are pre-specified and the initial probability for the first hidden state πh1(bi)inner=0.25 is set to 0.25. Other parameters (see Appendix 1—table 2) need to be estimated.

By comparing the emission probabilities of the outer HMM (Equation 3) and the marginal likelihood of the inner HMM (Equation 4), we have

(5) p(y(bi)=cbi=B")=j=1|y(bi)|p(yj(bi)|gj(bi))

To approximate p(yj(bi)|gj(bi)), we assume all values of p(yj(bi)|gj(bi)) are the same for any given j, we have

(6) p(y(bi)|cbi=B")=j=1|y(bi)|p(yj(bi)|gj(bi))=[p^(yj(bi)|gj(bi))]|y(bi)|

The approximated emission probability of each data point in the ith base block is given by

(7) p^(yj(bi)gj(bi))=p^(y(bi)cbi=B")y(bi),

where p^(y(bi)|cbi=B") can be obtained by our inference algorithm.

Note that the hidden states of the inner HMM h can be provided as a by-product of our inference algorithm. We will use it to derive the final composite hidden state z of the HHMM (‘Parameter inference’), which is utilized by downstream alignment tasks (‘Post-processing of alignment path’).

Linear model for transition state

This section provides the derivations for calculating the likelihood of the ith transition block ti, where i{1,2,,K}. We denote the input data by y(ti)=(y1(ti),y2(ti),,yNti(ti)) and its corresponding index by x(ti)=(1,2,,Nti). We assume a linear model for data points in the transition state, given by

(8) y(ti)=β1(ti)x(ti)+β0(ti)+ϵ,

where β1(ti) is the slope, β0(ti) is the intercept and ϵN(0,(σϵ(ti))2) is the Gaussian noise.

The likelihood of the linear model is given by

(9) p(y(ti)cti=T")=p(y(ti)x(ti),β0(ti),β1(ti),σϵ(ti))=j=1Ntip(yj(ti)xj(ti),β0(ti),β1(ti),σϵ(ti))=j=1NtiN(yj(ti)β1(ti)xj(ti)β0(ti)0,(σϵ(ti))2)

Similar to the inner HMM, the approximated emission probability for each data point in ith transition block is given by

(10) p^(yj(ti)|gj(ti))=|y(ti)|p^(y(ti)|cti=T"),
p^(y(bi)=cbi=T")

Parameter Inference

Due to the complexity of GPU-accelerated inference algorithm, the details are provided in Appendix 2.

The inference algorithm will infer the following parameters: (1) the hidden states g of the outer HMM; (2) the hidden states of the inner HMM h; (3) relevant parameters for emissions (Appendix 1—table 2) of each current signal fragment of the 2k+1 th block (base block) y(2k+1), where k(0,1,,K) and c2k+1=B"; and (4) the linear coefficients and noise std of the 2kth block (transition block) y(2k), where k(1,2,,K) and c2k=T".

We then integrate g and h into a single vector z=(z1,z2,,zi,,zN), where zi=g~ih~i and g~i,h~i are the numerical mappings of gi,hi. z is the final output of HHMM.

Alignment of signal segments with reference sequence

This section describes the alignment algorithms for aligning an input fragment of signal segments μi(k) and the corresponding 5-mer list si(k). After post-processing of the alignment path, we could get the modification states ui(k). For notation simplicity, we drop the indices and denote μi(k), si(k) and ui(k) by μ, s and u, respectively. In general, one 5-mer sj is aligned with k signal segments (μi,μi+1,,μi+k1), where k1.

Here we discuss different scenarios of aligning sj with μi. The first scenario (scenario 1) is that μi is generated by unmodified 5-mer sj, for which we could compute the likelihood using the Normal distribution defined by (μsjun,σsjun)=Tsj,unref. The second scenario (scenario 2) is the μi is generated by modified 5-mer sj, that is N(μsjmod,σsjmod), where (μsjmod,σsjmod)=Tsj,modref. The third scenario (scenario 3) is that μi is generated by an unknown nucleotide insertion event, whose probability is given by a pre-defined uniform distribution punif. The fourth scenario (scenario 4) is that the 5-mer sj represents a deletion event on the reference sequence, whose probability is given by the same uniform distribution punif. We define the scoring function as

(11) f(sj,μi)={max{log(N(μiμsjun,(σsjun)2)),log(N(μiμsjmod,(σsjmod)2))},(scenario1,2),log(punif)#sj=""orμi=0.0(scenario3,4)

We can modify the classical global alignment algorithm to align μ with s. The log probability of aligning sj with μi could be used as the matching score (Equation 11) in the standard global alignment algorithm. Since one 5-mer sj could be aligned with multiple signal segments (μi,μi+1,,μi+k1), we need to modify the recursion such that μi+1 could still be aligned with sj even if μi is already aligned with sj. Therefore, we have the following recursions in our alignment algorithm (Appendix 1—figure 3a). Here we denote the scoring matrix by M (m×n) and the traceback matrix by T, where |μ|=m1 and |s|=n1.

Appendix 1—figure 3
Alignment of signal segments with reference sequence.
(12) M(i,j)=max{M(i1,j1)+f(μi,sj),scenario1,2T(i,j)=1M(i,j1)+f(0,sj),scenario3T(i,j)=2M(i1,j)+f(μi,),scenario4T(i,j)=3M(i1,j)+f(μi,sj),scenario1,2T(i,j)=4

where both μi1 and μi are aligned with sj when T(i,j)=4. This is the only difference between our recursion and the recursion of the standard global alignment algorithm.

Based on the proposed recursion, we developed two alignment algorithms designed for different problems: the full alignment algorithm (‘Full alignment algorithm’) allows insertions and deletions in the reference sequence, while the partial alignment algorithm (‘Partial alignment algorithm’) does not allow any insertions or deletions in the partial reference sequence. The full alignment algorithm tries to align all the current signal segments to the whole reference sequence, while the partial alignment algorithm tries to align all the current signal segments to a consecutive sub-sequence of the 5-mer list from the beginning.

The output of the alignment algorithms is a matching path between the current signal segments μ and the reference 5-mer list s. We denote it by γ=(μp,sq), where μp=(μp1,μp2,,μpN) and sq=(sq1,sq2,,sqN). The length of the alignment path is given by |γ|=N. p=(p1,p2,,pN) are the indexes of the current signal segment μ in the alignment path γ, while q=(q1,q2,,qN) are the indexes of the 5-mer list s. Note that there are duplicated values in p and q to allow one 5-mer to match with multiple current signal segments. Appendix 1—figure 3b provides an illustration of the alignment result.

To obtain the modification states u, we will first process the optimal alignment path γ into γ^ (‘Post-processing of alignment path’), from which we can derive the modification states u (‘Modification state estimation on single molecule level’).

Full alignment algorithm

The full alignment algorithm is used when the read may possess differences (insertions or deletions) compared with the reference sequence, which is the general case. The algorithm is described in Appendix 1–algorithm 1.

Appendix 1–algorithm 1. Full alignment algorithm.
Input: μ, s, Tkmer
Output: Alignment path γ
 // Add special symbol before μ and s
μ ={0.0, μ}, s ={“–”, s}
 // Get the length of and μ and s
m = |μ|, n = |s|
 // Initialize the score matrix M and traceback matrix T
Initialize a m × n matrix M, with all elements set to 0.0
Initialize a m × n matrix T, with all elements set to 0
 // Initialize the first row of M and T
for i ∈ {1, 2, … , n-1} do
   M(0, i) ← M(0, i - 1) + f (0.0, si)
   T(0, i)←2
end
// Intialize the first column of M and T
for i ∈ {1, 2, · · · , m − 1} do
    M(i, 0) ← M(i − 1, 0) + f(µi , “ − ”)
    T(i, 0) ← 3
end
// Loop for all matrix
for i ∈ {1, 2,···,m − 1} do
for j ∈ {1, 2,···, n − 1} do
     v1M(i − 1, j − 1) + f(μi, sj)
     v2M(i, j − 1) + f(0.0, sj)
     v3M(i − 1, j) + f(μi, “ − ”)
     v4M(i − 1, j) + f(μi, sj)
     M(i, j) ← max{v1, v2, v3, v4}
      T(i, j) ← argmax{v1, v2, v3, v4} # index of maximum
   end
end
// Traceback
Traceback from position (m − 1, n − 1) to position (0, 0) using T to generate the optimal
alignment path γ.

Partial alignment algorithm

The partial alignment algorithm is used when the read must agree with the reference sequence (no insertions or deletions). If the sequencing data is generated by in vitro transcription from a single template, all reads should agree with the reference sequence. However, the read may only match the reference sequence from the start to a middle point due to the break of the RNA molecule in the reverse transcription.

The partial alignment algorithm is a modification of the full alignment algorithm by considering this special case. Since the alignment may not be full, we traceback from the maximum value in the last row of the scoring matrix μ. This means all current signal segments must be aligned, while only the first part of the 5-mer list needs to be aligned. Appendix 1—figure 3c provides an illustration of the alignment result.

The partial alignment algorithm is described in Algorithm 1–algorithm 2. We can see that scenarios 3 and 4 have been removed from the algorithm compared with the full alignment algorithm, that is T(i,j)=2or3. In other words, there should be no insertions or deletions in the 5-mer list s.

Appendix 1–algorithm 2. Partial alignment algorithm.
Input: μ, s, Tkmer
Output: Alignment path γ
 // Add special symbol before μ and s
μ = {0.0, μ}, s = {“ − ”, s}
 // Get the length of μ and s
m = |μ|, n = |s|
 // Initialize the score matrix M and traceback matrix T
Initialize a m × n matrix M, with all elements set to 0.0
Initialize a m × n matrix T, with all elements set to 0
 // Initialize the first row of M and T
for i ∈ {1, 2,···, n − 1} do
    M(0, i) ← M(0, i − 1) + f(0.0, si)
    T(0, i) ← 2
end
 // Initialize the first column of M and T
for i ∈ {1, 2,···,m − 1} do
  M(i, 0) ← M(i − 1, 0) + f(µi , “ − ”)
  T(i, 0) ← 3
end
 // Loop for all matrix
for i ∈ {1, 2, · · · ,m − 1} do
   for j ∈ {1, 2, · · · , n − 1} do
       v1M(i − 1, j − 1) + f(μi, sj)
       v4M(i − 1, j) + f(μi, sj)
       M(i, j) ← max{v1, v4}
       if v1v4 then
         T(i, j) ← 1;
       else
         T(i, j) ← 4;
       end
   end
end
 // Traceback
Find the maximum value M(m − 1, k) in the last row of the scoring matrix M.
Traceback from position (m − 1, k) to position (0, 0) using T to generate the optimal
 alignment path γ.

Post-processing of alignment path

In the alignment path γ, one k-mer might be aligned to noise (scenario 4) or one current signal segment may be aligned with ‘-’ (scenario 3). Firstly, we remove such pairs from the alignment path. Then, we re-compute a weighted mean for each remaining k-mer, since one k-mer may be aligned to multiple current signal segments. Here, we combine these segments and calculate the weighted mean over corresponding data points with hidden state 1, that is zi=1.

Let us assume there are m segments aligned to the same k-mer si, with nj denotes the number of data points assigned to ‘curr’ state (zi=1) in the j-th segment and μj denotes the mean of the j-th segment. The weighted mean, denoted by μ^i, corresponding to the k-mer si, is defined as follows:

(13) μ^i=j=1mnjn1+n2++nmμj.

In the end, we obtain a one-to-one correspondence between each k-mer and its newly derived weighted mean, denoted by μ^=(μ^1,μ^2,). Note that the length of μ^ and s is the same, that is |μ^|=|s|. We denote the post-processed alignment path by γ^=(μ^,s).

Modification estimation

Modification state estimation on single molecule level

We are interested in estimating the modification state for 5-mer after alignment. Tkmer has provided the mean and std for each 5-mer, as well as its modifications. Note that Tkmer may only contain modification parameters for only a subset of 5-mers. For those 5-mers without modification parameters, their mean and std are set to ‘NA’. We could evaluate the probability density of a given 5-mer under the unmodified distribution or the modified distribution. By comparing the likelihoods, we could infer the modification state. Given the post-processed alignment path γ^=(μ^,s), we estimate the modification state of jth 5-mer sj by

(14) uj={un"ifμsjun=NA"orσsjun=NA",un"ifN(μ^jμsjun,(σsjun)2)N(μ^jμsjmod,(σsjmod)2)mod"ifN(μ^jμsjun,(σsjun)2)<N(μ^jμsjmod,(σsjmod)2)

where (μsjun,σsjun)=Tsj,unkmer and (μsjmod,σsjmod)=Tsj,modkmer.

In the end, we get the modification state vector u=(u1,u2,,ui,,u|γ^|) for each 5-mer in the post-processed alignment path γ^.

Modification probability estimation on single molecule level

In addition to estimating the modification state, it is also crucial to determine the modification probability of each 5-mer on each read, which is essential for accurately benchmarking the performance at the single-molecule level.

Given the post-processed alignment path γ^=(μ^,s), we estimate the modification probability of jth 5-mer sj by

(15) pj={NA"ifμsjun=NA" or σsjun=NA"p(μ^jTsj,modkmer)p(μ^jTsj,modkmer)+p(μ^jTsj,unkmer),ifμsjunNA" and σsjunNA"

where (μsjun,σsjun)=Tsj,unkmer and (μsjmod,σsjmod)=Tsj,modkmer. Indeed, the pj is the posterior probability given the distribution of modified state:

(16) pj=p(Tsj,modkmerμ^j)=p(μ^jTsj,modkmer)p(Tsj,modkmer)p(μ^jTsj,modkmer)p(Tsj,modkmer)+p(μ^jTsj,unkmer)p(Tsj,unkmer)

where we assume that the prior probabilities are equal, p(Tsj,modkmer)=p(Tsj,unkmer).

In the end, we get the modification probability vector p=(p1,p2,,pi,,p|γ^|) for each 5-mer in the post-processed alignment path γ^.

Modification rate estimation on site level

In real data analysis, we need to decide the modification rate of a genomic location, which quantifies the probability of a read mapped to this location to be in the modified state. For a genomic location, we denote the 5-mer from reads mapped to this location as (s1,s2,,sN). Note that all si,i=1,2,,N are identical 5-mers on different reads. The modification state of these 5-mers on different reads is denoted by (u1,u2,,uN). The modification rate is given by

(17) η=i=1Nui=mod"N

Estimation of k-mer parameters table

This section provides the inference algorithms for 5-mer parameter table Tkmer. This is necessary for training data, as we do not have accurate prior parameters for Tkmer.

Given input data Y=(y1,y2,,yi,,yN), we first perform base calling and mapping to get S=(s1,s2,,si,,sN) run HHMM for all reads such we get M=(μ1,μ2,,μi,,μN). We assume there are ni fragments for the ith read yi=(yi(1),yi(2),,yi(ni)). Similarly, we have si=(si(1),si(2),,si(ni)) and μi=(μi(1),μi(2),,μi(ni)).

The inference algorithm (Appendix 1–algorithm 3) then estimate Tkmer given S, M and Tref (https://github.com/nanoporetech/kmer_models; Brennen, 2023). As illustrated in Appendix 1—figure 4, the inference algorithm infers the parameters iteratively. With a given Tkmer, we can use it to obtain the pairing between raw current signal segments and 5-mers, as well as modification states, through the alignment algorithm (‘Alignment of signal segments with reference sequence’). Given this pairing, we extract all current signal segments for a specific 5-mer, which may correspond to different genomic locations and different reads. As there are two states for this 5-mer (unmodified and modified), we expect to observe two peaks in the density of the current signal segments. Therefore, we could fit a two-component Gaussian mixture model to the current signal segments to estimate relevant parameters of the given 5-mer in Tkmer.

Appendix 1—figure 4
Estimation of 5-mer parameters.

Tkmer is initialized using Tref. In training, our algorithm infers the parameters iteratively. After training, Tref is fixed in testing.

Appendix 1–algorithm 3. Inference algorithm of k-mer parameter table.
Input: S, M, Tref
Output:Tkmer
 // Initialize Tkmer, ∀s ∈ {AAAAA,AAAAC, · · · , TTTTT}, ∀u ∈ {“un”, “mod”}
for (μs,u, σs,u) Tkmer do
   (μs,u, σs,u) = Trefs,“un”
end
 // iterate T times to update Tkmer
for round = 1, 2 · · · T do
  // align all fragments of all reads
  Perform full alignment algorithm (Appendix 1–algorithm 1) for μi(k) and si(k) to get the alignment path γi(k) ,
   where i = 1 · · · N and k = 1 · · · ni.
  Post-process (Post-processing of alignment path) the alignment path γi(k) into γ^i(k) = [(μ^1(k),s1(k)),(μ^2(k),s2(k)),].
  Concatenate all γ^i(k) and re-index into one vector Γ^=[(μ^,s1),(μ^,s2),].
   // parameter estimation
   for s0 ∈ {AAAAA,AAAAC, · · · , TTTTT} do
    Collect all μ^j in Γ^ where sj = s0 into one vector x={μ^j,sj=s0,(μ^j,sj)}Γ^.
    Fit x using a two-component Gaussian Mixture Model with the first component
    mean fixed to μ20, where (μ20, σ20) =Ts0,un"ref, and obtain the std of the first
    component σ1, the mean and std of the second component μ2, σ2.
    Update the parameters of 5-mer s0 in Tkmer : Ts0,un"kmer(μs0,σ1) and
    Ts0,``mod"kmer(μ2,σ2)
   end
   Update Tkmer manually based on several heuristic criteria.
end

In practice, the following heuristic criteria are used: (1) for a given 5-mer, perform GMM for each genomic location with the same 5-mer and only pool those sites with high modification rate (2) there must be two peaks in the density plot of a given 5-mer; (3) the mean of the two peaks are far apart; (4) the weight of second component (modified state) must be large, e.g. >0.1; (5) the weight of the first component (unmodified state) should be larger than the second component (modified state), as majority nucleotides are unmodified; (6) limit the size of std for both components in the GMM (σ1<5, σ2<5) to comply with the physical settings.

In the end, we will get an updated Tkmer from the training data, which contains the modification parameters for the strongest 5-mers. As illustrated in Appendix 1—table 1, we do have parameters for the modification state for certain k-mers such as ‘AAACA’, ‘AAACT’ etc.

Evaluation of segmentation and event alignment

The output of SegPore eventalign is the alignment path γ = (μ, s), which represents a one-to-one correspondence between the mean list μ = (μ1, μ2,...) and the k-mer lists = (s1, s2, ...). In addition, the corresponding standard deviation list σ = (σ1, σ2, ...) is also obtained along the alignment path. Appendix 1—table 3 is an example of the SegPore eventalign output, which illustrates one alignment path. The column ref_kmer corresponds to s, the column mean corresponds to μ, and the column stdv corresponds to σ.

Appendix 1—table 3
SegPore eventalign output example.
read_idxread_namecontigposstrandref_kmermodel_kmermeanstdvstart_idxend_idxevent_len
80741003655f4-ff77-409d-a29e-20dc778d1c47A157+GGTGTGGTGT79.3392.414472964732118
80741003655f4-ff77-409d-a29e-20dc778d1c47A158+GTGTCGTGTC89.8093.579472764729214
80741003655f4-ff77-409d-a29e-20dc778d1c47A159+TGTCTTGTCT117.0176.287472514727218
80741003655f4-ff77-409d-a29e-20dc778d1c47A160+GTCTTGTCTT76.4481.038471484724787
80741003655f4-ff77-409d-a29e-20dc778d1c47A161+TCTTATCTTA76.0361.044471174714425
80741003655f4-ff77-409d-a29e-20dc778d1c47A162+CTTAGCTTAG78.8861.127470924711321
80741003655f4-ff77-409d-a29e-20dc778d1c47A163+TTAGTTTAGT92.9712.307470074708873
80741003655f4-ff77-409d-a29e-20dc778d1c47A164+TAGTGTAGTG121.1294.2468824698390
80741003655f4-ff77-409d-a29e-20dc778d1c47A165+AGTGTAGTGT90.224.554468224687833
80741003655f4-ff77-409d-a29e-20dc778d1c47A167+TGTGCTGTGC104.5662.56346798468188
80741003655f4-ff77-409d-a29e-20dc778d1c47A168+GTGCTGTGCT91.5132.51467784679415
80741003655f4-ff77-409d-a29e-20dc778d1c47A169+TGCTTTGCTT111.4854.001467544677410

Given the eventalign results, we use two metrics to evaluate the performance of the segmentation and event alignment: (1) the average std σ^, and (2) the average log-likelihood L^. Assuming there are N reads in total (each read has one path), and the nth read has Kn events (|γn|=Kn). The average std σ^ is defined as

(18) σ^=1Nn=1N{1Knk=1Knσk},

and the average log-likelihood L^ is defined as

(19) L^=1Nn=1N{1Knk=1KnlogN(μk|μskref,σskref)}

where (μskref,σskref)=Tskref.

As shown in Appendix 1—figure 5, the red line represents the event mean μk and the shaded area represents the std σskest for event k. A poorly segmented raw signal corresponding to an event will exhibit a large standard deviation. Therefore, a smaller σ^ indicates lower variations within the raw signal segment, signifying better performance.

Appendix 1—figure 5
Illustration of segmentation and event alignment.

If an event is aligned to the correct reference k-mer, the mean μk will be close to the reference μskref and the log-likelihood L^ will be large. So higher L^ means more similar results to ONT’s estimates and better performances.

Appendix 2

Introduction

We have introduced the hierarchical hidden Markov model (HHMM) in Appendix 1. The HHMM is a two-layer HMM, consisting of the outer HMM and the inner HMM. The outer HMM specifies the base blocks and transition blocks. The inner HMM specifies the base blocks. Linear models are used for the transition blocks. However, we have not provided the details for the parameter inference. This document aims to provide a comprehensive introduction to the inference algorithm.

As we know, the raw fast5 files are very large, which contain millions of reads. Each read has 20,000 measurements in the raw current signal. This poses a significant computational burden to the HHMM inference algorithm. Therefore, we used GPU to accelerate the inference algorithm.

We will first introduce the general inference algorithm, then describe the implementation details.

For simplicity, we assume the input is a raw current signal of a single read with N measurements y=(y1,y2,,yN). The output is the hidden states of the outer HMM g=(g1,g2,,gN) and the hidden states of the inner HMM h=(h1,h2,,hN), where gi{B",T"} and hi{curr",prev",next",noise"}.

Given a hidden state configuration of the outer HMM g, it partitions the raw current signal y into alternating base blocks and transition blocks c=(c1,c2,,ci,,c2K+1), where ci{B",T"} denotes the label of ith block. Note that ci=B"i=1,3,,2K+1 and ci=T"i=2,4,,2K. Now we can denote y=(y(1),y(2),,y(2K+1)).

The general idea of the inference algorithm is that we will calculate the likelihood for all possible configurations of the hidden states of the outer HMM g and choose the one with the largest likelihood as our final estimation. Due to the special constraints of g, we can enumerate all possible configurations (‘Enumeration of hidden states of outer HMM’). Here we denote the full configuration set of g by G. The general inference algorithm is discussed in detail in ‘General inference algorithm’.

The general inference algorithm provides the analytic forms for calculating the exact likelihood. However, a direct implementation of the algorithm (Python and C++) requires huge amounts of computation resources and time to handle a typical fast5 file. Therefore, we implement a GPU version of the inference algorithm (‘Implementation details’), in which base blocks from different configurations are simultaneously inferred on different GPU cores.

Enumeration of hidden states of outer HMM

This section describes how to get the full configuration set G=(g1,g2,,gi,,gM) given the raw signal y. Obviously, it is not possible to enumerate all hidden state gi=(gi1,gi2,,gij,,giN) (where gij{B",T"}) because there are 2N configurations and N can be larger than 20,000. However, there exist special properties of the raw signal y: (1) the signal should be in alternating base and transition blocks (2) the hidden state gi does not change within a block (3) the average length of base blocks is larger than that of the transition blocks (4) the variation of base blocks is smaller than that of transition blocks. Based on these data properties, we could partition the raw signal y into atomic level base and transition blocks (‘Atomic block initialization’). Multiple consecutive atomic level base and transition blocks can be merged into a single larger base block. As shown in Appendix 2—figure 1a, we partition the raw signal y into 15 atomic-level blocks (B1,T1,B2,T2,,B7,T7,B8). We can merge (B1,T1,B2) into a new base block or (B5,T5,B6,T6,B7) into a new base block (Appendix 2—figure 1b).

Given a pre-defined partition g0 of raw signal y, we can derive the 2K+1 atomic blocks

(B1,T1,B2,T2,,BK,TK,BK+1)). To facilitate the description of the enumeration algorithm, we define seg(i,j)=(Bi+1,Ti+1,Bi+2,,Tj1,Bj) to be the merged block of all blocks in seg(i,j), where 0i<jK+1. A path is then a list of non-overlapping and consecutive blocks. For example, the path corresponding to gi in Appendix 2—figure 1b is given by [seg(0,2),T2,B3,T3,B4,T4,seg(4,7),T7,B8]. To reduce the number of paths, we set the maximum number of base blocks in any merged block to Nmax. The enumeration algorithm is given by Appendix 2–algorithm 1.

Appendix 2–algorithm 1. Enumeration of hidden states of outer HMM.
Input: g0, Nmax, K
Output: G
Derive (B1, T1, B2, T2,··· ,BK, TK, BK+1)) from g0
G{}
for path in recur_enum(0) do
  Convert path to g
  Append g toG
end
 // Recursive function to enumerate all paths starting from position st
Function recur_enum(st):
 res = []
if st = K then
  return [BK+1]
end
 // “[” is inclusive and “)” is exclusive
for i ∈ [st, K + 1) do
   if ist > Nmax then
    return res
   end
   tmp_seg = seg(st, i)
   all_paths_from_i = recur_enum(i)
     for path in all_ paths_ from_i do
     tmp_path = concatenate tmp_seg with path
     Append tmp_path to res
     end
end
return res;

General inference algorithm

Given all possible hidden state configurations of the outer HMM G, we want to calculate the likelihood for all gG. In order to calculate the likelihood, we need to specify the parameters for the emission distributions.

Given g, we can derive that y is divided into 2K+1 alternating base and transition blocks. We denote all blocks by y=(y(1),y(2),,y(i),,y(2K+1)), where y(i) represent the ith block. The base blocks are given by (y(b1),y(b2),,y(bi),,y(bK+1)), where bi=2i1 and i{1,2,,K+1}. The transition blocks are given by (y(t1),y(t2),,y(ti),,y(tK)), where ti=2i and i{1,2,,K}.

We pre-specify the transition probabilities Touter={TBBouter,TBTouter,TTBouter,TTTouter} between the base state and the transition state in the outer HMM, as shown in Appendix 2—table 1. The pre-specification has two considerations: (1) the probability of a transition from B to T after 10 consecutive B states is 0.1; (2) the probability of a transition from T to B after 10 consecutive T states is larger than 0.9; (3) to reduce computation load in the parameter inference. We pre-specify the initial probability of the first hidden state to 1, since g always starts with a ‘B’ state.

Appendix 2—table 1
Transition parameters of the outer HMM (Touter).
‘B’‘T’
‘B’0.990.01
‘T’0.100.90

For inner HMM, we pre-specify the transition probabilities Tinner shown in Appendix 2—table 2, which is based on the following consideration (1) large proportion of the data should come from the ‘curr’ state, that is it is easy to transit from other states to ‘curr’ state; (2) it is unlikely to transit from ‘prev’ to ‘next’, and vice versa; (3) it is unlikely to transit from other states to the ‘noise’ state. We pre-specify the initial probability of the first hidden state to 0.25, which gives equal probabilities for all four states. We pre-specify the parameters of the ‘noise’ state (Unif(lb,ub)) to lb=50 and ub=130, which are estimated from the current signals across different datasets to represent a reasonable range.

Appendix 2—table 2
Transition parameters of the inner HMM (Tinner).
‘curr’‘prev’‘next’‘noise’
‘curr’0.9250.0250.0250.025
‘prev’0.3000.5000.1000.100
‘next’0.3000.1000.5000.100
‘noise’0.3000.1000.1000.500

The remaining parameters, mainly associated with emission distributions, need to be estimated. For base blocks, we have inner HMM parameters ϕ=(ϕ(b1),ϕ(b2),,ϕ(bK+1)) and σ=(σ(b1),σ(b2),,σ(bK+1)). For transition blocks, we have linear coefficients β0=(β0(t1),β0(t2),,β0(tK)), β1=(β1(t1),β1(t2),,β1(tK)) and noise std σϵ=(σϵ(t1),σϵ(t2),,σϵ(tK)).

The general idea of the inference algorithm is to calculate the joint likelihood of the outer HMM given for any g. However, it is not possible to calculate the emission probability for each measurement as there exist dependencies with a base or transition block. Here we first calculate the joint likelihood for a base block (inner HMM) or a transition block (linear model), then average the joint likelihood to each measurement to use it as substitute for the emission probability of each measurement in the outer HMM. Here we use ω=(ω1,ω2,,ωN) to denote the approximated emission probability of each measurement. Note that ωi is derived either based on the inner HMM or the linear model, and |ω|=|y|=|g|=|h|.

The general inference algorithm is given by Appendix 2–algorithm 2.

Appendix 2–algorithm 2. General inference algorithm for HHMM.
Input: y, G, Touter, πouter, Tinner, πinner,
Output: best path g~ of outer HMM, best path h~ of inner HMM, ϕ, σ, β0, β1, σϵ
 // Initialization
set all elements of ω~,g~andh~ to 0, and g~∣=∣h~∣=∣y
H{}
Γ ← {}
 // Calculate likelihood for each g
for g G do
    
   hh~
 Divide y into base and transition blocks (y(1), y(2), · · · , y(2K+1)) based on g
 // transition state parameter estimation
 Estimate ω^(ti),β^0(ti),β^1(ti),σϵ(ti) using linear model (Appendix 2–algorithm 3),
 where ti = 2i and i ∈ {1, 2,···, K}
 Update relevant part of ω using ω^(ti).
 // base state parameter estimation
 Initialize ϕ(bi) and σ(bi) use the empirical mean and std of y(bi), where bi = 2i − 1 and
  i ∈ {1, 2,···, K + 1}
 ϕnewϕ
σnewσ
for round = 1, 2,···, R do
   for i ∈ (1, 2,···, K + 1) do
      bi ← 2i − 1
       // inner HMM parameter estimation
      Estimate ω^(bi), h^(bi), ϕ^(bi), σ^(bi) using Forward-backward
      algorithm (Appendix 2–algorithm 4) given
      y(bi), ϕ(bi), σ(bi), ϕ(bi1), σ(bi1), ϕ(bi+1), σ(bi+1), Tinner, πinner
      Update corresponding elements of ϕnew and σnew using ϕ^(bi),σ^(bi)
     Update relevant part of h and ω using h^(bi) and ω^(bi)
  end
  ϕϕnew
  σσnew
 end
 Calculate the joint likelihood of the outer HMM using g,ω,Touter,πouter
               γ=πg1outeri=2NTgi1giouterj=1Nωj
   Append h and γ to H and Γ, respectively.
end
Choose the largest likelihood from Γ and retrieve the corresponding g and h from G and H,
 respectively.
g~g
h~h

Implementation details

Atomic block initialization

As mentioned in ‘Enumeration of hidden states of outer HMM’, we need to partition the raw signal y into 2K+1 atomic blocks

(B1,T1,B2,T2,,BK,TK,BK+1). If we can identify all atomic transition blocks (T1,T2,,TK), then the atomic base blocks are automatically determined.

We assume atomic transition blocks (T1,T2,,TK) have higher variations than atomic base blocks. If we fit a line to a transition block, the absolute value of the slope of the line should be large. Based on this idea, we can obtain the atomic transition blocks as follows:

  1. For each measurement i, fit another line to (yi1,yi,yi+1) and obtain the slope β1, fit a line to (yi2,yi1,yi,yi+1,yi+2) and obtain the slope β2, use βi=|0.5(β1+β2)| as the absolute slope for yi.

  2. Smooth the obtain slopes β=(β1,β2,,βN) by taking the mean of a sliding window of size. Now we get the smoothed β¯

  3. Find peaks from β¯ using the find_peaks() function in the Python Scipy.signal package. Here, set the minimum distance of detected peaks to 10. Now we get a list of peak positions θ=(θ1,θ2,,θK).

  4. Expand θk to the atomic transition block Tk. We fit a 3-segment spline to

    (yθk8,,yθkl,,yθk,,yθk+r,,yθk+8), where 0<l,r<8. The slopes of the first segment (yθk8,,yθkl1) and the third segment (yθk+r+1,,yθk+8) are set to 0, while a standard linear model is fit to the second segment (yθkl,,yθk,,yθk+r). In the fitting, the slope of the second segment should be larger than 5.0. After the fitting of the spline, we take the residuals and calculate the joint likelihood assuming each residual follows a Gaussian distribution N(0,4). By enumerating all l,r, we pick the l^,r^ with the largest joint likelihood. Therefore, the final kth transition block Tk corresponding to the interval [θkl,θk+r].

Emission probabilities of linear model

For the ith transition block, the measurements are given by y(ti)=(y1(ti),y2(ti),,yNti(ti)), where ti=2i and i1,2,,K. We denote the corresponding index of y(ti) by x(ti)=(1,2,,Nti). Here we want to get the emission probabilities for each measurement ω^(ti), which can be used by the outer HMM. Detailed steps are provided in the following Appendix 2–algorithm 3.

Appendix 2–algorithm 3. Emission probabilities of linear model.
Input: y(ti), x(ti)
Output: ω^(ti), β0(ti), β1(ti),σϵ(ti)
initialize ω^(ti) to a 1 × Nti all zero vector.
Fit a linear model of y(ti) versus x(ti), obtain the maximum likelihood estimates β0(ti),β1(ti)
 and σϵ(ti)
Calculate the maximum log joint likelihood logp(y(ti)|x(ti),β0(ti),β1(ti),σϵ(ti)) =
logN(y(ti)β1(ti)x(ti)β0(ti)0,(σϵ(ti))2)
Set each element of ω^ti to exp (1Ntilogp(y(ti)x(ti),β0(ti),β1(ti),σϵ(ti)))

Emission probabilities of inner HMM

This section describes how to obtain the approximated emission probabilities of the outer HMM for a the ith base block y(bi)=(y1(bi),y2(bi),,yNbi(bi)), which is modeled using an inner HMM. Here bi=2i1 and i{1,2,,K}.

For the inner HMM, we list all the parameters in Appendix 2—table 3.

Appendix 2—table 3
Parameters of the inner HMM for the bi th block.
ParameterFixed parameterDescription
TinnerYesTransition probabilities between hidden states
(‘curr’, ‘prev’, ‘next’, ‘noise’), specified in Appendix 2—table 2
πkinnerYesprobability of the first hidden state, πkinner=0.25
, wherek{1,2,3,4}
lbYeslower bound of uniform distribution,
lb=50
ubYesupper bound of uniform distribution,ub=130
ϕ(bi)NoGaussian mean ofith base block
σ(bi)NoGaussian std ofi th base block
ϕ(bi1)NoGaussian mean of i1 th base block
σ(bi1)NoGaussian std of i1 th base block
ϕ(bi+1)NoGaussian mean of i+1 th base block
σ(bi+1)NoGaussian std of i+1 th base block

We want to obtain the approximated emission probabilities ω^(bi), hidden states h^(bi), estimated parameters ϕ^(bi) and σ^(bi) for the ith base block y(bi). To achieve this, we follow the Forward-backward algorithm in Bishop and Nasrabadi, 2006 (Chapter 13.2). The general idea of the Forward-backward algorithm is to use a Q-function to approximate the marginal likelihood, then use an expectation-maximization (EM) algorithm for the parameter estimation.

We introduce the following notations for the Forward-backward algorithm. We use Z=(z1,z2,,zNbi) to denote the hidden states, where Z is a Nbi×4 matrix, zj=(zj1,zj2,zj3,zj4) and zjk{0,1}. If zjk is 1, then it means hj corresponds to the kth hidden state. We use α (Nbi×4) to denote the forward probabilities and β (Nbi×4) to denote the backward probabilities. γ (Nbi×4) denotes the posterior distribution calculated from α and β. ξ ((Nbi1)×4×4) denotes the joint distribution of two successive hidden states. We use θ=(ϕ(bi),σ(bi)) to denote all parameters to be estimated.

(1) Q(θ,θold)=Zp(Zy(bi),θold)logp(y(bi),Zθ)=k=14γ1klogπkinner+n=2Nbij=14k=14ξn1,j,klogTjkinner+n=1Nbi{γn1logp(yn(bi)ϕ(bi),σ(bi))+γn2logp(yn(bi)ϕ(bi1),σ(bi1))+γn3logp(yn(bi)ϕ(bi+1),σ(bi+1))+γn4logp(yn(bi)lb,ub)}

Here we only need to update α, β, γ, ξ, ϕ(bi) and σ(bi) in the inference process of inner HMM for ith base block. Our inference algorithm is a modification of the standard Forward-backward algorithm based on the defined Q-function, as described in Appendix 2–algorithm 4.

Appendix 2–algorithm 4. Inference algorithm of inner HMM.
Input: y(bi),ϕ(bi),σ(bi),ϕ(bi1),σ(bi1),ϕ(bi+1),σ(bi+1),Tinner,πinner,lb,ub
Output: ω^(bi),h^(bi),ϕ^(bi),σ^(bi)
Initialize h^(bi) and ω^(bi) to the 1×Nbi all zero vector.
Initialize α, β, γ, ξ
 // EM algorithm
for round = 1, 2, · · · , R do
  E-step: update α, β, γ, and ξ given y(bi),ϕ(bi),σ(bi),ϕ(bi1),σ(bi1),ϕ(bi+1),σ(bi+1),Tinner,πinner,lb,ub
   M-step: calculate ϕ^(bi) and σ^(bi) by Equation 2 and Equation 3
  ϕ(bi)ϕ^(bi)
  σ(bi)σ^(bi)
  Calculate the Q-function (log marginal likelihood) q^ given all estimated parameters
end
Derive h^(bi) from γ by taking the state with highest probability for each data point, i.e.
hn(bi)=argmax([γn1,γn2,γn3,γn4])
Set each element of ω^(bi) to exp(q^/Nbi).
(2) ϕ(bi)=n=1Nbiγn1yn(bi)n=1Nbiγn1
(3) σ(bi)=n=1Nbiγn1(yn(bi)ϕ(bi))2n=1Nbiγn1

GPU-accelerated parameter inference

Given the general inference algorithm (‘General inference algorithm’) and algorithms for calculating the emission probabilities for the inner HMM (Appendix 2–algorithm 4) and linear model (Appendix 2–algorithm 3), we have the full algorithm for the parameter inference. The actual inference, however, can be really slow since the input fast5 data are huge. Here we use heuristics and GPU parallelization to accelerate the parameter inference.

We divide the raw current signal of each read y into overlapping regions, which start at an atomic base block and also end with an atomic base block. As shown in Appendix 2—figure 2, we process the rth region y(r) independently to get the segmentation results z(r). This process is based on the following heuristics: the segmentation result of one region will not affect another region when they are far away from each other. The inference is performed in two rounds. In the first round, we obtain the segmentation results for non-overlapping regions. The center part of the segmentation is not affected by the flank parts, which may be affected by neighboring regions. In the second round, we merge relevant parts of neighboring regions near the touch point as a new region based on segmentation results of the first round. After that, we perform the inference for the new regions to get the final segmentations for the whole read.

We use GPU to parallelize the inference algorithm of inner HMM (Alg. 4). For each region, we first enumerate all possible paths and calculate the likelihood for each candidate path (‘Enumeration of hidden states of outer HMM’). As shown in Appendix 2—figure 2, we need to perform inner HMM parameter inference for different paths of different regions of one read. We consider parallelizing the computation of different merged blocks seg(i,j) in different paths of different regions. We first collect all merged blocks and sort them by size and only keep unique blocks, which can be identified by the combinatorial key of region index, start atomic block index, end atomic block index. Then we run the inner HMM inference (Alg. 4) of each merged block on a GPU core. Note that all GPU cores perform the same computation step of Alg. 4 simultaneously, but for different merged blocks.

Appendix 2—figure 1
Illustration of enumeration of hidden states of outer HMM.

(a) Raw signal and initialized to atomic segmentations. (b) Examples of the full configuration set G=(g1,g2,,gi,,gM).

Appendix 2—figure 2
Illustration of GPU-accelerated parameter inference.

Data availability

The data utilized in this study are obtained from publicly available repositories. Details regarding the accession number and data processing can be found in Methods. The resulting data and the source code are hosted on GitHub (https://github.com/guangzhaocs/SegPore copy archived at Cheng, 2025).

The following previously published data sets were used
    1. Pratanwanich PN
    (2021) NCBI BioProject
    ID PRJEB40872. Differential RNA modifications from dRNA-seq.
    1. Chen Y
    2. Davidson NM
    3. Wan YK
    (2021) NCBI BioProject
    ID PRJEB44348. SGNEx: A systematic benchmark of Nanopore long read RNA sequencing for transcript level analysis in human cell lines.

References

Article and author information

Author details

  1. Guangzhao Cheng

    Department of Computer Science, Aalto University, Espoo, Finland
    Contribution
    Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0009-0008-6160-3637
  2. Aki Vehtari

    Department of Computer Science, Aalto University, Espoo, Finland
    Contribution
    Resources, Supervision, Writing – original draft
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-2164-9469
  3. Lu Cheng

    1. Department of Computer Science, Aalto University, Espoo, Finland
    2. Institute of Biomedicine, University of Eastern Finland, Kuopio, Finland
    Contribution
    Conceptualization, Resources, Software, Formal analysis, Supervision, Funding acquisition, Investigation, Methodology, Writing – original draft, Project administration, Writing – review and editing
    For correspondence
    lu.cheng.ac@gmail.com
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6391-2360

Funding

Research Council of Finland (335858)

  • Guangzhao Cheng
  • Lu Cheng

Research Council of Finland (358086)

  • Guangzhao Cheng
  • Lu Cheng

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We would like to thank Prof. Zhijie Tan from Wuhan University for a useful discussion about the molecule dynamics of Nanopore sequencing, Dr. Dan Zhang from Sichuan University for helpful tutorials about Nanopore analysis workflows. We also thank Prof. Luo Guanzheng for sharing the m6A benchmark baseline results. GC and LC acknowledge the computational resources provided by the Aalto Science-IT project. This work was supported by Research Council of Finland grants [335858, 358086 to GC and LC].

Version history

  1. Preprint posted:
  2. Sent for peer review:
  3. Reviewed Preprint version 1:
  4. Reviewed Preprint version 2:
  5. Reviewed Preprint version 3:
  6. Version of Record published:

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.104618. This DOI represents all versions, and will always resolve to the latest one.

Copyright

© 2025, Cheng et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,776
    views
  • 86
    downloads
  • 0
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Guangzhao Cheng
  2. Aki Vehtari
  3. Lu Cheng
(2026)
Raw signal segmentation for estimating RNA modification from Nanopore direct RNA sequencing data
eLife 14:RP104618.
https://doi.org/10.7554/eLife.104618.4

Share this article

https://doi.org/10.7554/eLife.104618