Raw signal segmentation for estimating RNA modification from Nanopore direct RNA sequencing data
eLife Assessment
This study presents SegPore, a valuable new method for processing direct RNA nanopore sequencing data, which improves the segmentation of raw signals into individual bases and boosts the accuracy of modified base detection. The evidence presented to benchmark SegPore is solid, and the authors provide a fully documented implementation of the method. SegPore will be of particular interest to researchers studying RNA modifications.
https://doi.org/10.7554/eLife.104618.4.sa0Valuable: Findings that have theoretical or practical implications for a subfield
- Landmark
- Fundamental
- Important
- Valuable
- Useful
Solid: Methods, data and analyses broadly support the claims with only minor weaknesses
- Exceptional
- Compelling
- Convincing
- Solid
- Incomplete
- Inadequate
During the peer-review process the editor and reviewers write an eLife Assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife Assessments
Abstract
Estimating RNA modifications from Nanopore direct RNA sequencing data is a critical task for the RNA research community. However, current computational methods often fail to deliver satisfactory results due to inaccurate segmentation of the raw signal. We have developed a new method, SegPore, which leverages a molecular jiggling translocation hypothesis to improve raw signal segmentation. SegPore is a pure white-box model with enhanced interpretability, significantly reducing structured noise in the raw signal. We demonstrate that SegPore outperforms state-of-the-art methods, such as Nanopolish and Tombo, in raw signal segmentation across three large benchmark datasets. Moreover, the improved signal segmentation achieved by SegPore enables SegPore+m6Anet to deliver state-of-the-art performance in site-level m6A identification. Additionally, SegPore surpasses baseline methods like CHEUI in single-molecule level m6A identification.
Introduction
RNA modifications play important roles in different diseases, such as Acute Myeloid Leukemia (Yankova et al., 2021) and Fragile X Syndrome (Prieto et al., 2020), as well as fundamental biological processes like cell differentiation (Bellodi et al., 2013; Lee et al., 2019) and immune response (Quin et al., 2021). To date, researchers have identified over 150 different types of RNA modifications (Boccaletto et al., 2022; Zimna et al., 2023; Ohira and Suzuki, 2024; Chen et al., 2023), highlighting the complexity and diversity of RNA regulation. These modifications are essential for proper RNA function, including maintaining secondary structures and facilitating accurate protein synthesis, such as the role of tRNA modifications in decoding mRNA codons (Agris et al., 2017).
The practical applications of RNA modifications are far-reaching. For instance, N1-methylpseudouridine (m1Ψ) has been used to enhance the efficacy of COVID-19 mRNA vaccines, highlighting their therapeutic potential (Nance and Meier, 2021). However, identifying and characterizing RNA modifications presents significant challenges due to the limitations of existing experimental techniques.
Traditional methods for RNA modification detection, such as MeRIP-Seq (Meyer et al., 2012), miCLIP (Linder et al., 2015), and m6ACE-Seq (Koh et al., 2019), rely on immunoprecipitation techniques that use modification-specific antibodies. While effective, these methods have several drawbacks. First, each method requires a separate antibody for each RNA modification, making it difficult to study multiple modifications simultaneously. Additionally, these techniques can only infer modification locations from next-generation sequencing (NGS) data, rather than measuring the modifications directly. As a result, they struggle to provide single-molecule resolution, limiting their accuracy and scope.
Recent advancements in direct RNA sequencing (DRS) by Oxford Nanopore Technologies (ONT) offer a promising alternative. DRS allows for the direct measurement of electrical currents as RNA molecules translocate through a nanopore, providing a potential avenue for detecting RNA modifications at the single-molecule level. Two versions of the direct RNA sequencing (DRS) kits are available: RNA002 and RNA004. Unless otherwise specified, this study focuses on RNA002 data. In this technique, five nucleotides (5-mers) reside in the nanopore at a time, and each 5-mer generates a characteristic current signal based on its unique sequence and chemical properties (Loman et al., 2015). By analyzing these signals, it is possible to infer the original RNA sequence and detect modifications like N6-methyladenosine (m6A).
The general workflow of Nanopore direct RNA sequencing (DRS) data analysis is as follows. First, the raw electrical signal from a read is basecalled using tools such as Guppy or Dorado (blawrence-ont and malton-ont, 2026), which produce the nucleotide sequence of the RNA molecule. However, these base-called sequences do not include the precise start and end positions of each ribonucleotide (or k-mer) in the signal. Because basecalling errors are common, the sequences are typically mapped to a reference genome or transcriptome using minimap2 (Li, 2018) to recover the correct reference sequence. Next, tools such as Nanopolish (Loman et al., 2015; Simpson et al., 2017) and Tombo (Stoiber et al., 2017) align the raw signal to the reference sequence to determine which portion of the signal corresponds to each k-mer. We define this process as the segmentation and alignment task (abbreviated as the segmentation task), which is referred to as ‘eventalign’ in Nanopolish. Based on this alignment, Nanopolish extracts various features—such as the start and end positions, mean, and standard deviation of the signal segment corresponding to a k-mer. This signal segment or its derived features is referred to as an ‘event’ in Nanopolish. The resulting events serve as input for downstream RNA modification detection tools such as m6Anet (Hendra et al., 2022) and CHEUI (Acera Mateos et al., 2024).
However, significant computational challenges remain. Segmenting the raw current signal into distinct 5-mers and distinguishing between normal nucleotides and their modified counterparts is a complex task. Current methods, such as Nanopolish, employ change-point detection methods to segment the signal and use dynamic programming methods and Hidden Markov Models (HMM) to align the derived segments to the reference sequence, but they are prone to noise and inaccuracies, which degrade performance in downstream tasks like RNA modification prediction. The root of this issue lies in the fact that these methods do not accurately model the physical process of Nanopore sequencing, particularly the dynamics of the motor protein that drives RNA through the pore. As a result, the segmentation process is not well-aligned with the actual translocation mechanics, leading to signal noise that hinders precise modification detection.
SegPore is a novel tool for direct RNA sequencing (DRS) signal segmentation and alignment, designed to overcome key limitations of existing approaches. By explicitly modeling motor protein dynamics during RNA translocation with a Hierarchical Hidden Markov Model (HHMM), SegPore segments the raw signal into small, biologically meaningful fragments, each corresponding to a k-mer sub-state, which substantially reduces noise and improves segmentation accuracy. After segmentation, these fragments are aligned to the reference sequence and concatenated into larger events, analogous to Nanopolish’s ‘eventalign’ output, which serve as the foundation for downstream analyses. Moreover, the ‘eventalign’ results produced by SegPore enhance interpretability in RNA modification estimation. While deep learning–based tools such as m6Anet classify RNA modifications using complex, non-transparent features (see Figure 4—figure supplement 1), SegPore employs a simple Gaussian Mixture Model (GMM) to distinguish modified from unmodified nucleotides based on baseline current levels. This transparent modeling approach improves confidence in the predictions and makes SegPore particularly well-suited for biological applications where interpretability is essential.
By introducing SegPore, we bridge the gaps left by traditional tools. SegPore utilizes a hierarchical hidden Markov model (HHMM) for more precise segmentation and combines it with signal alignment and a GMM for RNA modification prediction, offering both greater accuracy and interpretability. This integrated approach enables robust m6A modification detection at the single-molecule level, facilitating reliable, transparent predictions for both site-specific and molecule-wide m6A modifications.
Materials and methods
SegPore workflow
Workflow overview
Request a detailed protocolAn overview of SegPore is illustrated in Figure 1A, which outlines its five-step process. The output of Step 3 is the ‘events’, which is analogous to the output generated by the Nanopolish (v0.14.0) ‘eventalign’ command and can be used as input for downstream models such as m6Anet. An ‘event’ refers to a segment of the raw signal that is aligned to a specific k-mer on a read, along with its associated features such as start and end positions, mean current, standard deviation, and other relevant statistics. Step 4 allows for direct modification prediction at both the site and single-molecule levels. Notably, a key feature of SegPore is the k-mer parameter table, which defines the mean and standard deviation for each k-mer in either an unmodified or modified state. During the training phase, Steps 3~5 are iterated multiple times to stabilize the parameters in the k-mer table, which are subsequently fixed and applied for modification prediction on the test data. A detailed description of the model is provided in Appendix 1. Unless otherwise noted, the following analysis focuses on RNA002 data, using 5-mers rather than k-mers.
SegPore workflow.
(A) General workflow. The workflow consists of five steps: (1) First, raw current signals are basecalled and mapped using Guppy and Minimap2. The raw current signal fragments are paired with the corresponding reference RNA sequence fragments using Nanopolish. (2) Next, the raw current signal of each read is segmented using a hierarchical hidden Markov model (HHMM), which provides an estimated mean () for each segment. (3) These segments are then aligned with the 5-mer list of the reference sequence fragment using a full/partial alignment algorithm, based on a 5-mer parameter table. For example, denotes the base ‘‘ at the -th position on the reference. In this example, and refer to the first and second occurrences of ‘’ in the reference sequence, respectively. Accordingly, and are aligned to , while is aligned to (4). All signals aligned to the same 5-mer across different genomic locations are pooled together, and a two-component Gaussian Mixture Model (GMM) is used to predict the modification at the site-level or single-molecule level. One component of the GMM represents the unmodified state, while the other represents the modified state. (5) GMM is used to re-estimate the 5-mer parameter table. (B) Hierarchical hidden Markov model (HHMM). The outer HMM segments the current signal into alternating base and transition blocks. The inner HMM approximates the emission probability of a base block by considering neighboring 5-mers. A linear model is used to approximate the emission probability of a transition block. (C) Full/partial alignment algorithms. Rows represent the estimated means of base blocks from the HHMM, and columns represent the 5-mers of the reference sequence. Each 5-mer can be aligned with multiple estimated means from the current signal. (D) Gaussian mixture model (GMM) for estimating modification states. The GMM consists of two components: the green component models the unmodified state of a 5-mer, and the blue component models the modified state. Each component is described by three parameters: mean (μ), standard deviation (), and weight ().
Preprocessing
Request a detailed protocolWe begin by performing basecalling on the input fast5 file using Guppy (v6.0.1), which converts the raw signal data into ribonucleotide sequences. Next, we align the basecalled sequences to the reference genome using Minimap2 (v2.28-r1209) (Li, 2018), generating a mapping between the reads and the reference sequences. Nanopolish provides two independent commands: ‘polya’ and ‘eventalign’. The ‘polya’ command identifies the adapter, poly(A) tail, and transcript region in the raw signal, which we refer to as the poly(A) detection results. The raw signal segment corresponding to the poly(A) tail is used to standardize the raw signal for each read. The ‘eventalign’ command aligns the raw signal to a reference sequence, assigning a signal segment to individual k-mers in the reference. It also computes summary statistics (e.g. mean, standard deviation) from the signal segment for each k-mer. Each k-mer together with its corresponding signal features is termed an event. These event features are then passed into downstream tools such as m6Anet and CHEUI for RNA modification detection. For full transcriptome analysis (Figure 3), we extract the aligned raw signal segment and reference sequence segment from Nanopolish’s events for each read by using the first and last events as start and end points. For in vitro transcription (IVT) data with a known reference sequence (Figure 4), we extract the raw signal segment corresponding to the transcript region for each input read based on Nanopolish’s poly(A) detection results.
Due to inherent variability between nanopores in the sequencing device, the baseline levels and standard deviations of k-mer signals can differ across reads, even for the same transcript. To standardize the signal for downstream analyses, we extract the raw current signal segments corresponding to the poly(A) tail of each read. Since the poly(A) tail provides a stable reference, we standardize the raw current signals for each read, ensuring that the mean and standard deviation are consistent across the poly(A) tail region. This step is crucial for reducing variability between different reads and ensuring more accurate signal segmentation and modification prediction in subsequent steps. See Section 3 of Appendix 1 for more details.
Signal segmentation via hierarchical Hidden Markov model
Request a detailed protocolThe RNA translocation hypothesis (see details in the first section of Results) naturally leads to the use of a hierarchical Hidden Markov Model (HHMM) to segment the raw current signal. As shown in Figure 1B, the HHMM consists of two layers. The outer HMM divides the raw signal into alternating base and transition blocks, represented by hidden states ‘B’ (base) and ‘T’ (transition). Within each base block, the inner HMM models the current signal at a more granular level.
The inner HMM includes four hidden states: ‘prev’, ‘next’, ‘curr’, and ‘noise’. These correspond to the previous, next, and current 5-mer in the pore, while ‘noise’ refers to random fluctuations. Each raw current measurement is emitted from one of these hidden states, providing a detailed model of the signal within the base blocks. A linear model with a large absolute slope is used to represent sharp changes in the transition blocks.
To segment the signal, we first model the likelihood of the HHMM. Given the raw current signal of a read, the hidden states of the outer hidden HMM are denoted by . and are divided into blocks , where , correspond to kth block and , . Blocks with odd indices are base blocks, while those with even indices are transition blocks. The likelihood of the HHMM is given by
where is the transition matrix of the outer HMM and is the probability for the first hidden state. The first part of Equation 2 is emission probabilities, and the second part is the transition probabilities. It is not possible to directly compute the emission probabilities of the outer HMM (Equation 3) since there exist dependencies for the current signal measurements within a base or transition block. Therefore, we use the inner HMM and linear model (Equation 4) to handle the dependencies and approximate emission probabilities.
The inner HMM models transition between the ‘prev’, ‘next’, ‘curr’, and ‘noise’ states. For the ‘prev’, ‘next’, and ‘curr’ states, a Gaussian distribution is used to model the emission probabilities, while a uniform distribution models the ‘noise’ state. The forward-backward algorithm is used to compute the marginal likelihood of the inner HMM, which approximates the emission probabilities for base blocks. For transition blocks, a standard linear model computes the emission probabilities.
For any given , we can calculate the joint likelihood (Equation 1). We enumerate different configurations and select the one with the highest likelihood. This process segments the raw current signal into alternating base and transition blocks, where one or more base blocks may correspond to a single 5-mer. Each base block is characterized by its mean and standard deviation, which are used as input for downstream alignment tasks.
Due to the large size of fast5 files (which can reach terabytes), parameter inference for this model is computationally intensive. To address this, we developed a GPU-based inference algorithm that significantly accelerates the process. More details on this algorithm can be found in Appendix 2.
In summary, the HHMM allows us to accurately segment the raw current signal, providing estimates of the mean and variance for each base block, which are crucial for downstream analyses such as RNA modification prediction.
5-mer and 9-mer parameter table
Request a detailed protocolWe downloaded the k-mer models ‘r9.4_180 mv_70bps_5-mer_RNA’ from the ONT GitHub repository (here; Brennen, 2023) for RNA002 data. The columns labeled ‘level_mean’ and ‘level_stdv’ in these models were used as the mean and standard deviation values for the unmodified 5-mers. These values serve as the initial parameters in the 5-mer parameter table for SegPore, which we refer to as the ONT 5-mer parameter table. In the RNA004 data analysis, we obtained the 9-mer parameter table from the source code of f5c (version 1.5). Specifically, we used the array named ‘rna004_130bps_u_to_t_rna_9mer_template_model_builtin_data’ from the file here (accessed on 17 October 2025).
The initialization of the k-mer parameter table is a critical step in SegPore’s workflow. By leveraging ONT’s established k-mer models, we ensure that the initial estimates for unmodified k-mers are grounded in empirical data. These initial estimates are refined through SegPore’s iterative parameter estimation process, enabling the model to accurately differentiate between modified and unmodified k-mers during segmentation and modification prediction tasks. The refined k-mer parameters also provide a foundation for downstream analysis, such as alignment and m6A identification.
Alignment of raw signal segment with reference sequence
Request a detailed protocolAfter segmenting the raw current signal of a read into base and transition blocks using HHMM, we align the means of base blocks with the 5-mer list of the reference sequence.
The alignment process involves three main cases:
Base block matching with a 5-mer: In this case, a base block aligns with a 5-mer, which is modeled by a Gaussian distribution. The 5-mer may exist in either an unmodified or modified state, and the corresponding Gaussian parameters are retrieved from the 5-mer parameter table.
Base block matching with an insertion: Here, the base block aligns with an indel (‘-’), indicating an inserted nucleotide in the read.
Deletion in the read: In this case, an indel (0.0) matches with a 5-mer, representing a deleted nucleotide in the read.
The alignment score function models these matching cases as follows. For the first case (a base block matching a 5-mer), we calculate the probability that the base block mean is sampled from either the unmodified or modified 5-mer’s Gaussian distribution. The larger of the two probabilities is used as the match score. For the second and third cases, the alignment is treated as noise, and a fixed uniform distribution is used to calculate the match score.
An important distinction from classical global alignment algorithms (Needleman–Wunsch algorithm) is that one or multiple base blocks may align with a single 5-mer. Given the base block means and 5-mer list , we define the score matrix as (Figure 1C). The first row and column of the matrix are reserved for indels ("0.0" and "-"), representing insertions or deletions in the base blocks or 5-mers, respectively.
We denote the score function by . The recursion formula of the dynamic programming alignment algorithm is given by
where is the score function. It can be seen that we can still align with after we have aligned with , which fulfills the special consideration that one or multiple base blocks might be aligned with one 5-mer.
We implement two types of alignment algorithms (Figure 1C) based on the score matrix :
Full Alignment Algorithm: This algorithm aligns the full list of base block means with the full list of 5-mers, similar to classical global alignment. It traces back from the position in the score matrix.
Partial Alignment Algorithm: This aligns the full list of base blocks with the initial part of the 5-mer list, with no indels allowed in the base block means or the 5-mer list. The trace back starts from the maximum value in the last row of the score matrix.
A detailed description of both alignment algorithms is provided in Appendix 1. The output of the alignment algorithm is an alignment that pairs the base blocks with the 5-mers from the reference sequence for each read (Figure 1C). Base blocks aligned to the same 5-mer are concatenated into a single raw signal segment (referred to as an ‘event’), from which various features—such as start and end positions, mean current, and standard deviation—are extracted. A detailed derivation of the mean and standard deviation is provided in Section 5.3 in Appendix 1. In the remainder of this paper, we refer to these resulting events as the output of eventalign analysis, which also represents the final output of the segmentation and alignment task.
Modification prediction
Request a detailed protocolAfter obtaining the eventalign results, we estimate the modification state of each motif using the 5-mer parameter table. Specifically, for each 5-mer, we compare the probability that its mean is sampled from either the modified or unmodified 5-mer’s Gaussian distribution. If the probability under the modified 5-mer distribution is higher, the 5-mer is predicted to be in the modification state, and vice versa for the unmodified state.
To estimate the overall modification rate at a specific genomic location, we pool all reads that map to that location on the reference sequence. The modification rate is calculated as the proportion of reads that are predicted to be in the modification state at that location. A detailed description of the modification state prediction process can be found in Appendix 1.
GMM for 5-mer parameter table re-estimation
Request a detailed protocolTo improve alignment accuracy and enhance modification predictions, we use a Gaussian Mixture Model (GMM) to iteratively fine-tune the 5-mer parameter table. As illustrated in Figure 1A, the rows of the 5-mer parameter table represent the 5-mers, while the columns provide the mean and standard deviation for both the unmodified and modified states. We denote a 5-mer by , with its relevant parameters as , , , .
Using the alignment results from all reads, we collect the base block means aligned to the same 5-mer (denoted by ) across different reads and genomic locations, particularly those with high modification rates. A two-component GMM is then fit to these base blocks. In this process, the mean of the first component is fixed to . From the GMM, we update the parameters , , , , and , where , represent the weights for the unmodified and modified components, respectively. Afterward, we manually adjust the 5-mer parameter table using heuristics to ensure that the modified 5-mer distribution is significantly distinct from the unmodified distribution (see details in Section 7 of Appendix 1).
This re-estimation process is only performed on the training data. The initial 5-mer parameter table is based on the table provided by ONT. After each iteration of updating the table, we rerun the SegPore workflow from the alignment step onward. Typically, the process is repeated three to five times until the 5-mer parameter table stabilizes (when the average change of mean values of all 5-mers is less than 5e-3). Once a stabilized 5-mer parameter table is estimated from the training data, it is used for RNA modification estimation in the test data without further updates. A more detailed description of the GMM re-estimation process is provided in Section 6 of Appendix 1.
m6A site level benchmark
Request a detailed protocolThe HEK293T wild-type (WT) samples were downloaded from the ENA database (accession number PRJEB40872), while the HCT116 samples were obtained from ENA under accession PRJEB44348. The reference sequence (Homo_sapiens.GRCh38.cdna.ncrna_wtChrIs_modified.fa) was downloaded from https://doi.org/10.5281/zenodo.4587661. Ground truth data were sourced from Supplementary Data 1 of Pratanwanich et al., 2021. Fast5 files for the test dataset (mESC WT samples, mESCs_Mettl3_WT_fast5.tar.gz; Jenjaroenpun et al., 2021) were retrieved from the NCBI Sequence Read Archive (SRA) under accession SRP166020.
During training, we initialized the 5-mer parameter table using ONT’s data. The standard SegPore workflow was executed on the training data (HEK293T WT samples), using the full alignment algorithm. The 5-mer parameter table estimation was iterated five times. For mapping, reads were first aligned to the cDNA and subsequently converted to genomic locations using Ensembl GTF file (GRCh38, v9). The same 5-mer across different genomic locations was pooled together. A 5-mer was considered significantly modified if its read coverage exceeded 1500 and the distance between the means of the two Gaussian components in the GMM was greater than 5 picoamperes (pA). As a result, modification parameters were specified for ten significant 5-mers, as illustrated in Figure 3—figure supplement 1A.
With the estimated 5-mer parameter table from the training data, we then ran the SegPore workflow on the test data. The transcript sequences from GENCODE release version M18 were used as the reference sequence for mapping, with the corresponding GTF file (gencode.vM18.chr_patch_hapl_scaff.annotation.gtf) downloaded from GENCODE to convert transcript locations into genomic coordinates. It is important to note that the 5-mer parameter table was not re-estimated for the test data. Instead, modification states for each read were directly estimated using the fixed 5-mer parameter table. Due to the differences between human (Figure 3—figure supplement 1A) and mouse (Figure 3—figure supplement 1B), only six 5-mers were found to have m6A annotations in the test data’s ground truth (Figure 3—figure supplement 1C). For a genomic location to be identified as a true m6A modification site, it had to correspond to one of these six common 5-mers and have a read coverage of greater than 20. SegPore derived the ROC and PR curves for benchmarking based on the modification rate at each genomic location.
In the SegPore+m6Anet analysis, we fine-tuned the m6Anet model using SegPore’s eventalign results to demonstrate improved m6A identification. We started with the pre-trained m6Anet model (available at here; Hendra, 2025, model version: HCT116_RNA002) and fine-tuned it using the eventalign results from HCT116 samples. SegPore’s eventalign output provided the pairing between each genomic location and its corresponding raw signal segment, which allowed us to extract normalized features such as the normalized mean , standard deviation , dwell time (number of data points in the event). For genomic location , m6Anet extracts a feature vector , which was used as input for m6Anet. Feature vectors from a randomly selected 80% of the genomic locations were used for training, while the remaining 20% were set aside for validation. We ran 100 epochs during fine-tuning, selecting the model that performed best on the validation set.
Ground truth data and the performance of other methods (Tombo v1.5.1, Nanom6A v2.0, m6Anet v1.0, and Epinano v1.2.0) on the mESC dataset were provided through personal communications with Prof. Luo Guanzheng, the corresponding author of the benchmark study referenced (Zhong et al., 2023).
m6A single molecule level benchmark
Request a detailed protocolThe benchmark IVT data for single molecule m6A identification was downloaded from NCBI-SRA with accession number SRP166020. CHEUI was used for the benchmark. For the ground truth, every A in every read of the IVT_m6A sample is treated as a m6A modification and every A in every read of the IVT_normalA is treated as a normal A. Detailed methods for calculating modification probability at single molecule level were provided in Appendix 1, ‘Modification state estimation on single molecule level’ and ‘Modification probability estimation on single molecule level’.
Results
RNA translocation hypothesis
Accurate segmentation of raw current signals in direct RNA sequencing remains a major challenge, largely because the precise dynamics of RNA translocation through the pore are not fully understood. To address this, we propose a hypothesis that better reflects the physical movement of the RNA molecule. In traditional basecalling algorithms such as Guppy and Albacore, we implicitly assume that the RNA molecule is translocated through the pore by the motor protein in a monotonic fashion, that is the RNA is pulled through the pore unidirectionally. In the DNN training process of Guppy and Albacore, we try to align the current signal with the reference RNA sequence. The alignment is unidirectional, which is the source of the implicit monotonic translocating assumption.
Contrary to the conventional assumption of monotonic translocation, the raw current data suggests that the motor protein drives RNA both forward and backward during sequencing. Figure 2B illustrates this with several example fragments of DRS raw current signal (Pratanwanich et al., 2021), where each fragment roughly corresponds to three neighboring 5-mers. The raw current signals, as shown in Figure 2B, strongly support this hypothesis, with several instances of measured current intensities matching both the previous and next 5-mer’s baseline. These repeated patterns, observed across multiple DRS datasets, provide empirical evidence of the jiggling translocation. Similar patterns are widely observed across the whole data. This suggests that the RNA molecule may move forward and backward while passing through the pore. This observation is also supported by previous reports (Craig et al., 2017; Caldwell and Spies, 2017), in which the helicase (the motor protein) translocates the DNA strand through the nanopore in a back-and-forth manner. Depending on ATP or ADP binding, the motor protein may translocate the DNA/RNA forward or backward by 0.5~1 nucleotides.
RNA translocation hypothesis.
(A) Jiggling RNA translocation hypothesis. The top panel shows the raw current signal of Nanopore direct RNA sequencing, with gray areas representing SegPore-estimated transition blocks. We focus on three neighboring 5-mers, considering the central 5-mer (CTACG) as the current 5-mer. The RNA molecule may briefly move forward or backward during the translocation of the current 5-mer. If the RNA molecule is pulled backward, the previous 5-mer is placed in the pore, and the current signal (“prev” state, red dots) resembles the previous 5-mer’s baseline (mean and standard deviation highlighted by red lines and shades). If the RNA is pushed forward, the current signal (“next” state, blue dots) is similar to the next 5-mer’s baseline. (B) Example raw current signals supporting the jiggling hypothesis. The dashed rectangles highlight base blocks, with red and blue points representing measurements corresponding to the previous and next 5-mer, respectively. Red points align closely with the previous 5-mer’s baseline, and blue points match the next 5-mer’s baseline, reinforcing the hypothesis that the RNA molecule jiggles between neighboring 5-mers. The raw current signals were extracted from mESC WT samples of the training data in the m6A benchmark experiment.
Based on the reported kinetic model (Craig et al., 2017), we hypothesize that the RNA is translocated through the pore in a jiggling manner. This “jiggling” hypothesis presents a significant departure from the traditionally accepted view of unidirectional RNA translocation and aligns better with recent evidence from studies on DNA translocation (Craig et al., 2017; Caldwell and Spies, 2017). Incorporating this hypothesis into segmentation algorithms could lead to more accurate predictions of RNA modifications by accounting for the dynamic nature of translocation. On average, the motor protein sequentially translocates 5-mers on the RNA strand forward, and each 5-mer resides in the pore for a short time. During the short period of a single 5-mer, the motor protein may swiftly drive the RNA molecule forward and backward by 0.5~1 nucleotide in the translocation process of the current 5-mer (Figure 2A), which makes the measured current intensity occasionally similar to the previous or the next 5-mer. When the motor protein does not move the RNA molecule, we hypothesize that the RNA molecule undergoes slight thermal fluctuations, causing it to oscillate slightly within the pore and produce a stable current close to its baseline. In contrast, sharp changes in current intensity between consecutive 5-mers define transition blocks, where one 5-mer is replaced by the next.
We also assume that the raw current signal of a read can be segmented into a series of alternating base and transition blocks. In the ideal case, a base block corresponds to the base state where the 5-mer resides in the pore and jiggles between neighboring 5-mers, that is the current 5-mer can transiently jump to the previous or the next 5-mer. A transition block corresponds to the transition state between two consecutive base states where one 5-mer translocates to the next 5-mer in the pore. The current signal should be relatively flat in the base blocks, while a sharp change is expected in the transition blocks. One challenge we encountered was the overestimation of transition blocks. This may be due to a base block actually corresponding to a sub-state of a single 5-mer, rather than each base block corresponding to a full 5-mer, leading to inflated transition counts. To address this issue, SegPore’s alignment algorithm was refined to merge multiple base blocks (which may represent sub-states of the same 5-mer) into a single 5-mer, thereby facilitating further analysis.
SegPore Workflow
The SegPore workflow (Figure 1) consists of five key steps: (1) Preprocess fast5 files to pair raw current signal segments with corresponding RNA sequence fragments; (2) Segment each raw current signal using a hierarchical hidden Markov model (HHMM) into base and transition blocks; (3) Align the derived base blocks with the paired RNA sequence; (4) Fit a two-component GMM to estimate the modification state at the single-molecule level or the modification rate at the site level; (5) Use the results from Step (4) to update relevant parameters. Steps (3) to (5) are iterative and continue until the estimated parameters stabilize.
The final outputs of SegPore are the events and modification state predictions. SegPore’s events are similar to the outputs of Nanopolish’s ‘eventalign’ command, in that they pair raw current signal segments with the corresponding RNA reference 5-mers. Each 5-mer is associated with various features, such as start and end positions, mean current, and standard deviation, derived from the paired signal segment. For 5-mers that exhibit one clearly unmodified component and one clearly modified component, SegPore reports the modification rate at each site, as well as the modification state of that site on individual reads.
A key component of SegPore is the 5-mer parameter table, which specifies the mean and standard deviation for each 5-mer in both modified and unmodified states (Figure 1A). Since the peaks (representing modified and unmodified states) are separable for only a subset of 5-mers, SegPore can provide modification parameters for these specific 5-mers. For other 5-mers, modification state predictions are unavailable.
Segmentation benchmark
To evaluate SegPore’s performance in raw signal segmentation, we compared SegPore with Nanopolish (v0.14.0) and Tombo (v1.5.1) using three Nanopore direct RNA sequencing (DRS) datasets: two HEK293T datasets (wild type and Mettl3 knockout) (Pratanwanich et al., 2021) and the HCT116 dataset (Chen et al., 2025). Nanopolish and SegPore employed the ‘eventalign’ method to align 5-mers on each read with their corresponding raw signals, producing the mean and standard deviation (std) of the aligned signal segments. Tombo used the ‘resquiggle’ method to segment the raw signals, but the resulting signals are not reported on the absolute pA scale. To ensure a fair comparison with SegPore, we standardized the segments using the poly(A) tail in the same way as SegPore (See preprocessing section in Materials and methods).
To benchmark segmentation performance, we used two key metrics (details provided in Appendix 1, Section 8): (1) the log-likelihood of the segment mean, which measures how closely the segment matches ONT’s 5-mer parameter table (used as ground truth), and (2) the standard deviation (std) of the segment, where a lower std indicates reduced noise and better segmentation quality. If the raw signal segment aligns correctly with the corresponding 5-mer, its mean should closely match ONT’s reference, yielding a high log-likelihood. A lower std of the segment reflects less noise and better performance overall.
As shown in Table 1, SegPore consistently achieved the best performance averaged on all 5-mers across all datasets, with the highest log-likelihood and the lowest std values. These results suggest that SegPore provides a more accurate segmentation of the raw signal with significantly reduced noise compared to Nanopolish and Tombo. It is worth noting that the data points corresponding to the transition state between two consecutive 5-mers are not included in the calculation of the standard deviation in SegPore’s results in Table 1. However, their exclusion does not affect the overall conclusion. Because SegPore contains on average ~6 points per event within the transition state, we similarly removed the first and last three data points of each event for Nanopolish and Tombo prior to recalculating the metrics. The updated results are presented in Table 2.
Segmentation benchmark on RNA002 data.
| TestDataset | Avg. std () ↓ | Avg. log p () ↑ | ||||
|---|---|---|---|---|---|---|
| Nanopolish | Tombo | SegPore | Nanopolish | Tombo | SegPore | |
| HEK293T(WT) | 3.073 | 4.187 | 2.736 | –2.871 | –3.749 | –2.778 |
| HEK293T(KO) | 2.948 | 4.204 | 2.670 | –2.856 | –3.749 | –2.745 |
| HCT116 | 3.167 | 4.076 | 2.872 | –2.872 | –3.704 | –2.746 |
Segmentation benchmark on the RNA002 data excluding boundaries.
| TestDataset | Avg. std () ↓ | Avg. log p () ↑ | ||||
|---|---|---|---|---|---|---|
| Nanopolish | Tombo | SegPore | Nanopolish | Tombo | SegPore | |
| HEK293T(WT) | 2.862 | 4.090 | 2.736 | –2.838 | –3.772 | –2.778 |
| HEK293T(KO) | 2.856 | 4.120 | 2.670 | –2.878 | –3.729 | –2.745 |
| HCT116 | 3.058 | 4.053 | 2.872 | –2.869 | –3.708 | –2.746 |
To provide a more intuitive comparison, the segmentation results for two example raw signal clips are illustrated in Figure 3—figure supplement 2. These examples demonstrate the clearer, more precise segmentation achieved by SegPore compared to Nanopolish.
To evaluate SegPore’s performance on RNA004 data, we compared it with f5c (v1.5) (Gamaarachchi et al., 2020) and Uncalled4 (Kovaka et al., 2025) (v4.1.0) using three public DRS datasets: the S. cerevisiae dataset (Watson et al., 2025), the curlcake IVT and m6A datasets (Cruciani et al., 2025). The RNA002 data provides reference current levels for 5-mers, whereas RNA004 provides reference values for 9-mers, with Uncalled4 normalizing them to approximately zero mean and unit variance. As there are currently no established poly(A) detection methods available for RNA004, we used f5c to standardize the raw signals of each read prior to segmentation. SegPore was then applied to perform segmentation on the standardized signals. We computed the same benchmarking metrics—average log-likelihood and standard deviation—using the standardized raw signals and the segmentation results from f5c, Uncalled4, and SegPore. The 9-mer parameter table in pA scale for RNA004 data provided by f5c (see Materials and methods) was used in the analysis. As shown in Table 3, SegPore achieved the best overall performance across all datasets, indicating its robustness and suitability for RNA004 data. Moreover, we find that the jiggling hypothesis remains valid for RNA004, as illustrated in Figure 2—figure supplement 1.
Segmentation benchmark on RNA004 data.
| TestDataset | Avg. std () ↓ | Avg. log p () ↑ | ||||
|---|---|---|---|---|---|---|
| f5c | Uncalled4 | SegPore | f5c | Uncalled4 | SegPore | |
| S. cerevisiae | 3.129 | 3.970 | 3.125 | –2.541 | –3.835 | –2.489 |
| curlcake(IVT) | 3.424 | 4.729 | 3.359 | –2.716 | –2.904 | –2.515 |
| curlcake(m6A) | 3.520 | 4.971 | 3.392 | –3.118 | –3.524 | –2.599 |
m6A identification at the site level
We evaluated SegPore’s performance in raw signal segmentation and m6A identification using independent public datasets as both training and test data. Since m6A modifications typically occur at DRACH motifs (where D denotes A, G, or U, and H denotes A, C, or U) (Linder et al., 2015), this study focuses on estimating m6A modifications on these motifs.
To begin, we estimated the 5-mer parameter table for m6A modifications using Nanopore direct RNA sequencing (DRS) data from three wild-type human HEK293T cell samples (Pratanwanich et al., 2021). The fast5 files from all samples were concatenated, and the full SegPore workflow was run to obtain the 5-mer parameter table. The parameter estimation process was iterated five times to ensure stabilization, refining the modification parameters for ten 5-mers where the modification state distribution significantly differs from the unmodified state.
Next, we applied SegPore’s segmentation and m6A identification to test data from wild-type mouse embryonic stem cells (mESCs; Zhong et al., 2023). Given the comparable methods and input data requirements, we benchmarked SegPore against several baseline tools, including Tombo, MINES (Lorenz et al., 2020), Nanom6A (Gao et al., 2021), m6Anet, Epinano (Liu et al., 2019), and CHEUI (Acera Mateos et al., 2024). By default, MINES and Nanom6A use eventalign results generated by Tombo, while m6Anet, Epinano, and CHEUI rely on eventalign results produced by Nanopolish. In Figure 3C, ‘Nanopolish +m6Anet” refers to the default m6Anet pipeline, whereas ‘SegPore +m6Anet’ denotes a configuration in which Nanopolish’s eventalign results are replaced with those from SegPore. Based on the output of SegPore eventalign, we fine-tune m6Anet using the HCT116 data at all DRACH motifs, aiming to demonstrate that the performance of SegPore benefits downstream models. Additionally, due to the differences in the availability of ground truth labels for specific k-mer motifs between human and mouse (Figure 3—figure supplement 1), six shared 5-mers were selected to demonstrate SegPore’s performance in modification prediction directly. By utilizing the 5-mer parameter table derived from the training data, SegPore employs a two-component GMM to calculate the modification rates at the selected m6A sites.
m6A identification at the site level.
(A) Histogram of the estimated mean from current signals mapped to an example m6A-modified genomic location (chr10:128548315, GGACT) across all reads in the training data, comparing Nanopolish (left) and SegPore (right). The x-axis represents current in picoamperes (pA). (B) Histogram of the estimated mean from current signals mapped to the GGACT motif at all annotated m6A-modified genomic locations in the training data, again comparing Nanopolish (left) and SegPore (right). The x-axis represents current in picoamperes (pA). (C) Site-level benchmark results for m6A identification across all DRACH motifs, showing performance comparisons between SegPore+m6Anet and other methods. (D) Benchmark results for m6A identification on six selected motifs at the site level, comparing SegPore and other baseline methods.
-
Figure 3—source data 1
site_mod_rate_mES_WT_all_motifs.
- https://cdn.elifesciences.org/articles/104618/elife-104618-fig3-data1-v1.csv
-
Figure 3—source data 2
site_mod_rate_mES_WT_selected_motifs.
- https://cdn.elifesciences.org/articles/104618/elife-104618-fig3-data2-v1.csv
SegPore demonstrated improved segmentation compared to Nanopolish. Figure 3A shows the eventalign results at an example genomic location with m6A modifications. SegPore’s results show a more pronounced bimodal distribution in the raw signal segment mean, indicating clearer separation of modified and unmodified signals. Furthermore, when pooling all reads mapped to m6A-modified locations at the GGACT motif, SegPore exhibited more distinct peaks (Figure 3B), indicating reduced noise and potentially enabling more reliable modification detection.
We evaluated m6A predictions using two approaches: (1) SegPore’s segmentation results were fed into m6Anet, referred to as SegPore+m6Anet, which works for all DRACH motifs and (2) direct m6A predictions from SegPore’s Gaussian Mixture Model (GMM), which is limited to the six selected 5-mers shown in Figure 3—figure supplement 1C that exhibit clearly separable modified and unmodified components in the GMM (see Materials and methods for details).
In terms of m6A identification, SegPore performed strongly on the test data. Using miCLIP2 (Körtel et al., 2021) data as the ground truth, we calculated the area under the curve (AUC) for both the receiver operating characteristic (ROC) and precision-recall (PR) curves. SegPore+m6Anet achieved the best performance with an ROC AUC of 89.0% and a PR AUC of 43.8% (Figure 3C). For six selected m6A motifs, SegPore achieved an ROC AUC of 82.7% and a PR AUC of 38.7%, earning the third best performance compared with deep learning methods m6Anet and CHEUI (Figure 3D). It is noteworthy that SegPore’s GMM for m6A estimation is a very simple model, utilizing far fewer parameters than DNN-based methods. Achieving a decent performance with such a simple model is a significant accomplishment. These results highlight SegPore’s robust performance in m6A identification. For practical applications, we recommend taking the intersection of m6A sites predicted by SegPore and m6Anet to obtain high-confidence modification sites, while still benefiting from the interpretability provided by SegPore’s predictions.
m6A identification at the single molecule level
SegPore naturally identifies m6A modifications at the single-molecule level, which is crucial for understanding the heterogeneity of RNA modifications across individual transcripts. We benchmarked SegPore against CHEUI using an in vitro transcription (IVT) dataset containing two samples—one transcribed with m6A and the other with adenine (Jenjaroenpun et al., 2021). This dataset provides clear ground truth for m6A modifications at the single-molecule level, with all adenosine positions replaced by m6A in the ivt_m6A sample, and all adenosines unmodified in the ivt_normalA sample. We used 60% of the data for training and 40% for testing, with both SegPore and CHEUI estimating the m6A modification probability at each adenosine site on each read. Based on these probabilities, we calculated the ROC-AUC and PR-AUC by varying the modification probability threshold.
As shown in Figure 4A, SegPore outperformed CHEUI on this benchmark dataset, achieving better performance in both PR-AUC (94.7% vs 90.3%) and ROC-AUC (93% vs 89.9%). These results clearly demonstrate SegPore’s accuracy and robustness in detecting single-molecule m6A modifications.
m6A identification at the single-molecule level.
(A) Benchmark results for single-molecule m6A identification on IVT data. SegPore shows better performance compared to CHEUI in both PR-AUC and ROC-AUC. (B) Comparison of ‘eventalign’ results from SegPore and Nanopolish for five consecutive k-mers. Note that DRS is sequenced from 3’ to 5’, so the k-mers enter the pore from right to left. A total of 100 reads were randomly sampled from transcript locations A1 (positions 711–719) in both the IVT_normalA and IVT_m6A samples (SRA: SRP166020). Each line represents an individual read, and the y-axis shows the raw signal intensity in picoampere (pA). Pink lines represent the IVT_m6A sample, and gray lines represent the IVT_normalA sample. The k-mers ‘GCGGA’, ‘CGGAC’, ‘GGACT’, ‘GACTT’, and ‘ACTTT’ all contain N6-Methyladenosine (m6A) in the IVT_m6A sample. SegPore’s results show clearer separation between m6A and adenosine, especially for ‘CGGAC’ and ‘GGACT’, compared to Nanopolish. (C) The upper panel shows the modification rate for selected genomic locations in the example gene ENSMUSG00000003153. The lower panel displays the modification states of all reads mapped to this gene. The black borders in the heatmap highlight the biclustering results, showing distinct modification patterns across different read clusters labeled C1 through C6.
-
Figure 4—source data 1
ENSMUSG00000003153.10.info.
- https://cdn.elifesciences.org/articles/104618/elife-104618-fig4-data1-v1.csv
-
Figure 4—source data 2
ENSMUSG00000003153.10.reads.
- https://cdn.elifesciences.org/articles/104618/elife-104618-fig4-data2-v1.csv
Next, we demonstrate SegPore’s interpretability through an example comparison of raw signal clips from both ivt_m6A and ivt_normalA samples (Figure 4B). The raw signal segments (means) are aligned to a randomly selected m6A site as well as its neighboring sites in the reference sequence. Each line represents an individual read, with pink lines from the ivt_m6A sample and gray lines from the ivt_normalA sample. SegPore’s segmentation clearly distinguishes between m6A and adenosine at the single-molecule level, while Nanopolish’s segmentation shows less distinction.
Interestingly, we observed that the position of m6A within a 5-mer can affect the signal intensity. For instance, a clear difference between m6A and adenosine is evident when m6A occupies the fourth position in the 5-mer (e.g. ‘CGGAC’), but this difference is less pronounced when m6A is in the second position (e.g. ‘GACTT’).
To illustrate the benefits of single-molecule m6A estimation, we present an example gene (ENSMUSG00000003153) from the mESC dataset used in site-level m6A identification. Figure 4C shows a heatmap of highly modified genomic locations (modification rate >0.1), where rows represent reads and columns represent genomic locations. Biclustering reveals six clusters of reads and three clusters of genomic locations.
The heatmap in Figure 4C suggests heterogeneity in m6A modification patterns across different reads of the same gene. Biclustering reveals that modifications at g6 are specific to cluster C4, g7 to cluster C5, and g8 to cluster C6, while the first five genomic locations (g1 to g5) show similar modification patterns across all reads. Additionally, high modification rates are observed at the 3rd and 5th positions across the majority of reads. These results suggest that m6A modification patterns can vary significantly even within a single gene. This observation highlights the complexity of RNA modification regulation and underscores the importance of single-molecule resolution for understanding RNA function at a finer scale.
Discussion
One of the main computational challenges in direct RNA sequencing (DRS) lies in accurately segmenting the raw current signal. We developed a segmentation algorithm that models the jiggling property in the physical process of DRS, resulting in cleaner current signals for m6A identification at both the site and single-molecule levels. Our results demonstrate that SegPore’s segmentation enables clear differentiation between m6A-modified and unmodified adenosines. We believe that the de-noised current signals will be beneficial for other downstream tasks, such as the estimation of m5C, pseudouridine, and other RNA modifications. Nevertheless, several open questions remain for future research. In SegPore, we assume a drastic change between two consecutive 5-mers, which may hold for 5-mers with large difference in their current baselines but may not hold for those with small difference. As with other RNA modification estimation methods, SegPore can be affected by misalignment errors, particularly when the baseline signals of adjacent k-mers are similar. These cases may lead to spurious bimodal signal distributions and require careful interpretation. Another key question concerns the physical interpretation of the derived base blocks. Ideally, one base block would correspond to a single 5-mer, but in practice, multiple base blocks often align with one 5-mer. We hypothesize that the HHMM may segment a 5-mer’s current signal into multiple base blocks, where the 5-mer oscillates between different sub-states, each characterized by distinct baselines.
Currently, SegPore models only the modification state of the central nucleotide within the 5-mer. However, modifications at other positions may also affect the signal, as shown in Figure 4B. Therefore, introducing multiple states for a 5-mer to account for modifications at multiple positions within the same 5-mer could help to improve the performance of the model. This approach, however, would significantly increase model complexity—introducing two states per location results in modification states per 5-mer, a challenge for simple GMMs. It remains to be investigated how to model multiple modification states properly to improve the performance in RNA modification estimation. While the current two-state assumption simplifies the model, it has proven effective, as demonstrated by improved performance on m6A benchmarks at both site and single-molecule levels.
A key advantage of SegPore is its interpretability, which sets it apart from DNN-based methods. SegPore provides a clearer understanding of RNA modification predictions. A limitation of DNN-based approaches is that they often struggle to differentiate between m6A and adenosine signals, as their intensity levels are quite similar (Figure 4B, Figure 3—figure supplement 3). This poses a challenge when applying DNN-based methods to new datasets without short read sequencing-based ground truth. In such cases, it is difficult for researchers to confidently determine whether a predicted m6A modification is genuine. SegPore encodes current intensity levels for different 5-mers in a parameter table, where unmodified and modified 5-mers are modeled using two Gaussian distributions. One can generally observe a clear difference in the intensity levels between 5-mers with an m6A and those with adenosine, which makes it easier for a researcher to interpret whether a predicted m6A site is genuine (see Figure 4—figure supplement 1). A challenge is the relatively small number of 5-mers that show significant changes in their modification states. To improve accuracy, larger training datasets and expanding the scope to 7-mers or 9-mers could help capture more context, potentially revealing more significant baseline changes.
The limited improvement of SegPore combined with m6Anet over Nanopolish+m6Anet in bulk in vivo analysis (Figure 3) may be explained by several factors: potential alignment inaccuracies due to pseudogenes or transcript isoforms, the complexity of in vivo datasets containing additional RNA modifications (e.g. m5C, m7G) affecting signal baselines, and the fact that m6Anet is specifically trained on events produced by Nanopolish rather than SegPore. Additionally, the lack of a modification-free control (in vitro transcribed) sample in the benchmark dataset makes it difficult to establish true baselines for each k-mer. Despite these limitations, SegPore demonstrates clear improvement in single-molecule m6A identification in IVT data (Figure 4), suggesting it is particularly well suited for in vitro transcription data analysis.
Although SegPore provides clear interpretability, there is potential to explore DNN-based models that can directly leverage SegPore’s segmentation results. Currently, m6Anet computes features (e.g., mean and standard deviation) from raw signal segments, which are then fed into a neural network for m6A prediction. However, a more direct approach—where raw signal segments are used as input to a DNN—could allow for the extraction of more intricate features that may exist within the signal but are currently underexplored. These high-order features might capture subtle aspects of the raw signal, leading to improved m6A estimation. Since SegPore currently models only the mean and standard deviation, further work involving advanced DNNs could extend beyond this, uncovering finer patterns in the signal that traditional statistical models might miss.
Computation speed is also a concern when processing fast5 files. We addressed this by implementing a GPU-accelerated inference algorithm in SegPore, resulting in a 10- to 20-fold speedup compared to the CPU-based version. We believe that the GPU implementation will unlock the full potential of SegPore for a wider range of downstream tasks and larger datasets. More details are provided in Appendix 2. SegPore’s running times on datasets of varying sizes, using a single NVIDIA DGX-1 V100 GPU and one CPU core, are provided in Figure 4—figure supplement 2.
In summary, we developed a novel software SegPore that considered the conformation changes of motor protein to segment raw current signal of Nanopore direct RNA sequencing. SegPore effectively masks out noise in the raw signal, leading to improved m6A identification at both site and single-molecule levels.
Appendix 1
Notation
| Take the length of the inside vector. | |
| All reads in the input multi-fast5 file. . are the raw current measurements of ith read. Assuming there are reads, i.e. . | |
| The current signal of ith read. where is the length of i.e. . | |
| The kth mapped current signal fragment of ith read , given by eventalign using Nanopolish. | |
| Reference sequences for each read. . | |
| The reference sequence corresponding to ith read , given by basecalling and minimap2. , where and is the length of i.e. . | |
| The kth reference sequence fragment of corresponding to . Note that is the larger than in general. | |
| The 5mer list corresponds to . We have , where and . Note that . | |
| The k-mer list fragment corresponds to the kth reference sequence fragment . . | |
| The estimated states of all reads given by the Hierarchical Hidden Markov Model (HHMM). . | |
| The hidden states for each current measure of ith read. , where corresponds to and , where means the transition state and means base states. | |
| The kth fragment in , which corresponds to . Note that . | |
| The mean values of base segments given by HHMM, which are estimated from and . , where indexes the base segments of ith read. In general, we have . | |
| The kth fragment of which corresponds to and . . In general we have . | |
| The modification state of each 5mer in . , where indicates if is in unmodified or modified state. . | |
| The reference 5mer parameter table provided by Oxford Nanopore Technology (ONT). . | |
| The parameter table for all 5mers. . |
Introduction
Given the parameter table for all -mers, the overall workflow of SegPore consists of the following major steps: (1) Base-calling and mapping, (2) preprocessing of nanopore raw signal, (3) segmenting nanopore raw signal using hierarchical hidden Markov model (HHMM), and (4) aligning raw signal segments with corresponding -mer list. The input of Segpore is the raw current signal of Nanopore direct RNA sequencing. The output of the Segpore workflow is the pairing of raw signal segments with corresponding -mers with inferred state (unmodified state or modified state). Unless otherwise noted, the following analysis focuses on RNA002 data, using 5-mers rather than -mers.
As illustrated in Appendix 1—figure 1, we denote the input FAST5/POD5 file as a set of raw current signal reads . We use to denote the number of reads and to denote the number of measurements / data points of . We then perform basecalling using Guppy and map basecalled sequence to the reference sequences (‘Basecalling, mapping and preprocessing’), yielding a pairing between the ith read and its corresponding reference sequence . We assume there are matched fragments . For the jth fragment , we use HHMM (‘Hierarchical hidden Markov model’) to estimate the hidden states for each measurement , where and . Based on and , we can convert it into a list of low variation segments, with their means denoted by . As a result, the number of measurements in is larger than the number of segments in , that is . After that, we perform full or partial alignment between the raw signal segments and reference sequence , which is converted into a list of 5-mers and . Based on and , we perform full or partial alignment (‘Alignment of signal segments with reference sequence’) to match each 5-mer with one or multiple current signal segments, as well as estimating its modification state , where .
SegPore workflow with a fixed 5-mer parameter table .
The input is the raw current signal and reference sequence for ith read. The output is the paired current signal segments µi, 5-mer list and modification states .
However, is not fully known beforehand (Appendix 1—table 1), especially the parameters for the modification states of 5-mers. We have to estimate the parameters from the training data through an iterative process (‘Estimation of k-mer parameters table’). After obtaining the estimated , we derive the modification states from new test data using SegPore workflow with fixed .
Example of Tkmer (1024×4).
‘-’ represents ‘NA’.
| modle_kmer | un_mu | un_sigma | mod_mu | mod_sigma |
|---|---|---|---|---|
| AAAAA | 108.9 | 2.68 | - | - |
| AAAAC | 107.75 | 2.68 | - | - |
| AAAAG | 101.72 | 2.68 | - | - |
| AAAAT | 112.77 | 2.68 | - | - |
| AAACA | 99.38 | 2.41 | 95.28 | 2.44 |
| AAACC | 100 | 1.32 | 96.82 | 2.42 |
| AAACG | 101.01 | 3.45 | - | - |
| AAACT | 106.91 | 2.18 | 101.98 | 2.29 |
| AAAGA | 110.54 | 4.06 | - | - |
| AAAGC | 107.69 | 4.06 | - | - |
| AAAGG | 108.29 | 4.06 | - | - |
| AAAGT | 108.73 | 4.06 | - | - |
| AAATA | 114.11 | 3.11 | - | - |
| ... | ... | ... | ... | ... |
| TTTTT | 80.78 | 1.97 | - | - |
Basecalling, mapping, and preprocessing
This section describes the details about the basecalling and mapping process, as well as raw data preprocessing for standardizing the raw signal of all reads.
First, we obtain the matching between raw current signals and reference sequences. The raw multi-fast5 reads (input file) were basecalled using Guppy (v6.0.1), which results in a FASTQ file. Then we align the yielded FASTQ file to the reference sequences (we use cDNA reference) using minimap2 (v2.24-r1122)(Li, 2021), which provides a bam file that matches the basecalled reads with the reference sequence. After that, we feed the raw fast5 file, basecalled FASTQ file, bam file, and reference sequence to Nanopolish (v0.13.2) to get (1) matching fragments between the raw current signal and the reference sequence (as well as ) and (2) the current signal fragment for poly(A) part of the mRNA molecule.
Then, we standardize the raw current signal based on the poly(A) tail. Due to technical reasons, there exist variations in the current signal of the same sequence fragment across different reads. It is necessary to perform a standardization. Since all reads have a poly(A) tail in Nanopore direct RNA sequencing, we could standardize the reads such that the mean and std of poly(A) part is the same for all reads. The standardization is given by
where and (RNA002). Note that refers to the standardized signal throughout this document.
Hierarchical hidden Markov model
This section describes the hierarchical hidden Markov model (HHMM) to segment the raw current signal of a read. The input of HHMM is and the output is . For notation simplicity, we use and to represent and in this section, respectively.
Model assumptions
Here we have several assumptions regarding the raw signal. We first assume there is a sharp change of the baseline levels of the current signal between two consecutive 5-mers. This means there is a large change of current signal within a short period of time. If we fit a line to the corresponding signals, we expect to see a large slope. The second assumption is that the 5-mer resides in the Nanopore, oscillates forward and backward during a 5-mer event, while staying in the pore most of the time. Based on these two assumptions, we can divide the raw current signal into two states (hidden states of the outer HMM): the base states for 5-mer event and the transition state for the transition between two consecutive 5-mer events. We assume that the current signal always starts with a base state, then switches to a transition state, after that switches back to a base state or transition state alternately, and in the end, the signal should end in a base state. For the base state, we use an inner HMM to model the oscillations, which has four hidden states: ‘curr’, ‘prev’, ‘next’, and ‘noise’. For the transition state, we use a linear regression model to model the abrupt change of the current signal.
Outer HMM
We define the following notations for the HHMM. As illustrated in Appendix 1—figure 2, we assume there are time points of the current signal, that is . We use , to indicate the hidden states of the outer HMM, where and stand for ‘base’ and ‘transition’, respectively. Similarly, we use , , respectively. We then integrate and into a single vector , where . Here and (‘curr’→1, ‘prev’→2, ‘next’→3, ‘noise’→4) are the numerical mappings of and . Based on , we can derive that is divided into blocks, which is defined by , . Note that and . As a result, we could also represent , where represent the ith block. The base blocks are given by , where and . The transition blocks are given by , where and . Similarly, we have and .
Example of HHMM where , and .
In this figure, ‘c’, ‘p’, ‘n’, and ‘o’ represent ‘curr’, ‘prev’, ‘next’, and ‘noise’ respectively in . .
We use to denote the transition probabilities between the base state and the transition state in the outer HMM. We use to denote the transition probabilities between four hidden states of the inner HMM. For base blocks, we have inner HMM parameters and . For transition blocks, we have linear coefficients , and noise std .
The joint likelihood of the HHMM is given by
where is the initial probability of hidden state of the (always ‘B’ state). However, it is not possible to directly derive , since are not independent given . Here we use different models for each block to calculate the joint probability to approximate . Therefore, we have
The first part is the marginal likelihood for base blocks, while the second part is the marginal likelihood for transition blocks. The specific likelihood formulas are given in the following sections.
For the outer HMM, the transition probabilities are pre-specified and the emission probabilities are obtained from the inner HMM (‘Inner HMM for base state’) and linear regression model (‘Linear model for transition state’). The initial probability for the first hidden state is set to 1, since the first state should always be “B”.
Inner HMM for base state
This section provides the derivations for calculating the likelihood of the ith base block , where . We assume the data is generated by an HMM with four states: ‘curr’, ‘prev’, ‘next’, and ‘noise’. The descriptions of the states are given by Appendix 1—table 2, where parameters for the noise state are given and parameters for other states need to be estimated from the data.
Hidden states of inner HMM for the th block.
| State | Event | Emission distribution | Parameter estimation |
|---|---|---|---|
| ‘curr’ | current 5-mer resides in pore | Yes | |
| ‘prev’ | previous 5-mer resides in pore | Yes | |
| ‘next’ | next 5-mer resides in pore | Yes | |
| ‘noise’ | unknown noise | No |
The likelihood for the kth block is given by
where is the probability of the first hidden state and is the indicator function
For the inner HMM, the transition probabilities are pre-specified and the initial probability for the first hidden state is set to 0.25. Other parameters (see Appendix 1—table 2) need to be estimated.
By comparing the emission probabilities of the outer HMM (Equation 3) and the marginal likelihood of the inner HMM (Equation 4), we have
To approximate , we assume all values of are the same for any given , we have
The approximated emission probability of each data point in the ith base block is given by
where can be obtained by our inference algorithm.
Note that the hidden states of the inner HMM can be provided as a by-product of our inference algorithm. We will use it to derive the final composite hidden state of the HHMM (‘Parameter inference’), which is utilized by downstream alignment tasks (‘Post-processing of alignment path’).
Linear model for transition state
This section provides the derivations for calculating the likelihood of the ith transition block , where . We denote the input data by and its corresponding index by . We assume a linear model for data points in the transition state, given by
where is the slope, is the intercept and is the Gaussian noise.
The likelihood of the linear model is given by
Similar to the inner HMM, the approximated emission probability for each data point in ith transition block is given by
Parameter Inference
Due to the complexity of GPU-accelerated inference algorithm, the details are provided in Appendix 2.
The inference algorithm will infer the following parameters: (1) the hidden states of the outer HMM; (2) the hidden states of the inner HMM ; (3) relevant parameters for emissions (Appendix 1—table 2) of each current signal fragment of the th block (base block) , where and ; and (4) the linear coefficients and noise std of the block (transition block) , where and .
We then integrate and into a single vector , where and are the numerical mappings of . is the final output of HHMM.
Alignment of signal segments with reference sequence
This section describes the alignment algorithms for aligning an input fragment of signal segments and the corresponding 5-mer list . After post-processing of the alignment path, we could get the modification states . For notation simplicity, we drop the indices and denote , and by μ, and , respectively. In general, one 5-mer is aligned with signal segments , where .
Here we discuss different scenarios of aligning with . The first scenario (scenario 1) is that is generated by unmodified 5-mer , for which we could compute the likelihood using the Normal distribution defined by . The second scenario (scenario 2) is the is generated by modified 5-mer , that is , where . The third scenario (scenario 3) is that is generated by an unknown nucleotide insertion event, whose probability is given by a pre-defined uniform distribution . The fourth scenario (scenario 4) is that the 5-mer represents a deletion event on the reference sequence, whose probability is given by the same uniform distribution . We define the scoring function as
We can modify the classical global alignment algorithm to align μ with . The log probability of aligning with could be used as the matching score (Equation 11) in the standard global alignment algorithm. Since one 5-mer could be aligned with multiple signal segments , we need to modify the recursion such that could still be aligned with even if is already aligned with . Therefore, we have the following recursions in our alignment algorithm (Appendix 1—figure 3a). Here we denote the scoring matrix by () and the traceback matrix by , where and .
where both and are aligned with when . This is the only difference between our recursion and the recursion of the standard global alignment algorithm.
Based on the proposed recursion, we developed two alignment algorithms designed for different problems: the full alignment algorithm (‘Full alignment algorithm’) allows insertions and deletions in the reference sequence, while the partial alignment algorithm (‘Partial alignment algorithm’) does not allow any insertions or deletions in the partial reference sequence. The full alignment algorithm tries to align all the current signal segments to the whole reference sequence, while the partial alignment algorithm tries to align all the current signal segments to a consecutive sub-sequence of the 5-mer list from the beginning.
The output of the alignment algorithms is a matching path between the current signal segments μ and the reference 5-mer list . We denote it by , where and . The length of the alignment path is given by . are the indexes of the current signal segment μ in the alignment path , while are the indexes of the 5-mer list . Note that there are duplicated values in and to allow one 5-mer to match with multiple current signal segments. Appendix 1—figure 3b provides an illustration of the alignment result.
To obtain the modification states , we will first process the optimal alignment path into (‘Post-processing of alignment path’), from which we can derive the modification states (‘Modification state estimation on single molecule level’).
Full alignment algorithm
The full alignment algorithm is used when the read may possess differences (insertions or deletions) compared with the reference sequence, which is the general case. The algorithm is described in Appendix 1–algorithm 1.
| Appendix 1–algorithm 1. Full alignment algorithm. |
|---|
| Input: μ, s, Tkmer Output: Alignment path // Add special symbol before μ and s μ ={0.0, μ}, s ={“–”, s} // Get the length of and μ and s m = |μ|, n = |s| // Initialize the score matrix M and traceback matrix T Initialize a m × n matrix M, with all elements set to 0.0 Initialize a m × n matrix T, with all elements set to 0 // Initialize the first row of M and T for i ∈ {1, 2, … , n-1} do M(0, i) ← M(0, i - 1) + f (0.0, si) T(0, i)←2 end // Intialize the first column of M and T for i ∈ {1, 2, · · · , m − 1} do M(i, 0) ← M(i − 1, 0) + f(µi , “ − ”) T(i, 0) ← 3 end // Loop for all matrix for i ∈ {1, 2,···,m − 1} do for j ∈ {1, 2,···, n − 1} do v1 ← M(i − 1, j − 1) + f(μi, sj) v2 ← M(i, j − 1) + f(0.0, sj) v3 ← M(i − 1, j) + f(μi, “ − ”) v4 ← M(i − 1, j) + f(μi, sj) M(i, j) ← max{v1, v2, v3, v4} T(i, j) ← argmax{v1, v2, v3, v4} # index of maximum end end // Traceback Traceback from position (m − 1, n − 1) to position (0, 0) using T to generate the optimal alignment path γ. |
Partial alignment algorithm
The partial alignment algorithm is used when the read must agree with the reference sequence (no insertions or deletions). If the sequencing data is generated by in vitro transcription from a single template, all reads should agree with the reference sequence. However, the read may only match the reference sequence from the start to a middle point due to the break of the RNA molecule in the reverse transcription.
The partial alignment algorithm is a modification of the full alignment algorithm by considering this special case. Since the alignment may not be full, we traceback from the maximum value in the last row of the scoring matrix μ. This means all current signal segments must be aligned, while only the first part of the 5-mer list needs to be aligned. Appendix 1—figure 3c provides an illustration of the alignment result.
The partial alignment algorithm is described in Algorithm 1–algorithm 2. We can see that scenarios 3 and 4 have been removed from the algorithm compared with the full alignment algorithm, that is . In other words, there should be no insertions or deletions in the 5-mer list .
| Appendix 1–algorithm 2. Partial alignment algorithm. |
|---|
| Input: μ, s, Tkmer Output: Alignment path γ // Add special symbol before μ and s μ = {0.0, μ}, s = {“ − ”, s} // Get the length of μ and s m = |μ|, n = |s| // Initialize the score matrix M and traceback matrix T Initialize a m × n matrix M, with all elements set to 0.0 Initialize a m × n matrix T, with all elements set to 0 // Initialize the first row of M and T for i ∈ {1, 2,···, n − 1} do M(0, i) ← M(0, i − 1) + f(0.0, si) T(0, i) ← 2 end // Initialize the first column of M and T for i ∈ {1, 2,···,m − 1} do M(i, 0) ← M(i − 1, 0) + f(µi , “ − ”) T(i, 0) ← 3 end // Loop for all matrix for i ∈ {1, 2, · · · ,m − 1} do for j ∈ {1, 2, · · · , n − 1} do v1 ← M(i − 1, j − 1) + f(μi, sj) v4 ← M(i − 1, j) + f(μi, sj) M(i, j) ← max{v1, v4} if v1 ≥ v4 then T(i, j) ← 1; else T(i, j) ← 4; end end end // Traceback Find the maximum value M(m − 1, k) in the last row of the scoring matrix M. Traceback from position (m − 1, k) to position (0, 0) using T to generate the optimal alignment path γ. |
Post-processing of alignment path
In the alignment path , one -mer might be aligned to noise (scenario 4) or one current signal segment may be aligned with ‘-’ (scenario 3). Firstly, we remove such pairs from the alignment path. Then, we re-compute a weighted mean for each remaining -mer, since one -mer may be aligned to multiple current signal segments. Here, we combine these segments and calculate the weighted mean over corresponding data points with hidden state 1, that is .
Let us assume there are segments aligned to the same -mer , with denotes the number of data points assigned to ‘curr’ state () in the j-th segment and denotes the mean of the j-th segment. The weighted mean, denoted by , corresponding to the -mer , is defined as follows:
In the end, we obtain a one-to-one correspondence between each -mer and its newly derived weighted mean, denoted by . Note that the length of and is the same, that is . We denote the post-processed alignment path by .
Modification estimation
Modification state estimation on single molecule level
We are interested in estimating the modification state for 5-mer after alignment. has provided the mean and std for each 5-mer, as well as its modifications. Note that may only contain modification parameters for only a subset of 5-mers. For those 5-mers without modification parameters, their mean and std are set to ‘NA’. We could evaluate the probability density of a given 5-mer under the unmodified distribution or the modified distribution. By comparing the likelihoods, we could infer the modification state. Given the post-processed alignment path , we estimate the modification state of jth 5-mer by
where and .
In the end, we get the modification state vector for each 5-mer in the post-processed alignment path .
Modification probability estimation on single molecule level
In addition to estimating the modification state, it is also crucial to determine the modification probability of each 5-mer on each read, which is essential for accurately benchmarking the performance at the single-molecule level.
Given the post-processed alignment path , we estimate the modification probability of jth 5-mer by
where and . Indeed, the is the posterior probability given the distribution of modified state:
where we assume that the prior probabilities are equal, .
In the end, we get the modification probability vector for each 5-mer in the post-processed alignment path .
Modification rate estimation on site level
In real data analysis, we need to decide the modification rate of a genomic location, which quantifies the probability of a read mapped to this location to be in the modified state. For a genomic location, we denote the 5-mer from reads mapped to this location as . Note that all are identical 5-mers on different reads. The modification state of these 5-mers on different reads is denoted by . The modification rate is given by
Estimation of k-mer parameters table
This section provides the inference algorithms for 5-mer parameter table . This is necessary for training data, as we do not have accurate prior parameters for .
Given input data , we first perform base calling and mapping to get run HHMM for all reads such we get . We assume there are fragments for the ith read . Similarly, we have and .
The inference algorithm (Appendix 1–algorithm 3) then estimate given , and (https://github.com/nanoporetech/kmer_models; Brennen, 2023). As illustrated in Appendix 1—figure 4, the inference algorithm infers the parameters iteratively. With a given , we can use it to obtain the pairing between raw current signal segments and 5-mers, as well as modification states, through the alignment algorithm (‘Alignment of signal segments with reference sequence’). Given this pairing, we extract all current signal segments for a specific 5-mer, which may correspond to different genomic locations and different reads. As there are two states for this 5-mer (unmodified and modified), we expect to observe two peaks in the density of the current signal segments. Therefore, we could fit a two-component Gaussian mixture model to the current signal segments to estimate relevant parameters of the given 5-mer in .
Estimation of 5-mer parameters.
is initialized using . In training, our algorithm infers the parameters iteratively. After training, is fixed in testing.
| Appendix 1–algorithm 3. Inference algorithm of k-mer parameter table. |
|---|
| Input: S, M, Tref Output:Tkmer // Initialize Tkmer, ∀s ∈ {AAAAA,AAAAC, · · · , TTTTT}, ∀u ∈ {“un”, “mod”} for (μs,u, σs,u) ∈ Tkmer do (μs,u, σs,u) = Trefs,“un” end // iterate T times to update Tkmer for round = 1, 2 · · · T do // align all fragments of all reads Perform full alignment algorithm (Appendix 1–algorithm 1) for μi(k) and si(k) to get the alignment path γi(k) , where i = 1 · · · N and k = 1 · · · ni. Post-process (Post-processing of alignment path) the alignment path γi(k) into = . Concatenate all and re-index into one vector . // parameter estimation for s0 ∈ {AAAAA,AAAAC, · · · , TTTTT} do Collect all in where sj = s0 into one vector . Fit x using a two-component Gaussian Mixture Model with the first component mean fixed to , where (, ) =, and obtain the std of the first component σ1, the mean and std of the second component μ2, σ2. Update the parameters of 5-mer s0 in : and end Update Tkmer manually based on several heuristic criteria. end |
In practice, the following heuristic criteria are used: (1) for a given 5-mer, perform GMM for each genomic location with the same 5-mer and only pool those sites with high modification rate (2) there must be two peaks in the density plot of a given 5-mer; (3) the mean of the two peaks are far apart; (4) the weight of second component (modified state) must be large, e.g. >0.1; (5) the weight of the first component (unmodified state) should be larger than the second component (modified state), as majority nucleotides are unmodified; (6) limit the size of std for both components in the GMM (, ) to comply with the physical settings.
In the end, we will get an updated from the training data, which contains the modification parameters for the strongest 5-mers. As illustrated in Appendix 1—table 1, we do have parameters for the modification state for certain -mers such as ‘AAACA’, ‘AAACT’ etc.
Evaluation of segmentation and event alignment
The output of SegPore eventalign is the alignment path γ = (μ, s), which represents a one-to-one correspondence between the mean list μ = (μ1, μ2,...) and the k-mer lists = (s1, s2, ...). In addition, the corresponding standard deviation list σ = (σ1, σ2, ...) is also obtained along the alignment path. Appendix 1—table 3 is an example of the SegPore eventalign output, which illustrates one alignment path. The column ref_kmer corresponds to s, the column mean corresponds to μ, and the column stdv corresponds to σ.
SegPore eventalign output example.
| read_idx | read_name | contig | pos | strand | ref_kmer | model_kmer | mean | stdv | start_idx | end_idx | event_len |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 57 | + | GGTGT | GGTGT | 79.339 | 2.414 | 47296 | 47321 | 18 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 58 | + | GTGTC | GTGTC | 89.809 | 3.579 | 47276 | 47292 | 14 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 59 | + | TGTCT | TGTCT | 117.017 | 6.287 | 47251 | 47272 | 18 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 60 | + | GTCTT | GTCTT | 76.448 | 1.038 | 47148 | 47247 | 87 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 61 | + | TCTTA | TCTTA | 76.036 | 1.044 | 47117 | 47144 | 25 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 62 | + | CTTAG | CTTAG | 78.886 | 1.127 | 47092 | 47113 | 21 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 63 | + | TTAGT | TTAGT | 92.971 | 2.307 | 47007 | 47088 | 73 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 64 | + | TAGTG | TAGTG | 121.129 | 4.2 | 46882 | 46983 | 90 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 65 | + | AGTGT | AGTGT | 90.22 | 4.554 | 46822 | 46878 | 33 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 67 | + | TGTGC | TGTGC | 104.566 | 2.563 | 46798 | 46818 | 8 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 68 | + | GTGCT | GTGCT | 91.513 | 2.51 | 46778 | 46794 | 15 |
| 80741 | 003655f4-ff77-409d-a29e-20dc778d1c47 | A1 | 69 | + | TGCTT | TGCTT | 111.485 | 4.001 | 46754 | 46774 | 10 |
Given the eventalign results, we use two metrics to evaluate the performance of the segmentation and event alignment: (1) the average std , and (2) the average log-likelihood . Assuming there are reads in total (each read has one path), and the nth read has events (). The average std is defined as
and the average log-likelihood is defined as
where .
As shown in Appendix 1—figure 5, the red line represents the event mean and the shaded area represents the std for event . A poorly segmented raw signal corresponding to an event will exhibit a large standard deviation. Therefore, a smaller indicates lower variations within the raw signal segment, signifying better performance.
If an event is aligned to the correct reference -mer, the mean will be close to the reference and the log-likelihood will be large. So higher means more similar results to ONT’s estimates and better performances.
Appendix 2
Introduction
We have introduced the hierarchical hidden Markov model (HHMM) in Appendix 1. The HHMM is a two-layer HMM, consisting of the outer HMM and the inner HMM. The outer HMM specifies the base blocks and transition blocks. The inner HMM specifies the base blocks. Linear models are used for the transition blocks. However, we have not provided the details for the parameter inference. This document aims to provide a comprehensive introduction to the inference algorithm.
As we know, the raw fast5 files are very large, which contain millions of reads. Each read has 20,000 measurements in the raw current signal. This poses a significant computational burden to the HHMM inference algorithm. Therefore, we used GPU to accelerate the inference algorithm.
We will first introduce the general inference algorithm, then describe the implementation details.
For simplicity, we assume the input is a raw current signal of a single read with measurements . The output is the hidden states of the outer HMM and the hidden states of the inner HMM , where and .
Given a hidden state configuration of the outer HMM , it partitions the raw current signal into alternating base blocks and transition blocks , where denotes the label of ith block. Note that and . Now we can denote .
The general idea of the inference algorithm is that we will calculate the likelihood for all possible configurations of the hidden states of the outer HMM and choose the one with the largest likelihood as our final estimation. Due to the special constraints of , we can enumerate all possible configurations (‘Enumeration of hidden states of outer HMM’). Here we denote the full configuration set of by . The general inference algorithm is discussed in detail in ‘General inference algorithm’.
The general inference algorithm provides the analytic forms for calculating the exact likelihood. However, a direct implementation of the algorithm (Python and C++) requires huge amounts of computation resources and time to handle a typical fast5 file. Therefore, we implement a GPU version of the inference algorithm (‘Implementation details’), in which base blocks from different configurations are simultaneously inferred on different GPU cores.
Enumeration of hidden states of outer HMM
This section describes how to get the full configuration set given the raw signal . Obviously, it is not possible to enumerate all hidden state (where ) because there are configurations and can be larger than 20,000. However, there exist special properties of the raw signal : (1) the signal should be in alternating base and transition blocks (2) the hidden state does not change within a block (3) the average length of base blocks is larger than that of the transition blocks (4) the variation of base blocks is smaller than that of transition blocks. Based on these data properties, we could partition the raw signal into atomic level base and transition blocks (‘Atomic block initialization’). Multiple consecutive atomic level base and transition blocks can be merged into a single larger base block. As shown in Appendix 2—figure 1a, we partition the raw signal into 15 atomic-level blocks . We can merge into a new base block or into a new base block (Appendix 2—figure 1b).
Given a pre-defined partition of raw signal , we can derive the atomic blocks
. To facilitate the description of the enumeration algorithm, we define to be the merged block of all blocks in , where . A path is then a list of non-overlapping and consecutive blocks. For example, the path corresponding to in Appendix 2—figure 1b is given by . To reduce the number of paths, we set the maximum number of base blocks in any merged block to . The enumeration algorithm is given by Appendix 2–algorithm 1.
| Appendix 2–algorithm 1. Enumeration of hidden states of outer HMM. |
|---|
| Input: g0, Nmax, K Output: Derive (B1, T1, B2, T2,··· ,BK, TK, BK+1)) from g0 for path in recur_enum(0) do Convert path to g Append g to end // Recursive function to enumerate all paths starting from position st Function recur_enum(st): res = [] if st = K then return [BK+1] end // “[” is inclusive and “)” is exclusive for i ∈ [st, K + 1) do if i − st > Nmax then return res end tmp_seg = seg(st, i) all_paths_from_i = recur_enum(i) for path in all_ paths_ from_i do tmp_path = concatenate tmp_seg with path Append tmp_path to res end end return res; |
General inference algorithm
Given all possible hidden state configurations of the outer HMM , we want to calculate the likelihood for all . In order to calculate the likelihood, we need to specify the parameters for the emission distributions.
Given , we can derive that is divided into alternating base and transition blocks. We denote all blocks by , where represent the ith block. The base blocks are given by , where and . The transition blocks are given by , where and .
We pre-specify the transition probabilities between the base state and the transition state in the outer HMM, as shown in Appendix 2—table 1. The pre-specification has two considerations: (1) the probability of a transition from B to T after 10 consecutive B states is 0.1; (2) the probability of a transition from T to B after 10 consecutive T states is larger than 0.9; (3) to reduce computation load in the parameter inference. We pre-specify the initial probability of the first hidden state to 1, since always starts with a ‘B’ state.
Transition parameters of the outer HMM ().
| ‘B’ | ‘T’ | |
|---|---|---|
| ‘B’ | 0.99 | 0.01 |
| ‘T’ | 0.10 | 0.90 |
For inner HMM, we pre-specify the transition probabilities shown in Appendix 2—table 2, which is based on the following consideration (1) large proportion of the data should come from the ‘curr’ state, that is it is easy to transit from other states to ‘curr’ state; (2) it is unlikely to transit from ‘prev’ to ‘next’, and vice versa; (3) it is unlikely to transit from other states to the ‘noise’ state. We pre-specify the initial probability of the first hidden state to 0.25, which gives equal probabilities for all four states. We pre-specify the parameters of the ‘noise’ state () to and , which are estimated from the current signals across different datasets to represent a reasonable range.
Transition parameters of the inner HMM ().
| ‘curr’ | ‘prev’ | ‘next’ | ‘noise’ | |
|---|---|---|---|---|
| ‘curr’ | 0.925 | 0.025 | 0.025 | 0.025 |
| ‘prev’ | 0.300 | 0.500 | 0.100 | 0.100 |
| ‘next’ | 0.300 | 0.100 | 0.500 | 0.100 |
| ‘noise’ | 0.300 | 0.100 | 0.100 | 0.500 |
The remaining parameters, mainly associated with emission distributions, need to be estimated. For base blocks, we have inner HMM parameters and . For transition blocks, we have linear coefficients , and noise std .
The general idea of the inference algorithm is to calculate the joint likelihood of the outer HMM given for any . However, it is not possible to calculate the emission probability for each measurement as there exist dependencies with a base or transition block. Here we first calculate the joint likelihood for a base block (inner HMM) or a transition block (linear model), then average the joint likelihood to each measurement to use it as substitute for the emission probability of each measurement in the outer HMM. Here we use to denote the approximated emission probability of each measurement. Note that is derived either based on the inner HMM or the linear model, and .
The general inference algorithm is given by Appendix 2–algorithm 2.
| Appendix 2–algorithm 2. General inference algorithm for HHMM. |
|---|
| Input: y, , Touter, πouter, Tinner, πinner, Output: best path of outer HMM, best path of inner HMM, ϕ, σ, β0, β1, σϵ // Initialization set all elements of to 0, and Γ ← {} // Calculate likelihood for each g for g ∈ do Divide y into base and transition blocks (y(1), y(2), · · · , y(2K+1)) based on g // transition state parameter estimation Estimate using linear model (Appendix 2–algorithm 3), where ti = 2i and i ∈ {1, 2,···, K} Update relevant part of ω using . // base state parameter estimation Initialize ϕ(bi) and σ(bi) use the empirical mean and std of y(bi), where bi = 2i − 1 and i ∈ {1, 2,···, K + 1} ϕnew ← ϕ σnew ← σ for round = 1, 2,···, R do for i ∈ (1, 2,···, K + 1) do bi ← 2i − 1 // inner HMM parameter estimation Estimate using Forward-backward algorithm (Appendix 2–algorithm 4) given Update corresponding elements of ϕnew and σnew using Update relevant part of h and ω using and end ϕ ← ϕnew σ ← σnew end Calculate the joint likelihood of the outer HMM using Append h and γ to and Γ, respectively. end Choose the largest likelihood from Γ and retrieve the corresponding g and h from and , respectively. |
Implementation details
Atomic block initialization
As mentioned in ‘Enumeration of hidden states of outer HMM’, we need to partition the raw signal into atomic blocks
. If we can identify all atomic transition blocks , then the atomic base blocks are automatically determined.
We assume atomic transition blocks have higher variations than atomic base blocks. If we fit a line to a transition block, the absolute value of the slope of the line should be large. Based on this idea, we can obtain the atomic transition blocks as follows:
For each measurement , fit another line to and obtain the slope , fit a line to and obtain the slope , use as the absolute slope for .
Smooth the obtain slopes by taking the mean of a sliding window of size. Now we get the smoothed
Find peaks from using the find_peaks() function in the Python Scipy.signal package. Here, set the minimum distance of detected peaks to 10. Now we get a list of peak positions .
Expand to the atomic transition block . We fit a 3-segment spline to
, where . The slopes of the first segment and the third segment are set to 0, while a standard linear model is fit to the second segment . In the fitting, the slope of the second segment should be larger than 5.0. After the fitting of the spline, we take the residuals and calculate the joint likelihood assuming each residual follows a Gaussian distribution . By enumerating all , we pick the with the largest joint likelihood. Therefore, the final kth transition block corresponding to the interval .
Emission probabilities of linear model
For the ith transition block, the measurements are given by , where and . We denote the corresponding index of by . Here we want to get the emission probabilities for each measurement , which can be used by the outer HMM. Detailed steps are provided in the following Appendix 2–algorithm 3.
| Appendix 2–algorithm 3. Emission probabilities of linear model. |
|---|
| Input: , Output: , , , initialize to a 1 × Nti all zero vector. Fit a linear model of y(ti) versus x(ti), obtain the maximum likelihood estimates and Calculate the maximum log joint likelihood = Set each element of to exp |
Emission probabilities of inner HMM
This section describes how to obtain the approximated emission probabilities of the outer HMM for a the ith base block , which is modeled using an inner HMM. Here and .
For the inner HMM, we list all the parameters in Appendix 2—table 3.
Parameters of the inner HMM for the th block.
| Parameter | Fixed parameter | Description |
|---|---|---|
| Yes | Transition probabilities between hidden states (‘curr’, ‘prev’, ‘next’, ‘noise’), specified in Appendix 2—table 2 | |
| Yes | probability of the first hidden state, , where | |
| Yes | lower bound of uniform distribution, | |
| Yes | upper bound of uniform distribution, | |
| No | Gaussian mean ofith base block | |
| No | Gaussian std ofi th base block | |
| No | Gaussian mean of th base block | |
| No | Gaussian std of th base block | |
| No | Gaussian mean of th base block | |
| No | Gaussian std of th base block |
We want to obtain the approximated emission probabilities , hidden states , estimated parameters and for the ith base block . To achieve this, we follow the Forward-backward algorithm in Bishop and Nasrabadi, 2006 (Chapter 13.2). The general idea of the Forward-backward algorithm is to use a -function to approximate the marginal likelihood, then use an expectation-maximization (EM) algorithm for the parameter estimation.
We introduce the following notations for the Forward-backward algorithm. We use to denote the hidden states, where is a matrix, and . If is 1, then it means corresponds to the kth hidden state. We use () to denote the forward probabilities and () to denote the backward probabilities. () denotes the posterior distribution calculated from and . () denotes the joint distribution of two successive hidden states. We use to denote all parameters to be estimated.
Here we only need to update , , , , and in the inference process of inner HMM for ith base block. Our inference algorithm is a modification of the standard Forward-backward algorithm based on the defined -function, as described in Appendix 2–algorithm 4.
| Appendix 2–algorithm 4. Inference algorithm of inner HMM. |
|---|
| Input: Output: Initialize and to the all zero vector. Initialize α, β, γ, ξ // EM algorithm for round = 1, 2, · · · , R do E-step: update α, β, γ, and ξ given M-step: calculate and by Equation 2 and Equation 3 Calculate the Q-function (log marginal likelihood) given all estimated parameters end Derive from by taking the state with highest probability for each data point, i.e. Set each element of to . |
GPU-accelerated parameter inference
Given the general inference algorithm (‘General inference algorithm’) and algorithms for calculating the emission probabilities for the inner HMM (Appendix 2–algorithm 4) and linear model (Appendix 2–algorithm 3), we have the full algorithm for the parameter inference. The actual inference, however, can be really slow since the input fast5 data are huge. Here we use heuristics and GPU parallelization to accelerate the parameter inference.
We divide the raw current signal of each read into overlapping regions, which start at an atomic base block and also end with an atomic base block. As shown in Appendix 2—figure 2, we process the rth region independently to get the segmentation results . This process is based on the following heuristics: the segmentation result of one region will not affect another region when they are far away from each other. The inference is performed in two rounds. In the first round, we obtain the segmentation results for non-overlapping regions. The center part of the segmentation is not affected by the flank parts, which may be affected by neighboring regions. In the second round, we merge relevant parts of neighboring regions near the touch point as a new region based on segmentation results of the first round. After that, we perform the inference for the new regions to get the final segmentations for the whole read.
We use GPU to parallelize the inference algorithm of inner HMM (Alg. 4). For each region, we first enumerate all possible paths and calculate the likelihood for each candidate path (‘Enumeration of hidden states of outer HMM’). As shown in Appendix 2—figure 2, we need to perform inner HMM parameter inference for different paths of different regions of one read. We consider parallelizing the computation of different merged blocks in different paths of different regions. We first collect all merged blocks and sort them by size and only keep unique blocks, which can be identified by the combinatorial key of region index, start atomic block index, end atomic block index. Then we run the inner HMM inference (Alg. 4) of each merged block on a GPU core. Note that all GPU cores perform the same computation step of Alg. 4 simultaneously, but for different merged blocks.
Illustration of enumeration of hidden states of outer HMM.
(a) Raw signal and initialized to atomic segmentations. (b) Examples of the full configuration set .
Data availability
The data utilized in this study are obtained from publicly available repositories. Details regarding the accession number and data processing can be found in Methods. The resulting data and the source code are hosted on GitHub (https://github.com/guangzhaocs/SegPore copy archived at Cheng, 2025).
-
NCBI BioProjectID PRJEB40872. Differential RNA modifications from dRNA-seq.
-
NCBI BioProjectID PRJEB44348. SGNEx: A systematic benchmark of Nanopore long read RNA sequencing for transcript level analysis in human cell lines.
-
NCBI Sequence Read ArchiveID SRP166020. Decoding Epitranscriptional Landscapes form Native RNA Sequences.
References
-
MODOMICS: a database of RNA modification pathways. 2021 updateNucleic Acids Research 50:D231–D235.https://doi.org/10.1093/nar/gkab1083
-
Helicase SPRNTing through the nanoporePNAS 114:11809–11811.https://doi.org/10.1073/pnas.1716866114
-
Coordination of RNA modifications in the brain and beyondMolecular Psychiatry 28:2737–2749.https://doi.org/10.1038/s41380-023-02083-2
-
Decoding the epitranscriptional landscape from native RNA sequencesNucleic Acids Research 49:e7.https://doi.org/10.1093/nar/gkaa620
-
Atlas of quantitative single-base-resolution N6-methyl-adenine methylomesNature Communications 10:5636.https://doi.org/10.1038/s41467-019-13561-z
-
Minimap2: pairwise alignment for nucleotide sequencesBioinformatics 34:3094–3100.https://doi.org/10.1093/bioinformatics/bty191
-
Accurate detection of m6A RNA modifications in native RNA sequencesNature Communications 10:4079.https://doi.org/10.1038/s41467-019-11713-9
-
ADAR RNA modifications, the epitranscriptome and innate immunityTrends in Biochemical Sciences 46:758–771.https://doi.org/10.1016/j.tibs.2021.02.002
-
Detecting DNA cytosine methylation using nanopore sequencingNature Methods 14:407–410.https://doi.org/10.1038/nmeth.4184
-
Duplexed direct RNA sequencing protocol using polyadenylation and polyuridylationMicrobiology Resource Announcements 14:e0104124.https://doi.org/10.1128/mra.01041-24
-
The N6-methyladenosine RNA modification in acute myeloid leukemiaCurrent Opinion in Hematology 28:80–85.https://doi.org/10.1097/MOH.0000000000000636
-
The expanding role of RNA modifications in plant RNA polymerase II transcripts: highlights and perspectivesJournal of Experimental Botany 74:3975–3986.https://doi.org/10.1093/jxb/erad136
Article and author information
Author details
Funding
Research Council of Finland (335858)
- Guangzhao Cheng
- Lu Cheng
Research Council of Finland (358086)
- Guangzhao Cheng
- Lu Cheng
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We would like to thank Prof. Zhijie Tan from Wuhan University for a useful discussion about the molecule dynamics of Nanopore sequencing, Dr. Dan Zhang from Sichuan University for helpful tutorials about Nanopore analysis workflows. We also thank Prof. Luo Guanzheng for sharing the m6A benchmark baseline results. GC and LC acknowledge the computational resources provided by the Aalto Science-IT project. This work was supported by Research Council of Finland grants [335858, 358086 to GC and LC].
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
- Reviewed Preprint version 2:
- Reviewed Preprint version 3:
- Version of Record published:
Cite all versions
You can cite all versions using the DOI https://doi.org/10.7554/eLife.104618. This DOI represents all versions, and will always resolve to the latest one.
Copyright
© 2025, Cheng et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 1,776
- views
-
- 86
- downloads
-
- 0
- citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.