Bayesian inference of kinetic schemes for ion channels by Kalman filtering
Abstract
Inferring adequate kinetic schemes for ion channel gating from ensemble currents is a daunting task due to limited information in the data. We address this problem by using a parallelized Bayesian filter to specify hidden Markov models for current and fluorescence data. We demonstrate the flexibility of this algorithm by including different noise distributions. Our generalized Kalman filter outperforms both a classical Kalman filter and a rate equation approach when applied to patchclamp data exhibiting realistic openchannel noise. The derived generalization also enables inclusion of orthogonal fluorescence data, making unidentifiable parameters identifiable and increasing the accuracy of the parameter estimates by an order of magnitude. By using Bayesian highest credibility volumes, we found that our approach, in contrast to the rate equation approach, yields a realistic uncertainty quantification. Furthermore, the Bayesian filter delivers negligibly biased estimates for a wider range of data quality. For some data sets, it identifies more parameters than the rate equation approach. These results also demonstrate the power of assessing the validity of algorithms by Bayesian credibility volumes in general. Finally, we show that our Bayesian filter is more robust against errors induced by either analog filtering before analogtodigital conversion or by limited time resolution of fluorescence data than a rate equation approach.
Editor's evaluation
The authors develop a Bayesian approach to modeling signals arising from ensembles of ion channels that can incorporate multiple simultaneously recorded signals such as fluorescence and ionic current. For simulated data from a simple ion channel model where ligand binding drives pore opening, they show that their approach enhances parameter identifiability and/or estimates of parameter uncertainty over more traditional approaches. The developed approach provides a valuable tool for modeling macroscopic time series data including data with multiple observation channels.
https://doi.org/10.7554/eLife.62714.sa0Introduction
Ion channels are essential proteins for the homeostasis of an organism. Disturbance of their function by mutations often causes severe diseases, such as epilepsy (Oyrer et al., 2018; GoldschenOhm et al., 2010), sudden cardiac death (Clancy and Rudy, 2001), or sick sinus syndrome (Verkerk and Wilders, 2014) indicating a medical need (GoldschenOhm et al., 2010) to gain further insight into the biophysics of ion channels. The gating of ion channels is usually interpreted by kinetic schemes which are inferred either from macroscopic currents with rate equations (REs) (Colquhoun and Hawkes, 1995b; Celentano and Hawkes, 2004; Milescu et al., 2005; Stepanyuk et al., 2011; Wang et al., 2012) or from singlechannel currents using dwell time distributions (Neher and Sakmann, 1976; Colquhoun et al., 1997a; Horn and Lange, 1983; Qin et al., 1996; Epstein et al., 2016; Siekmann et al., 2016) or hidden Markov models (HMMs) (Chung et al., 1990; Fredkin and Rice, 1992; Qin et al., 2000; Venkataramanan and Sigworth, 2002). A HMM consists of a discrete set of metastable states. Changes of their occupation occur as random events over time. Each state is characterized by transition probabilities, related to transition rates, and a probability distribution of the observed signal (Rabiner, 1989). It is becoming increasingly clear that the use of Bayesian statistics in HMM estimation constitutes a major advantage (Ball et al., 1999; De Gunst et al., 2001; Rosales et al., 2001; Rosales, 2004; Gin et al., 2009; Siekmann et al., 2012; Siekmann et al., 2011; Hines et al., 2015; Sgouralis and Pressé, 2017b; Sgouralis and Pressé, 2017a; KinzThompson and Gonzalez, 2018). In ensemble patches, simultaneous orthogonal fluorescence measurement of either conformational changes (Zheng and Zagotta, 2000; Taraska and Zagotta, 2007; Taraska et al., 2009; BrueningWright et al., 2007; Kalstrup and Blunck, 2013; Kalstrup and Blunck, 2018; Wulf and Pless, 2018) or ligand binding itself (Biskup et al., 2007; Kusch et al., 2010; Kusch et al., 2011; Wu et al., 2011) has increased insight into the complexity of channel activation.
Currently, a Bayesian estimator that can collect information from crosscorrelations and time correlations inherent in multidimensional signals of ensembles of ion channels is still missing. Traditionally, macroscopic currents are analyzed with solutions of REs which yield a point estimate of the rate matrix or its eigenvalues (Colquhoun et al., 1997a; Sakmann and Neher, 2013; d’Alcantara et al., 2002; Milescu et al., 2005; Wang et al., 2012) if they are fitted to the data. The RE approach is based on a deterministic differential equation derived by averaging the chemical master equation (CME) for the underlying kinetic scheme (Kurtz, 1972; Van Kampen, 1992; Jahnke and Huisinga, 2007). Its accuracy can be improved by processing the information contained in the intrinsic noise (stochastic gating and binding) (Milescu et al., 2005; Munsky et al., 2009). Nevertheless, all deterministic approaches do not use the information of the time and crosscorrelations of the intrinsic noise. These deterministic approaches are asymptotically valid for an infinite number of channels. Thus, a time trace with a finite number of channels contains, strictly speaking, only one independent data point. Previous rigorous attempts to incorporate the autocorrelation of the intrinsic noise of current data into the estimation (Celentano and Hawkes, 2004) suffer from cubic computational complexity (Stepanyuk et al., 2011) in the amount of data points, rendering the algorithm nonoptimal or even impractical for a Bayesian analysis of larger data set. To understand this, note, that a maximum likelihood optimization (ML) usually takes several orders of magnitude fewer likelihood evaluations to converge compared to the number of posterior evaluations when one samples the posterior. One Monte Carlo iteration (Betancourt, 2017) evaluates the posterior distribution and its derivatives many times to propose one sample from the posterior. Stepanyuk suggested an algorithm (Stepanyuk et al., 2011; Stepanyuk et al., 2014) which derives from the algorithm of Celentano and Hawkes, 2004 but evaluates the likelihood quicker. Under certain conditions, Stepanyuk’s algorithm can be faster than the Kalman filter (Moffatt, 2007). The algorithm by Milescu et al., 2005 achieves its superior computation time efficiency at the cost of ignoring the time correlations of the fluctuations. A further argument for our approach, independent of the Bayesian context, is investigated in this paper: The KF is the minimal variance filter (Anderson and Moore, 2012). Instead of strong analog filtering of currents to reduce the noise, but with the inevitable signal distortions (Silberberg and Magleby, 1993), we suggest to apply the KF with higher analyzing frequency on minimally filtered data.
Phenomenological difference between an RE approach and our Bayesian filter
Two major problems for parameter inference for the dynamics of the ion channel ensemble $\mathbf{n}(t)$ are: (I) that currents are only lowdimensional observations (e.g. one dimension for patchclamp or two for cPCF) of a highdimensional process (dimension being the number of model states) blurred by noise and (II) the fluctuations due to the stochastic gating and binding process cause autocorrelation in the signal. Traditional analyses for macroscopic PC data (and also for related fluorescence data) by the RE approach, e.g. Milescu et al., 2005 ignores the longlasting autocorrelations of the deviations (Box 1—figure 1a) blue and orange curves from the mean time trace (black) that occur in real data measured from a finite ensemble. To account for the autocorrelation in the signal, an optimal prediction (Box 1—figure 1b) of the future signal distribution $\mathbb{P}(y({t}_{2}))$ should use the measurement $y({t}_{1})$ from the current time step t_{1} to update the belief about the underlying $\mathbf{n}({t}_{1})$. Based on stochastic modelling of the time evolution of the channel ensemble, it then predicts $\mathbb{P}(y({t}_{2}))$.
To demonstrate the difference how the two algorithms analyze the data, we compute the autocorrelation of the residuals of the data. After the analysis with either the RE approach or the KF, we can construct from the model with the mean predicted signal $\mathbf{H}\mathbb{E}[\mathbf{n}({t}_{i})]$ (see Eq. 4 for the definition of $\mathbf{H}$) and the predicted standard deviation $\sqrt{\mathrm{v}\mathrm{a}\mathrm{r}[y({t}_{i})]}$ the normalized residual time trace of the data which are defined as
Filtering (fitting) with the KF (given the true kinetic scheme) one expects to find a whitenoise process for the residuals. Plots of the autocorrelation function of both signal components (Box 1—figure 2a) confirms our expectation. The estimated autocorrelation vanishes after one multiple of the lag time (the interval between sampling points), which means that the residuals are indeed a whitenoise process. In contrast, the residuals derived from the RE approach (Box 1—figure 2b) display long lasting periodic autocorrelations.
On the one hand, a complete HMM analysis (forward algorithm) would deliver the most exact likelihood of macroscopic data. On the other hand, the computational complexity of the forward algorithm limits this type of analysis in ensemble patches to no more than a few hundred channels per time trace (Moffatt, 2007). To tame the computational complexity (Jahnke and Huisinga, 2007), we approximate the solution of the CME with a Kalman filter (KF), thereby remaining in a stochastic framework Kalman, 1960. This allows us to explicitly model the time evolution of the first two moments (mean value and covariance matrix) of the probability distribution of the hidden channel states. Notably, for linear (first or pseudo) Gaussian system dynamics, the KF is optimal in producing a minimal prediction error for the mean state. KFs have been used previously in several protein expression studies which also demonstrate the connection of the KF to the linear noise approximation (Komorowski et al., 2009; Finkenstädt et al., 2013; Fearnhead et al., 2014; Folia and Rattray, 2018; Calderazzo et al., 2019; Gopalakrishnan et al., 2011).
Our approach generalizes the work of Moffatt, 2007 by including statedependent fluctuations such as openchannel noise and Poisson noise in additional fluorescence data. A central technical difficulty which we solved is that due to the statedependent noise the central Bayesian update equation loses its analytical solution. We derived an approximation which is correct for the first two moments of the probability distributions. Stochastic rather than deterministic modeling is generally preferable for small systems or nonlinear dynamics (Van Kampen, 1992; Gillespie and Golightly, 2012). However, even with simulated data of unrealistic high numbers of channels per patch (more than several thousands within one patch), the KF outperforms the deterministic approach in estimating the model parameters. Moffatt, 2007 already demonstrated the advantage of the KF to learn absolute rates from time traces at equilibrium. Like all algorithms that estimate the variance and the mean (Milescu et al., 2005) the KF can infer the number of channels ${N}_{\mathrm{ch}}$ for each time trace, the singlechannel current $i$ and analogous in optical recordings the mean number ${\lambda}_{\text{b}}$ of photons from bound ligands per recorded frame. To select models and to identify parameters, stochastic models are formulated within the framework of Bayesian statistics where parameters are assigned uncertainties by treating them as random variables (Hines, 2015; Ball, 2016). In contrast, previous work on ensemble currents combined the KF only with ML estimation (Moffatt, 2007). Difficulties in treating simple stochastic models by ML approaches in combination with the KF (AugerMéthé et al., 2016), especially with nonobservable dynamics, justify the computational burden of Bayesian statistics. Bayesian statistics has an intuitive way to incorporate soft or hard constrains from diverse sources of prior information. Those sources include mathematical prerequisites, other experiments, simulations or theoretical assumptions. They are applied as additional model assumptions by a prior probability distribution over the possible parameter space. Hence, knowledge of the model parameters prior to the experiment are correctly accounted for in the analyzes of the new data. Alternatively, some of these benefits of prior knowledge can be incorporated by penalized maximum likelihood (Salari et al., 2018; Navarro et al., 2018). Bayesian inference provides outmatching tools for modeling over point estimates: First, the Bayesian approach is still applicable in situations where parameters are not identifiable (Hines et al., 2014; Middendorf and Aldrich, 2017b) or posteriors are nonGaussian, whereas ML fitting ceases to be valid (Calderhead et al., 2013; Watanabe, 2007). Second, a Bayesian approach provides superior model selection tools for singular models such as HMMs or KFs Gelman et al. (2014). Third, Bayesian statistics has a correct uncertainty quantification (Gillespie and Golightly, 2012) based on the data and the prior for the statistical problem. In contrast, ML or maximum posterior approaches lack uncertainty quantification based on one data set (Joshi et al., 2006). Only under optimal conditions their uncertainty quantification becomes equivalent to Bayesian credibility volumes (Jaynes and Kempthorne, 1976). This study focuses on the effects on the posterior due to formulating the likelihood via a KF instead of an RE approach and the benefits of adding a second dimension of observation. We consider the performance of our algorithm against the gold standards in four different aspects: (I) The relative distance of the posterior to the true values, (II) the uncertainty quantification, here in the form of the shape of the posterior, (III) parameter identifiability, and (IV) robustness against typical misspecifications of the likelihood (such as ignoring that currents are filtered or that the integration time of fluorescence data points is finite) of real experimental data.
Results and discussion
Simulation of ligandgated ionchannel data
Here we treat an exemplary ligandgated channel with two ligand binding steps and one openclosed isomerization described by an HMM (see Figure 1a). For this model, confocal patchclamp fluorometry (cPCF) data were simulated: time courses of ligand binding and channel current upon concentration jumps were generated (see Appendix 5 and Materials and methods section). Idealized example data with added white noise are shown in Figure 1b–d. We added realistic instrumental noise to the simulated data (see Appendix 5). A qualitative description of the statistical problem that needs to be addressed when modeling time series data such as the simulated is outlined in Box. 1.
Kalman filter derived from a Bayesian filter
Here and in the Materials and methods section, we derive the mathematical tools to account correctly for the stochastic Markov dynamics of single molecules in the fluctuations of macroscopic signals. The KF is a Bayesian filter (see Materials and methods), that is a continuous state HMM with a multivariate normal transition probability Ghahramani, 1997 (Figure 2a). We define the hidden ensemble state vector
which counts the number of channels in each state $\mathbf{s}$ (see Methods). To make use of the KF, we assume the following general form of the dynamic model: The evolution of $\mathbf{n}(t)$ is determined by a linear model that is parametrized by the state evolution matrix $\mathrm{T}$
where ∼ means sampled from and $\mathcal{N}(\cdot \mathit{\mu},\mathbf{\Sigma})$ is a shorthand for the multivariate normal distribution, with the mean μ and the variancecovariance matrix $\mathbf{\Sigma}$. The state evolution matrix (transition matrix) is related to the rate matrix $\mathbf{K}$ by the matrix exponential $\mathbf{T}=\mathrm{exp}(\mathbf{K}\mathrm{\Delta}t)$. The mean of the hidden state evolves according to the equation $\mathbb{E}[{\mathbf{n}}_{t+1}{\mathbf{n}}_{t}]=\mathbf{T}{\mathbf{n}}_{t}$. It is perturbed by normally distributed white process noise $\mathit{\omega}$ with the following properties: The mean value of the noise fulfills $\mathbb{E}[{\mathit{\omega}}_{t}]=0$ and the variancecovariance matrix of the noise is $\mathrm{cov}[{\mathit{\omega}}_{t},{\mathit{\omega}}_{t}]=\mathbf{Q}(\mathbf{T},{\mathbf{n}}_{t})$ (see Materials and methods Equation 38d, Ball, 2016). In short, Equation 3 defines a Gaussian Markov process.
The observations $\mathbf{y}}_{t$ depend linearly on the hidden state $\mathbf{n}}_{t$. The linear map is determined by an observation matrix $\mathbf{H}$.
The noise of the measurement setup (Appendix 5 and Equation 43) is modeled as a random perturbation of the mean observation vector. The noise fulfills $\mathbb{E}[\mathit{\nu}]=0$ and $\mathrm{cov}[{\mathit{\nu}}_{t},{\mathit{\nu}}_{t}]={\mathbf{\Sigma}}_{t}$. Equation 4 defines the stateconditioned observation distribution $\mathbb{O}$ (Figure 2a). If the system strictly obeys Equation 3 and Equation 4 then the KF is optimal in the sense that it is the minimum variance filter of that system Anderson and Moore, 2012. If the distributions of ν and ω are not normal, the KF is still the minimum variance filter in the class of all linear filters but there might be better nonlinear filters. In case of colored noise ν and ω the filtering equations (see Materials and methods) can be reformulated by state augmentation or measurementtimedifference approach techniques Chang, 2014. For each element in a sequence of hidden states $\{{\mathbf{n}}_{t}:0<t<T\}$ and for a fixed set of parameters $\mathit{\theta}$, an algorithm based on a Bayesian filter (Figure 2a), explicitly exploits the conditional dependencies of the assumed Markov process. A Bayesian filter recursively predicts prior distributions for the next $\mathbf{n}}_{t$
given what is known about $\mathbf{n}}_{t1$ due to $y}_{t1$. The KF as a special Bayesian filter assumes that the transition probability is multivariate normal according to Equation 3
Note, that Equation 6 is a central approximation of the KF. While the exact transition distribution of an ensemble of ion channels is the generalizedmultinomial distribution (Methods Equation 32), the quality of normal approximations to multinomial Milescu et al., 2005 or generalizedmultinomial Moffatt, 2007 distributions depends on the number of ion channels ${N}_{\mathrm{ch}}$ in the patch and on the position of the probability vector in the simplex space. The difference between the loglikelihoods of the true generalizedmultinomial dynamics and Equation 6 type approximation scales as $1/{N}_{\mathrm{c}\mathrm{h}}$ Moffatt, 2007. As a rule of thumb one should be careful with both algorithms for time traces with ${N}_{\mathrm{c}\mathrm{h}}\in \left[{10}^{1},{10}^{2}\right]$. Below or even inside this interval there are more qualified concepts such as the forward algorithm or even particle filters (Golightly and Wilkinson, 2011; Gillespie and Golightly, 2012) which avoid the normal approximation.
Each prediction of $\mathbf{n}}_{t$ (Equation 6) is followed by a correction step,
that allows to incorporate the current data point into the estimate, based on the Bayesian theorem (Chen, 2003). Additionally, the KF assumes (Anderson and Moore, 2012; Moffatt, 2007) a multivariate normal observation distribution
If the initial prior distribution is multivariate normal then due to the mathematical properties of the normal distributions the prior and posterior $\mathbb{P}(\cdot )$ in Equation 8 become multivariate normal Chen, 2003 for each time step. In this case, one can derive algebraic equations for the prediction (Materials and methods Equation 37, Equation 38d) and correction (Materials and methods Equation 58 and Equation 58) of the mean and covariance. The algebraic equations originate from the fact that a normal prior is the conjugated prior for the mean value of a normal likelihood. Due to the recursiveness of its equations, the KF has a time complexity that is linear in the number of data points, allowing a fast algorithm. The denominator of Equation 8 is the normal distributed marginal likelihood $\mathbb{L}({\mathbf{y}}_{t}{\mathcal{Y}}_{t1},\mathit{\theta})$ for each data point, which constructs by
a product marginal likelihood of normal distributions of the whole time trace $\mathcal{Y}}_{\mathrm{T}}=\{{\mathbf{y}}_{1},\dots ,{\mathbf{y}}_{{N}_{\mathrm{T}}}\$ of length ${N}_{\mathrm{T}}$ for the KF. For the derivation of $\mathbf{P}}_{t$ and $\mathbf{\Sigma}}_{t$ see Materials and methods (Equation 38d) and Equation 43. $\mathbf{P}}_{t$ is the covariance of the prior distribution over $\mathbf{n}(t)$ before the KF took $\mathbf{y}(t)$ into account. The likelihood for the data allows to ascribe a probability to the parameters $\mathit{\theta}$, given the observed data (Methods Equation 20). An illustration for the operation of the KF on the observation space is given in Figure 2b. The predicted mean signal $\mathbf{H}\mathbb{E}[\mathbf{n}(t)]$ corresponds to binding degree $B(t)=\frac{\mathbf{H}\mathbb{E}[\mathbf{n}(t){]}_{1}}{{N}_{\mathrm{c}\mathrm{h}}}$ and open probability $P}_{O}(t)=\frac{\mathbf{H}\mathbb{E}[\mathbf{n}(t){]}_{2}}{{N}_{\mathrm{c}\mathrm{h}}$. These values are plotted as vector trajectories.
The standard KF (Moffatt, 2007; Anderson and Moore, 2012; Chen, 2003) has additive constant noise ${\mathrm{\Sigma}}_{t}=const$ in the observation model. Thus, in this case a constant variance term $\mathrm{\Sigma}$ is added, in Equation 9 to the aleatory variance $\mathbf{H}{\mathbf{P}}_{t}{\mathbf{H}}^{\mathrm{\top}}$ which, as mentioned above, originates (Equation 38d) from the the fact that we do not know the true system state $\mathbf{n}(t)$. For signals with Poissondistributed photon counting or openchannel noise, we need to generalize the noise model to account for additional whitenoise fluctuations with $\mathbf{n}(t)$dependent variance. For instance, in singlechannel currents additional noise is often observed whose variance is referred to by $\sigma}_{\mathrm{o}\mathrm{p}}^{2$. In macroscopic currents this additional noise can be modeled by a term ${\sigma}_{\mathrm{o}\mathrm{p}}^{2}{n}_{4}(t)$, causing statedependency of our noise model.
The second noise term ${\nu}_{\mathrm{op}}$ is defined in terms of the first two moments $\mathbb{E}({\nu}_{\mathrm{o}\mathrm{p}})=0$ and $\mathrm{v}\mathrm{a}\mathrm{r}({\nu}_{\mathrm{o}\mathrm{p}})=\mathbb{E}({\nu}_{\mathrm{o}\mathrm{p}}^{2})={\sigma}_{\mathrm{o}\mathrm{p}}^{2}{n}_{4}(t)$. To the best of our knowledge such a statedependent noise makes the integration of the denominator of Equation 8 (which is also the incremental likelihood) intractable
This is because the state distribution $\mathcal{N}(\mathbf{n}\overline{\mathbf{n}}(t),\mathbf{P}(t))$ as the prior also influences the variance parameter of the likelihood which means that the conjugacy property is lost. While a normal distribution is the conjugated prior of the mean of a normal likelihood, it is not the conjugated prior for the variance. However, by applying the theorem of total variance decomposition Equation 46a we deduce a normal approximation to Equation 8 and to the related problem of Poissondistributed noise in fluorescence Equation 57, Equation 55a data. By computing the mean and the variance or covariance matrix of the signal, we can reformulate the noise model to fit the form of the traditional KF framework. Note, that the derived equations for the covariance matrix are still exact for the more general noise model. Mean and covariance just do not form a set of sufficient statistics anymore.
Our derivation is not limited to ligandgated ion channels. For example, when investigating voltagegated channels, the corresponding noise model can be easily adapted. This holds also when using the P/n protocol for which the noise model resembles that of the additional variance in the fluorescence signal. The additional variance is induced because the mean signal from the ligands swimming in the bulk (Materials amd methods Equation 43 Appendix 5) is eliminated by subtracting scaled mean reference signal which itself has an error. This manipulation adds additional variance to the resulting signal comparable to P/n protocol. Other experimental challenges, as for example series resistance compensation promoting oscillatory behavior of the amplifier, deserve certainly advanced treatment. Nevertheless, for voltageclamp experiments with a rate equation approach it also becomes clear (Lei et al., 2020) that modeling of the actual experimental limitations, including series resistance, membrane and pipette capacitance, voltage offsets, imperfect compensations by the amplifier, and leak currents are necessary for consistent kinetic scheme inference.
The Bayesian posterior distribution
encodes all information from model assumptions and experimental data used during model training (see Materials and methods). A full Bayesian inference is usually not an optimization (finding the global maximum or mode of the posterior or likelihood) but calculates all sorts of quantities derived from the posterior distribution such as mean values of any function $f$ including the mean value or covariance matrix of the parameters themselves or even the likelihood of the data.
Besides the covariance matrix of the parameter to express parameter uncertainty, the posterior allows to calculate a credibility volume. The smallest volume $V}_{P$ that encloses a probability mass $P$ of
is called the Highest Density Credibility Volume/Interval (HDCV/HDCI). Those credibility volumes should not be confused with confidence volumes although under certain conditions they can become equivalent. Given that our model sufficiently captures the true process, the true values $\mathit{\theta}}_{\mathrm{t}\mathrm{r}\mathrm{u}\mathrm{e}$ will be inside that volume with a probability $P$. Unfortunately, typically there is no analytical solution to Equation 12 . However, it can be solved numerically with Monte Carlo techniques, enabling to calculate all quantities related to Equation 13 and Equation 14 . Our algorithm uses automatic differentiation of the statistical model to sample from the posterior (Appendix 1—figure 1a) via Hamiltonian Monte Carlo (HMC) (Betancourt, 2017), see Appendix 7 , as provided by the Stan software (Hoffman and Gelman, 2014; Gelman et al., 2015).
Benchmark for PC data against the gold standard algorithms
We compare the posterior distribution (Figure 3) of our algorithm against Bayesian versions of the deterministic (Milescu et al., 2005) and stochastic (Moffatt, 2007) algorithms, which we consider as the gold standard algorithms for macroscopic patchclamp data. Simulated currents of a patch with $N}_{ch}=5\cdot {10}^{3$ are shown in (Figure 3d). The resulting posteriors (Figure 3a) show that both former algorithms are further away from the true parameter values with their maxima or mean values (Figure 3a). E.g., the relative error of the maximum of the posterior are $\mathrm{\Delta}{k}_{21}\approx 200\mathrm{\%}$ for Milescu et al., 2005 and $\mathrm{\Delta}{k}_{32}\approx 240\mathrm{\%}$ for Moffatt, 2007 . The four other parameters including the three equilibrium constants behave less problematic as judged by their relative error. Additionally, if one does not only judge the performance by the relative distance of maximum (or some other significant point) of the posterior but considers the spread of the posterior as well, it becomes apparent, that the marginal posterior of both former algorithms fail to cover the true values within at least the reasonable parts of their tails. Accordingly, for maximum likelihood inferences the true value would be far outside the estimated confidence interval. For the RE approach only the marginal posterior of $\stackrel{~}{K}}_{21$ is nicely centered over the true values and the marginal of $\stackrel{~}{K}}_{32$ could be considered to cover within a reasonable part of the distribution the true value. Uncertainty quantification is investigated in more detail further down (Figures 4—9). Note that in Figure 3a, parameter unidentifiability by heavy tails/ multiple maxima of the posterior distribution or (anti) correlation is easily visible as non axial symmetric patterns.

Figure 4—source data 1
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data1v2.zip

Figure 4—source data 2
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data2v2.zip

Figure 4—source data 3
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data3v2.zip

Figure 4—source data 4
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data4v2.zip

Figure 4—source data 5
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data5v2.zip

Figure 4—source data 6
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data6v2.zip

Figure 4—source data 7
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data7v2.zip

Figure 4—source data 8
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data8v2.zip

Figure 4—source data 9
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data9v2.zip

Figure 4—source data 10
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data10v2.zip

Figure 4—source data 11
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data11v2.zip

Figure 4—source data 12
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data12v2.zip
To assess the location of the posterior conditioned on ${N}_{\mathrm{ch}}$, we select the median vector $\mathit{\theta}$ of the marginal posteriors and calculate its Euclidean distance to the true values by:
This defines a single value to judge the overall accuracy of the posterior. Varying ${\sigma}_{\text{op}}/i$ reveals the range of the validity (Figure 3b) of the algorithm (red) from Moffatt, 2007 . While both stochastic approaches are nearly equivalent for low openchannel noise, the RE (blue) performs consistently poorer. It may seem surprising that even for In fact, the KF method beha the two stochastic algorithms start to produce different results. But considering the scaling (Materials and methods Equation 46a) of the total openchannel noise (top axis) from currents of an ensemble patch $\propto {({N}_{\mathrm{ch}}{P}_{\mathrm{open},\mathrm{max}}0.5)}^{0.5}{\sigma}_{\mathrm{open}}$ one sees that if $\propto {({N}_{\mathrm{ch}}{P}_{\mathrm{open},\mathrm{max}}0.5)}^{0.5}{\sigma}_{\mathrm{open}}$ approaches σ the traditional KF suffers from ignoring state dependent noise contributions. The lower scale changes with experiments (e.g. ${N}_{\mathrm{ch}}$ and ${\sigma}_{op}$). In contrast, the upper scale is largely independent of the particular measurements. The two different normalizations indicate an experimental intuition: “ Why should I consider the extra noise from the open state of the single channel if only ${\sigma}_{\mathrm{op}}/i={\sigma}_{\mathrm{op}}/\sigma \approx 0.01$” is misleading. The small advantage of our algorithm for small ${\sigma}_{\mathrm{op}}/i$ over Moffatt, 2007 is due to the fact that we could apply an informative prior in the formulation of the inference problem on ${\sigma}_{\mathrm{exp}}\sim \mathrm{normal}({\sigma}_{\mathrm{exp},\mathrm{true}}^{2},{\sigma}_{\mathrm{exp},\mathrm{true}}^{2}\cdot 0.01)$ by taking advantage of our generalization (Equation 46a) Bayesian filter. Further, Figure 3b indicates the importance that the functional form of the likelihood is flexible enough to capture the second order statistics of the noise of the data sufficiently.
For an increasing data quality, which in our benchmark is an increasing ${N}_{\mathrm{ch}}$ per trace, we show (Figure 3c) that the deterministic RE and our Bayesian filter are consistent estimators, that is they converge in distribution to the true parameter values with their posterior maxima or median for increasing data quality. The scaling of the RE approach (blue) and our Bayesian filter (green) vs. ${N}_{\mathrm{ch}}$ shows that for large ${N}_{\mathrm{ch}}$ both algorithms seem to have a constant error ratio relative to each other. They are both well described by $\mathrm{error}({N}_{\mathrm{ch}})\propto a/\sqrt{{N}_{\mathrm{ch}}}$ with an error ratio computed from the fit of 4.4. Thus, although our statistical model is singular (meaning that the fisher information matrix is singular Watanabe, 2007), its asymptotic learning behaviour is similar to a regular model (Figure 4c) which, however, means that the euclidean error from both algorithms stays different also for large ${N}_{\mathrm{ch}}$. For data with $N}_{\mathrm{c}\mathrm{h}}<2\cdot {10}^{3$ the samples from the posterior typically indicate that the posterior is improper which is defined as
We consider this as the case of unidentified parameters. This datadriven definition is in so far different from structural and practical identifiability definitions (Middendorf and Aldrich, 2017a; Middendorf and Aldrich, 2017b) as the two latter cases are not distinguished. Still the practical consequence of structural or practical unidentifiability, which is usually an improper posterior, is captured. Cases of structural or practical unidentifiability which lead to a confined region of constant posterior density will be considered identified as the posterior is still normalizable thus the uncertainty quantification will still be correct, even when this finding is not sufficient to answer the research question at hand.
Benchmarking for cPCF data against the gold standard algorithm
For the simulated time traces with an optimistically high signaltonoise assumption, the posterior of the KF (from hereon KF denotes our Bayesian Filter) and a RE (Milescu et al., 2005) approach are compared for cPCF data (Figure 4a–d). For a brief introduction of the RE approach, see Appendix 8 . The failure to analyze PC data with moderate openchannel noise (Moffatt, 2007; Figure 3a) disqualifies the classical KF with its constant noise variance also as a useful algorithm for fluorescence data, because here the Poisson distribution of the signal generates an even stronger state dependency of the signal variance.
By “high signaltonoise assumption” , we refer to an experimental situation with a standard deviation of the current recordings ${\sigma}_{\mathrm{ex}}/i=0.5$, a low additional ${\sigma}_{\mathrm{op}}/i=0.05$, and a high mean photon rate per bound ligand and frame ${\lambda}_{\mathrm{b}}=5$. Additionally, we assume vanishing fluorescence background noise generated by the ligands in the bulk. The benefit of the high signaltonoise is that the limitations of the two different approximations to the stochastic process of binding and gating can be investigated without running into the risk of being compensated or obscured by the noise from the experimental setup. For these experimental settings (Figure 4a), we calculate the Euclidean distance of the median (Equation 15) for different ${N}_{\mathrm{ch}}$. For ${N}_{\mathrm{c}\mathrm{h}}<500$ (gray shaded area in Figure 4a), the Euclidean error of both algorithms is roughly the same. On the single parameter level (Figure 4b), this can be seen as an onset of correlated deviations from the true value for both algorithms. Each marginal posterior has for each ${N}_{\mathrm{ch}}$ a similar deviation in magnitude and direction. That is in particular true for ${\stackrel{~}{k}}_{32}$ and ${\stackrel{~}{K}}_{32}$ which dominate Equation 15 . In spite of the correlation in direction of the errors of ${\stackrel{~}{k}}_{21}$ and ${\stackrel{~}{K}}_{21}$ their magnitude is still smaller for the KF. In summary, this indicates that in this regime the approximations to the involved multinomial distributions fail in a similar manner for both algorithms. That implies that treating the autocorrelation of the gating and binding becomes similar important compared to the error induced by normal approximations (which are used by the KF and the RE approach). For larger ${N}_{\mathrm{ch}}$, the Euclidean error of the RE is on average 1.6 times larger than the corresponding error of the posterior mode of the KF, which we deduce by fitting the function $\mathrm{error}({N}_{\mathrm{ch}})=\frac{a}{\sqrt{{N}_{\mathrm{ch}}}}$. On the one hand, both algorithms are better in approaching the true values than with patchclamp data alone. On the other hand, the smaller error ratio means, that adding a second observable constrains the posterior, such that much of the overfitting is prevented for the RE approach. By overfitting, we define the adaptation of any inference algorithm to the specific details of the used data set due to experimental and intrinsic noise which is aggravated if too complex kinetic schemes are used. Similarly, (Milescu et al., 2005) showed that the over fitting tendency of the RE can be reduced if the autocorrelation of the data is eliminated. The dotted green curve derives from PC data. The Euclidean error is roughly an order of magnitude larger for ${N}_{\mathrm{c}\mathrm{h}}>2000$. Thus, in this regime the cPCF data set is equivalent to 10^{2} fold more time traces or 10^{2} more ${N}_{\mathrm{ch}}$ in a similar PC data set. For ${N}_{\mathrm{c}\mathrm{h}}<2000$ only cPCF establishes parameter identifiability (given a data set of 10 ligand concentrations and no other prior information). In Figure 4b(16), we demonstrate the 0.95HDCI (Equation 14) of all parameters and their modes vs. ${N}_{\mathrm{ch}}$. Even though the Bayesian filter and the RE approach are both consistent estimators, the RE approach covers the true values with its 0.95HDCI only occasionally. The modeling assumption of the RE approach of treating each data point as if it does not come from a Markov process but from an individual draw from a multinomial distribution with deterministically evolving mean and variance makes the parameter estimates overly confident (Figure 4b(16)) . A likely explanation can be found by analyzing the extreme case where data points are sampled at high frequency relative to the time scales of the channel dynamics. The RE approach treats each data point as a new draw from Equation 67 while in reality the ion channel ensemble had no time to evolve into a new state. In contrast, the KF updates its information about the ensemble state after incorporating the current data point and then predicts from this updated information the generalised multinomial distribution of the next data point. For ${N}_{\mathrm{c}\mathrm{h}}>200$, the marginal posterior of the KF usually contains the true value. Nevertheless, one might depict a bias in both algorithms, in particular (Figure 4b 2,4) for ${\stackrel{~}{k}}_{32}$ and ${\stackrel{~}{K}}_{32}$ for $N}_{\text{ch}}<2\cdot {10}^{3$, similar to the findings of Moffatt, 2007 . A proper investigation of bias can be found in Figure 11 and 12 and in the Appendix. Notably, with the more realistic higher experimental noise level, in those tests the bias is hardly or not all detectable (consider the unfiltered or infinitely fast integrated data). A plausible explanation is that the bias only occurs (Figure 4 2,4) because the data are that perfect that the discrete nature of the ensemble dynamics is almost visually detectable, thus deviating from to the modeling assumption of multivariate normal distributions.
To investigate the six onedimensional 0.95HDCIs simultaneously, we declare the analysis of a data set as successful if all 0.95HDCIs include the true values. Otherwise we define it as a failure. This enables to determine the true probability at which the probability mass of the KF and the RE approach covers the true values in a binomial setting. The left blue vertical line in Figure 4c indicates $p={0.95}^{6}\approx 0.735$ which is the lower limit and which would be the true success probability for an ideal model whose six 0.95HDCIs are drawn from $y\sim \mathrm{binomial}(0.95,6)$. This is the probability of getting 6 successes in 6 trials. The right blue vertical line equals $p=0.95$, signifying the upper limit obtained by treating the six $0.95$ HDCIs as being drawn from $y\sim \mathrm{binomial}(0.95,1)$ each, which is a rather loose approximation. All marginal distributions are computed from the same highdimensional posterior which is formed by one data set for each trial. Thus, the six $0.95$ HDCIs $y\sim \mathrm{binomial}(0.95,1)$ must have success rates between those two extremes if the algorithm creates an accurate posterior. We next combine the binomial likelihood with the conjugated beta prior (Hines et al., 2014) for mathematical convenience. On this occasion, for the sake of the argument, $\mathrm{beta}(1,1)$ seems sufficient. A $\mathrm{beta}(1,1)$ prior is a uniform prior on the open interval $(0,1)$. The estimated true success rate of the RE approach (blue) is $\approx 0.15$ and therefore far away from the success probability an algorithm should have when it is based on an exact likelihood of the data. In contrast, the posterior (green) of the true success probability of the KF resides with a large probability mass between the lower and upper limit of the success probability of an optimal algorithm (given the correct kinetic scheme). As both algorithms use the same prior distribution, the different performance is not induced by the prior.
Exploiting six onedimensional posterior distributions does not necessarily answer whether the posterior is accurate in 6 dimensions but we can refine the used binomial setting. In Figure 4d $\mathbb{P}({\stackrel{~}{k}}_{32},{\stackrel{~}{K}}_{43})$, we see that 2D marginal distributions can, due to their additional degree of freedom, twist around the true value without covering it with HDCV (Equation 14) of reasonable size while simultaneously the two 1–D marginal distribution do cover it with a reasonable HDCI. In general, the KF posterior distribution has its mode much closer to the true value for various parameter combinations and it seems that the posterior is approximately multivariate normal. Further, we recognize that the probability mass of the reasonably sized HDCV of the KF posterior includes the true values whereas the HDCV from the RE does not. In 6 dimensions we lack visual representations of the posterior. Since we showed that both algorithms are consistent for a given identifiable model, we are looking for a way to ask whether the posterior is accurate (has the posterior distribution the right shape). We can answer that question by asking, how much probability mass around the mode (or around multiple modes) needs to be counted to construct a HDCV Equation 14 which includes the true values. Then we can ask for ${N}_{\mathrm{set}}$ data sets how often did we find the true values inside a volume $V(P)$ of a specific probability mass $P$ of the posterior distribution
An algorithm which estimates the parameters of the true process should fulfill this property simultaneously to being consistent. Otherwise credibility volumes or confidence volumes are meaningless. Noteworthy, that this is a empirical test of how sufficient the Bayesian filter and the RE approach hold frequentist coverage property of their HDCVs (Rubin and Schenker, 1986). We explain (Appendix 8) in detail how to quantify the overall shape and $n$dimensional posterior and comment on its geometrical meaning. One way is to use an analytical approximation via the cumulative Chisquared distribution (Figure 5a and b), The other way is to count the probability mass of $n$dimensional histogram bins starting with the highest value until the first bin includes the true values (Figure 5b).
Knowing how much volume/probability mass is needed to include the true rate matrix allows us to test whether all HDCVs constructed from the two probability distributions match the binomial distributions of the ideal model. For each data set and for each HDCV of a fixed probability mass, there are two possible outcomes: The true rate matrix is inside or outside of that volume. For a chosen HDCV with a fixed probability volume, as indicated by a gray space in Figure 5b , we count how many times the true matrix is included in the volume of that probability mass for each trail in a fixed amount of trials. Since the success is binomially distributed, we plot the expected mean of a perfect model $E[y]={N}_{\mathrm{trials}}{P}_{\mathrm{true}}$ and binomial quantiles and compare them with the success rate found in our test runs (Figure 5c) for both algorithms with both methods to determine the posterior shape. The posterior of the KF distributes the probability mass in a consistent manner such that each volume includes the true rate matrix within the quantile range. In contrast, the RE approach fails for all data sets for all HDCVs (from 0 – 0.95 probability mass) and does not include the true values in one single case. Note, that all the binomial trials for each HDCV are made from the same set of data sets which explains the correlated deviation from the mean. For lower but realistic signal to noise ratios, where the fit quality decreases, for example by producing larger errors/wider posterior distributions (Figure 6a), the statistics of the HDCV from the RE approach improve but are still outperformed by the KF. In particular, in our tested case of realistic experimental noise we never find the true values within a 0.65HDCV if the data are analyzed with a RE approach. Even for the highest noise level (Figure 6b), the probability mass of the KF posterior needed to include the true rate matrix remains almost always smaller then the posterior mass of the RE approach. That means that the posterior mass of the KF is much closer to the true value distributed than the posterior mass of the RE. With the KF we find the true rate matrix for one data set in small volume $P<0.05$ around the mode. To achieve the same with the RE approach we need at least a probability mass of 0.3.
In the inset of Figure 6a and b we do not observe a trend, thus no indication that the RE approach has a better performance for large values ${N}_{\mathrm{ch}}$ in this regard. This challenges the common argument that the RE approach should be equivalent to the KF for large ${N}_{\mathrm{ch}}$ because the ratio of mean signal vs. the intrinsic binding and gating noise is so large. Thus, including the autocorrelation into the analysis is important even for unrealistic large ${N}_{\mathrm{ch}}$. One possible explanation is model a signaltonoise ratio which scales $\propto {N}_{\mathrm{ch}}$. From the multinomial distribution both algorithms inherit mean signals which scale $\propto {N}_{\mathrm{ch}}$ and variances which scale in the terms dominating for large ${N}_{\mathrm{ch}}$ similarly with $\propto {N}_{\mathrm{ch}}$. Thus, identical to the real signal, both algorithms model the scaling of the signaltonoise ratio $\propto \sqrt{N}$. It is plausible, that both algorithms remain sensitive for the occurrence of autocorrelation of the noise even for largest signaltonoise ratios. In Figure 5c we compare the Euclidean error of the pessimistic high white noise case with an overoptimistic low noise case. We see, that when increasing ${N}_{\mathrm{ch}}$ there is a regime where the Euclidean error increases faster than ${\sqrt{{N}_{\mathrm{ch}}}}^{1}$ which we indicate with a coarse approximate fit $\propto {N}_{\mathrm{ch}}^{1}$. In that regime two effects happen simultaneously. First, the mean and the intrinsic fluctuations of the signal become more and more dominant over the experimental noise. Second, the standard deviation of intrinsic fluctuations becomes smaller relative to the mean signal. We speculate, that this produces together a learning rate which is faster than the usual asymptotic learning rate ${\sqrt{{N}_{\mathrm{ch}}}}^{1}$ of a regular model but relaxes asymptotically towards ${\sqrt{{N}_{\mathrm{ch}}}}^{1}$.
Statistical properties of both algorithms for more complex models
We have seen in Figure 3c and Figure 4a that the RE and the KF algorithm are consistent estimators, while their error ratio (Figure 7a) seems to have no trend to approach 1 with increasing ${N}_{\mathrm{ch}}$. Adding a second observable increases parameter accuracy and adds identifiability for both algorithms since less aspects of the dynamics need to be statistically inferred (Figure 4a). Furthermore, the second observable takes away much of the tendency (compare Figure 4b 1 – 6 with 8) of the RE approach to overinterpret (overfit) which leads to a shrinking of the error ratio $5.6\pm 1.4$ for PC data to smaller values for cPCF data (Figure 7a) (red) which are on average still bigger than one, while the Euclidean error is reduced (Figure 4a). If we then keep the amount and quality of the PC/cPCF data but increase the complexity of the model which produced the data (Figure 7b and d) from a fourstate to a fivestate model (see kinetic schemes above Figure 7a–c), we see that for cPCF data the error ratio stays roughly the same (difference between Figure 7a and b). For PC data instead both algorithms deliver an unidentified k_{21} for ${N}_{\mathrm{ch}}\leqq 2\cdot {10}^{3}$ (defined as an improper posterior). For larger ${N}_{\mathrm{ch}}$ the KF always identifies all parameters while the RE fails at ${N}_{\mathrm{ch}}\in \{7000,2000,75000\}$ to identify k_{54}. Thus, the KF reduces the risk of unidentified parameters. To calculate the mean error ratio, we exclude the values were some of the parameters are unidentified in total that still amounts to $6.8\pm 2.7$ thus the advantage of the KF (given all parameters are identified) might increase with model complexity for PC data. The lowest Euclidean error for this kinetic scheme has cPCF data analyzed with the KF. (Figure 7d). A 6state1openstates model with cPCF data has again an error ratio of the the usual scale (Figure 7c). As expected, the Euclidean error continuously increases with model complexity (Figure 7d and e). For PC data of the 6state1openstates model even the likelihood of the KF is that weak (Figure 7e) that it delivers unidentified parameters even for and we can detect heavy tailed distributions up until . Using RE on PC data alone does not lead to parameter identification, thus no error ratio can be calculated.
Consistent with our findings, fluorescence data itself, should lower the advantage of the KF compared to PC data simply by signaltonoise arguments. The stochastic aspect of the ligand binding is usually more dominated by the noise of Photon counting and background noise than the stochastic gating is dominated in current data by experimental noise. In terms of uncertainty quantification the advantage of the KF with cPCF varies with the model complexity (see, Appendix 9).
Besides analyzing what causes the changes in the Euclidean error (Figure 7a and b) at the single parameter, we now investigate whether the posterior is a proper representation of uncertainty. Thus, we look back at the HDCIs. The HDCIs of the 4state1openstate (Figure 8) of the PC data from Figure 3 reveal an exacerbated overconfidence problem of the RE approach (blue) compared to cPCFdata (Figures 4b1—6). This, underlines our conclusion of Figures 5 and 6 that the Bayesian posterior sampled by the RE approach is misshaped. As a consequence a confidence volume derived from the curvature at the ML estimate of the RE algorithm understates parameter uncertainty. A possible way for ML methods to derive correct uncertainty quantification is by using bootstrapping data methods (Joshi et al., 2006). Furthermore, the error ratios of each single parameter from its true value in the last column ${\stackrel{~}{k}}_{43}$${\stackrel{~}{K}}_{43}$ strongly increased their magnitudes (insets Figure 8). Even error ratios of $5\cdot {10}^{2}$ are possible. Note, that the way we defined Equation 15 suppresses the influence of the smaller parameter errors in the overall error ratio. Thus the advantage (error ratio) of the KF over RE approach for a single parameter can be much larger or lower compared to the error ratio derived from the Euclidean error if the respective parameter is contributing less to the Euclidean error. The posterior of the KF (green) seems to be unbiased after the transition into the regime $N}_{\mathrm{c}\mathrm{h}}>2\cdot {10}^{3$ where all parameters are identified. Similarly, for the RE algorithm there is no obvious bias in the inference. If we use the RE algorithm and change from the fourstate to the fivestate model (PC data from Figure 7b), bias occurs (Figure 9) in many inferred parameters, even for the highest ${N}_{\mathrm{ch}}$ investigated. Milescu et al., 2005 showed that one or the reason of the biased inference of the RE approach is its ignorance of autocorrelation of the intrinsic noise. We add here that the bias problem clearly aggravates with an increased model complexity. It is even present in unrealistically large patches which in principle could be generated by summing up 10^{2} time traces with ${N}_{\mathrm{ch}}={10}^{3}$. In contrast, the KF algorithm reveals that its parameter inference is either unbiased or at least much less biased in the displayed ${N}_{\mathrm{ch}}$ regime. Furthermore, for both algorithms the position of the HDCI relative to the true value is for some parameters highly correlated, which corresponds to the correlation between optima of the ML method of Milescu et al., 2005 ; Moffatt, 2007.
As a side note, unbiased parameter estimates are a highly desirable feature of an inference algorithm. For example, with a bias in the inference, repeated experiments do not lead to the true value if the arithmetic mean of the parameter inferences is taken. With bias even bootstrapping methods fail to produce reliable uncertainty quantification. Due to the variation of the data the k_{54} parameter is either identified in some neighbourhood of the true value or complete unidentified (Figure 9), if the RE algorithm is used. The unidentified k_{54} occurs even at highquality data such as ${N}_{\mathrm{ch}}=7.5\cdot {10}^{4}$. Only because of the nonphysical prior (Figure 9) of k_{54} induced by the limits of the sampling box of the sampling algorithm the posterior appears to be proper but is in fact either unidentified or or more than two orders of magnitude away from the true value. For the same data using the KF did not result in any unidentified parameters. Note, that comparable inference pathologies such as multimodal distributions of inferred parameter were also reported for the maximum likelihood RE algorithm for low quality PC data or too simple stimulation protocols (Milescu et al., 2005).
In conclusion, the two different perspectives on parameter uncertainty: On the one hand distributions of ML estimates due to the random data (Milescu et al., 2005 ; Moffatt, 2007) and the Bayesian posterior distribution loose their tightly linked (and necessary) connection if the RE algorithm is used. Thus, KF robustifies also ML inferences of the rate matrix. Our findings are consistent with the findings for gene regulatory networks (Gillespie and Golightly, 2012) which show that RE approaches deliver a too narrow posterior in contrast to stochastic approximations which deliver an acceptable posterior compared to the true posterior (defined by a particle filter algorithm). On the data side of the inference problem adding cPCF data eliminates the bias, reduces the variance of the position of the HDCI and eliminates unidentified parameters (Appendix 9—figures 1 and 2) for both investigated algorithms. This advantage increases with modelcomplexity.
For the fivestate and sixstate model, we applied microscopicreversibility (Colquhoun et al., 2004). We enforced it by hierarchical prior distribution (Materials and methods Equation 60) whose parameters can be chosen such that they allow only arbitrarily small violations of microscopicreversibility. But the prior distribution can also be used to enforce some softer regularization around microscopicreversibility. Thus, we can transfer the usually strictly applied algebraic constraint (Salari et al., 2018) of microscopicreversibility to a constraint with scalable softness. In that way we can model the lack of information if microscopicreversibility is exactly fulfilled (Colquhoun et al., 2004) by the given ion channel instead of enforcing the strict constraint upon the model.
Prior critique and model complexity
In the Bayesian framework, the likelihood of the data and the prior generate the posterior. Thus, the performance of both algorithms can be influenced by appropriate prior distributions. We used a uniform prior over the rate matrix which is not optimal. Note, that uniform priors are widely used by several reasons. They appear to be unbiased, and are assumed to be a ‘no prior’ option (which they are not). This is true for location parameters like mean values. In contrast, for other parameters, such as scaling parameters like rates or variances, a uniform prior adds bias to the inference towards faster rates (Zwickl and Holder, 2004). We suspect, that for the PC data even in the simplest model discussed here the lower data quality limit below which we detected unidentified parameters (improper posteriors) is caused by the uniform prior. This lower limit for the KF also increases with the complexity of the model from $N}_{\mathrm{c}\mathrm{h}}<2\cdot {10}^{3$ for the fouestate model till ${N}_{\mathrm{ch}}\leqq 2\cdot {10}^{4}$ for 6state1openstate model. Note, that it is hardly possible to fit the 6state1openstate model with the RE approach for the same amount of PC data. We observe cPCF data eases this problem because the likelihood becomes more concentrated for all parameters. The likelihood dominates the uniform prior. Nevertheless, for most parts of the paper we used a uniform prior over the rates and equilibrium constants to be comparable with the usual default method: a plain ML which influences our results in data regimes in which the data is not strong enough to dominate the bias from the uniform prior. Thus, both algorithms perform better with smarter informative or at least unbiased prior choices for the rate matrix.
In principle, to rule out an influence of the prior, unbiased priors should be used for the rates. The standard concept for unbiased least informative priors is to construct a Jeffreys prior Jeffreys, 1946 for the rate matrix which is, however, beyond the scope of the paper.
The influence of the brightness of the ligands of cPCF data on the inference
To evaluate the advantage of cPCF data Biskup et al., 2007 with respect to PC data only (Figure 10), we compare different types of ligands: Idealized ligands with brightness ${\lambda}_{\text{b}}$, emitting light only when bound to the channels, ‘real’ ligands which also produce background fluorescence when diffusing in the bath solution (Appendix 5) and current data alone. For datasets including fluorescence, the increased precision for the dissociation rate of the first ligand, ${k}_{2,1}$, is that strong that the variance of the posterior $\mathbb{P}({k}_{2,1},{k}_{3,2})$ nearly vanishes in the combined plot with the current data (nearly all probability mass is concentrated in a single point in Figure 10a). The effect on the error of the equilibrium constants ${K}_{i}$ is less strong. Additionally, the bias is reduced and even the estimation of ${N}_{\mathrm{ch}}$ is improved. The brighter the ligands are, the more the posterior of the rates decorrelates, in particular $\mathbb{P}({k}_{2,1},{k}_{3,2})$ (Figure 10a). All median estimates of nine different cPCF data sets (Figure 10b) differ by less than a factor 1.1 from the true parameter except ${k}_{3,2}$, which does not profit as much from the fluorescence data as ${k}_{2,1}$ (Figure 10c). The 95th percentiles, l_{95} of $\mathbb{P}({k}_{2,1})$ and $\mathbb{P}({K}_{1})$ follow ${l}_{95}\sim 1/\sqrt{{\lambda}_{\mathrm{b}}}$. Thus, with increasing magnitude of ligand brightness λ, the estimation of ${k}_{2,1}$ becomes increasingly better compared to that of ${k}_{3,2}$ (Figure 10c). The posterior of the binding and unbinding rates of the first ligand contracts with increasing ${\lambda}_{\mathrm{b}}$. The l_{95} percentiles of other parameters exhibit a weaker dependency on the brightness (${l}_{95}\sim {\lambda}^{0.1}$). For ${\lambda}_{\mathrm{b}}=0.01$ photons per bound ligand and frame, which corresponds to a maximum mean signal of 20 photons per frame, the normal approximation to the Poisson noise hardly captures the asymmetry of photon counting noise included in the time traces. Nevertheless, l_{95} decreases about ten times when cPCF data are used (Figure 10c). The estimated variance of
with the mean predicted signal $\mathbf{H}E[\mathbf{n}({t}_{i})]$, for PC or cPCF data is ${\sigma}^{2}({r}_{i})\approx 1$ (Figure 10d) which means that the modeling predicts the stochastic process correctly up to the variance of the signal. Note that the mean value and covariance of the signal and the state form sufficient statistics of the process, since all involved distributions are approximately multivariate normal. The fat tails and skewness of $\mathbb{P}({k}_{21})$ and $\mathbb{P}({k}_{12})$ arises because the true model is too flexible for current data without further prior information. The KF allows to determine the variance (Figure 10e) of the openchannel current noise for ${\sigma}_{\mathrm{op}}=0.1i$. Adding fluorescence data has roughly the same effect on the estimation of ${\sigma}_{\mathrm{op}}$ like using five times more ion channels to estimate ${\sigma}_{\text{op}}^{2}$.
Sensitivity towards filtering before the analogtodigital conversion of the signal
On the one side, every analog signal to be digitized needs analog filtering for antialiasing according to the Nyquist theorem. On the other side, every analog filter does not only suppress unwanted white noise but also distorts the dynamics (Figure 11a) of the signal of interest (Silberberg and Magleby, 1993). Therefore, (Qin et al., 2000) recommend to avoid analog filtering as much as possible in singlechannel analysis and let the HMM analyze the data in the rawest available form, even with simultaneous drift correction (Sgouralis and Pressé, 2017a). One can also expect that analog filtering of a macroscopic signal is harmful for the inference of the KF and the RE approach. For the CCCO model considered herein we investigated the mean behavior (accuracy and precision) of the posterior of both algorithms with seven data sets (simulated at 100 kHz to mimic an analog signal). A digital fourthorder Bessel filter (Virtanen et al., 2020) was then applied. The maximum analysing frequency f_{ana} of the KF used is $100400$ Hz to be comparable to cPCF setups. The slower frequency at which the Bayesian filter analyzes the data is necessary because the applied Bessel filter has caused additional time correlations in the originally white noise of the signal. Thus, an alldatapoints fit would immediately violate the white noise assumption of Equation 4 which we restore by analyzing at a much lower frequency. We then let the time scales of the induced time correlations become larger and larger by decreasing f_{cut}. Physically, the absolute cutoff frequency f_{cut} is irrelevant; what matters is the magnitude of f_{cut} relative to f_{ana} and to the eigenvalues ${\alpha}_{i}$ of the ensemble (see, Appendix 3), since the eigenvalues determine the time evolution of the mean ensemble state, the autocorrelation, and Fourier spectrum of the fluctuations around the equilibrium distribution (Colquhoun et al., 1997b). The eigenvalues depend on the ligand concentration such that for a fourstate model for each ligand concentration there are three relevant time scales $1/{\alpha}_{i}$ (where $i=2,3,4$) plus the equilibrium solution which satisfies ${\alpha}_{1}=0$. For 10 different time series $3\cdot 10+3$ the outcome is to have different values of ${\alpha}_{i}$.Each eigenvalue is the inverse of the time constant of an exponential decay (see, Appendix 3). For this reason, we normalize in the following (Figure 11) the cutoff frequencies by ${\alpha}_{2}$ at the highest ligand concentration. We analyze the arithmetic mean from 7 different data sets of the median of the posterior of the rate matrix. The mean Euclidean error of the median (Figure 11b) and a series of quantiles demonstrate that overall the error of the mean median of the posterior KF (green) is smaller than that obtained by the RE. For unfiltered data, the accuracy of the mean median of the KF is increased by $\approx 1.6$. Based on the Euclidean error both algorithms benefit slightly from careful analog filtering for ${f}_{\mathrm{cut}}/{\alpha}_{2}\ge 1$ while the offset remains rather constant. A strong negative effect of analog filtering starts for both algorithms around ${f}_{\mathrm{cut}}\approx 1\mathrm{kHZ}$. This is induced by ${f}_{\mathrm{cut}}\to {f}_{\mathrm{ana}}$ (see, Appendix 10). In contrast, based on the level of each individual parameter of the rate matrix (Figure 11c 1–6) the bias induced by analog filtering immediately starts with ${f}_{\mathrm{cut}}=70\mathrm{kHz}$ (Figure 11c 1–3). Note, that visual inspection of the signal (Figure 11a) does not reveal signal distortions ${f}_{\mathrm{cut}}\ge 10\mathrm{kHz}$ though they are detected by both algorithms. For unfiltered data, the maximum of the posterior for the RE approach is a biased estimate $E[{\theta}_{\mathrm{ME}}]\ne {\theta}_{\mathrm{true}}$ for at least the parameters ${\stackrel{~}{k}}_{21},{\stackrel{~}{K}}_{21},{\stackrel{~}{K}}_{32}$ of the true value ${\theta}_{\mathrm{true}}$, which is explained (Milescu et al., 2005) by the fact that RE approaches ignore the autocorrelation of the intrinsic noise. Additionally, the data indicate that for ${\stackrel{~}{K}}_{43}$ the maximum of the posterior is even for the KF a biased estimate which we interpret as limitations induced by the fact that the mean vector and covariancematrix do not constitute sufficient statistics as soon as Poisson distributed photon counting or openchannel noise blurs the signal. For the RE approach, the additional bias induced by the analog filter on the mean maximum of all parameters of the posterior starts with ${f}_{\mathrm{cut}}\approx 70$ kHz or, in other words, at the fastest time scale in the whole data set. The total bias in the estimate is reduced for k_{21} with the additional bias from the analog filtering but increased for k_{32} which for the Euclidean error leads at first to a small increase in accuracy. The KF is more robust towards analog filtering, as the results alter less with f_{cut} (given a reasonable f_{cut}), and less biased for unfiltered data in the estimates of these parameters. On the one hand, the Euclidean error shrinks for ${f}_{\mathrm{c}\mathrm{u}\mathrm{t}}>10$ kHz (Figure 11b). On the other hand, on the singleparameter level (Figure 11c 1–6), the parameter estimates pick up bias due the analog filtering even for high filter frequencies, in particular for the RE approach. Only for k_{43} the KF is more biased than the RE approach.

Figure 11—source data 1
 https://cdn.elifesciences.org/articles/62714/elife62714fig11data1v2.zip

Figure 11—source data 2
 https://cdn.elifesciences.org/articles/62714/elife62714fig11data2v2.zip

Figure 11—source data 3
 https://cdn.elifesciences.org/articles/62714/elife62714fig11data3v2.zip

Figure 11—source data 4
 https://cdn.elifesciences.org/articles/62714/elife62714fig11data4v2.zip

Figure 11—source data 5
 https://cdn.elifesciences.org/articles/62714/elife62714fig11data5v2.zip

Figure 11—source data 6
 https://cdn.elifesciences.org/articles/62714/elife62714fig11data6v2.zip
The KF is the unique minimal variance Bayesian filter for a linear Gaussian process (Anderson and Moore, 2012) which means, given that the assumptions of the KF are fulfilled by the true process of interest, the KF is mathematically proven the best modelbased filter to apply. Consequently, analog filtering does not provide an advantage unless it removes specific high frequency external noise sources (colored noise). We demonstrate (Appendix 10) this for PC data and varied f_{cut} and f_{ana}. On the downside, increasing f_{ana} makes the results of both algorithms more fragile if ${f}_{\mathrm{cut}}\gg {f}_{\mathrm{ana}}$ does not hold. Thus, the critical edge in Figure 11b is indeed induced by f_{cut} approaching f_{ana}. This suggests that the white noise assumption of both algorithms is violated. On the upside, if ${f}_{\mathrm{cut}}\gg {f}_{\mathrm{ana}}$ is given, the KF with an order of magnitude higher f_{ana} has a reduced bias of up to 20% for ${f}_{\mathrm{cut}}\to \mathrm{\infty}$ for individual parameters compared to the KF with lower f_{ana}. Additionally, a higher f_{ana} reduces the variance. To reduce the bias of parameter estimates to a minimum, the experimental design offers two remedies, either doing cPCF experiments with additional discussed advantages or using the KF at a high f_{ana} with even much higher f_{cut}.
By theoretical grounds a further argument for doing less analog filtering is that this benchmark analyzes data of a finite state Markov process, which is a coarse proxy for the true process. In reality, relaxation of a protein is a highdimensional continuousstate Markov process with infinitely many relaxation time scales (eigenvalues) (Frauenfelder et al., 1991) which, however, might be grouped in slower experimentally accessible and nonaccessible faster time scales (Noé et al., 2013). With larger data sets of higher quality from better experiments, the faster time scales might become accessible if they are not distorted by analog filtering. In conclusion, deciding on a specific kinetic scheme and inferring its parameters means finding a model which accommodates in the best way to the set of observed eigenvalues. Analog filtering hampers the RE, KF or HMM forwardbackward algorithm (Qin et al., 2000) to correctly describe the faster time scales.
Error due to finite integration time of fluorescence data
So far, we idealized the fluorescence data integration time as being instantaneously relative to the time scales of ensemble dynamics. In real experiments, the fluorescence signal of cPCF data has orders of magnitude longer minimal integration time ${T}_{\mathrm{int}}$ (time to record all voxels of a frame) or maximal integration frequency ${f}_{\mathrm{int}}=1/{T}_{\mathrm{int}}$, than the possible sampling frequency of current recordings. We mimic the finite integration time
by summing with a sliding window the 100 kHz signal including the white noise to obtain data at an effectively lower sampling frequency (Figure 12a). Additionally we set the Bessel filter for the current data to ${f}_{\mathrm{cut}}/{\alpha}_{2}=4.59$ or ${f}_{\mathrm{cut}}=90$ kHz. The fastest used analysing frequency is ${f}_{\mathrm{ana}}=500\mathrm{Hz}$. We scale mean photo brightness ${\lambda}_{\mathrm{b}}$ and background noise down such that the signaltonoise ratio of the lower integration frequency data is the same as of the highfrequency data ${\lambda}_{\mathrm{b}}/{T}_{\mathrm{int}}=const$ . We do that in order to separate the bias from the finite integration time from other effects such as a better signal to noise ratios for each integrated point. Note that we only analyzed the plot until ${f}_{\mathrm{int}}={f}_{\mathrm{ana}}$. Both algorithms incur very similar bias due to the finite integration time (Figure 12b). The KF (green) is more precise for high integration frequencies ${f}_{\mathrm{cut}}/{\alpha}_{2}$ until ${f}_{\mathrm{cut}}/{\alpha}_{2}\approx 0.08$ then the RE approach becomes more robust. Similar to Besselfiltered current data (Figure 11b) on the single parameter level the systematic deviations start early for example ${f}_{\mathrm{int}}=10$ kHz for K_{21} (Figure 12c4). Possibly the systematic deviations start (Figure 12c2) already at ${f}_{\mathrm{int}}=50$ kHz for k_{32}. The sudden increase of the Euclidean error (Figure 12b) of the mean median at ${f}_{\mathrm{cut}}/{\alpha}_{2}\approx 0.2$ occurs in this case not due to f_{int} approaching f_{cut} but due to ${f}_{\mathrm{int}}\u2a85{\alpha}_{1,2}$ for many ligand concentrations. To show this we plot the results of the fitting of five different data sets without including the highest 4 ligand concentrations (red) which means the largest eigenvalues are much smaller (Figure 12b,C16). Additionally, we keep ${f}_{\mathrm{int}}=const$. Although fluctuations of the posterior medians are higher, the KF becomes more robust. Note, that the fastest eigenvalues of these reduced data sets are indicated by the blue bars (Figure 12b and c4). Based on the Euclidean error (Figures 11a and 12a) the robustness of both algorithms against the cutoff frequency is compared with the robustness against the integration frequency found to be about an order of magnitude higher. That is related to a specific detail of the model used: the binding reaction, corresponds to the fastest time scales of the overall dynamic (difference between Figure 1b and c), which is exposed by the fluorescence signal. Thus, kinetic analysis of any data should make sure that the corresponding frequency of the most dominant timescales of the time series are much slower than the respective f_{int}f_{cut} independently of the investigated algorithms.

Figure 12—source data 1
 https://cdn.elifesciences.org/articles/62714/elife62714fig12data1v2.zip

Figure 12—source data 2
 https://cdn.elifesciences.org/articles/62714/elife62714fig12data2v2.zip

Figure 12—source data 3
 https://cdn.elifesciences.org/articles/62714/elife62714fig12data3v2.zip

Figure 12—source data 4
 https://cdn.elifesciences.org/articles/62714/elife62714fig12data4v2.zip

Figure 12—source data 5
 https://cdn.elifesciences.org/articles/62714/elife62714fig12data5v2.zip

Figure 12—source data 6
 https://cdn.elifesciences.org/articles/62714/elife62714fig12data6v2.zip

Figure 12—source data 7
 https://cdn.elifesciences.org/articles/62714/elife62714fig12data7v2.zip
Conclusions
We generalized the filter equations (Methods Equation 37, 38d, 57, 58 and 59) of the KF for analyzing the gating and binding dynamics of ligandgated ion channels with a realistic signalgenerating model for isolated patchclamp (PC) and confocal patchclamp fluorometry (cPCF) data including openchannel noise, photoncounting noise and background noise. Any other type of linear kinetic scheme (e.g. for voltagedependent channels) and signal can be applied as long as the characteristics of the signal are sufficiently described by normal distributions. Our approach is derived by approximating the chemical master equation of a first order chemical reaction network (which ion channel experiments usually are) which is exact up to the second statistical moment. For firstorder chemical reaction networks, the linear noise approximation (Wallace et al., 2012) are exact up to the second moment too (Grima, 2015). Thus, we can conclude that our Bayesian filter uses a time integrated version of the linear noise approximation. To our understanding of Wallace et al., 2012 our approach is thus equivalent to approaches based on the chemical Langevin or FokkerPlanck equations (Gillespie, 2002). Consequently, this also makes the considerations of the quality of the chemical Langevin equation as an approximation (Gillespie, 2000) of the chemical master equation valid for our approach. Compared to previous attempts Moffatt, 2007, this mathematical generalization is necessary (Figure 3b) in order to use Bayesian filters on macroscopic PC or cPCF data. With our algorithm, we demonstrate (Figures 3c and 7) that the common assumption that for large ensembles of ion channels simpler deterministic modeling by RE approaches is on par with stochastic modeling, such as a KF, is wrong in terms of Euclidean error and uncertainty quantification (Figures 5a–c ,–6a–b).
Enriching the data by fluorescencebased ligand binding reveals two regimes. In one regime, the twodimensional data increase the accuracy of the parameter estimates up to $\approx 10$fold (Figure 4a and c). In the other regime of lower channel expression, enriching the data by the second observable, makes nonidentified parameters to identified parameters. The second observable in cPCF data decreases the overfitting tendency (Figure 4a, b and d) of the RE approach on the true process. Thus, in this regard the advantage of the KF becomes smaller. However, by exploiting Bayesian HDCV we gain a second perspective: We show for various signaltonoise ratios (Figures 5a–c ,–6a–b) that the posterior sampled by a RE approach never covers the true values within a reasonable HDCV. Thus, the central feature of Bayesian statistics, exact uncertainty quantification by having the full posterior, is meaningless in combination with an RE approach (considering the type of data and set of signaltonoise ratios that we tested). This even holds true for very pessimistic signaltonoise assumptions Figure 6b. If HDCVs based on an RE approach cannot be trusted, the same applies to confidence volumes based on the curvature of the likelihood. This is not the case for the KF which delivers properly shaped posteriors (Figures 6a–c ,–5a–c). Increasing the model complexity, at unchanged PC data quality (Figure 7) shows that the RE approach displays unidentified rates even for large ion channel ensembles while our approach identified all parameters for the same data. We also investigated the robustness of both algorithms against the cutoff frequency of a Bessel filter (Figure 11) and showed the overall superior robustness of the KF against errors of analog filtering compared to the RE approach. Analog filtering has its limitations due to distorting the higher frequencies of the Fourier spectrum of the signal. Thus, one should let the KF sample as fast as possible, with a cutoff frequency of at least one order of magnitude higher than the sampling frequency of the KF.
Similar to the Bessel filter, the KF is more robust than the RE approach against errors due to the finite integration time. Nevertheless, it is crucial for both algorithms (Figure 12), that the intrinsic time scales (1/eigenvalues) of the process to be analyzed are slower than the integration time of the data points. Otherwise the accuracy of the inference deteriorates.
Altogether, we demonstrated the performance of the generalized Kalman filter on ion channel data for inference of kinetic schemes. Nevertheless, our approach can approximate any other stochastic system and signal distributions of linear (pseudofirstorder) kinetics (Sorenson and Alspach, 1971). Prospective extensions of the Bayesian filter, for example by Bayesian Gaussian sum filters or similar numerically brute force concepts such as particle filters (Golightly and Wilkinson, 2011; Gillespie and Golightly, 2012), can overcome modeling errors at low ion channel numbers or low photon fluxes.
Materials and methods
We simulated state evolution $s(t)$ with either the software QuB (Nicolai and Sachs, 2014) for PC data or an inhouse Matlab routine (The code will be shared on request.) for cPCF data. The inhouse Matlab routine is an implementation of the Gillespie algorithm Gillespie Daniel T. (1977). Traces were summed up, defining the ensemble state vector $\mathbf{n}(t):={({n}_{1},{n}_{2},{n}_{3},{n}_{4})}^{\top}$, which counts the number of channels in each state. At first we used a 10 kHz sampling frequency for the Gillespie algorithm but for investigating the errors induced by analog filtering the current signal and the finite integration time for each fluorescence data point the Gillespie algorithm sampled at 100 kHz. The KF, RE, and Bayesian filter routines were implemented in Stan (Carpenter et al., 2017) with the interface package PyStan and ran on a high performance computing cluster with O(100) Broadwell and SkyLake nodes. A Tutorial for Patch clamp data can be found on the git hub page https://github.com/JanMuench/Tutorial_Patchclamp_data and for cPCF data, https://github.com/JanMuench/Tutorial_Bayesian_Filter_cPCF_data. The cPCF data simulation code can be found here: https://cloudhsm.itdlz.de/s/QB2pQQ7ycMXEitE (Source code 1).
Methods
Hereinafter, we derive the equations for our Bayesian filter for time series analysis of hidden linear chemical reaction networks (kinetic schemes). A detailed description of the experimental noise is provided in the Appendix 5.
The relation of Bayesian statistics to the Kalman filter
Request a detailed protocolThe following conventions are generally used: Bold symbols are used for multidimensional objects such as vectors or matrices. Calligraphic letters are used for (some) vectorial time series and doublestrike letters are used for probabilities and probability densities. Within the Bayesian paradigm (Hines, 2015; Ball, 2016), each unknown quantity, including model parameters $\mathit{\theta}$ and time series of occupancies of hidden states ${N}_{T}={\{\mathbf{n}({t}_{i})\}}_{i=1}^{T}$, are treated as random variables conditioned on observed time series data $\mathcal{Y}}_{T}={\mathbf{y}({t}_{i})}_{i=1}^{T$. The prior $\mathbb{P}(\mathit{\theta})={\prod}_{j}^{{N}_{\mathrm{par}}}\mathbb{P}({\theta}_{j})$ or posterior distribution $\mathbb{P}(\mathit{\theta}{\mathcal{Y}}_{T})$ encodes the available information about the parameter values before and after analysing the data, respectively. According to the Bayesian theorem, the posterior distribution
is a probability distribution of a parameter set $\mathit{\theta}$ conditioned on $\mathcal{Y}}_{T$. The likelihood $\mathbb{L}({\mathcal{Y}}_{T}\mathit{\theta})$ encodes the distribution of the data by modelling the intrinsic fluctuations of the protein as well as noise coming from the experimental devices. The prior provides either assumptions before measuring data or what has been learnt from previous experiments about $\mathit{\theta}$. The normalization constant
ensures that the posterior is a normalized distribution. The KF is a special class of models in the family of Bayesian filters (Ghahramani, 1997), which is a generalisation of the classical KF. Due to its linear time evolution (Equation 1), the KF is particularly useful for modeling time series data of ensembles dynamics of first order chemical networks. It delivers a set of recursive algebraic equations (Materials and methods Equation 28 and Equation 32) for each time point, which allows to express the prior $\mathbb{P}(\mathbf{n}(t){\mathcal{Y}}_{t1})$ and (after incorporating $\mathbf{y}(t)$) the posterior $\mathbb{P}(\mathbf{n}(t){\mathcal{Y}}_{t})$ occupancies of hidden states $\mathbf{n}(t)$ for all $t$ given a set of parameters $\mathit{\theta}$. This means the KF solves the filtering problem (inference of ${N}_{T}$) by explicitly modeling the time evolution of $\mathbf{n}(t)$ by multivariate normal distributions. This allows us to replace $\mathbb{L}({\mathcal{Y}}_{T}\mathit{\theta})$ of Equation 20 by the expression of Equation 9.
The Bayesian framework (as demonstrated in this article) has various properties which makes it superior to ML estimation (MLE) (McElreath, 2018). Those properties are in particular useful for the analysis of biophysical data since very often the dynamics of interest are hidden or latent in the data. Models with a hidden structure are called singular. For regular (nonsingular) statistical models, maxima ${\mathit{\theta}}_{\mathrm{ML}}$ of the posterior or likelihood converge in distribution
to the true value ${\mathit{\theta}}_{\mathrm{true}}$,where ${\mathbf{F}}^{1}({\mathit{\theta}}_{\mathrm{true}})$ is the inverse Fisher information matrix. Under those conditions it is justified to derive from the curvature of the likelihood at ${\mathit{\theta}}_{\mathrm{ML}}$ via the CramerRaobound theorem
a confidence volume for the inferred parameters. In contrast, consider for example the type of data investigated in this study which probes the protein dynamics by current and light. Singularity means that the Fisher information matrix of a model is not invertible leading to the breakdown of the CramerRao Bound theorem. Due to the breakdown, it cannot be guaranteed that even in the asymptotic limit the loglikelihood function can be approximated by a quadratic form Watanabe, 2007. Thus, usually the MLE does not obey Equation 22. Consequently, the posterior distribution is usually not a normal distribution either (Watanabe, 2007). Using the full posterior distribution without further approximations detects the resulting problems such as deviation from normality or nonidentifiability of parameters, related to the singularity. In conclusion, the posterior is still a valid representation of parameter plausibility while ML fails.
Time evolution of a Markov Model for a single channel
Request a detailed protocolIn the following, we write the time $t$ as function argument rather than a subscript. Following standard approaches, we attribute to each state of the Markov model an element of a vector space with dimension $M$. At a time, a channel can only be in a single state. This implies that the set of possible states is S:=$\{(1,0,0,\mathrm{\dots}),(0,1,0,\mathrm{\dots}),\mathrm{\dots},(\mathrm{\dots},0,1)\}\subset {\{0,\mathrm{\hspace{0.17em}1}\}}^{M}$. In the following, Greek subscripts refer to different states while Latin subscripts refer to different channels. By $\mathbf{s}(t)={\mathbf{e}}_{\alpha}$ we specify that the channel is in state α at time $t$. Mathematically, ${\mathbf{e}}_{\alpha}$ stands for the αth canonical unit Cartesian vector (Table 1).
Assuming that the state transitions can be modeled by a first order Markov process, the path probability can be decomposed as the product of conditional probabilities as follows:
Markov models (MMs) and rate models are widely used for modeling molecular kinetics (Appendix 2). They provide an interpretation of the data in terms of a set of conformational states and the transition rates between these states. For exactness it remains indispensable to model the dynamics with a HMMs (Noé et al., 2013). The core of a hidden Markov model is a conventional Markov model, which is supplemented with a an additional observation model. We will therefore first focus on a conventional Markov model. Statetostate transitions can be equivalently described with a transition matrix $\mathbf{T}$ in discrete time or with a rate matrix $\mathbf{K}$ in continuous time, as follows:
where $\mathrm{exp}$ is the matrix exponential. We aim to infer the elements of the rate matrix $\mathbf{K}$, constituting a kinetic model or reaction network of the channel. Realizations of sequences of states can be produced by the DoobGillespie algorithm Gillespie Daniel T. (1977). To derive succinct equations for the stochastic dynamics of a system, it is beneficial to consider the time propagation of an ensemble of virtual system copies. This allows to ascribe a probability vector $\mathbf{p}(t)$ to the system, in which each element ${p}_{\alpha}(t)$ is the probability to find the system at $t$ in state α. One can interpret the probability vector $\mathbf{p}$ as the instantaneous expectation value of the state vector $\mathbf{s}$.
The probability vector obeys the discretetime Master equation
Time evolution of an ensemble of identical noninteracting channels
Request a detailed protocolWe model the experimentally observed system as a collection of noninteracting channels. A single channel can be modeled with a firstorder MM. The same applies to the ensemble of noninteracting channels. We focus on modeling the time course of extensive macroscopic observables such as the mean current and fluorescence signals as well as their fluctuations. A central quantity is the vector $\mathbf{n}(t)$ which is the occupancy of the channel states at time $t$:
This quantity, like $\mathbf{s}(t)$, is a random variate. Unlike $\mathbf{s}(t)$, its domain is not confined to canonical unit vectors but to $\mathbf{n}\in {\mathbb{N}}^{M}$. From the linearity of Equation 28 in the channel dimension and from the singlechannel CME Equation 27b one can immediately derive the equation for the time evolution of the mean occupancy $\overline{\mathbf{n}}(t)=E[\mathbf{n}(t)]$:
with the transition matrix $\mathbf{T}$. The full distribution $\mathbb{P}(\mathbf{n}(t+1)\mathbf{n}(t))$ is a generalized multinomial distribution. To understand the generalized multinomial distribution and how it can be constructed from the (conventional) multinomial distribution, consider the simplified case where all channels are assumed to be in the same state α. Already after one time step, the channels will have spread out over the state space. The channel distribution after one time step is parametrized by the transition probabilities in row number α of the singlechannel transition matrix $\mathbf{T}$. According to the theory of Markov models, the final distribution of channels originating from state α is the multinomial distribution
In general, the initial ensemble will not have only one but multiple occupied channel states. Because of the independence of the channels, one can imagine each initial subpopulation spreading out over the state space independently. Each subpopulation with initial state α gives rise to its own final multinomial distribution that contributes ${n}_{\beta}^{(\alpha )}$ transitions into state β to the total final distribution. The total number of channels at $t+1$ in each state can then be simply found by adding the number of channels transitioning out of the different states α.
Evidently, the total number of channels is conserved during propagation. The distribution of $\mathbf{n}(t+1)$, defined by Equations 30; 31, is called the generalized multinomial distribution:
While no simple expression exists for the generalized multinomial distribution, closed form expressions for its moments can be readily derived. For large ${N}_{\mathrm{ch}}$ each $\mathbb{P}({\mathbf{n}}^{(\alpha )}(t+1)\mid {n}_{\alpha}{\mathbf{e}}_{\alpha})$ can be approximated by a multivariatenormal distribution such that also $\mathrm{general}\mathrm{multinomial}(\mathbf{n}(t),\mathbf{T})$ has a multivariatenormal approximation. In the next section, we combine the kinetics of channel ensembles with the KF by a moment expansion of the governing equations for the ensemble probability evolution.
Moment expansion of ensemble probability evolution
Request a detailed protocolThe multinomial distribution (Fredkin and Rice, 1992) has the following mean and covariance matrix
where ${\mathbf{T}}_{:,\alpha}$ denotes the column number α of the transition matrix and $\mathrm{diag}({\mathbf{T}}_{:,\alpha})$ describes the diagonal matrix with ${\mathbf{T}}_{;,\alpha}$ on its diagonal. Combining Equation 31 with Equations 33; 34 we deduce the mean and variance of the generalized multinomial distribution:
Note that Equations 35; 36 are conditional expectations that depend on the random state $\mathbf{n}$ at the previous time $t$ and not only on the previous mean $\overline{\mathbf{n}}$. To find the absolute mean, the law of total expectation is applied to Equation 35, giving
in agreement with the simple derivation of Equation 29. We introduce a shorthand $\mathbf{P}(t):=\mathrm{c}\mathrm{o}\mathrm{v}(\mathbf{n}(t),\mathbf{n}(t))$ for the absolute covariance matrix of $\mathbf{n}(t+1)$. Similarly, $\mathbf{P}(t)$ can be found by applying the law of total variance decomposition (Weiss, 2005 to Equations 35; 36), giving
Equations 37, 38d dare compact analytical expressions for the mean and the covariance matrix of the occupancy vector $\mathbf{n}$ at $t+1$ that depend on the mean $\overline{\mathbf{n}}$ and covariance matrix $\mathbf{P}$ at the previous time step $t$. Chaining these equations for different time steps $t=0,\phantom{\rule{thinmathspace}{0ex}}\dots ,\phantom{\rule{thinmathspace}{0ex}}T$ allows to model the whole evolution of a channel ensemble. Moreover, these two equations together with the output statistics of $\mathbb{O}(\mathbf{y}\mathbf{n}(t))$ are sufficient to formulate correction equations Equation 59 of the KF (Moffatt, 2007; Anderson and Moore, 2012). These equations will be used in a Bayesian context to sample the posterior distribution of the model parameters. The sampling entails repeated numerical evaluation of the model likelihood. Therefore, analytical equations for the ensemble evolution that can be quickly evaluated on a computer millions of times are indispensable. This was achieved by deriving Equation 37, Equation 38d. Comparing Equation 38d with the KF prediction equation (Anderson and Moore, 2012) for $\mathbf{P}(t)$, we obtain the statedependent covariance matrix of Equation 3 as
In the following section on properties of measured data and the KF, we no longer need to refer to the random variate $\mathbf{n}(t)$. All subsequent equations can be formulated by only using the mean hidden state $\overline{\mathbf{n}}(t)$ and the variancecovariance matrix of the hidden state $\mathbf{P}(t)$. We therefore drop the overbar in $\overline{\mathbf{n}}(t)$ so that the symbol $\mathbf{n}(t)$ refers from now on to the mean hidden state.
Modeling simultaneous measurement of current and fluorescence
Request a detailed protocolIn the following, we develop a model for the conditional observation distribution $\mathbb{O}(\mathbf{y}\mathbf{n}(t))$ (Appendix 5 for experimental details). Together with the hidden ensemble dynamics this will enable us to derive the output statistics of the KF (see, below). Let $\mathbf{y}(t)$ be the vector of all observations at $t$. Components of the vector are the ion current and fluorescence intensity.
As outlined in the introduction part, in Equation 4 we model the observation by using a conditional probability distribution $\mathbb{O}(\mathbf{y}(t)\mathbf{n}(t))$ that only depends on the mean hidden state $\mathbf{n}(t)$, as well as on fixed channel and other measurement parameters. $\mathbb{O}(\mathbf{y}(t)\mathbf{n}(t))$ is modeled as a multivariate normal distribution with mean $\mathbf{Hn}(t)$ and variancecovariance matrix $\mathbf{\Sigma}(t)$, that can in general depend on the mean state vector $\mathbf{n}(t)$ (much like the covariance matrix of the kinetics in (Equation 38d) ). The observation matrix $\mathbf{H}\in {\mathbb{R}}^{{N}_{\mathrm{obs}}\times M}$ projects the hidden state vector $\mathbf{n}(t)$ onto $\mathbf{Hn}(t)\in {\mathbb{R}}^{{N}_{\mathrm{obs}}}$, the observation space. The observation distribution is
This measurement model is very flexible and allows to include different types of signals and error sources arising from both the molecules and the instruments. A summary of the signals and sources of measurement error and their contributions to the parameters of $\mathbb{O}(\mathbf{y}(t)\mathbf{n}(t))$ is provided by Table 2. Below we address the two types of signals and four noise sources one by one. For this, we decompose the observation matrix and the observation noise covariance matrix into the individual terms:
In the following, we report the individual matrices for the exemplary CCCO model with one open state $\alpha =4$ and three closed states $\alpha =1,2,3$. Matrices can be constructed analogously for the other models. For the definition of ${\mathbf{\Sigma}}_{\mathrm{back}}$ refer to (Appendix 5).
Macroscopic current and openchannel noise
Request a detailed protocolWe model the current and the intrinsic fluctuations of the openchannel state $\mathbf{s}={\mathbf{e}}_{4}$ (the open channel noise) by a statedependent normal distribution with mean $i{n}_{4}(t)$ where ${n}_{4}(t)$ is the number of channels in the open state at $t$ and $i$ is the singlechannel current. The additional variance of the singlechannel current is described by ${\sigma}_{\mathrm{open}}^{2}$. The sum of the instrumental noise of the experimental setup and the open channel noise is modeled as uncorrelated (white) normally distributed noise with the mean $E[{\nu}_{\mathrm{I}}(t)]=0$ and variance $E[{\nu}_{\mathrm{I}}^{2}(t)]={\sigma}_{\mathrm{op}}^{2}{n}_{4}(t)+{\sigma}_{\mathrm{m}}^{2}$. By making the openchannel noise dependent on the hidden state population ${n}_{4}(t)$, we fully take advantage of the flexibility of Bayesian filters which admits an (explicitly or implicitly) timedependent observation model. By tabulating the parameters of the two normal distributions into $\mathbf{H}$ and $\mathbf{\Sigma}$, we obtain
One can now ask for the variance of a data point $y(t)$ given the epistemic and aleatory uncertainty of $\mathbf{n}(t)$ encoded by $\mathbf{P}(t)$ in Equation 38d. By using the law of total variance the signal variance follows as:
See, Appendix 6 for further details.
Fluorescence and photoncounting noise
Request a detailed protocolThe statistics of photon counts in the fluorescence signal are described by a Poisson distribution with emission rate ${\lambda}_{\mathrm{Fl}}$
The total emission rate ${\lambda}_{\mathrm{Fl}}$ can be modeled as a weighted sum of the specific emission rates ${\lambda}_{b}$ of each ligand class $\{0,1,2\}$. The weights are given by the stoichiometric factors which reflect the number of bound ligands. In order to cast the Poisson distribution into the functional form of the observation model (Equation 41), we invoke the central limit theorem to approximate
The larger ${\lambda}_{\mathrm{Fl}}$ the better is the approximation. We assume, that the confocal volume is equally illuminated. For our model of ligand fluorescence, we assume for a moment that there is no signal coming from ligands in the bulk. We will drop this assumption in the next section. With these assumptions, we arrive at the following observation matrix
The matrix $\mathbf{H}$ aggregates the states into two conductivity classes: nonconducting and conducting and three different fluorescence classes. The first element ${(\mathbf{Hn})}_{1}$ is the mean fluorescence ${\lambda}_{\mathrm{Fl}}(t)={\lambda}_{b}[{n}_{2}(t)+2({n}_{3}(t)+{n}_{4}(t))]$. The variancecovariance matrix ${\mathbf{\Sigma}}_{\mathrm{binding}}$ can be derived along the same lines using Equation 48. We find
Under these assumptions, the observation matrix can be written as follows
Output statistics of a Kalman Filter
Request a detailed protocolwith twodimensional statedependent noiseNow simultaneously measured current and fluorescence data $\mathbf{y}\in {\mathbb{R}}^{2}$, obtained by cPCF, are modeled. Thus, the observation matrix fulfills $\mathbf{H}\in {\mathbb{R}}^{2\times M}$. One can formulate the observation distribution as
The vector ${\mathit{\nu}}_{\mathrm{m}}$ denotes the experimental noise, with $E[{\mathit{\nu}}_{\mathrm{m}}]=0$ and variance given by the diagonal matrix ${\mathbf{\Sigma}}_{\mathrm{meas}}+{\mathbf{\Sigma}}_{\mathrm{back}}$. The second noise term arises from Poissondistributed photon counting statistics and the openchannel noise. It has the properties
and
The matrix $\mathbf{\Sigma}$ is a diagonal matrix. To derive the covariance matrix $\mathrm{cov}(\mathbf{y}(t))$ we need to additionally calculate $\mathrm{var}({y}_{\mathrm{fluo}}(t))$ and $\mathrm{c}\mathrm{o}\mathrm{v}({y}_{\mathrm{f}\mathrm{l}\mathrm{u}\mathrm{o}}(t),{y}_{\mathrm{p}\mathrm{a}\mathrm{t}\mathrm{c}\mathrm{h}}(t))$. By the same arguments as above we get
The cross terms can be calculated by using the law of total covariance
yielding the matrix
We assumed that the Poisson distribution is well captured by the normal approximation. In cPCF data, the ligand binding to only a subensemble of the channels is monitored, which we assume to represent the conducting ensemble such that ${N}_{\mathrm{ch},\mathrm{FL}}={N}_{\mathrm{ch},\mathrm{I}}$. For real data, further refinement might be necessary to model the randomness of the subensemble in the summed voxels. With the time evolution equations for the mean (Equation 35) and for the covariance matrix Equation 38d as well as with the expressions for the signal variance, we possess all parameters that are needed in the correction equation of the (Kalman, 1960; Anderson and Moore, 2012).
The correction step
Request a detailed protocolFor completeness we write down the correction step (Bayesian update) of the KF, although its derivation can be found in Chen, 2003; Anderson and Moore, 2012; Moffatt, 2007. The mean ensemble state $\mathbf{n}(t)$ is corrected by the current data point
where Kalman gain matrix ${\mathbf{K}}_{\mathrm{Kal}}:=\mathbf{P}{(t)}_{\mathrm{prior}}{\mathbf{H}}^{\top}{\mathbf{\Sigma}}^{1}$ evaluates the intrinsic noise against the experimental noise. How precise are my model predictions about $\mathbf{n}(t)$ compared with the information gained about $\mathbf{n}(t)$ by measuring $\mathbf{y}(t)$. The covariance $\mathbf{P}(t)$ of the ensemble state $\mathbf{n}(t)$ is corrected by
Equation 58,59, 37 and 38d form the filtering equations which summarize the algorithm. One initialises the first $\mathbf{n}(0)$ and $\mathbf{P}(0)$ and with an equilibrium assumption.
Microscopicreversibility as a hierarchical prior
Request a detailed protocolWe applied microscopicreversibility (Colquhoun et al., 2004) by a hierarchical prior distribution. Usually, microreversibility is strictly enforced by setting the product of the rates of the clockwise loop ${k}_{1},{k}_{2},{k}_{3}{k}_{4}$ equal to the anticlockwise loop ${k}_{5},{k}_{6},{k}_{7},{k}_{8}$ and then solving for the desired rate parameter to be replaced. This means that the classical approach can be described by drawing the resulting rate from a Dirac delta distribution prior with
Following Equation 60, we can model microscopicreversibility with any hierarchical prior distribution whose limit for a vanishing variance is Equation 60. For mathematical convenience, we defined the hierarchical prior by a sharply peaking beta distribution
and by rescaling and adding an offset
we derived a conditional prior which allows at maximum a ±0.005 relative deviation from the strict microscopicreversibility. The ±0.005 microreversibility constraint is applied in (Figure 7b–d). In this way, one could model or even test possible small violation of microscopicreversibility if smaller beta parameters such as $\mathrm{beta}(1,1)$ would be chosen.
Appendix 1
Efficiency of the Hamiltonian Monte Carlo sampler
Bayesian posterior sampling requires a fast and efficient sampler which requires in turn a fast evaluation of the likelihood and prior distribution. Optimally, the calculation time should increase as little as possible with the amount of data. This is why we chose the KF framework and parallelized the processing of a time trace across many CPUs. Compared to the maximization of the likelihood, sampling from the posterior requires approximately one to multiple orders of magnitude times more evaluations of the likelihood function to have a decent representation of the posterior. For example, in Moffatt, 2007 both algorithms need roughly 10^{2} to 10^{3} evaluations of the likelihood function to converge. This is only a small fraction of the number of evaluations needed here to obtain a converged estimate of the posterior (in fact the typical "warmup" phase of the sampler which we used is barely half completed with such a small number of evaluations).
The ratio of the number of used samples to the number of likelihood evaluations can serve as a simple efficiency measure. However, this ratio is a too optimistic measure as revealed by a look into the internal dynamics of Hamiltonian Monte Carlo sampler (HMC). HMC treats the current parameter sample as part of a phase space point of a (dynamic) Hamiltonian system. The force acting on the system is the derivative of the negative logarithm of the posterior density Betancourt, 2017. Each suggested sample is derived from a ballistic movement with random kinetic energy in the augmented parameter space. Calculating this movement is done by integration of the corresponding Hamiltonian equations. Therefore, one suggested sample requires many evaluations of the gradient of the log posterior distribution. How many evaluations were needed in turn depends on the shape of the posterior and the length of the ballistic movement.
Strong changes of the gradient of the posterior (high curvature of the posterior) require more integration steps. Setting all those sampler parameters requires either expert knowledge about the statistical model and sampler or a sampler which automatically learns optimal sampler parameter settings (warmup) before it starts the actual sampling of the parameters of interest. Nevertheless, HMC generates samples (Appendix 1—figure 1a) from highdimensional, correlated posteriors more efficiently than many other classical Markov chain Monte Carlo methods (Betancourt, 2017). By more efficient we mean the product of the speed of drawing one correlated sample with the amount of iterations needed for the autocorrelation function of those samples to vanish sufficiently. By construction, the samples from any Markov chain Monte Carlo method are correlated (Appendix 1—figure 1b). The quality of the adaptive NUTS HMC sampler provided by Stan can by seen in Appendix 1—figure 1b. The sample traces show a small anticorrelation whose absolute value is hardly larger than 0.05. This is achieved by letting the sampler adapt to the geometry of the posterior in the warmup phase. The samples collected during the warmup phase are discarded.
Hereinafter, we show that likelihood and sampler considerations are one aspect of computational efficiency. The other is the implementation of a parallelized algorithm. The blue curve (Appendix 1—figure 2) represents the scaling of the KF with an implementation of the algorithm which saves besides the sampled parameters the predicted mean and the covariance of the signal. This means for cPCF data that for each parameter sample $5\cdot {N}_{\mathrm{traces}}\cdot {N}_{\mathrm{data}}/\mathrm{CPU}$ more memory operations are needed on the cluster node. Note, since we assigned each ligand concentration jump and the following data points to its own CPU the following relation holds: ${N}_{\mathrm{data}}/\mathrm{CPU}={N}_{\mathrm{data}}/\mathrm{trace}$. Skipping saving those derived quantities (orange curve) creates a speedup of roughly 2 orders of magnitude. The derived quantities are redundant quantities which we suggest to calculate by feeding a small subsample of the posterior samples to the KF filter which then reanalyzes the traces by the given draws but does not redraw from the posterior. See GitHub: https://github.com/JanMuench/Tutorial_Patchclamp_data.
We compare the computation time for the KF (green) and the RE approach (blue) for the 4state model (Appendix 1—figure 3a) versus the data quality by ${N}_{\mathrm{ch}}$. The calculation time t_{calc} rises for both algorithms roughly like $\sim \sqrt{{N}_{\mathrm{ch}}}$. For each curve the likelihood evaluations stay constant for the whole plot. Thus this scaling relates to the integration time of the HMC sampler. Taking the average of the last 5 calculation times shows that the KF is about $2.7\pm 0.2$ times slower than the RE approach. The computation time for cPCF data of the 5state2open states model (Appendix 1—figure 3b) shows a similar $\sim \sqrt{{N}_{\mathrm{ch}}}$regime until it seems to become independent of ${N}_{\mathrm{ch}}$. Note, the different ${N}_{\mathrm{ch}}$ interval which is displayed.
Taking the average of the last 4 computation times shows that the KF is about $2.24\pm 0.01$ times slower than the RE approach. Surprisingly, with increasing model complexity the ratio of the computation time becomes not necessarily larger. The computation time for cPCF data of the 6state1open state model (Appendix 1—figure 3c) shows a similar $\sim \sqrt{{N}_{\mathrm{ch}}}$regime until it becomes independent of ${N}_{\mathrm{ch}}$. Taking the average of the last 4 calculation times reveals that the KF is about $3.3\pm 0.3$ times slower than the RE approach. With increasing model complexity, the ratio of the computation times becomes larger. Note, that there is little variation of the absolute calculation time of the algorithm across model complexity which is a good prospect for more complex kinetic schemes. Nevertheless, to achieve this computational speed we used in total 72 CPUs on one node, 20 CPUs per chain for 20 time traces. Thus, nodes with more then 80 CPUs would enable a better performance. Overall, the more complex likelihood calculations of the KF require 2–3 times more calculation for the tested models and cPCF data. Surprisingly, for most part of the displayed interval the KF is faster than the RE approach (Appendix 1—figure 4a) for PC data. Considering the more expensive likelihood evaluation (prediction and correction) of the KF, this result reminds that this computational time benchmark is a downstream test that integrates all aspects of the sampling. One possible explanation could be the following: We describe in the main text (Figures 4b1—6) for the RE approach, that due to the treatment of every data point as an individual draw from a multinomial distribution, HDCVs (Figures 5–6) are too narrow. This problem gets exacerbated (Figure 8) as we used 10 times more data points which we interpret as the cause of the KF being faster than the RE approach (Appendix 1—figure 4a). The narrower the posterior is the more integration steps from the sampler are needed to calculate the proposed trajectory with sufficient precision. Another example displayed (Appendix 1—figure 4b), has an increased kinetic scheme complexity. Note that we reduced the analyzing frequency 10 times to be comparable with the analyzing frequency of cPCF data. Under this condition, the intuition from likelihood calculation complexity is confirmed that the KF is slower. Taking the average of the last 4 calculation times shows that the KF is about $1.8\pm 0.2$ slower than the RE algorithm for PC data.
Convergence diagnostics
In the limit of infinitely many samples drawn from a posterior by any MCMC method, given that posterior space is simply connected, the sampling is guaranteed to converge to the typical set Betancourt, 2017. For a finite amount of samples from the posterior it is not guaranteed that the samples are representative for the typical set of the posterior. To ensure that the trace has converged to the typical set is an active field of statistical research. A discussion of these important consistency checks (Gelman and Rubin, 1992a; Gelman et al., 2013; Vats and Knudson, 2018; Gabry et al., 2019; Vehtari et al., 2021) ensuring that the sampled traces represent the typical set, is beyond the scope of this paper. In short, we usually worked with 4 independent sampling chains of the posterior (Gelman and Rubin, 1992b). On the one hand, we visually checked that all 4 chains do report the same typical set to verify that the sampler is fine. This convergence can be monitored by the $\widehat{R}$ statistics (Gelman et al., 2013; Vehtari et al., 2021). $\widehat{R}$ evaluates differences between the within and betweenchain parameter estimates. A bad mixing is indicated by large values of $\widehat{R}$. Perfectly converged chains would be reported with $\widehat{R}=1$. Additionally, the effective sample size ${N}_{\mathrm{eff}}$ of each object of interest needs to be monitored. Both of them are reported by the “summary” method from the “fit” object in PyStan. We show (Appendix 1—table 1) an edited version of the output from this method. The last two columns would indicate signs of missing convergence. The second column reports the Markov sampling error telling if we drew enough samples in total. Note that any quantity far out in the tails would have a much larger error. There were cases when the sampler was not fine for all chains, meaning that 3 chains showed the same posterior but one showed something different. In such cases we simply discarded that chain. This occurred in particular more often with the RE approach. As we are benchmarking methods and know the true parameter values, it was rather obvious which chain did not work well. In particular, the successful binomial test of the HDCV is a downstream test that confirms besides the assumptions of the algorithm also whether the sampler worked sufficiently. For the Bayesian filter, sampler problems occurred at very low ion channel counts below ${N}_{\mathrm{ch}}=200$.
HMC parameter settings
The sampling efficiency of an HMC algorithm is sensitive to three parameters (Neal, 2011, Hoffman and Gelman, 2014). The three parameters, discretization time $\u03f5$, the metric $\mathbf{M}$ and the number of steps taken $L$ can be predefined or/and are adapted in the warmup phase of the sampler. A typical number of iterations ${N}_{\mathrm{warm}}$ in the warmup phase for each chain is between ${N}_{\mathrm{warm}}=3\cdot {10}^{3}\mathrm{\dots}7\cdot {10}^{3}$. Since a sufficient ${N}_{\mathrm{warm}}$ depends on the correlation structure of the posterior, which depends on the quality and quantity of the data, we did not engage to optimize the number of warmup steps. We rather checked whether for a set of different data (for example with different ${N}_{\mathrm{ch}}$) all chains are fine.
Stan
Stan is a probabilistic programming language that provides the research community with easy access to HMC sampling, variational Bayes, and optimization procedures like Maximum posterior/ML approaches. Stan uses automatic analytical differentiation of the statistical model such that no analytical treatment by hand has to be done for any tested model. The only requirement from the user is to define a statistical model of the data. The Stan compiler calculates the analytical derivatives of the model and then translates the statistical model to C++code and after that into an executable program. Furthermore, Stan uses adaptive HMC such that the length of the ballistic trajectory as well as the variance of the random kinetic energy to create the ballistic trajectory and the number of integration steps are optimized automatically. Then the program can be started from any highlevel data processing language of the user’s choice such as Python, Matlab, Julia, R or from the command line.
Appendix 2
Markov models for a single ion channel
Markov models and rate models are widely used for modeling molecular kinetics. They provide an interpretation of the data in terms of a set of functional states and the transition rates between these states. Markov models can be estimated from experimentally recorded data as well as from computer simulation data. The use of Markov models with onestep memory is supported by the concept of the molecular free energy landscape. Molecular energy landscapes are typically characterized by conformationally welldefined freeenergy minima that are separated by freeenergy barriers. State transitions in molecules are thermally activated barriercrossing events on this landscape (Frauenfelder et al., 1991) leading to a rapid equilibration of the system in the vicinity of this new minimum. Memory of other minima that have been visited in the past is not required. Regarding the wide spectrum of time scales at which processes in a protein take place, one has to be aware that there is typically a small number of relaxation modes with excessively long autocorrelation times and many relaxation modes with much faster autocorrelation times. To model the slow, experimentally accessible processes, it is sufficient to retain the small number of slow modes (Noé et al., 2011). It has been shown rigorously that working with the set of slow modes is equivalent to model the state dynamics with a small number of fuzzily defined metastable states in the full conformational space (Deuflhard and Weber, 2005) Later it has been shown that the set of slow modes can be well approximated with a hidden Markov model (Noé et al., 2013).
Appendix 3
Eigenvalues and fitting of sums exponential decays: Mean time evolution of the spectral components of the time evolution
From the experimental perspective we are used to fit sums of exponential decays to time series data of any signal such as currents or fluorescence.
The mathematical background for this procedure is the spectral decomposition of the solution of the corresponding RE equation
which is a vector differential equation whose evolution is governed by the rate matrix $\mathbf{K}$.The general solution to the mean time evolution of the system is
A Markov state model has as many real valued eigenvalues as it has states $M$. With those eigenvalues one can decompose the solution into its spectral components
All of the eigenvalues obey $\alpha \le 0$. One of them always fulfills ${\alpha}_{0}=0$ which belongs to the equilibrium probability distribution of the ion channel. With $exp(0)=1$ we can write the sum as
The eigenvalue ${\alpha}_{0}=0$ determines the equilibrium solution. Due to ${\alpha}_{i}\le 0$ the solution is stable and converges to
The time evolution of the mean signal in its spectral components can then be derived
by multiplying with the observation matrix $\mathbf{H}$ and setting ${\alpha}_{i}=\frac{1}{{\tau}_{i}}$ we derived Equation 63, Colquhoun and Hawkes, 1995a. To derive the likelihood based on this deterministic approximation Milescu et al., 2005 one assumes for each data point that the ensemble state $\mathbf{n}(t)$ is drawn from
with the additional approximation of the multinomial distribution by a multivariate normal distribution. Subsequently, the equation
provides the contribution of the recording system to the signal of the ensemble state. In contrast, the KF approximates a
distribution with multivariate normal distributions (see Methods) and then the details of the experimental setup are adjoined
Appendix 4
A note about Kalman filtering within other frameworks such as variational Bayes, maximum a posteriori and ML
We like to emphasize that the key property of the Kalman filter likelihood for an autocorrelated time series (see Equation 9) to factorize over time enables us to treat each data point as if it were independent. The alternative is to treat the whole time series as one highdimensional data point as done in Celentano and Hawkes, 2004. A ML Kalman filter (Moffatt, 2007) would try to optimize Equation 9, which is usually done in the log space to improve numerical stability. Taking the logarithm of Equation 9 gives also an insight for the reader, who might be more experienced in using residual sums of squares for fitting.
where we used the logarithmic algebra rule $\mathrm{log}({\prod}_{i}{a}_{i})={\sum}_{i}\mathrm{log}({a}_{i})$ Using the functional form of the multivariate normal distribution, we arrive at
which is a vectorial residual sum of squares whose summands are weighted by ${\mathbf{HP}}_{t}{\mathbf{H}}^{\top}+{\mathbf{\Sigma}}_{t}$. The past observations of time series are encoded in ${\mathbf{P}}_{t}$ and $E[{\mathbf{n}}_{t}]$, which are calculated with the Kalman filtering equations before the likelihood can be evaluated.
Generally, the user of our algorithm is not restricted to the full Bayesian inference. He/she can also stick to optimizing Equation 71. This is because in the probabilistic programming language Stan one only formulates the statistical model. The method to evaluate the statistical model and the data is provided by the Stan compiler and the high level programming language interface. Together they allow to switch quickly between HMC sampling, variational Bayesian posterior estimation, maximum a posteriori and ML. In particular, the variational Bayesian posterior might be a way to ease the computational burden due HMC sampling while keeping many of the benefits of knowing the full posterior distribution.
In the previous versions before PyStan 3.0 all three approaches were supported. Currently, with PyStan 3.0 only HMC sampling of the posterior is supported but, if desired, any other interface can be used (Matlab, Julia, R, CmdStan, Pystan 2.9, CmdStanPy,…) to perform optimization or variational Bayesian approaches. No actual changes in the Stan code are required, eventually with the exception that the CPU based parallelization needs to be eliminated. Given that one chooses optimization, the differences between maximum a posteriori/penalized maximum likelihood or only ML are just a matter of prior choice which need to be coded into the statistical model. Apart from those minor points, there is no need of extensive reimplementation of the code.
Appendix 5
The fluorescence signal of cPCF experiments
First four moments of a photomultiplier signal
In this work, the KF analysis assumes Poisson statistics for the fluorescence signal in cPCF. Many commercial microscopes are not equipped with photon counting detectors or detectors are not operated in photoncounting mode, often to due to ease of use or limitation in dynamic range. Therefore, it is important to verify that the fluorescence signal follows, at least approximately, Poisson counting statistics. In particular, for the KF it is assumed that higher order statistics, such as skewness and excess kurtosis, vanish. The central assumption of the derivation of our Bayesian filter is $\mathrm{var}[{y}_{\mathrm{f}\mathrm{l}}]=\mathbb{E}[{y}_{\mathrm{f}\mathrm{l}}]$
Here we show that this assumption for the detectors used in our previously used LSM 710 (Carl Zeiss, Jena) system (Biskup et al., 2007; Kusch et al., 2010) is fulfilled by rescaling to photonnumbers. The measured variance obeys $\mathrm{var}[{y}_{\mathrm{f}\mathrm{l}}]=a\cdot \mathbb{E}[{y}_{\mathrm{f}\mathrm{l}}]$. It depends linearly on the mean signals (Appendix 5—figure 1).
For the scaled signal $x$ being Poisson distributed follows $\mu =a$. Rescaling of the signal by $1/a$ provides approximately Poisson distributed values. A linear fit yields $a=205\text{a.u.}(16\text{bit})/\text{photon}$ (for 680 V PMT voltage, 3.26 $\mu s$ pixel dwell time). (Appendix 5—figures 2 and 3) show that excess kurtosis and skewness remain small at all levels of photons/pixel but are somewhat higher than theoretically predicted for Poissondistributed data. The proportionalities are correctly described by the Poisson distribution assumption but the skewness and the kurtosis are too small by a constant factor of $\sqrt{2}$ and 4, respectively. This finding has to be verified for different experimental conditions, because at lower concentration/particle densities and higher count rates, particle number fluctuations can dominate statistics (Brown et al., 2008). For comparison another option would be a Gamma distribution which has the mean and the variance of $E[y]=k\theta $ and $var[y]=k{\theta}^{2}$, respectively. Thus, the applied scaling requires that $\theta =1$. The Gamma distribution has a higher skewness by factor two (independently of θ) than a Poisson distribution and overscores the skewness and excess kurtosis of the detector. For simplicity only the Poisson distribution is considered in this work. In conclusion: Typical cPCF fluorescence signal detection rates are well approximated by a Gamma or Poisson distribution which in turn have the desired property that can be approximated by a normal distribution.
Background noise statistics
In cPCF measurements with fluorescencelabeled ligands, the signals of the ligands bound to the receptors overlap with the signals from freely diffusing fluorescencelabelled ligands in the bulk. This bulk signal is subtracted from the total signal (Biskup et al., 2007). While the mean difference signal ${y}_{\mathrm{fl},\mathrm{k}}(t)$ of the confocal voxel $k$ represents the bound ligands in that voxel, its noise ${y}_{\zeta ,k}(t)$ originates from both bound and bulk ligands. The additional bulk signal, e.g. the fraction of bulk solution inside that voxel, varies from voxel to voxel and can hardly be described theoretically. Nevertheless, it can be determined experimentally (Biskup et al., 2007). At low expression levels or at ligand concentrations above low nanomolar levels, this background signal is not negligible. It scales linearly with the ligand concentration, while the signal from bound receptors depends on the affinity, as estimated by the concentration of half maximum binding $B{C}_{50}$, and the number of ion channels in the membrane of the observed volume. The binding signal saturates at high concentrations (Appendix 5—figure 4). Thus, both high affinity (low $B{C}_{50}$) and high expression reduce the relative contribution of the background to the overall signal, improving the signal to noise ratio.
Practically, the bulk signal is estimated by counterstaining the solution with a spectrally distinct reference dye (Biskup et al., 2007). The spatial distribution of this dye mimics the spatial distribution of the freely diffusing ligands. The bulk absolute concentration as well as the molecular brightness of the reference dye and the labeled ligand differ. Hence, the binding signal is calculated as the average pixel intensity of the scaled difference image between the signal of labeled ligand and reference dye according to
where ${\widehat{\lambda}}_{\mathrm{lig},\mathrm{back}}$ and ${\widehat{\lambda}}_{\mathrm{ref},\mathrm{back}}$ are the arithmetic mean background signals of the ligand and reference dye recorded beyond the membrane where no signal should be recorded. They represent a signal offset which needs to be subtracted. The mean intensities in the bulk, ${\widehat{\lambda}}_{\mathrm{bulk}}$ and ${\widehat{\lambda}}_{\mathrm{ref}}$, are estimated outside the pipette. In order to get the correct scaling, the mean intensities need to be corrected by the respective mean background signals. If $\frac{{\widehat{\lambda}}_{\mathrm{lig}}{\widehat{\lambda}}_{\mathrm{lig},\mathrm{back}}}{{\widehat{\lambda}}_{\mathrm{ref}}{\widehat{\lambda}}_{\mathrm{ref},\mathrm{back}}}=1$ holds then ${y}_{\mathrm{fl},\mathrm{bin}}$ would be Skellam distributed (Hwang et al., 2007). The total signal is then ${y}_{\mathrm{fl}}={\sum}_{k}{y}_{\mathrm{fl},\mathrm{k}}$. This procedure creates $E[{y}_{\zeta}]=0$ but adds an additional noise term $\mathit{\zeta}({t}_{j})$. For the general case of different intensities, we name the distribution ’scaled Skellam distributed’. The scaling variance of the background noise in each voxel of the difference image
is derived from simulated data further below. ${\lambda}_{\mathrm{lig}}$ and ${\lambda}_{\mathrm{ref}}$ are the fluorescence intensity from the freely diffusing ligands and reference dye molecules per voxel, respectively. ${\lambda}_{\mathrm{lig}}$ and ${\lambda}_{\mathrm{ref}}$ are proportional to the volume fraction of the voxel, which is occupied by the bulk, and to the respective concentrations. To achieve a symmetric $\mathbb{P}(\zeta )$, one can set ${\lambda}_{\mathrm{lig}}={\lambda}_{\mathrm{ref}}$. The summed variance of all selected voxels can be tabulated according to
To mimic an experiment which creates time series data $\zeta (t)$, we draw Poisson numbers for the signal from the membrane $\mathrm{Poisson}(\mathbf{Hn}(t))$ and for the signal from the bulk we draw numbers from the two respective Poisson distributions. Then subtraction of the two background signals is performed according to
assuming that the dark count signal has been correctly subtracted. Then we add the bulk signal to the bound ligand signal. In this way we produce a time trace with colored noise by the Gillespie algorithm and add white noise to time traces as it is observed in real experiments.
Deriving the moments of the background noise for the difference signal
For the KF the variance, skewness and kurtosis arising from the background noise has to be calculated. Skewness and excess kurtosis of the distribution have to be small compared to the total variance of the signal including all noise sources because only in this case the KF algorithm can be considered as the optimal solution for the filtering and inference problem (Anderson and Moore, 2012). In the following the 2nd to 4th moment of ζ are derived. The noise intensity parameter of the reference dye ${\lambda}_{\mathrm{ref}}$ is proportional to ${\rho}_{\mathrm{ref}}{\lambda}_{\mathrm{bulk}}$ with $V$ being the confocal volume fraction containing fluorophores and ${\rho}_{\mathrm{ref}}$ the density of the fluorophores in this volume. In Appendix 5—figure 6 we deduce master curves for the variance skewness and excess kurtosis of the white noise by drawing $4\cdot {10}^{5}$ Poisson numbers from the respective Poisson distribution and subtract them from each other according to Appendix 5 Equation 78. The variance is derived empirically to be
In Appendix 5—figure 6a, we confirm the intuition ${\lambda}_{\mathrm{ref}}\to \mathrm{\infty}\Rightarrow \mathrm{var}(\zeta )={\lambda}_{\mathrm{lig}}$. Optimally, the skewness should be zero to avoid a biased estimate of $\mathit{\theta}$ when the data are analyzed by the KF. Empirically, for ${\lambda}_{lig}\ll {\lambda}_{ref}$ the skewness holds
Additionally for $\lambda}_{lig}<{\lambda}_{ref$ the skewness holds
It is zero when ${\lambda}_{ref}={\lambda}_{lig}$. The KF is optimal if the kurtosis excess approaches zero, in other words if ζ is distributed normally. Empirically the kurtosis holds this
for $\lambda}_{ref}\le {\lambda}_{lig$. The relative intensity $\lambda}_{lig$ of the voxel fraction compared to the intensity $\lambda}_{b$ depends on the affinity of the ligand to the receptor, the number of receptors in the patch, and the density of the fluorophores $p}_{lig$ at the patch. For larger concentrations the ratio should be $\lambda}_{lig}/{\lambda}_{ref$.
Appendix 6
Output statistics of Bayesian filters
Classical Kalman Filter without openchannel noise
Assuming that current measurements are only compromised by additive technical white noise ν but do not contain openchannel noise ${\nu}_{\mathrm{op}}$, then our noise model reduces to
The noise term $\nu}_{\mathrm{m}$ has a mean of $E[{\nu}_{\mathrm{m}}]=0$ and variance $E[{\nu}_{\mathrm{m}}^{2}]={\sigma}_{\mathrm{m}}^{2}=const$. One has to keep in mind that we have to add an extra variance term originating from the dispersion of channels over the state space, as encoded by $\mathbf{P}(t)$ and $\mathbf{n}(t)$. The uncertainty $\mathbf{P}(t)$ is calculated using Methods Equation 30. The variance of the total output is
The two cross terms $E[\nu ({t}_{1}){(\mathbf{n}E[\mathbf{n}])}^{T}{\mathbf{H}}^{T}]$ and $E[\mathbf{H}(\mathbf{n}E[\mathbf{n}])\nu {({t}_{1})}^{T}]$ are zero since ν is independent of $\mathbf{n}$ and $E[{\nu}_{\mathrm{m}}]=0$. Our derivation is equivalent to marginalization over the predicted normal prior of the ensemble state $\mathbb{P}(\mathbf{n}(t){\mathcal{Y}}_{t1})$ at the time of the measurement except that the prior distribution could be any probability distribution with some mean and variance. Equation 84a is the classical KF variance prediction of a signal. The first term in Equation 84a, describes the variance from stochastic gating and that the ensemble state is hidden. Notably, by Methods Equation 30 we realize that $\mathrm{var}(y(t))$ contains information about $\mathbf{T}$ and $\mathbf{n}(t1)$, which we can exploit with the KF framework.
A generalized Kalman filter with statedependent openchannel noise
In addition to the standard KF with only additive noise (Moffatt, 2007; Anderson and Moore, 2012; Chen, 2003), fluctuations arising from the singlechannel gating lead to a second whitenoise term ${\nu}_{\mathrm{op}}{n}_{4}(t)$, causing statedependency of our noise model. The output model is then
The second noise term ${\nu}_{\mathrm{op}}$ is defined in terms of the first two moments $E({\nu}_{\mathrm{op}})=0$ and therefore $\mathrm{var}({\nu}_{\mathrm{op}})=E({\nu}_{\mathrm{op}}^{2})={\sigma}_{\mathrm{op}}^{2}{\mathbf{n}}_{4}(t)$. To the best of our knowledge such a statedependent noise makes the following integration intractable
When assuming that the relative fluctuations of $\mathbf{n}(t)$ are on average small then n_{4} in the denominator is close to $E({n}_{4})$ of the state. Thus, the incremental likelihood can be written as in the standard KF, with the only difference that the measurement noise is the sum of two components.
To see that this approximation of the variance is correct, we apply the law of total variance decomposition Weiss, 2005.
The terms $\mathbf{HP}(t){\mathbf{H}}^{\top}+{\sigma}_{m}^{2}$ are the standard output covariance matrix. Again $\mathbf{P}(t)$ contains information about $\mathbf{T}$, $\mathbf{n}(t1)$ while the additional variance term includes information about the current $\mathbf{n}(t)$. The information contained in the noise influences likelihood in two ways. By the variance or covariance of the current $\mathbf{y}(t)$ but also for $\mathbf{y}(t+1)$ in correction step by the Kalman gain ${\mathbf{K}}_{\mathrm{Kal}}$ matrix defined in the next section.
Appendix 7
Error induced for the RE or KF approach by analog filtering of PC data
In order to mimic an analog signal before the analogtodigital conversion we simulated 5 different data sets of 100 kHz signals which were then filtered by a digital fourthorder Bessel filter. The activation curves were then analyzed with the Bayesian filter at 125 Hz and the deactivation curves at sampling rates between $166500$ Hz. Operating the Bayesian filter at a lower frequency is necessary because due to applying the Bessel filter the former white noise of the signal obtained additional time correlations. Thus, an all data points fit would immediately violate the white noise assumption of Equation 4 which we restore by analyzing at a much lower frequency. We then let the time scales of the induced time correlations become larger and larger by decreasing f_{cut}. We show (Appendix 7—figure 1a) the results of the Euclidean Error for 3 different cases for PC data. The RE approach (blue) and the Bayesian filter (red) share the same analyzing frequency. In contrast, an increased ${f}_{\mathrm{ana}}=16605000$ Hz (green) is used for the Bayesian filter. Both algorithms (blue, red) with the same f_{ana} show a similar rather constant region separated by an offset. Similar, to cPCF data the KF is more robust. The Bayesian filter with ${f}_{\mathrm{ana}}=16605000$ Hz (green) does not show this constant region but outperforms the Bayesian filter with the lower f_{ana} if the recording system is operated with minimal analog filtering. This becomes even more apparent when we consider the single parameter deviation vs. f_{cut} (Appendix 7—figure 1b(16)). Note, the strong dependence of the critical f_{cut} at which the performance of all algorithms becomes suddenly worse scales with f_{ana}. Additionally, it is crucial, independently of the algorithm, that ${f}_{\mathrm{cut}}\gg {f}_{\mathrm{ana}}$. In (Appendix 7—figure 1b(16)) we demonstrate the distribution of the errors on the single parameter level and compare only the KF for different f_{ana}. Similar to the findings in the main text, on the one hand, the Euclidean error (red) shows some robustness against f_{cut}. On the other hand, the influence of f_{cut} on the individual parameter level sets in immediately and is complex. The best results are achieved with minimal analog filtering and a high f_{ana}.
Appendix 8
Details of the specified two methods to count the probability mass needed to include the true value
In Figure 4c we compare the true against the expected success probabilities of finding the complete true rate matrix within an $n$dimensional HDCV for different expected success probabilities of a perfect model. By ‘expected’ and ‘perfect’ we refer to a fictitious ideal algorithm which exactly models exhaustively all details of the true process, including all measurement details. The first way assumes a multivariate normal posterior distribution with $\mathbb{P}{(\mathit{\theta}y)}_{\mathrm{post}}\approx \mathrm{normal}(\mathit{\theta}E[\mathit{\theta}],\mathbf{\Sigma})$. Then the ellipsoid of a constant probability density is exactly the surface of a HDCV of given probability mass $P$. For a twodimensional representation see Figure 4d below the diagonal. In one dimension, the ellipsoid consists of the two points with a distance ${d}_{\mathrm{Mah}}=\sqrt{{(\theta E[\theta ])}^{2}}/\sigma $ from the mean (inset Figure 5a). In general, the $n$dimensional ellipsoids around the mean of a multivariate normal distribution can be described by points which have the same Mahalanobis distance ${d}_{\mathrm{Mah}}$ from the mean.
For a multivariate standard normal distribution without correlation, the Mahalanobis distance becomes the Euclidean distance $\sqrt{{(\mathit{\theta}E[\mathit{\theta}])}^{T}(\mathit{\theta}E[\mathit{\theta}])}$. Rewriting the multivariate normal distribution in terms of the Mahalanobis distance
reveals that all points $\mathit{\theta}$ with ${d}_{\mathrm{Mah}}=const$ have the same probability density. The random variable ${d}_{\mathrm{Mah}}^{2}$ is χsquare distributed. Thus the χsquare distribution (Figure 5a) is the probability density of drawing an $\mathit{\theta}$ in $n$ dimensions and finding it on the ellipsoid’s $n1$dimensional surface, at a distance ${d}_{\mathrm{Mah}}$ from the mean. This allows us to use the cumulative ${\chi}^{2}$square distribution function to calculate the probability mass inside the ellipsoid which is the desired HDCV. By evaluating ${d}_{\mathrm{Mah}}=\sqrt{({\mathit{\theta}}_{\mathrm{true}}E[\mathit{\theta}]){\mathrm{\Sigma}}^{1}({\mathit{\theta}}_{\mathrm{true}}E[\mathit{\theta}])}$ and ${\chi}_{\mathrm{cdf}}^{2}({d}_{\mathrm{Mah}}^{2})$ we specify how much volume, in units of probability mass, has to be counted until the volume includes the true value. Note, that with increasing dimensionality (Figure 5a) the probability mass is shifted away from the mode of the probability density. For general probability densities, this is also true in higher dimensions where one only rarely finds the true value within a small distance to the maximum of the probability density. Note, that there is no mathematical theorem for singular models (Watanabe, 2007) saying that the posterior approaches asymptotically a multivariate normal with increasing data quality/quantity. Consequently, the underlying assumption that the posterior distribution is multivariate normal, is situation dependent valid or highly questionable. Thus the displayed method should be validated additionally in an independent way. To determine the probability mass needed to include the true value into the HDCV for such a posterior, we can also use a histogrambased method. One starts by constructing an $n$dimensional histogram from the samples of the posterior and by initializing a global variable with zero. Then we start counting from the bin with most counts and check whether the true value falls inside this bin. If not, we add the probability mass inside that bin to the global variable. Repeating this for next lower bins eventually leads to the detection of the bin with the true rate matrix inside. On detection, the global variable contains the soughtafter value $P$. While this procedure does not depend on a multivariate normal assumption, it is prone to errors due to the discrete bins and the finite samples from the posterior.
Nevertheless, both procedures show (Figure 5b) good agreement when plotting the volume/ probability mass needed to include the true parameter value vs. ${N}_{\mathrm{ch}}$. Again ${N}_{\mathrm{ch}}\approx 200$ seems to be a reasonable data quality to trust the statistics of the posterior using the KF. In contrast, the posterior of the RE approach never includes the true values with a reasonable HDCV. Note, that a probability mass of $\approx 1$ to reach the true value means that qualitatively speaking the estimate of the RE approach is infinitely far away from the true value in terms of the Mahalanobis distance. Further, due to the finite sampling the method of histogram counting is not qualified for the largest HDCVs approaching $P\approx 1$. The two developed methods contrast the Euclidean distance from the true values to the maximum mode, median or even the mean of the distribution against all higher moments of the distribution. Thus it tests the overall shape of the posterior.
Appendix 9
Uncertainty quantification for the 5state and 6state model with cPCF data
The effect of the second observable on the single parameter level for the 5state model can be seen in Appendix 9—figure 1. The biased estimates and the unidentified parameters of the RE approach for PC data (Figure 9) are eliminated by the fluorescence data.
Even the overconfidence problem seems to be decreased (compare with Figures 4b1—6). In contrast, the HDCIs of the 6states1openstate model for cPCF data (Appendix 9—figure 2) display again a much increased overconfidence problem of the rate equation approach. This indicates that not simply counting of states but the actual topology of the kinetic scheme and the structure of the observation matrix $\mathbf{H}$ determines the scale of the overconfidence problem. Apparently, the more information comes from the fluorescence data relative to the information from the current data the less grave is the overconfidence problem and the less different are the Euclidean errors.
Data availability
We included the simulated data time traces into supporting files and we uploaded the source code on https://github.com/JanMuench/Tutorial_Patchclamp_data and https://github.com/JanMuench/Tutorial_Bayesian_Filter_cPCF_data.
References

Bayesian inference for ion–channel gating mechanisms directly from single–channel recordings, using Markov chain Monte CarloProceedings of the Royal Society of London. Series A 455:2879–2932.https://doi.org/10.1098/rspa.1999.0432

MCMC for IonChannel SojournTime Data: A Good ProposalBiophysical Journal 111:267–268.https://doi.org/10.1016/j.bpj.2016.02.042

Kinetic relationship between the voltage sensor and the activation gate in spHCN channelsThe Journal of General Physiology 130:71–81.https://doi.org/10.1085/jgp.200709769

Filtering and inference for stochastic oscillators with distributed delaysBioinformatics (Oxford, England) 35:1380–1387.https://doi.org/10.1093/bioinformatics/bty782

Stan: A probabilistic programming languageJournal of Statistical Software 76:10.https://doi.org/10.18637/jss.v076.i01

On kalman filter for linear system with colored measurement noiseJournal of Geodesy 88:1163–1170.https://doi.org/10.1007/s0019001407517

Characterization of single channel currents using digital signal processing techniques based on Hidden Markov ModelsPhilosophical Transactions of the Royal Society of London. Series B, Biological Sciences 329:265–285.https://doi.org/10.1098/rstb.1990.0170

BookA QMatrix CookbookIn: Sakmann B, Neher E, editors. SingleChannel Recording. Boston, MA: Springer. pp. 589–633.https://doi.org/10.1007/9781441912299_20

BookThe Principles of the Stochastic Interpretation of IonChannel MechanismsIn: Sakmann B, Neher E, editors. SingleChannel Recording. Boston, MA: Springer. pp. 397–482.https://doi.org/10.1007/9781441912299_18

On the stochastic properties of single ion channelsProceedings of the Royal Society of London. Series B. Biological Sciences 211:205–235.https://doi.org/10.1098/rspb.1981.0003

How to impose microscopic reversibility in complex reaction mechanismsBiophysical Journal 86:3510–3518.https://doi.org/10.1529/biophysj.103.038679

Statistical Analysis of Ion Channel Data Using Hidden Markov Models With Correlated StateDependent Noise and FilteringJournal of the American Statistical Association 96:805–815.https://doi.org/10.1198/016214501753208519

Robust Perron cluster analysis in conformation dynamicsLinear Algebra and Its Applications 398:161–184.https://doi.org/10.1016/j.laa.2004.10.026

Trajectory inference and parameter estimation in stochastic models with temporally aggregated dataStatistics and Computing 28:1053–1072.https://doi.org/10.1007/s112220179779x

The energy landscapes and motions of proteinsScience (New York, N.Y.) 254:1598–1603.https://doi.org/10.1126/science.1749933

Maximum likelihood estimation and identification directly from singlechannel recordingsProceedings. Biological Sciences 249:125–132.https://doi.org/10.1098/rspb.1992.0094

Visualization in Bayesian workflowJournal of the Royal Statistical Society 182:389–402.https://doi.org/10.1111/rssa.12378

A single series from the gibbs sampler provides A false sense of securityBayesian Statistics 4:75.

Inference from Iterative Simulation Using Multiple SequencesStatistical Science 7:1136.https://doi.org/10.1214/ss/1177011136

Understanding predictive information criteria for Bayesian modelsStatistics and Computing 24:997–1016.https://doi.org/10.1007/s1122201394162

Stan: A probabilistic programming language for bayesian inference and optimizationJ. of Educational and Behavioral Statistics 40:113.https://doi.org/10.3102/1076998615606113

The chemical Langevin equationThe Journal of Chemical Physics 113:297–306.https://doi.org/10.1063/1.481811

Exact stochastic simulation of coupled chemical reactionsThe Journal of Physical Chemistry 81:2340–2361.https://doi.org/10.1021/j100540a008

BookBayesian Inference for the Chemical Master Equation Using Approximate ModelsUlm, Germany: Elsevier.

Markov chain Monte Carlo fitting of singlechannel data from inositol trisphosphate receptorsJournal of Theoretical Biology 257:460–474.https://doi.org/10.1016/j.jtbi.2008.12.020

Linearnoise approximation and the chemical master equation agree up to secondorder moments for a class of chemical systemsPhysical Review. E, Statistical, Nonlinear, and Soft Matter Physics 92:042124.https://doi.org/10.1103/PhysRevE.92.042124

Determination of parameter identifiability in nonlinear biophysical models: A Bayesian approachThe Journal of General Physiology 143:401–416.https://doi.org/10.1085/jgp.201311116

A primer on Bayesian inference for biophysical systemsBiophysical Journal 108:2103–2113.https://doi.org/10.1016/j.bpj.2015.03.042

Analyzing singlemolecule time series via nonparametric Bayesian inferenceBiophysical Journal 108:540–556.https://doi.org/10.1016/j.bpj.2014.12.016

The nouturn sampler: adaptively setting path lengths in hamiltonian monte carloJournal of Machine Learning Research 15:1593–1623.

Estimating kinetic constants from single channel dataBiophysical Journal 43:207–223.https://doi.org/10.1016/S00063495(83)843410

ConferenceSensor noise modeling using the Skellam distribution: Application to the color edge detectionIEEE conference on computer vision and pattern recognition.https://doi.org/10.1109/CVPR.2007.383004

Solving the chemical master equation for monomolecular reaction systems analyticallyJournal of Mathematical Biology 54:1–26.https://doi.org/10.1007/s002850060034x

BookConfidence intervals vs bayesian intervalsIn: Harper WL, Hooker CA, editors. Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science. Berlin, Germany: Springer. pp. 175–257.https://doi.org/10.1007/9789401014366_6

An invariant form for the prior probability in estimation problemsProceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences 186:453–461.https://doi.org/10.1098/rspa.1946.0056

A New Approach to Linear Filtering and Prediction ProblemsJournal of Basic Engineering 82:35–45.https://doi.org/10.1115/1.3662552

The Relationship between Stochastic and Deterministic Models for Chemical ReactionsThe Journal of Chemical Physics 57:2976–2978.https://doi.org/10.1063/1.1678692

How subunits cooperate in cAMPinduced activation of homotetrameric HCN2 channelsNature Chemical Biology 8:162–169.https://doi.org/10.1038/nchembio.747

Accounting for variability in ion current recordings using a mathematical model of artefacts in voltageclamp experimentsPhilosophical Transactions of the Royal Society A 378:20190348.https://doi.org/10.1098/rsta.2019.0348

Structural identifiability of equilibrium ligandbinding parametersThe Journal of General Physiology 149:105–119.https://doi.org/10.1085/jgp.201611702

The structure of binding curves and practical identifiability of equilibrium ligandbinding parametersThe Journal of General Physiology 149:121–147.https://doi.org/10.1085/jgp.201611703

Maximum likelihood estimation of ion channel kinetics from macroscopic currentsBiophysical Journal 88:2494–2515.https://doi.org/10.1529/biophysj.104.053256

Listening to the noise: random fluctuations reveal gene network parametersMolecular Systems Biology 5:318.https://doi.org/10.1038/msb.2009.75

Estimating kinetic mechanisms with prior knowledge II: Behavioral constraints and numerical testsJournal of General Physiology 150:339–354.https://doi.org/10.1085/jgp.201711912

Solving ion channel kinetics with the qub softwareBiophysical Reviews and Letters 08:191–211.https://doi.org/10.1142/S1793048013300053

Projected and hidden Markov models for calculating kinetics and metastable states of complex moleculesThe Journal of Chemical Physics 139:184114.https://doi.org/10.1063/1.4828816

Ion Channels in Genetic Epilepsy: From Genes and Mechanisms to DiseaseTargeted TherapiesPharmacological Reviews 70:142–173.https://doi.org/10.1124/pr.117.014456

A tutorial on hidden Markov models and selected applications in speech recognitionProceedings of the IEEE 77:257–286.https://doi.org/10.1109/5.18626

Bayesian restoration of ion channel records using hidden Markov modelsBiophysical Journal 80:1088–1103.https://doi.org/10.1016/S00063495(01)760870

MCMC for hidden Markov models incorporating aggregation of states and filteringBulletin of Mathematical Biology 66:1173–1199.https://doi.org/10.1016/j.bulm.2003.12.001

Efficiently Simulating the Coverage Properties of Interval EstimatesApplied Statistics 35:159.https://doi.org/10.2307/2347266

BookSingleChannel RecordingBerlin, Germany: Springer Science & Business Media.https://doi.org/10.1007/9781441912299

Estimating kinetic mechanisms with prior knowledge I: Linear parameter constraintsThe Journal of General Physiology 150:323–338.https://doi.org/10.1085/jgp.201711911

An Introduction to Infinite HMMs for SingleMolecule Data AnalysisBiophysical Journal 112:2021–2029.https://doi.org/10.1016/j.bpj.2017.04.027

ICON: An Adaptation of Infinite HMMs for Time Traces with DriftBiophysical Journal 112:2117–2126.https://doi.org/10.1016/j.bpj.2017.04.009

MCMC estimation of Markov models for ion channelsBiophysical Journal 100:1919–1929.https://doi.org/10.1016/j.bpj.2011.02.059

MCMC Can Detect Nonidentifiable ModelsBiophysical Journal 103:2275–2286.https://doi.org/10.1016/j.bpj.2012.10.024

Modelling modal gating of ion channels with hierarchical Markov modelsProceedings of the Royal Society A 472:20160122.https://doi.org/10.1098/rspa.2016.0122

Maximum likelihood estimation of biophysical parameters of synaptic receptors from macroscopic currentsFrontiers in Cellular Neuroscience 8:303.https://doi.org/10.3389/fncel.2014.00303

Structural dynamics in the gating ring of cyclic nucleotidegated ion channelsNature Structural & Molecular Biology 14:854–860.https://doi.org/10.1038/nsmb1281

BookStochastic Processes in Physics and ChemistryAmsterdam, Netherlands: Elsevier.https://doi.org/10.1016/B9780444529657.X50004

Applying hidden Markov models to the analysis of single ion channel activityBiophysical Journal 82:1930–1942.https://doi.org/10.1016/S00063495(02)755422

Pacemaker activity of the human sinoatrial node: effects of HCN4 mutations on the hyperpolarizationactivated currentJournal of the Working Groups on Cardiac Pacing 16:384–395.https://doi.org/10.1093/europace/eut348

ConferenceAlmost All Learning Machines are Singular2007 IEEE Symposium on Foundations of Computational Intelligence. pp. 383–388.https://doi.org/10.1109/FOCI.2007.371500

Statedependent cAMP binding to functioning HCN channels studied by patchclamp fluorometryBiophysical Journal 100:1226–1232.https://doi.org/10.1016/j.bpj.2011.01.034
Decision letter

Marcel P GoldschenOhmReviewing Editor; University of Texas at Austin, United States

Richard W AldrichSenior Editor; University of Texas at Austin, United States

Marcel P GoldschenOhmReviewer; University of Texas at Austin, United States

Lorin S MilescuReviewer; University of Maryland, Baltimore, United States

Colin KinzThompsonReviewer; Rutgers, United States
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Decision letter after peer review:
Thank you for submitting your article "Bayesian selection of Hidden Markov models for multidimensional ion channel data" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, including Marcel P GoldschenOhm as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Richard Aldrich as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Lorin S Milescu (Reviewer #2); Colin KinzThompson (Reviewer #3).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
We would like to draw your attention to changes in our revision policy that we have made in response to COVID19 (https://elifesciences.org/articles/57162). Specifically, when editors judge that a submitted work as a whole belongs in eLife but that some conclusions require additional discussion or a modest amount of additional analysis, as they do with your paper, we are asking that the manuscript be revised to address the points raised by the reviewers.
Our expectation is that the authors will eventually carry out the additional analyses and report on how they affect the relevant conclusions either in a preprint on bioRxiv or medRxiv, or if appropriate, as a Research Advance in eLife, either of which would be linked to the original paper.
Summary:
Extracting ion channel kinetic models from experimental data is an important and perennial problem. Much work has been done over the years by different groups, with theoretical frameworks and computational algorithms developed for specific combinations of data and experimental paradigms, from single channels to realtime approaches in live neurons. At one extreme of the data spectrum, single channel currents are traditionally analyzed by maximum likelihood fitting of dwell time probability distributions; at the other extreme, macroscopic currents are typically analyzed by fitting the average current and other extracted features, such as activation curves. Robust analysis packages exist (e.g., HJCFIT, QuB), and they have been put to good use in the literature.
Münch et al. focus here on several areas that need improvement: dealing with macroscopic recordings containing relatively low numbers of channels (i.e., hundreds to tens of thousands), combining multiple types of data (e.g., electrical and optical signals), incorporating prior information, and selecting models. The main idea is to approach the data with a predictorcorrector type of algorithm, implemented via a Kalman filter that approximates the discretestate process (a metaMarkov model of the ensemble of active channels in the preparation) with a continuousstate process that can be handled efficiently within a Bayesian estimation framework, which is also used for parameter estimation and model selection.
With this approach, one doesn't fit the macroscopic current against a predicted deterministic curve, but rather infers – point by point – the ensemble state trajectory given the data and a set of parameters, themselves treated as random variables. This approach, which originated in the signal processing literature as the ForwardBackward procedure (and the related BaumWelch algorithm), has been applied since the early 90s to single channel recordings (e.g., Chung et al., 1990), and later has been extended to macroscopic data, in a breakthrough study by Moffatt (2007). In this respect, the study by Münch et al. is not necessarily a conceptual leap forward. However, their work strengthens the existing mathematical formalism of state inference for macroscopic ion channel data, and embeds it very nicely in a rigorous Bayesian estimation framework.
The main results are very convincing: basically, model parameters can be estimated with greater precision – as much as an order of magnitude better – relative to the traditional approach where the macroscopic data are treated as noisy but deterministic (but see comments below). Estimated uncertainty can be further improved by incorporating prior information on parameters (e.g., diffusion limits), and by including other types of data, such as fluorescence.
The manuscript is well written and overall clear, and the mathematical treatment is a rigorous tourdeforce. However, the reviewers raised a number points that need further clarification, better discussion or amendment. These concerns are likely to be addressable largely by changes to the main text and software documentation along with some additional analyses. Once addressed, a revised version of this manuscript would likely be suitable for publication in eLife. That said, the study is very nice and ambitious, but clarity is a bit impaired by dealing with perhaps too many issues. The state inference and the bayesian model selection are very important but completely different issues. The authors should consider whether they may be better treated separately, or even perhaps in a more specialized journal where they can be developed in more detail.
Essential revisions:
1. Tutorialstyle computational examples must be provided, along with well commented/documented code. The interested readers should be able to implement the method described here in their own code/program. The supplied code is not well documented, and it is unclear whether it is applicable beyond the specific models examined in the paper. It was supplied as.txt files, but looks like C code, and is unlikely to be easily adaptable by others to their own models.
2. The authors should clearly discuss the types of data and experimental paradigms that can be optimally handled by this approach, and they must explain when and where it fails or cannot be applied or becomes inefficient in comparison with other methods. One must be aware that ion channel data are very often subject to noise and artifacts that alter the structure of microscopic fluctuations. It seems that the state inference algorithm would work optimally with low noise, stable, patchclamp recordings (and matching fluorescence recordings) in heterologous expression systems (e.g., HEK293 cells), where the currents are relatively small, and only the channel of interest is expressed (macropatches?). In contrast, it may not be effective with large currents that are recorded with low gain, are subject to finite series resistance, limited rise time, restricted bandwidth, colored noise, contaminated by other currents that are (partially) eliminated with the P/n protocol with the side effect of altering the noise structure, power line 50/60 Hz noise, baseline fluctuations, etc. This basically excludes some types of experimental data and experimental paradigms, such as recordings from neurons in brain slices or in vivo, oocytes, etc. Of course, artifacts can affect all estimation algorithms, but approaches based on fitting the predicted average current have the obvious benefit of averaging out some of these artifacts.
The discussion in the manuscript is insufficient in this regard and must be expanded. The method should be tested under nonideal but commonly occurring conditions, such as limited bandwidth and in the presence of contaminating noise. For example, compare estimates obtained without filtering with estimates obtained with 2, 3 times overfiltering, with and without large measurement noise added (whole cell recordings with lowgain feedback resistors and series resistance compensation are quite noisy), with and without 50/60 Hz interference. How does the algorithm deal with limited bandwidth that distorts the noise spectrum? How are the estimated parameters affected? The reader will have to get sense of how sensitive is this method to artifacts. Also, fluorescence data in particular is usually of much lower temporal resolution than current measurements. How does this impact its benefit to parameter estimation?
3. Even more emphasis on the approximation of n(t) as being distributed according to a multivariate normal, and thus being continuous, should be placed in the main text. It seems that this may limit the applicability of the method to data with > ~100s of channels; although the point is not investigated. In Figure 3, the method is only benchmarked to a lower limit of ~500 channels. As shown in Milescu et al., 2005, fitting macroscopic currents is asymptotically unbiased. In other words, the estimates are accurate, unless the number of channels is small (tens or hundreds), in which case the multinomial distribution is not very well approximated by a Gaussian. What about the predictorcorrector method? How accurate are the estimates, particularly at low channel counts (10 or 100)? Since the Kalman filter also uses a Gaussian to approximate the multinomial distribution of state fluctuations, I would also expect asymptotic accuracy. Parameter accuracy should be tested, not just precision.
4. Achieving the ability to rigorously perform model selection is very impressive aspect of this work and a large contribution to the field. However, the manuscript offers too many solutions to performing that model selection itself along with a long discussion of the field (for instance, line 376395 could be completely cut). Since probabilistic model selection is an entire area of study by itself, the authors do not need to present underdeveloped investigations of each of them in a paper on modeling channel data (e.g., of course WAIC outperforms AIC. Why not cover BIC and WBIC?). The authors should pick one, and maybe write a second paper on the others instead of presenting nonrigorous comparisons (e.g., one kinetic scheme and set of parameters). As a side note, it is strange that the authors did not consider obtaining evidence or Bayes factors to directly perform Bayesian model selection – for instance, they could have used thermodynamic integration since they used MC to obtain posteriors anyway (c.f., Computing Bayes Factors Using Thermodynamic Integration by Lartillot and Philippe, Systematic Biology, 2006, 55(2), 195207. DOI: 10.1080/10635150500433722)
5. Regarding model selection for the PC data in Figure 7, why does the RE model with BC appear to provide a visually better identification of the true model than the KF model with BC if the KF model does a better job at estimating the parameters and setting non true rates to zero? Doesn't this suggest that RE with cross validation is better than the proposed KF approach in regards to model selection? In terms of parameter estimates (i.e. as shown in Figure 3), how does RE + BC stack up?
6. A better comparison with alternative parameter estimation approaches is necessary. First of all, explain more clearly what is different from the predictorcorrector formalism originally proposed by Moffatt (2007). The manuscript mentions that it expands on that, but exactly how? If it is only an incremental improvement, of interest to a very limited audience, a specialized, technical journal, is more appropriate.
Second, the method proposed by Celentano and Hawkes, 2004, is not a predictorcorrector type but it utilizes the full covariance matrix between data values at different time points. It seems that the covariance matrix approach uses all the information contained in the macroscopic data and should be on par with the state inference approach. However, this method is only briefly mentioned here and then it's quickly dismissed as "impractical". We all agree that it's a slower computation than, say, fitting exponentials, but so is the Kalman filter. Where do we draw the line of impracticability? Computational speed should be balanced with computational simplicity, estimation accuracy, and parameter and model identifiability. Moreover, that method was published in 2004, and the computational costs reported there should be projected to present day computational power. We are not saying that the authors should code the C&H procedure and run it here, but should at least give it credit and discuss its potential against the KF method.
The only comparison provided in the manuscript is with the "rate equation" approach, by which the authors understand the family of methods that fit the data against a predicted average trajectory. In principle, this comparison is sufficient, but there are some issues with the way it's done.
Table 3 compares different features of their state inference algorithm and the "rate equation fitting", referencing Milescu et al., 2005. However, there seems to be a misunderstanding: the algorithm presented in that paper does in fact predict and use not only the average but also – optionally – the variance of the current, as contributed by stochastic state fluctuations and measurement noise. These quantities are predicted at any point in time as a function of the initial state, which is calculated from the experimental conditions. In contrast, the KF calculates the average and variance at one point in time as a projection of the average and variance at the previous point. However, both methods (can) compare the data value against a predicted probability distribution. The Kalman filter can produce more precise estimates but presumably with the cost of more complex and slower computation, and increased sensitivity to data artifacts.
Figure 3 is very informative in this sense, showing that estimates obtained with the state inference (KF) algorithm are about 10 times more precise that those obtained with the "rate equation" approach. However, for this test, the "rate equation" method was allowed to use only the average, not the variance.
Considering this, the comparison made in Figure 3 should be redone against a "rate equation" method that utilizes not only the expected average but also the expected variance to fit the data, as in Milescu et al., 2005. Calculating this variance is trivial and the authors should be able to implement it easily (reviewers are happy to provide feedback). The comparison should include calculation times, as well as convergence.
7. The manuscript nicely points out that a "rate equation" approach would need 10 times more channels (N) to attain the same parameter precision as with the Kalman filter, when the number of channels is in the approximate range of 10^2 … 10^4. With larger N, the two methods become comparable in this respect.
This is very important, because it means that estimate precision increases with N, regardless of the method, which also means that one should try to optimize the experimental approach to maximize the number of channels in the preparation. However, it should be pointed out that one could simply repeat the recording protocol 10 times (in the same cell or across cells) to accumulate 10 times more channels, and then use a "rate equation" algorithm to obtain estimates that are just as good. Presumably, the "rate equation" calculation is significantly faster than the Kalman filter (particularly when one fits "features", such as activation curves), and repeating a recording may only add seconds or minutes of experiment time, compared to a comprehensive data analysis that likely involves hours and perhaps days. Although obvious, this point can be easily missed by the casual reader and so it would be useful to be mentioned in the manuscript.
8. A misunderstanding is that a current normalization is mandatory with "rate equation" algorithms. This is really not the case, as shown in Milescu et al., 2005, where it is demonstrated clearly that one can explicitly use channel count and unitary current to predict the observed macroscopic data. Consequently, these quantities can also be estimated, but state variance must be included in the calculation. Without variance, one can only estimate the product i x N, where i is unitary current and N is channel count. This should be clarified in the manuscript: any method that uses variance can be used to estimate i and N, not just the Kalman filter. In fact, the nonstationary noise analysis does exactly that: a modelblind estimation of N and i from nonequilibrium data. Also, one should be realistic here: in some circumstances it is far more efficient to fit data "features", such as the activation curve, in which case the current needs to be normalized.
9. It's great that the authors develop a rigorous Bayesian formalism here, but it would be a good idea to explain – even briefly – how to implement a (presumably simpler) maximum likelihood version that uses the Kalman filter. This should satisfy those readers who are less interested in the Bayesian approach and will also be suitable for situations when no prior information is available.
10. The Bayesian formalism is not the only way of incorporating prior knowledge into an estimation algorithm. In fact, the reader may have more practical and pressing problems than guessing what the parameter prior distribution should be, whether uniform or Gaussian or other. More likely one would want to enforce a certain KD, microscopic (i)reversibility, an (in)equality relationship between parameters, a minimum or maximum rate constant value, or complex model properties and behaviors, such as maximum Popen or halfactivation voltage. A comprehensive framework for handling these situations via parameter constraints (linear or nonlinear) and cost function penalty has been recently published (Salari et al/Navarro et al., 2018). Obviously, the Bayesian approach has merit, but the authors should discuss how it can better handle the types of practical problems presented in those papers, if it is to be considered an advance in the field, or at least a usable alternative.
11. The methods section should include information concerning the parameter initialization choices, HMC parameters (e.g. number of steps) and any burnin period used in the analyses used in Figures 36. For example, how is convergence established? How many iterations does it take to reach convergence? How long does it take to run? How does it scale with the data length, channel count, and model state count? How long does it take to optimize a large model (e.g., 10 or 20 states)? Provide some comparison with the "rate equation method".
12. In the section on priors, the entire part concerning the use of a β distribution should be removed or replaced, because it is a probabilistic misrepresentation of the actual prior information that the authors claim to have in the manuscript text. The maxentropy prior derived for the situation described in the text (i.e., an unknown magnitude where you don't know any moments but do have upper and lower bounds; the latter could be from the length from the experiment) is actually P(x) = (ln(x_{max}) – ln(x_{min}))^{1} * x^{1}. Reviewers are happy to discuss more with the authors.
13. Here and there, the manuscript somehow gives the impression that existing algorithms that extract kinetic parameters by fitting the average macroscopic current ("fitting rate equations") are less "correct", or ignorant of the true mathematical description of the data. This is not the case. Published algorithms often clearly state what data they apply to, what their limitations are, and what approximations were made, and thus they are correct within that defined context and are meant to be more effective than alternatives. Some quick editing throughout the manuscript should eliminate this impression.
14. The manuscript refers to the method where the data are fitted against a predicted current as "rate equations". However, it is not completely clear what that means. The rate equation is something intrinsic to the model, not a feature of any algorithm. An alternative terminology must be found. Perhaps different algorithms could be classified based on what statistical properties are used and how. E.g., average (+variance) predicted from the starting probabilities (Milescu et al., 2005), full covariance (Celentano and Hawkes, 2004), pointbypoint predictorcorrector (Moffatt, 2007).
15. The manuscript needs line editing and proofreading (e.g., on line 494, "Roa" should be "Rao"; missing an equals sign in equation 13). Additionally, in many paragraphs, several of the sentences are tangential and distract from communicating the message of the paper (e.g., line 55). Removing them will help to streamline the text, which is quite long.
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
Thank you for submitting your article "Bayesian inference of kinetic schemes for ion channels by Kalman filtering" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, including Marcel P GoldschenOhm as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Richard Aldrich as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Colin KinzThompson (Reviewer #3); Lorin Milescu (Reviewer #4).
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.
Essential revisions:
Reviewer #1 (Recommendations for the authors):
The authors develop a Bayesian approach to modeling macroscopic signals arising from ensembles of individual units described by a Markov process, such as a collection of ion channels. Their approach utilizes a Kalman filter to account for temporal correlations in the bulk signal. For simulated data from a simple ion channel model where ligand binding drives pore opening, they show that their approach enhances parameter identifiability and/or estimates of parameter uncertainty over more traditional approaches. Furthermore, simultaneous measurement of both binding and pore gating signals further increases parameter identifiability. The developed approach provides a valuable tool for modeling macroscopic time series data with multiple observation channels.
The authors have spent considerable effort to address the previous reviewer comments, and I applaud them for the breadth and depth of their current analysis.
1. The figure caption titles tend to say what analysis or comparison is being presented, but not what the take home message of the figure is. I suggest changing them to emphasize the latter. This will especially help nonexperts to understand what the figures are trying to convey to them.
2. I very much appreciate the GitHub code and examples for running your software. However, I feel that a handholding stepbystep explicit example of running a new model on new data is likely necessary for many to be able to utilize your software. Much more handholding than the current instructions on the GitHub repo.
3. Figure captions sometimes do not explain enough of what is shown. I appreciate that many of these details are in the main text, but data that is displayed in the figure and not concretely described in the caption can makes the figures hard to follow. e.g. Figure 4a – "With lighter green we indicate the Euclidean error of patchclamp data set." But what data set do the other colors reflect? It is not stated in the caption. Again, I realize this is described in the main text, but it also needs to be defined in the caption where the data is presented. Figure 4d – Please spell out what "both algorithms" intends. Also, a suggestion: instead of having to refer to the caption for the color codes (i.e. RE vs. KF, etc.) it would speed figure interpretation to have a legend in the figures themselves. Few other examples: Box 1. Figure 2. – Please define the solid vs. dashed lines in the caption. Figure 3c – Please define the solid vs. dashed lines in the caption. Figure 12 – "We simulated 5 different 100kHz signals." What kind of signals? Fluorescence I assume, but this needs to be explicitly defined. I'd check all figures for similar.
Reviewer #3 (Recommendations for the authors):
In this revised manuscript, the Münch et al. have addressed all of my original concerns. It is significantly revised, though, and includes many new investigations of the algorithm's performance. Overall, the narrative of this manuscript is now to introduce an approximation to the solution for a Bayesian Kalman Filter, and then spend time demonstrating that this approximation is reasonable and even better than previous methods. In my opinion, they successfully do this, although, as they mention in their comments, their manuscript is very long.
I am not 100% certain, but the approximation that the author's make seems to be equivalent (or at least similar to) an approximation of the chemical master equation using just the 1st and 2nd moments, which is just the FokkerPlanck equation. The authors should discuss any connection to this approximation, as there is a great deal of literature on this topic (e.g., see van Kampen's book).
In Figures 3A and 4D, it is unclear to me what is plotted for the RE and classical Kalman filter (i.e., how is there a posterior if they are not Bayesian methods)? Perhaps it is buried in the methods or appendices, but, if so, it needs to at least be clarified in the figure captions.
The Bayesian statistical tests devised to determine success (e.g., on pgs. 1213) seem a little ad hoc, but are technically acceptable. I do not see a need for additional metrics.
Line 939: Equation 61 is absolutely not "close to uniform distribution". The α and β parameters of 100.01 are much larger than 1. It is incredibly peaked around 0.5. Perhaps this is a typo?
Line 942: The allowed small breaking of microscopic reversibility in the prior is an interesting idea that I wish the authors would expound upon more.
Line 712: The authors state that the simulation 'code will be shared upon request'. They should include it with their github pages tutorials for running the examples in case others wish to check their work and/or use it. There is no reason to withhold it.
Line 707: 'Perspectively' is not a commonly used word.
Reviewer #4 (Recommendations for the authors):
The authors have addressed all my comments and suggestions. The manuscript is nice and extremely comprehensive, and should advance the field.
Nevertheless, the manuscript is also very long (but justifiably so), and certain statements could be a little clearer. Most of these statements refer to the comparison with the socalled rate equation (RE) methods, with which I'm more familiar. For example:
Abstract: "Furthermore, the Bayesian filter delivers unbiased estimates for a wider range of data quality and identifies parameters which the rate equation approach does not identify."
The first part of this statement is not quite true, as Figure 4 shows clearly that the Bayesian estimates are biased (and the authors acknowledge this elsewhere in the manuscript). If they are biased in one "range of data quality", that probably means they are always biased, just to different degrees. This is not surprising, because the Kalman filter is a continuous state approximation to a discrete state process, and the overall estimation algorithm makes a number of approximations to intractable probability distributions. It would definitely be correct to say that the estimates are very good, but not unbiased.
Second, this statement is also ambiguous. Are you referring to the theoretical nonidentifiability caused by having more parameters in the model than what the combination of estimator+experimental protocol can extract from data? In this case, it's not a matter of certain parameters that cannot be identified, but a matter of how many can be identified uniquely. The more information is extracted from the data, the more unique parameters can be identified, so the Kalman filter should do better. Or, alternatively, are you referring to specific parameters that are poorly identified because, for example, they refer to transitions that are rarely undertaken by the channel? In this case, it would be a matter of obtaining looser or tighter estimates with one method or another, but the parameters should be intrinsically identifiable, I imagine, regardless of the algorithm. In any case, it's not clear that the better identifiability is the result of the Bayesian side, or of the predictorcorrector state inference filter. I would guess it is the Kalman filter, but I'm not sure.
Perhaps it would be clearer if you said that the KF method produces good estimates and generally improves parameter identifiability compared to other methods, as it extracts more information from the data?
Introduction:
32: I'm not sure, but if the intention here is to cite mathematical treatments for estimation, you may add references to the "macroscopic" papers by Celentano, Milescu, Stepaniuk and perhaps a few others that use "rate equations". Also, you may cite Qin et al., 1996, as a single channel paper describing a method used in hundreds of studies.
Pg. 3:
51: I remain skeptical that it is a good idea to use "rate equations" (RE) as a term to refer to those methods that are fundamentally different from the approach described here (also see my comment to the first submission). The rate equations must always be used to predict a future state from a past or current state, by all methods, explicitly or implicitly, because REs simply describe the channel dynamics. In this very manuscript, Equation 3, central to the Kalman filter formalism, is nothing but a deterministic rate equation with a stochastic term added to approximate the stochastic evolution of an ensemble of channels. In fact, there are some old papers by Fox and some more recent by Osorio (I'm not exactly sure of the name and I don't remember the years) that discuss that approximation and its shortcomings – perhaps that is the source of bias?
Whether that prediction is then corrected, as in the Kalman filter approach, with a corrector step is irrelevant, as far as the rate equations. Furthermore, it's all a matter of degree. For example, the Milescu et al. approach, which is classified here under the RE umbrella, predicts future states as a mean + variance (or as a mean only), but only using the initial state. There is no correction at each point, as with the Kalman filter method, but there is a correction of the initial state from one iteration to the next (by the way, it's not clear if you implemented that feature here, which would make a big difference). Then, because it considers the stochastical aspect of the data, the Milescu et al. approach should also be considered a stochastic method, just one that doesn't use all the information contained in the data (and so it is fast). Imagine a situation where you ran the same stimulation protocol multiple times, but each time you record only one data point, further and further away from the beginning of the stimulation protocol. A "stochastic" algorithm applied to this type of data would be exactly as described in Milescu et al. Of course, in reality all points are recorded in sequence, but that doesn't mean that the approach is not stochastic, just that it 's simplifying and discarding some information to gain speed. All methods (such as the Kalman filter described here) make some compromises to reduce complexity and increase speed.
The bottom line is that there is the most basic approach of solving the rate equations deterministically, without considering any variance whatsoever, and then there is everything else.
58: "Thus, a time trace with a finite number of channels contains, strictly speaking, only one independent data point."
I don't understand this sentence. Could you please clarify?
74: "The KF is the minimal variance filter Anderson and Moore (2012). Thus, instead of excessive analog filtering of currents with the inevitable signal distortions Silberberg and Magleby (1993) one can apply the KF with higher analysing frequency on less filtered data."
Yes, but I'm not sure why should we use excessive filtering? Where is this idea coming from?
131: "However, even with simulated data of unrealistic high numbers of channels per patch, the KF outperforms the deterministic approach in estimating the model parameters."
It is clearly true from your figures, but please give a number as to what is unrealistic (10,000? 100,000?). Also, outperforms, but by how much? As I commented above and below, all methods tested here seem to produce good estimates under the conditions they were designed for (and even outside), and one might argue that it's not worth adding the additional computational complexity and running time for a possibly small increase in accuracy. How is one to judge that increase in accuracy?
276: Has the RE method been used with the mean + variance or mean only? Also, how was the initial probability vector calculated with the RE method? If you used the mean + variance, then (as mentioned above), you can't call it deterministic. If you used only the mean, then it is indeed a deterministic method, but it's not the approach described in Milescu et al. Please explain.
229: "added, in Equation 9 to the variance … which originates (Equation 38d) from the fact that we do not know the true system state"
Which noise are you referring to? The state noise or the measurement noise?
326: Which are the two algorithms?
278: "Both former algorithms are far away from the true parameter values with their maxima and also fail to cover the true values within reasonable tails of their posterior."
I think the authors might have gotten a little carried away when they made this statement. There is no doubt that the Kalman/Bayesian method produces more accurate estimates, but the other two methods (KF and RE) are very reasonable as well. I see estimates that are within 5, 10, or 20% from the true values, for most parameters. This is not "failure" by any stretch of the imagination. Most people would call this quite good, given the nature of the data.
294: "The small advantage of our algorithm for small … over Moffatt (2007) is due to the fact that we could apply an informative prior in the formulation of the inference problem … by taking advantage of our Bayesian filter. "
This is an interesting and important statement. I interpret it as saying that the Bayesian aspect itself makes only a small contribution to the quality of the estimates, when comparing it with the Moffatt method, which is a Kalman filter as well. The only issue with the Moffatt method is the lack of an explicit formalism for the excess state noise (which could presumably be added). It also suggests to me that any method that tries to use the noise may run into difficulties when the noise model is unknown (typical reallife scenario). The Moffatt method is confronted with unknown noise and it fails. What about when the Bayesian method is confronted with unknown noise? There are some comments in the manuscript, but nothing tangible. Could you please comment?
Figure 3: There are no data sets in the figure. What data were analyzed here?
Is there a typo regarding the colors in a? What are the red, blue, black, green symbols? Please verify.
Assuming the red is KF, there is something very curious about its estimates, which are very different in distribution from the Bayesian estimates. Could there be a coding problem? If red is actually RE, then it would make more sense. Could you explain? Is this the effect of the "excessive" open state noise? If so, I find this situation a bit unfair, because then the 2007 KF algorithm is tested with data that it is not meant to work with.
Also, I think it would be more interesting to have a comparison between the original KF and the newer Bayesian approach, so we can understand what Bayesian estimation does. In any case, I can't really interpret the data until you clarify the colors.
The legend is unclear overall.
303: What means "singular"?
324: What shaded area in Figure 1a? Perhaps you mean Figure 4a?
Figure 4: I don't see any light green.
368: "The estimated true success rate by the RE approach (blue) is ≈ 0.15 and therefore far away from the true success probability. In contrast, the posterior (green) of the true success probability of the KF resides with a large probability mass between the lower and upper limit of the success probability of an optimal algorithm (given the correct kinetic scheme)."
I'm really a bit puzzled here: yes, the KF is more accurate (given the correct kinetic scheme), but what I see in this figure is that both algorithms are biased, yet both are generally within 10% (and quite a bit better for KF) of the true values. If one would plot the log of the rates, to transform into δ Gs, the differences would be even smaller. I would bet that experimental artifacts would contort the estimates to a greater degree anyway.
It's very nice that the RE approach was embedded and tested within a Bayesian framework, but it would still be interesting to know what is the contribution of the Bayesian aspect (unless I missed this point in the manuscript).
436: "…while their error ratio (Figure 7a) seems to have no trend to diminish with increasing NCh".
This would be unexpected, because the Kalman filter will always use more information than RE. Why would the KF approach become relatively more erroneous at higher Nc? I don't see any reason. To me, an important question is when do the overall errors become dominated by external factors, such as recording artifacts, using the wrong model, etc.
511: "Thus, transferring the unusually strictly applied algebraic constraint Salari et al. (2018) of microscopicreversibility to constraint with scalable softness we can model the lack information if microscopicreversibility is exactly fulfilled Colquhoun et al. (2004) by the given ion channel instead of forcing it upon the model."
I'm not sure I understand what you mean here. I think it is very usual to constrain microscopic reversibility EXACTLY, whether through the SVD method (Qin et al., Plested et al., Milescu et al., Salari et al), or through some other algebraic method (Plested et al). It's not "forcing" it on the channel, but simply testing the data for that condition. Of course, one could test the condition where there is no reversibility enforced at all (i.e., it's "given by the channel"), or anything in between.
675: "With our algorithm we demonstrate (Figure 3c and 7) that the common assumption that for large ensembles of ion channels simpler deterministic modeling by RE approaches is on par with stochastic modeling, such as a KF, is wrong in terms of Euclidean error and uncertainty quantification (Figure 5ac and Figure 6ab)".
I find this statement a little subjective. I think anyone who cares about stochastic vs. deterministic modeling knows enough that any method that uses more information from data should produce better estimates, regardless of the number of channels. I would say that the more likely assumption is that deterministic estimators produce poor estimates with small numbers of channels, perfect estimates with infinitely many channels, and anything in between. In fact, the KF method behaves exactly the same way, just with overall higher accuracy. Looking at the tests in this manuscript, I would say that all the previous studies that modeled macroscopic data using deterministic methods are safe. They don't need to be redone. The future, of course, is a different matter.
https://doi.org/10.7554/eLife.62714.sa1Author response
Essential revisions:
1. Tutorialstyle computational examples must be provided, along with well commented/documented code. The interested readers should be able to implement the method described here in their own code/program. The supplied code is not well documented, and it is unclear whether it is applicable beyond the specific models examined in the paper. It was supplied as.txt files, but looks like C code, and is unlikely to be easily adaptable by others to their own models.
An extended tutorialstyle github page for PC data can be found at https://github.com/JanMuench/Tutorial_Patchclamp_data and for confocal patchclamp fluorometry data, https://github.com/JanMuench/Tutorial_Bayesian_Filter_cPCF_data.
2. The authors should clearly discuss the types of data and experimental paradigms that can be optimally handled by this approach, and they must explain when and where it fails or cannot be applied or becomes inefficient in comparison with other methods. One must be aware that ion channel data are very often subject to noise and artifacts that alter the structure of microscopic fluctuations. It seems that the state inference algorithm would work optimally with low noise, stable, patchclamp recordings (and matching fluorescence recordings) in heterologous expression systems (e.g., HEK293 cells), where the currents are relatively small, and only the channel of interest is expressed (macropatches?). In contrast, it may not be effective with large currents that are recorded with low gain, are subject to finite series resistance, limited rise time, restricted bandwidth, colored noise, contaminated by other currents that are (partially) eliminated with the P/n protocol with the side effect of altering the noise structure, power line 50/60 Hz noise, baseline fluctuations, etc. This basically excludes some types of experimental data and experimental paradigms, such as recordings from neurons in brain slices or in vivo, oocytes, etc. Of course, artifacts can affect all estimation algorithms, but approaches based on fitting the predicted average current have the obvious benefit of averaging out some of these artifacts.
The discussion in the manuscript is insufficient in this regard and must be expanded. The method should be tested under nonideal but commonly occurring conditions, such as limited bandwidth and in the presence of contaminating noise. For example, compare estimates obtained without filtering with estimates obtained with 2, 3 times overfiltering, with and without large measurement noise added (whole cell recordings with lowgain feedback resistors and series resistance compensation are quite noisy), with and without 50/60 Hz interference. How does the algorithm deal with limited bandwidth that distorts the noise spectrum? How are the estimated parameters affected? The reader will have to get sense of how sensitive is this method to artifacts. Also, fluorescence data in particular is usually of much lower temporal resolution than current measurements. How does this impact its benefit to parameter estimation?
We developed the algorithm for analysing patchclamp or patchclamp fluorometry data though we are convinced that other experimental settings are applicable, as long as normal distributions can be reasonably applied and the white noise assumption for the measurement process holds. We show that it is advantageous to use the data in the ”rawest” form possible, e.g. without additional filtering (Figure 11 and App. 7 for patchclamp data). Instead of smoothing, averaging, or subtraction, optimal statistical description of noise sources is needed in our approach. For example, in uncompensated wholecell measurements the RCcircuit of the membrane is a lowpass filter with known characteristics. It should be noted that any active compensation of the series resistance certainly affects the noise characteristics in a hardly predictable way. It should therefore be used with particular care. In contrast, traces corrected by P/n protocols (given sufficiently low openprobability of the inactive channel) should be treatable with our approach when including the additional variance of the noise introduced by the P/n trace adequately. In fact, the assumed fluorescence signal in a confocal patchclamp fluorometry experiment is a difference signal just as the current signal obtained with the P/n technique. Here, the amount of mean signal of swimming bulk ligands is subtracted from the original signal. Thus, in the current signal in Equation 4 we simply need to add a second noise term whose variance needs to be characterized. Nonstochastic powerline or slow baseline fluctuations are not described by our approach and should be minimized by experimental means. Although, adding a drift term in Equation 4, governed by a Gaussian process with an appropriate kernel, is worth to try. In the similar way, with a periodic kernel powerline errors could be tackled. In the “Kalman filter derived from a Bayesian filter” section (line 244255) we specify those considerations.
Moreover, we selected now two nonideal aspects induced by limitations of real experiments to investigate the robustness of the idealizations of our algorithm versus the approach by Milescu et al., 2005. First, we analyzed the bias induced by the analog filtering of confocal patchclamp fluorometry data before the analogtodigital conversion (Figure 11). For PC only data see App. 7. Second, we investigated finite integration times (Figure 12) of fluorescence data (as explicitly asked by Reviewer 1). We chose to combine those two aspects because of their similar physical origin. Both of them induce an intrinsic time scale of the recording system which is not represented in the algorithms.
3. Even more emphasis on the approximation of n(t) as being distributed according to a multivariate normal, and thus being continuous, should be placed in the main text. It seems that this may limit the applicability of the method to data with > ~100s of channels; although the point is not investigated. In Figure 3, the method is only benchmarked to a lower limit of ~500 channels.
We expanded for confocal patchclamp fluorometry data (Figure 4a, 7d,e) the benchmark to channel numbers of 5 · 10^{1} to 10^{6} per patch. At the lower limit below N_{ch} < 200 we see that highest density volumina (HDCV) do not report the correct success probability of the true parameters to be inside this volume. Below this limit both algorithms have very similar Euclidean errors and the deviations of the highest density intervals for both algorithms for each N_{ch} become similar. Therefore, we suspect that the error due to the approximations of the multinomial distribution becomes more relevant than the autocorrelation of the intrinsic noise. In principle, the almost equal errors could also result from a bias induced by the uniform prior over the rate matrix. Therefore, we tested the Jeffreys Prior (loguniform on the dwell times) and dirichilet distribution on the probabilities which transition is chosen and did not detect differences for confocal patchclamp fluorometry data. For PC data (Figure 3c), we see that below N_{ch} < 2000 the posterior becomes improper for some parameters (unidentified parameters) such that here a comparison is problematic to interpret. Below N_{ch} < 2000 the mentioned Jeffreys prior has an enormous effect but to avoid prolongation of the text we save this aspect for a separate paper about priors in kinetic scheme estimation. We added in the main text a reference to Moffat, 2007 (line 201207). He applied his algorithm even for ion channel numbers as low as N_{ch} = 1 and found in Figure 4 an inverse relationship $\propto \frac{1}{{N}_{ch}}$ for the difference between the true log likelihood value and the loglikelihood of his Kalman filter approximation which establishes around N_{ch} = 20. Of course, the differences also depend on which data points of the time traces are used for the inference because the quality of the used approximations also depend on where on the probability simplex it is applied. Thus, in this regard our results should be seen as a rule of thumb for the lower limit. Nevertheless, we expect a similar behaviour.
As shown in Milescu et al., 2005, fitting macroscopic currents is asymptotically unbiased. In other words, the estimates are accurate, unless the number of channels is small (tens or hundreds), in which case the multinomial distribution is not very well approximated by a Gaussian. What about the predictorcorrector method? How accurate are the estimates, particularly at low channel counts (10 or 100)?
We expanded the benchmarking for confocal patchclamp fluorometry data (Figure 4) to low channel counts of 50. We observe that both algorithms have similar Euclidean errors if the channel numbers for confocal patchclamp fluorometry data for the simple model drops below N_{ch} < 200. For those small ion channel ensembles, the uncertainty quantification by Bayesian credibility intervals/volume (Figure 4. b16) becomes unreliable even for the Kalman filter. Note, we showed that uncertainty quantification by Bayesian credibility volume in combination with the rate equation approach is generally unreliable because the posterior is to narrow (overconfident) leading to bad coverage properties of its Highest Density Credibility Volume/Interval (Figure 5 and 6). Thus, in that regime both algorithms fail in a similar way.
On the one hand, these problems can originate from the used approximations of both algorithms to derive the likelihood. One the other hand, as information from the data gets weaker in the lower regime it could also be a bias from the uniform prior of the rate matrix. This is because a uniform prior is not an unbiased minimum information prior for rates, as indicated by Reviewer 3. Tests with a Jeffreys prior for confocal patchclamp fluorometry data with the 4state model did not reveal a significant difference such that we consider in the article only the first option and for clarity took out prior discussions.
We also like to note that the difference between the exact likelihood and a Bayesian/Kalman Filter approximation is investigated for patchclamp data of a 3state model in (Moffatt, 2007) (Figure 2, 3 and 4). He shows that the difference of the log likelihood of data between an algorithm with the exact likelihood and the Kalman filter implementation scales with ∝ 1/N_{ch} if N_{ch} > 20.
Since the Kalman filter also uses a Gaussian to approximate the multinomial distribution of state fluctuations, I would also expect asymptotic accuracy.
For a 5state2openstates model for PC data (Figure 9), we demonstrate that the advantage of the KF is a bias reduction compared to the RE approach. Thus, a high accuracy is reached by the KF already at N_{ch} = 10^{3}. In contrast, we detect bias and inaccurate inferences of the RE approach even for ensemble sizes as large as N_{ch} = 5·10^{5}. Due to the small number of test data sets a small undetected amount of inaccuracy in parameter inferences of the KF is possible. Additionally, for some data sets of high quality N_{ch} = 7.5·10^{4} the RE algorithm fails to identify parameters, whereas the KF algorithm has no identification problems.
In contrast, with confocal patchclamp fluorometry data for all tested models bias and inaccuracy is a smaller but still present issue (Figure 11 and 12). The overconfidence problem of the RE algorithm remains for all tested models.
Parameter accuracy should be tested, not just precision.
consistent estimators (given an identifiable model) as their Euclidean error (distance from the true value) vanishes in distributions with increasing data quality. If any estimator is consistent it has to be asymptotically accurate and precise.
Accurate means
accurate = E[θ_{i} − θ_{i,}_{true}] = 0 (1)
with i indicating the ith parameter and E means the arithmetic mean of many inferences with different data sets. The common mathematical definition of precision is
For a data set of finite quality or amount, the estimators could still be inaccurate (biased) and/or imprecise but it could also be that inferences are perfectly accurate (unbiased) and only imprecise. To see that the last aspect is true, take for example a perfectly accurate inference: Thus, the mean equals exactly the true value E[θ] − θ_{true} = 0 and assume its imprecision is Σ = 1. In this case the Euclidean error $\sqrt{\sum _{i}^{{N}_{\text{prameters}}}{\theta}_{i}^{2}}$ , is a random variable that is χdistributed. Thus, the mean Euclidean distance equates to $E\left[\text{error}\right]=\sqrt{2}\frac{\mathrm{\Gamma}(({N}_{\text{parameter}}+1)/2)}{\mathrm{\Gamma}({N}_{\text{parameter}}/2)}\ne 0$, with Γ() being the γ function. In conclusion, the Euclidean error is not a direct estimator of precision but has good summarizing properties.
We display now in Figure 4b16, 8 and 9 and App. 9 accuracy and precision implicitly by the narrowest highest density interval. By implicitly, we mean that both aspects are demonstrated by the local trend (accuracy) and the spread (precision) around the trend vs. N_{ch}. In this representation, perfect accuracy means E[θ^{˜}_{i}] = 1 as the parameters are normalized to their true value. The advantage of the single parameter level is that systematically inaccurate (biased) estimates E[θ^{˜}_{i}] 6 = 1 for a finite data quality can be detected Figure 9 for which the more summarizing Euclidean error is blind.
Conceptually, we prefer the Bayesian highest density credibility interval. In the new version of the article we argue that Euclidean error and Bayesian highest density volumes together are a very effective way to probe the performance of an algorithm (Figure 5. and Figure 6). Highest density credibility volumes (HDCV) allow uncertainty quantification exactly in the way the researcher desires in most cases. HDCVs allow to determine the probability mass of the true parameters to be in a certain volume for a given data set. This guided us to the binomial testing of the overall shape of the posterior. This interpretation is not possible for the usual confidence intervals/volumes of maximum likelihood, Janes, et al., 1976. In the benchmark situation of this article, we used the HDCV to show that the RE approach is generally too confident in the Bayesian parameter uncertainty quantification, which also leads to too confident (narrow) confidence volumes or errorbars in a maximum likelihood context. Note, that this shortcoming had remained undetected if we would have explored only accuracy and precision. Later in the article we worked explicitly with the classical definition of accuracy and precision because we want to understand the mean bias (for all possible random data sets) induced by analog Besselfiltering (Figure 11 and App. 7 Figure 1) the data before analysing them. The same applies to the mean bias due to limited time resolution of fluorescence data (Figure 12).
4. Achieving the ability to rigorously perform model selection is very impressive aspect of this work and a large contribution to the field. However, the manuscript offers too many solutions to performing that model selection itself along with a long discussion of the field (for instance, line 376395 could be completely cut). Since probabilistic model selection is an entire area of study by itself, the authors do not need to present underdeveloped investigations of each of them in a paper on modeling channel data (e.g., of course WAIC outperforms AIC. Why not cover BIC and WBIC?). The authors should pick one, and maybe write a second paper on the others instead of presenting nonrigorous comparisons (e.g., one kinetic scheme and set of parameters). As a side note, it is strange that the authors did not consider obtaining evidence or Bayes factors to directly perform Bayesian model selection – for instance, they could have used thermodynamic integration since they used MC to obtain posteriors anyway (c.f., Computing Bayes Factors Using Thermodynamic Integration by Lartillot and Philippe, Systematic Biology, 2006, 55(2), 195207. DOI: 10.1080/10635150500433722)
We took out the whole model selection part and saved it for a successor publication.
5. Regarding model selection for the PC data in Figure 7, why does the RE model with BC appear to provide a visually better identification of the true model than the KF model with BC if the KF model does a better job at estimating the parameters and setting non true rates to zero? Doesn't this suggest that RE with cross validation is better than the proposed KF approach in regards to model selection? In terms of parameter estimates (i.e. as shown in Figure 3), how does RE + BC stack up?
We are pleased to clarify this point though the model selection will be placed in a second article. In the previous version of the article, we presented in Figure 6 and Figure 7 two related aspects.
First, we define overfitting in the usual way as a performance difference of a trained model when it is used to predict the training data and prediction data. A trained model, no matter if under complex or overly complex performs on average better to predict the already used trainings data than unseen new data. When we choose a too complex kinetic scheme in which the true process is nested we observe, that the RE approach, when applied to a more complex model, has a higher tendency to overfit. We concluded that from the observation that the RE algorithm concentrates less the posterior mass around zero for rates which do not exist in the true process (Figure 6 of the previous article) than the KF. The RE algorithm places during the training more statistical weight on values of rates to explain fluctuations which are specific for the training data set. This happens no matter which model is used just because the amount of data is finite. That leads due to the higher flexibility of a too complex model to even more ways to overfit the data set. This means for a kinetic scheme trained by an RE algorithm, that if the trained model is now used to predict new data, it does not perform (generalize) as good as if the simpler model would have been trained and then used to predict new data (cross validation). In mathematical terms, more statistical weight on structures which are not present in the real process reduces, of course, the value of Equation 16 of the previous version of the article. This tendency to overfit more of the RE approach can be exploited if the model inference is done by Bayesian cross validation (Figure 7a, see Equation 16 of the previous version of the article). Because crossvalidation penalizes more the bigger tendency of RE approach on the too complex model to overfit to the trainings data.
Thus, the conclusion of Reviewer 1 is the same as we drew in lines 406 410 of the previous version of the article “This means, ignoring intrinsic fluctuations of macroscopic data, such as RE approaches do, leads to the inference of more states than present in the true model if the model performance is evaluated by the training data. One can exploit this weakness by choosing the kinetic scheme on crossvalidated data, since too complex models derived from the RE approach do not generalize as good to new data as a parsimonious model”. We were intending to state that if one uses a RE approach one is limited to cross validation for the kinetic scheme selection. To the best of our knowledge, crossvalidation is not the default strategy how models are selected. But the data (Figure 7a in the previous version of the article) show that it should be the method of choice if an RE approach is used.
The Bayesian filter is more flexible than a RE approach. It is more reliable to detect structures that are overfitted by asking for which rates is the probability mass close to zero (continuous model expansion). But it can also use estimators of the minimum KullbackLeibler entropy (Bayesian cross validation or information criteria such as WAIC). Because, Bayesian cross validation is data intensive, it is an advantage to have an algorithm which is able to perform this task on the training data by WAIC.
6. A better comparison with alternative parameter estimation approaches is necessary. First of all, explain more clearly what is different from the predictorcorrector formalism originally proposed by Moffatt (2007). The manuscript mentions that it expands on that, but exactly how? If it is only an incremental improvement, of interest to a very limited audience, a specialized, technical journal, is more appropriate.
We thank the Reviewer for mentioning these concerns. First, we inserted the new Figure 3 to address these concerns. In particular, Figure 3b shows even for tiny standard deviations of the openchannel noise, measured in units of single channel current σ_{op}/i ≈ 0.01, that the previous Kalman filter, Moffat, 2007, produces a larger Euclidean error than our algorithm. The magnitude of the Euclidean error of the parameters of the kinetic scheme inferred by Moffat, 2007, becomes quickly even worse than the Euclidean Error of Milescu et al., 2005 while our algorithm and the one used by, Milescu et al., 2005 is hardly affected. The reason for this is revealed if one considers the scaling of the total amount of openchannel noise in the macroscopic signal. In fact, we interpret our findings such that for most ion channels this generalization from pure additive to additional multiplicative noise (from adding instrumental noise of a constant variance to adding instrumental noise whose variance depends on the ensemble state) is a necessary step to apply the Kalman filter. Second, we discuss the mathematical difficulties now in more detail in the main text in the subsection “Kalman filter derived from a Bayesian filter”.
Second, the method proposed by Celentano and Hawkes, 2004, is not a predictorcorrector type but it utilizes the full covariance matrix between data values at different time points. It seems that the covariance matrix approach uses all the information contained in the macroscopic data and should be on par with the state inference approach. However, this method is only briefly mentioned here and then it's quickly dismissed as "impractical".
Thanks for pointing out that our concerns with the algorithm by Celentano and Hawkes, 2004, were not convincing. We agree that the likelihood of the Celentano and Hawkes algorithm seems to be a mathematical equivalent. But our major concern is computational speed. We agree that there is no sharp borderline between possible and impossible with regard to a Bayesian adaptation of the Celentano and Hawkes, 2004, algorithm and our algorithm. Thus, we softened the tone by writing “nonoptimal or even impractical” (line 6263). Nevertheless, we clarified our reasons to call it impractical here after the next quote and in the main article (line 6072).
We all agree that it's a slower computation than, say, fitting exponentials, but so is the Kalman filter. Where do we draw the line of impracticability? Computational speed should be balanced with computational simplicity, estimation accuracy, and parameter and model identifiability. Moreover, that method was published in 2004, and the computational costs reported there should be projected to present day computational power.
To clarify our concerns, note that a maximum likelihood optimization usually takes some orders of magnitude fewer likelihood evaluations compared to the number of posterior evaluations when one samples the posterior (App. 1). The reason is not simply the larger number of samples by some orders of magnitude compared to the iteration steps of a maximum likelihood optimizer: For each excepted sample, the Hamiltonian Monte Carlo sampler (Betancourt, 2017) evaluates the derivative of the likelihood and prior distribution many times. This is necessary for numerically integrating the Hamiltonian equations via a leapfrog integrator. The evaluations of the likelihood by Celentano and Hawkes, 2004, have a computational complexity due to matrix inversion from N_{data}^{2.3} to N_{data}^{3} while the Kalman filter has a linear computational complexity in N_{data} (App. 1).
We are not saying that the authors should code the C&H procedure and run it here, but should at least give it credit and discuss its potential against the KF method.
We expanded our considerations on the Celentano and Hawkes algorithm, and its offsprings, in the introduction (line 6072). Based on the Celentano and Hawkes algorithm, Stepanyuk and Borisyuk, 2011, and Stepanyuk, 2014, published an algorithm which evaluates the likelihood quicker. Under certain conditions they claim that their algorithm can be faster than the Kalman filter from Moffatt, 2007.
A second point relates again to the computational complexity. Analog filtering before the analogtodigital conversion distorts the dynamics of interest. Instead of a extensive analog filtering, it is beneficial to use the Kalman filter with a high analysing frequency because it is mathematically proven to be the minimal variance filter, Anderson et all, 2012, given the assumptions mentioned in the article are fulfilled. Under these conditions the linear computational complexity relative to N_{data} becomes even more important.
That said it is not sufficient to just care about the scaling of the computational time. Memory operations (App. 1 Figure 2) of the code between the CPUs become quickly the bottleneck of the total computational time. For example, if the algorithm is asked for each data point to save the covariance matrix and mean vector of the signal, the computational time increases roughly by 5 · 10 − 10^{2} times (App. 1 Figure 2).
A third reason for choosing the KF, is the implementation of a second dimension of the signal which is straightforward for the Bayesian filter as all of the Kalman filter equations are vector and matrix equations.
The only comparison provided in the manuscript is with the "rate equation" approach, by which the authors understand the family of methods that fit the data against a predicted average trajectory. In principle, this comparison is sufficient, but there are some issues with the way it's done.
Table 3 compares different features of their state inference algorithm and the "rate equation fitting", referencing Milescu et al., 2005. However, there seems to be a misunderstanding: the algorithm presented in that paper does in fact predict and use not only the average but also – optionally – the variance of the current, as contributed by stochastic state fluctuations and measurement noise. These quantities are predicted at any point in time as a function of the initial state, which is calculated from the experimental conditions. In contrast, the KF calculates the average and variance at one point in time as a projection of the average and variance at the previous point. However, both methods (can) compare the data value against a predicted probability distribution. The Kalman filter can produce more precise estimates but presumably with the cost of more complex and slower computation, and increased sensitivity to data artifacts.
We thank the Reviewer for this criticism because it allows us to point out what we liked to say with Table 3 in the previous version of the article. We used the terminology of the Kalman filter which separates the prediction equations for the mean and covariances of the system from the correction equations. The prediction equation evolves the mean and the covariance in time. The prediction of the mean value of the RE approach is identical to the one of the Kalman filter but there is no prediction equation for the covariance in the terminology sense of the Kalman filter. This does not mean that we intended to say that the RE approach cannot predict a covariance of the signal and state. The RE approach uses the multinomial assumption and the mean values for the prediction of the covariance. But this distinction obviously leads to unnecessary confusion and is not necessary. We therefore deleted the table.
As a rule of thumb at moderate analysing frequencies, the KF is usually 2 to 3 times slower than the RE algorithm (App. 1 Figure 3) for cPCF data. Using higher analysing frequencies for the patchclamp data (App. 1 Figure 4a) the computational time of the Kalman filter can be even faster. This effect is not present at a lower analysing frequency. (App. 1 Figure 4b). For the two investigated data artifacts (Figure 11, 12 and App. 7) we could show that the Kalman filter performs better.
Figure 3 is very informative in this sense, showing that estimates obtained with the state inference (KF) algorithm are about 10 times more precise that those obtained with the "rate equation" approach. However, for this test, the "rate equation" method was allowed to use only the average, not the variance.
Considering this, the comparison made in Figure 3 should be redone against a "rate equation" method that utilizes not only the expected average but also the expected variance to fit the data, as in Milescu et al., 2005. Calculating this variance is trivial and the authors should be able to implement it easily (reviewers are happy to provide feedback). The comparison should include calculation times, as well as convergence.
We benchmarked against an inhouse algorithm, but of course, Milescu et al., 2005, and Moffat, 2007, should be considered the gold standard. Therefore, we now challenge (Figure 3) our algorithm for patchclamp data by the Bayesian version of Milescu et al., 2005 and Moffat, 2007. The error ratio between both algorithms amounts to 5.6 ± 1.4 for 4state model and to 6.8 ± 2.7 for 5state model (Figure 7).
In particular, the algorithm used by Moffat, 2007 suffers from statedepended noise (Figure 3b). Consequently, his approach is not applicable for Poissonian distributed fluorometry data. Thus, we benchmark only against Milescu et al., 2005 for confocal patchclamp fluorometry data on a 4state model (Figure 4), and on two more complex models (Figure 7,8 and 9) with 5 and 6 states. We show that the error ratio between the KF and the RE algorithm gets smaller when the second observable is added. For all tested models (Figure 7) the error ratio is about 1.52.0. Thus, adding the variance information, as asked by the Reviewer, made a large difference, similar to patchclamp data, Milescu et al., 2005. That said, we show in Figure 4,5,6 that even if the error ratios are smaller the posterior sampled by an RE algorithm is meaningless as it is overconfident, which has also consequences for the confidence volume of maximum likelihood estimation (see discussion around Figure 4b,5,6,8,9).
In Figure 7b and d, we show that the advantage of the KF for patchclamp data depends on the complexity of the process to be inferred. The KF is unbiased and identifies all parameters while the RE algorithm generates on the same data biased estimates and sometimes unidentified parameters.
For patchclamp data of the most complex tested 6state model (Figure 7 e) even the KF requires at least N_{ch} > 4 · 10^{4} channels per patch to identify all parameters. We suspect that this effect is caused by the 12 parameter dimensions of the rate matrix. Probably, the 12 dimensions make it difficult for the likelihood to dominate the bias picked up from the uniform prior.
We discuss calculation times and convergence in App. 1 Figure 24. Normally, the Kalman filter is 23 times slower than the RE approach. But against intuition, (App. 1 Figure 4a) we show that the Kalman filter can be even faster than a Rate equation type algorithm if the analysing frequency is high. This is only possible if the integration times of the Hamiltonian dynamics of the sampler are taking much longer for the RE approach. We showed that the RE approach creates too confident posteriors which in turn could lead to more integration steps of the leapfrog integrator of the sampler. Thus, algorithms with lower computational complexity in the likelihood evaluation can still be slower regarding the total computational time.
7. The manuscript nicely points out that a "rate equation" approach would need 10 times more channels (N) to attain the same parameter precision as with the Kalman filter, when the number of channels is in the approximate range of 10^2 … 10^4. With larger N, the two methods become comparable in this respect.
This is very important, because it means that estimate precision increases with N, regardless of the method, which also means that one should try to optimize the experimental approach to maximize the number of channels in the preparation. However, it should be pointed out that one could simply repeat the recording protocol 10 times (in the same cell or across cells) to accumulate 10 times more channels, and then use a "rate equation" algorithm to obtain estimates that are just as good.
As we were asked to redo the benchmark against the RE approach using the likelihood of Milescu, 2005, instead of least squares, there are many things that changed in our interpretation. First, we do not see anymore, in any of the tested scenarios for patchclamp data (Figure 3c) our patchclamp fluorometry data (Figures 4a, 7d,e) that both algorithms become equivalent in terms of the Euclidean error (Figures 3c, 4a, 7d,e) or HDCV (Figures 5c, 6ab) with increasing channel numbers. They rather have a constant mean error ratio (Figure 7a,b) such that our old interpretation, to repeat the experiment 10 times, does not hold anymore.
Second, the ratio of the Euclidean error changes, such that with the second observable the KF is less better compared to the rate equation approach, given a constant model complexity (Figure 7a).
Third, we show that the rate equation approach is incapable to produce reliable credibility volumes. Thus, the Bayesian benefit that a parameter uncertainty quantification based on one data set is possible requires to use an algorithm which mimics the autocorrelation of the intrinsic noise, even if accuracy differences are small (Figures 5ac). That is true even under pessimistic signal to experimental noise conditions (Figure 6b). This has consequences also for confidence volumes of maximum likelihood which at best are equivalent to the credibility volume if the likelihood dominates the prior, Jaynes, et all, 1976. For a discussion on limitations and work arounds of maximum likelihood error quantification of a dynamical system, see Joshi, et all, 2006.
Presumably, the "rate equation" calculation is significantly faster than the Kalman filter (particularly when one fits "features", such as activation curves), and repeating a recording may only add seconds or minutes of experiment time, compared to a comprehensive data analysis that likely involves hours and perhaps days. Although obvious, this point can be easily missed by the casual reader and so it would be useful to be mentioned in the manuscript.
Of course, the computational complexity for the calculation of the likelihood in a rate equation approach is smaller than the computational complexity of the Bayesian filter. We confirm the intuition that typically the RE algorithm is 1.8 till 3 times faster (App. 1 Figure 3) with one important exception for which the RE algorithm is slower (App. 1 Figure 4). We hypothesize that this exception is caused by the fact that, the full computational time to gain N samples depends not only on the computational complexity and memory operations but also how many integration steps of the Hamiltonian equations need to be applied to suggest a new sample. If an algorithm such as the RE approach generates much narrower posteriors, the curvature of the posterior is higher and, consequently the HMC sampling would require more integration.
But even if our discussed limits of the RE approach are considered to be unimportant, the bottle neck in the whole scientific process data acquisition or data analysis/modeling depends on the one hand on both the speed of the data acquisition and the complexity of the modeling question. On the other hand, it depends on the resources of the laboratory. Data acquisition of fast voltagegated channels might take many orders of magnitude less time than a conconfocal patchclamp fluorometry experiments in slowly gating HCN channels. Taking other limitations into account, such as e.g. photo toxicity when working with fluorescent ligands, it is not even a question of the time scales but also a matter of how many repetitions can be done at all.
8. A misunderstanding is that a current normalization is mandatory with "rate equation" algorithms. This is really not the case, as shown in Milescu et al., 2005, where it is demonstrated clearly that one can explicitly use channel count and unitary current to predict the observed macroscopic data. Consequently, these quantities can also be estimated, but state variance must be included in the calculation. Without variance, one can only estimate the product i x N, where i is unitary current and N is channel count. This should be clarified in the manuscript: any method that uses variance can be used to estimate i and N, not just the Kalman filter. In fact, the nonstationary noise analysis does exactly that: a modelblind estimation of N and i from nonequilibrium data. Also, one should be realistic here: in some circumstances it is far more efficient to fit data "features", such as the activation curve, in which case the current needs to be normalized.
We apologise for this error and corrected the introduction accordingly. In our old version this sentence referred rather to the inhouse algorithm against which we benchmarked originally. The subtle but important differences in RE approaches were carelessly deleted during the text refinement. We indeed intended this article to be about kinetic scheme estimation not about fitting certain aspects of the data. Thus, we assume a situation where the data is qualified enough to try kinetic scheme inference.
9. It's great that the authors develop a rigorous Bayesian formalism here, but it would be a good idea to explain – even briefly – how to implement a (presumably simpler) maximum likelihood version that uses the Kalman filter. This should satisfy those readers who are less interested in the Bayesian approach and will also be suitable for situations when no prior information is available.
The programming language Stan offers to select from three options: Sampling the posterior, variational Bayesian inference and maximising the posterior. The same code provided by us can be used (see, Appendix 4). If one maximises the parameter of a posterior or simply of a likelihood it is then a matter of statements of prior distributions within the code. There is no need for reimplementation of the code.
10. The Bayesian formalism is not the only way of incorporating prior knowledge into an estimation algorithm. In fact, the reader may have more practical and pressing problems than guessing what the parameter prior distribution should be, whether uniform or Gaussian or other. More likely one would want to enforce a certain KD, microscopic (i)reversibility,
We thank the Reviewer for pointing out that our former text was not stating that adding prior information is not unique to Bayesian statistics. Similar approaches with regard to the maximum likelihood method for algebraic constrains (Salari et al., 2018) and behavioral constraints (Navarro et al., 2018) called penalized maximum likelihood, are now cited. We like to note, that the penalized maximum likelihood is synonymously called maximum a posteriori because the penalizing function is usually the logarithm of a prior (Gelman et all 2013). Thus, we do not think that it is helpful to separate simple prior considerations like uniform, Jeffrey or informative priors on subsets of or single parameters from more complex priors incorporating relations (Navarro et al., 2018) between parameters: For example, we demonstrate in the discussion of the methods section related to (Figures 7 and 9) how enforce from strictly to softly micro reversibility by means of a conditional prior on
In this way we translated the algebraic constraint (Salari et al., 2018) to a rather behavioral constraint (Navarro et al., 2018) which might be chosen to be strict (thus to be effectively algebraic) or soft to allow testing whether for a channel the thermodynamic equilibrium assumption is fulfilled.
As this article is already quite long with investigating the behaviour and the robustness of the likelihood, we hardly dwell into prior selecting questions expect that we demonstrate a way to incorporate microreversibility. In general, all problems which can be formulated as loglikelihood with an added penalizing function for the parameters can be expressed as a prior multiplied by the likelihood and sampled (given that exp[penalizing function] can be integrated).
…an (in)equality relationship between parameters, a minimum or maximum rate constant value, or complex model properties and behaviors, such as maximum Popen or halfactivation voltage.
The programming language Stan employs base parameter types, whose support (the interval between their minimal and maximal values) can be readily defined by the user. The Stan compiler does then the necessary transformation to an unconstrained parameter space where the sampling is takes places. More complex parameter relationships, such as inequalities, are incorporated by other predefined parameter types such as simplexes, ordered vectors etc. Again, the stan compiler does the nasty parameter transformation to an unconstrained space automatically such that these constraints are fulfilled. Thus, testing model assumptions can be implemented fast. A constraint such as P_{open} can quickly be enforced from the fact that
P_{open} = f(K,L) (4)
which means for each parameter sample of the rate matrix K we can calculate the open probability for any ligand concentration L. Suppose that we have derived from previous data posteriors P(P_{open}(L)) of a set of open probabilities P_{open} for a set ligand concentrations. Then we can simply define the prior by
P(P_{open}) = P(f(K,L)) (5)
It is not even necessary that the inverse function f^{−1}(P_{open}) = K exists (which it obviously does not). The only prerequisite for doing this information fusion in a mathematically rigorous attempt is that the previous analysis derived a posterior on P_{open}. Proxy’s such as point estimates and their standard errors of should be used cautiously.
A hard limit for a derived quantity, which is not naturally fulfilled, can be rather tricky. A possible solution is defining a own prior on this derived quantity in the same manner as Equation 2. The Prerequisite is that it can be differentiated such as a sum of Gaussians. With increasing terms in the sum it is possible to model a hard limit as a soft limit. That in turn alleviates hard limit problems.
A comprehensive framework for handling these situations via parameter constraints (linear or nonlinear) and cost function penalty has been recently published (Salari et al/Navarro et al., 2018). Obviously, the Bayesian approach has merit, but the authors should discuss how it can better handle the types of practical problems presented in those papers, if it is to be considered an advance in the field, or at least a usable alternative.
Unless the asymptotic large data limiting case of the Bernsteinvon Mises theorem hasn’t taken over, it is crucial to select a reasonable prior distribution in the Bayesian context. Nevertheless, we restricted ourselves mainly to the behaviour of the likelihood of both algorithms (in the asymptotic strong data case) which offers many things to explore. If faced with the situation that a substantial amount of information enters the parameter inference via soft or inequality constraints (Salari et al/Navarro et al., 2018) one risks a waste of information if one uses a point estimator. That is because if one for example reports only the parameter vector of a penalized maximum likelihood inference and its covariance matrix most of the time one does not report the sufficient statistics of the corresponding posterior (Gelman et al., Chapter 4.5, 2013 third Edition). Take for example an inequality constraint that cuts at one standard deviation distance from the mean through the sampling distribution of a maximum likelihood inference. The mentioned Bernsteinvon Mises theorem is a Bayesian justification for doing maximum likelihood inferences in the asymptotic limit (Gelman et al., Chapter 4.5, 2013 third Edition) when all prior information becomes irrelevant. Thus, we do not see any relevant differences in which statistical framework prior information for a parametric model can be formulated. But Bayesian statistics will make the most of the prior information unless the data completely dominates the prior information. A rather detailed discussion on priors is beyond the scope of this article.
11. The methods section should include information concerning the parameter initialization choices, HMC parameters (e.g. number of steps) and any burnin period used in the analyses used in Figures 36.
We thank the Reviewer for pointing out the missing information: We included in App. 1 a section about the typically used parameter settings and convergence diagnostics. The Stan compiler generates a HMC sampler which adaptively tunes the sampling parameters. Additionally, the noUturn implementation (NUTS) automatically selects an appropriate number of leapfrog steps in each iteration. In that way there is much less knowledge of the user required to operate the program. Typically, we used 4 independent sampling chains with a warm up phase of 3 − 9 · 10^{3} iterations per chain followed by the actual iterations which generate the posterior. Convergence is monitored by the Rˆ statistics (Appendix 1).
For example, how is convergence established? How many iterations does it take to reach convergence? How long does it take to run? How does it scale with the data length, channel count, and model state count? How long does it take to optimize a large model (e.g., 10 or 20 states)? Provide some comparison with the "rate equation method".
We dedicated App. 1 to this discussion
12. In the section on priors, the entire part concerning the use of a β distribution should be removed or replaced, because it is a probabilistic misrepresentation of the actual prior information that the authors claim to have in the manuscript text. The maxentropy prior derived for the situation described in the text (i.e., an unknown magnitude where you don't know any moments but do have upper and lower bounds; the latter could be from the length from the experiment) is actually P(x) = (ln(x_{max}) – ln(x_{min}))^{1} * x^{1}. Reviewers are happy to discuss more with the authors.
We picked up these suggestions. We deleted this section and continued to employ uniform priors. We do not want to advocate a uniform prior for rate matrices. We use the uniform prior to be comparable to plain maximum likelihood, as it is the default method of most researchers, and to reduce the size of this publication. Unfortunately, the article is already quite long which makes us focus on the likelihood side of the Bayesian formalism. Therefore, we restrict the comparison between the algorithms to the situation where the likelihood dominates the uniform prior. Nevertheless, there are clear indications in Figure 3, 7, 8 and 9 that the uniform prior becomes insufficient for patchclamp data, in particular for more complex (data generating) models and if the RE approach is used. We will present a detailed prior investigation in the near future in the archive version of another article.
13. Here and there, the manuscript somehow gives the impression that existing algorithms that extract kinetic parameters by fitting the average macroscopic current ("fitting rate equations") are less "correct", or ignorant of the true mathematical description of the data. This is not the case. Published algorithms often clearly state what data they apply to, what their limitations are, and what approximations were made, and thus they are correct within that defined context and are meant to be more effective than alternatives. Some quick editing throughout the manuscript should eliminate this impression.
We did not intend to generate this impression. What we liked to communicate was to emphasize both the similarities but in several details also relevant differences in the assumptions adopted by both approaches. We revised the manuscript accordingly.
14. The manuscript refers to the method where the data are fitted against a predicted current as "rate equations". However, it is not completely clear what that means. The rate equation is something intrinsic to the model, not a feature of any algorithm. An alternative terminology must be found. Perhaps different algorithms could be classified based on what statistical properties are used and how. E.g., average (+variance) predicted from the starting probabilities (Milescu et al., 2005), full covariance (Celentano and Hawkes, 2004), pointbypoint predictorcorrector (Moffatt, 2007).
We thank the Reviewer for this criticism which helped us to clarify our terminology. We agree that rate equations are something intrinsic of a kinetic scheme but this does not exclude them to be also a feature of an algorithm: A good Bayesian model is always a forward model. This means, given a proper prior, one can simulate parameters and then (given parameters) sample data at every time point from the likelihood which in turn is calculated either by the deterministic rate equation or by the first order Markov process of the KF. For the inverse problem every Bayesian inference algorithm takes the model assumptions of the forward model and uses them to implement a calculation rule for the likelihood and the prior given the data. In this way, it makes sense to us to speak of an algorithm as a mathematical (nonunique) representation of the set of model assumptions/features which allows to calculate the posterior. The same applies to maximum likelihood models except that parameters are fixed and can not be updated once new information is available. They have to be refitted completely. Our terminology is coherent with the terminology of the statistical research community. They mean by statistical model either a likelihood or the combination of likelihood and prior.
15. The manuscript needs line editing and proofreading (e.g., on line 494, "Roa" should be "Rao"; missing an equals sign in equation 13). Additionally, in many paragraphs, several of the sentences are tangential and distract from communicating the message of the paper (e.g., line 55). Removing them will help to streamline the text, which is quite long.
Thanks. We changed Roa to Rao and also streamlined and divided the text. The changes evoked by the detailed criticism slightly prolonged the article.
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
Essential revisions:
Reviewer #1 (Recommendations for the authors):
The authors develop a Bayesian approach to modeling macroscopic signals arising from ensembles of individual units described by a Markov process, such as a collection of ion channels. Their approach utilizes a Kalman filter to account for temporal correlations in the bulk signal. For simulated data from a simple ion channel model where ligand binding drives pore opening, they show that their approach enhances parameter identifiability and/or estimates of parameter uncertainty over more traditional approaches. Furthermore, simultaneous measurement of both binding and pore gating signals further increases parameter identifiability. The developed approach provides a valuable tool for modeling macroscopic time series data with multiple observation channels.
The authors have spent considerable effort to address the previous reviewer comments, and I applaud them for the breadth and depth of their current analysis.
We are glad that our revision could convince the reviewer.
1. The figure caption titles tend to say what analysis or comparison is being presented, but not what the take home message of the figure is. I suggest changing them to emphasize the latter. This will especially help nonexperts to understand what the figures are trying to convey to them.
We checked and modified the figure caption titles that all provide now a brief take home message.
2. I very much appreciate the GitHub code and examples for running your software. However, I feel that a handholding stepbystep explicit example of running a new model on new data is likely necessary for many to be able to utilize your software. Much more handholding than the current instructions on the GitHub repo.
We followed the suggestion and edited the tutorial https://github.com/
JanMuench/Tutorial_Patchclamp_data/blob/main/README.md
and https://github.com/JanMuench/Tutorial_Bayesian_Filter_cPCF_data/blob/main/ README.md to make it more a stepbystep description of how to analyze a new data set with a new model.
3. Figure captions sometimes do not explain enough of what is shown. I appreciate that many of these details are in the main text, but data that is displayed in the figure and not concretely described in the caption can makes the figures hard to follow. e.g. Figure 4a – "With lighter green we indicate the Euclidean error of patchclamp data set." But what data set do the other colors reflect? It is not stated in the caption. Again, I realize this is described in the main text, but it also needs to be defined in the caption where the data is presented. Figure 4d – Please spell out what "both algorithms" intends. Also, a suggestion: instead of having to refer to the caption for the color codes (i.e. RE vs. KF, etc.) it would speed figure interpretation to have a legend in the figures themselves. Few other examples: Box 1. Figure 2. – Please define the solid vs. dashed lines in the caption. Figure 3c – Please define the solid vs. dashed lines in the caption. Figure 12 – "We simulated 5 different 100kHz signals." What kind of signals? Fluorescence I assume, but this needs to be explicitly defined. I'd check all figures for similar.
We edited the Figure 3,4 caption in this regard. Every panel has its legend and the color coding of all panel matches. But in order to keep the figure caption and the figure together small enough that they fit onto one page we took away some of the redundant information from the caption. Additionally, Figure 8, 9 got the information how much probability mass is included in each HDCI. We also edited the caption of Figure 12 and state now that we simulated cPCF data.
Reviewer #3 (Recommendations for the authors):
In this revised manuscript, the Münch et al. have addressed all of my original concerns. It is significantly revised, though, and includes many new investigations of the algorithm's performance. Overall, the narrative of this manuscript is now to introduce an approximation to the solution for a Bayesian Kalman Filter, and then spend time demonstrating that this approximation is reasonable and even better than previous methods. In my opinion, they successfully do this, although, as they mention in their comments, their manuscript is very long.
We thank the reviewer that he appreciates our work.
I am not 100% certain, but the approximation that the author's make seems to be equivalent (or at least similar to) an approximation of the chemical master equation using just the 1st and 2nd moments, which is just the FokkerPlanck equation. The authors should discuss any connection to this approximation, as there is a great deal of literature on this topic (e.g., see van Kampen's book).
We added a respective sentence in the introduction about protein expression studies which approximates the solution of the chemical master equation (CME) with the linear noise approximation to derive a Kalman filter. The linear noise approximation is the van Kampen’s System size expansion “pursued only up to first order”[13], or a linearized FokkerPlanck equation [11].
We added a paragraph in the conclusion to contextualize our method against the background of the classical approximations of the CME. We conclude due to the first order dynamics of chemical reaction network our approach is actually equivalent to the full chemical FokkerPlanck equation.
In Figures 3A and 4D, it is unclear to me what is plotted for the RE and classical Kalman filter (i.e., how is there a posterior if they are not Bayesian methods)? Perhaps it is buried in the methods or appendices, but, if so, it needs to at least be clarified in the figure captions.
We implemented a full Bayesian Version of Milescu 2005, algorithm and Moffatt 2007, algorithm, thus all results in this paper were obtained by posterior sampling. Since we used uniform priors, the mode of the posterior equals the maximum of the likelihood, i.e. the outcome of the classical implementation as a point estimator. However, our implementation of Milescu 2005 algorithm and Moffatt 2007 algorithms has the advantage that all sorts of priors can be applied. Unfortunately, our wording was somewhat ambiguous: Every Kalman filter is Bayesian filter (Moffatt, 2007) regarding the time varying unknown n(t) ensemble state. That is independent of how the time constant parameters such as the rate matrix parameters are inferred. The wording “Bayesian Filter” for our algorithm was used to distinguish it from the previous algorithm (Moffatt, 2007), as only our generalized Bayesian filter allows the inclusion of state dependent noise. We added to the manuscript in the caption and the main text that we implemented both algorithms in the full Bayesian framework.
The Bayesian statistical tests devised to determine success (e.g., on pgs. 1213) seem a little ad hoc, but are technically acceptable. I do not see a need for additional metrics.
Interestingly, we found out is is not that actually not that “ad hoc” as we thought in the beginning. We in fact test the frequentist coverage property of “credibility” intervals or volumes, see [8] Equation 2.1. Usually for parametric Bayesian statistics it is at least asymptotically clear by the Bernstein vonMises theorem (when the data swamped the prior) that coverage property should hold the frequentist interpretation of probability [9]. Just like us, others interpret the Bayesian uncertainty quantification as “invalid” if this frequentist coverage does not hold. [10]. We add a sentence “Noteworthy, that this is a empirical test of how sufficient the Bayesian filter and the RE approach hold frequentist coverage property of their HDCVs.” to indicate the theoretical foundation of our approach.
Line 939: Equation 61 is absolutely not "close to uniform distribution". The α and β parameters of 100.01 are much larger than 1. It is incredibly peaked around 0.5. Perhaps this is a typo?
This was indeed a leftover text fragment from the previous version of the article. We changed it to ”sharply peaked β distribution”.
Line 942: The allowed small breaking of microscopic reversibility in the prior is an interesting idea that I wish the authors would expound upon more.
Thanks for this encouragement. Indeed, we plan to go more into detail on this topic in the future.
Line 712: The authors state that the simulation 'code will be shared upon request'. They should include it with their github pages tutorials for running the examples in case others wish to check their work and/or use it. There is no reason to withhold it.
The simulation code is exemplified here:
https://cloudhsm.itdlz.de/s/QB2pQQ7ycMXEitE and now also in the manuscript.
Line 707: 'Perspectively' is not a commonly used word.
Thanks. We changed “Perspectively, extensions …” to “Prospective extensions …”
Reviewer #4 (Recommendations for the authors):
The authors have addressed all my comments and suggestions. The manuscript is nice and extremely comprehensive, and should advance the field.
Nevertheless, the manuscript is also very long (but justifiably so), and certain statements could be a little clearer. Most of these statements refer to the comparison with the socalled rate equation (RE) methods, with which I'm more familiar. For example:
We thank the Reviewer for his appreciation of our work.
Abstract: "Furthermore, the Bayesian filter delivers unbiased estimates for a wider range of data quality and identifies parameters which the rate equation approach does not identify."
The first part of this statement is not quite true, as Figure 4 shows clearly that the Bayesian estimates are biased (and the authors acknowledge this elsewhere in the manuscript). If they are biased in one "range of data quality", that probably means they are always biased, just to different degrees.
We agree that it is right to say “If they are biased in one “range of data quality” this probably means they are always biased, just to different degrees”. The approximations lead to all sorts of errors including bias. This argument is also supported by the findings of Moffatt, 2007 (Figure 11) in which he detects bias in the deterministic approach (Milescu, 2005) and his own Kalman filter. It can certainly be stated that Figure 4b indicates a bias. However, even for optimistically high signaltonoise in the fluorescence signal as in Figure 4, reasonable large HDCIs such as the 0.95−HDCIs of Figure 4c are covering the true value. Thus, the bias can be interpreted as negligible relative to the parameter uncertainty quantification. One can actually observe how the size (the probability mass) of the HDCV of the Bayesian filter overcomes the bias in Figure 5c. We toned down our statement by adding the word “negligible” in the context with “biased estimates” in the abstract. Considering the differences in the bias in Figure 9, this seems justified to us. We also mention the bias now in the discussion of Figure 4.
This is not surprising, because the Kalman filter is a continuous state approximation to a discrete state process, and the overall estimation algorithm makes a number of approximations to intractable probability distributions. It would definitely be correct to say that the estimates are very good, but not unbiased.
We agree with the reviewer and added to the discussion of Figure 4 a statement indicating the bias of both algorithms.
Second, this statement is also ambiguous. Are you referring to the theoretical nonidentifiability caused by having more parameters in the model than what the combination of estimator+experimental protocol can extract from data? In this case, it's not a matter of certain parameters that cannot be identified, but a matter of how many can be identified uniquely. The more information is extracted from the data, the more unique parameters can be identified, so the Kalman filter should do better. Or, alternatively, are you referring to specific parameters that are poorly identified because, for example, they refer to transitions that are rarely undertaken by the channel? In this case, it would be a matter of obtaining looser or tighter estimates with one method or another, but the parameters should be intrinsically identifiable, I imagine, regardless of the algorithm.
In the context of the whole article this sentence seems to be unambiguous to us. The abstract alone leaves of course room for interpretation but we defined what we mean by unidentified parameters by Equation 16 in the main article. To avoid confusion we changed in the abstract the sentence to “For some data sets it identifies more parameters than the rate equation approach”. By our definition we are able to detect many cases of both practical and structural unidentifiability (Middendorf,Aldrich 2017). But our definition is not equivalent to the definition by (Middendorf,Aldrich 2017). We added an explanatory paragraph after Equation 16.
In any case, it's not clear that the better identifiability is the result of the Bayesian side, or of the predictorcorrector state inference filter. I would guess it is the Kalman filter, but I'm not sure.
Indeed, it is only the “predictorcorrector” side of the algorithm. Unfortunately, our wording was somewhat ambiguous which gives rise to a couple of the Reviewer’s following questions: Every Kalman filter is a Bayesian Filter but not every Bayesian Filter is (the classical) Kalman filter. In fact, Moffatt, 2007 derives his Kalman filter as a Bayesian forward algorithm in continuous space. He thus shows that the forward algorithm and the Kalman filtering performed for discrete and continuous state Markov processes, respectively, are the same. We come to the same conclusion herein, see Equation 59. (Nevertheless, he optimizes the rate matrix etc. via maximum likelihood optimization.)
We additionally describe in Equation 11 the consequences of the openchannel noise, that Equation 11 cannot be derived by the original Kalman filter equations. A more flexible noise modeling is necessary. This leads to our algorithm which is exact up to the second moment. That also includes other statedependent noise such as photon counting noise. We believe that this explicit noise description is the reason for the performance advantage of our algorithm. We would also expect this advantage if our algorithm would be implemented as point estimator. The full Bayesian implementation results in a fully sampled posterior and, in principle, allows for arbitrary priors. Since we used uniform priors, the mode of the posterior equals the maximum of the likelihood, i.e. the outcome of the classical implementation as point estimator.
Perhaps it would be clearer if you said that the KF method produces good estimates and generally improves parameter identifiability compared to other methods, as it extracts more information from the data?
As mentioned above, we changed the abstract to avoid misunderstandings.
Introduction:
32: I'm not sure, but if the intention here is to cite mathematical treatments for estimation, you may add references to the "macroscopic" papers by Celentano, Milescu, Stepaniuk and perhaps a few others that use "rate equations". Also, you may cite Qin et al., 1996, as a single channel paper describing a method used in hundreds of studies.
Thank you for pointing out the Qin paper, 1996. We also refer now to the other proposed papers at that position in the text.
Pg. 3:
51: I remain skeptical that it is a good idea to use "rate equations" (RE) as a term to refer to those methods that are fundamentally different from the approach described here (also see my comment to the first submission).
We follow with our terminology on stochastic processes that of other groups in the community [6], [4], (Walczak 2011). In this terminology, the rate equation or (reaction)rate equation is a deterministic ordinary differential equation describing the time evolution of the first moment (the mean value) of the chemical master equation (CME) [3, 4]. To our understanding the rate equation is the basis of the work of Milescu 2005 which justifies the wording.
The rate equations must always be used to predict a future state from a past or current state, by all methods, explicitly or implicitly, because REs simply describe the channel dynamics. In this very manuscript, Equation 3, central to the Kalman filter formalism, is nothing but a deterministic rate equation with a stochastic term added to approximate the stochastic evolution of an ensemble of channels.
Indeed, all approximations (deterministic or stochastic) which make physically sense should have as the large system limit the solution of the RE and thus solve at least implicitly the RE. We hesitate to use today only the celebrated traditional chemical kinetics perspective [5] that the Res describe the (full) channel dynamics. But we acknowledge that some others do. Also, we are aware of its central importance as one aspect of the evolution of a chemical system. However, we prefer the accepted terminology [4, 5, 12] that the CME is the governing equation of a well stirred ensemble of chemically reacting molecules or in other words of a Markov jump process. Our Bayesian filter uses only stochastic approximation Equation 3 of the CME, not the CME itself.
Strictly speaking, Equation 3 is the solution of the rate equation plus a stochastic perturbation. The perturbation around the mean evolution defines it as a stochastic approach. It is a standard technique to approximate the solution of the CME around the mean evolution (the RE) with a Gaussian perturbation [4],[12]. In this regard our method does not differ from the cited methods but it does differ from the RE approach by its rigorous evaluation of the likelihood. Consequently the solution of the RE without any perturbation does also occur in the Bayesian filter as the time evolution of the mean value of the ensemble vector E[n(t+∆t)n(t)] = Tn(t). Everything else would be mathematically surprising.
In fact, there are some old papers by Fox and some more recent by Osorio (I'm not exactly sure of the name and I don't remember the years) that discuss that approximation and its shortcomings – perhaps that is the source of bias?
We would be really interested to discuss the mentioned papers but without further comments we are not able to find them. We agree that the approximations are the origin of the bias.
Whether that prediction is then corrected, as in the Kalman filter approach, with a corrector step is irrelevant, as far as the rate equations.
It is true that inbetween the correction steps the solution of the RE is used to predict the future mean state and similarly a matrix expression is used to deterministically predict the future covariance. But the defining stochastic perturbation of Equation 3 allows to go beyond a deterministic mean evolution (RE approach) of the system. This is exactly induced by the “correction step”. Therefore, we believe it is appropriate to call Milescu 2005 a deterministic RE approach. On a further note, if one would be only interested in solving the dynamic equations without relation to experimental data, we agree that the difference between REs with added variance and Equation 3 would be one of wording; it is, however, the combination of the stochastic dynamic model with the rigorous likelihood computation via the correction steps which makes the Kalman filter perform differently from RE approaches when applied to data.
Furthermore, it's all a matter of degree. For example, the Milescu et al. approach, which is classified here under the RE umbrella, predicts future states as a mean + variance (or as a mean only), but only using the initial state. There is no correction at each point, as with the Kalman filter method, but there is a correction of the initial state from one iteration to the next (by the way, it's not clear if you implemented that feature here, which would make a big difference). Then, because it considers the stochastical aspect of the data, the Milescu et al. approach should also be considered a stochastic method, just one that doesn't use all the information contained in the data (and so it is fast).
To our understanding we are consistent with the common terminology when distinguishing deterministic (Milescu, 2005) and stochastic modelling (Moffat, 2007) in that way: An illustration of this distinction can be seen in Figure 2. panel A and B in Moffatt, 2007: The RE approach (panel A in Moffatt, 2007) creates a mean signal which is continuous and derivatives can be taken. Panel B and C are only continuous but one cannot take the derivative of the mean signal (at least not in the strict sense of a mathematical limit), which is one central mathematical problem of stochastic methods. In particular that is true for stochastic differential equations which therefore are usually represented in an integrated form such as Equation 3. We like to state that we do appreciate the work of (Milescu, 2005) and (Moffat, 2007) because it is the basis for our stochastic extension elaborated herein. We only disagree with the suggested wording. The following refers to the statement in brackets: We like to draw the Reviewer’s attention to two instances of the article where we state how initialisation was done:
1. Directly after Equation 8 line 224 (which is maybe a bit indirect).
2. After Equation 59 which is explicit.
Similar for the RE approach the equilibrium assumption is used as described in (Milescu, 2005).
Imagine a situation where you ran the same stimulation protocol multiple times, but each time you record only one data point, further and further away from the beginning of the stimulation protocol. A "stochastic" algorithm applied to this type of data would be exactly as described in Milescu et al.
Indeed, in this case the deterministic RE approach (Milescu, 2005) derives a solution which is identical to the solution of the master equation of a closed system if the initial conditions are multinomial [6]. We agree that Milescu, 2005 is a sound approximation of a stochastic process, however, it is not stochastic modeling itself (to our understanding of the terminology), even if the derived likelihood is identical in the mentioned example to solving the master equation.
Of course, in reality all points are recorded in sequence, but that doesn't mean that the approach is not stochastic, just that it's simplifying and discarding some information to gain speed. All methods (such as the Kalman filter described here) make some compromises to reduce complexity and increase speed.
The bottom line is that there is the most basic approach of solving the rate equations deterministically, without considering any variance whatsoever, and then there is everything else.
All methods make compromises and so does the Bayesian filter. Using the word deterministic simply points to the details of the assumptions which are adopted to calculate a likelihood, given the terminology that we chose.
58: "Thus, a time trace with a finite number of channels contains, strictly speaking, only one independent data point." I don't understand this sentence. Could you please clarify?
This is exactly meant in the way the Reviewer constructed his example. The RE plus the multinomial assumption is exact if one uses only one data point per relaxation because implicitly (Milescu, 2005) models each data point as an independent draw from a multinomial distribution (plus the experimental noise) whose probability vector evolves in time governed by the rate equation. But in the common perception that the dynamics of the ion channel can be described by a first order Markov process, every following data point has some conditional dependency on the first one. Thus, every following data point is statistically not independent.
74: "The KF is the minimal variance filter Anderson and Moore (2012). Thus, instead of excessive analog filtering of currents with the inevitable signal distortions Silberberg and Magleby (1993) one can apply the KF with higher analysing frequency on less filtered data."
Yes, but I'm not sure why should we use excessive filtering? Where is this idea coming from?
We thank the Reviewer for his careful reading. We changed the sentence to “The KF is the minimal variance filter [?]. Instead of strong analog filtering of currents to reduce the noise, but with the inevitable signal distortions (Silberberg and Magleby (1993)), we suggest to apply the KF with higher analyzing frequency on minimally filtered data.”
131: "However, even with simulated data of unrealistic high numbers of channels per patch, the KF outperforms the deterministic approach in estimating the model parameters."
It is clearly true from your figures, but please give a number as to what is unrealistic (10,000? 100,000?). Also, outperforms, but by how much? As I commented above and below, all methods tested here seem to produce good estimates under the conditions they were designed for (and even outside), and one might argue that it's not worth adding the additional computational complexity and running time for a possibly small increase in accuracy. How is one to judge that increase in accuracy?
We added a “more than a couple of thousands within one patch” to specify typical channels numbers among patches. In our opinion, evaluating the performance of an algorithm by one number (in terms of accuracy only) does not give all the relevant information. There is the relative distance to the true values which the Reviewer has his focus on. But there is also the problem with the uncertain quantification of the RE (Figure 4,5,6,8,9) approach and with the fact that the rate equation approach sometimes does not identify all parameters (Figure 9) which can be identified by the Bayesian filter. All of these aspects are central points of this article and are lumped together in the word “outperforms”. We do not see how to break this down to a simple number.
To be less vague, we added a sentence at the end of the introduction saying which aspects of performance are considered ”We consider the performance of our algorithm against the gold standards in four different aspects: (I) The relative distance of the posterior to the true values, (II) the uncertainty quantification, here in the form of the shape of the posterior, (III) parameter identifiability, and (IV) robustness against typical misspecifications of the likelihood (such as ignoring that currents are filtered or that the integration time of fluorescence data points is finite) of real experimental data.”
276: Has the RE method been used with the mean + variance or mean only?
We used the (Milescu, 2005) algorithm with the variance information.
Also, how was the initial probability vector calculated with the RE method? If you used the mean + variance, then (as mentioned above), you can't call it deterministic. If you used only the mean, then it is indeed a deterministic method, but it's not the approach described in Milescu et al. Please explain.
The publication (Milescu, 2005) describes how to initialize the time evolution by making use of the equilibrium assumption over all states. In the same way, we solve the transition matrix for its equilibrium vector (explained after Equation 59). For the RE approach the probability vector is evolved in time whereas for the Bayesian filter both the mean vector and the covariance matrix of the initial distribution are evolved in time.
To our understanding the variance is a deterministic parameter describing a stochastic process, but is not stochastic itself. Its definition is based on a mean value
To the best of our knowledge the deterministic differential equation for the evolution of the variance is derived by applying the definition above to the CME [15].
It is the same situation for the initial distribution. The definition
of the mean value which is a deterministic object enables to express the starting probability vector as
This equation holds independently of the fact that all investigated algorithms calculate their initial probability vector by Equation 14 of (Milescu, 2005).
We are aware that an estimator of the mean or the variance or the location of the maximum of the likelihood in parameter space are mathematically random variables because they are datadependent. However, to our understanding of the terminology, that does not necessarily mean that stochastic modeling has been used to define the estimator (of a mean value, a variance or a full likelihood). To exemplify this for the initial probability vector Ξ(t_{0}) both algorithms start with the identical multinomial assumption. But given the current parameter sample the Bayesian filter updates from the multinomial distribution by accounting the first data point y(t_{0}) the first two moments E[n(t_{0})] and P(t_{0}) before the two moments are evolved deterministically in time. This update takes explicitly the stochastic nature of the process into account but it is not present in the RE approach. In our opinion that is meant with the distinction between deterministic and stochastic in the terminology.
229: "added, in Equation 9 to the variance … which originates (Equation 38d) from the fact that we do not know the true system state"
Which noise are you referring to? The state noise or the measurement noise?
Both noise aspects are influencing this term: P(t) is the covariance matrix of the aleatory uncertainty about n(t). Thus, the covariance matrix of the prior distribution over n(t) before the information of the current data Equation 5859 has been taken into account (before the correction step). So it is influenced by the combination of both intrinsic and experimental noise sources Equation 38d. Only at the initial step, where the equilibrium distribution of the channel is used, P(t) is the covariance matrix of the intrinsic noise alone. To clarify this in the article, we added the sentence “P_{t} is the covariance of the prior distribution over n(t) before the KF took y(t) into account” and we used the word “aleatory”.
326: Which are the two algorithms?
We changed “the two algorithms” to “KF and the RE approach”.
278: "Both former algorithms are far away from the true parameter values with their maxima and also fail to cover the true values within reasonable tails of their posterior."
I think the authors might have gotten a little carried away when they made this statement. There is no doubt that the Kalman/Bayesian method produces more accurate estimates, but the other two methods (KF and RE) are very reasonable as well. I see estimates that are within 5, 10, or 20% from the true values, for most parameters. This is not "failure" by any stretch of the imagination. Most people would call this quite good, given the nature of the data.
We replaced “far” by “further” to avoid any subjective definition about what is close and what is far. We added a quantification of parameter divergence: “[…] E.g., the relative error of the maximum of the posterior are ∆k_{21} ≈ 200% for [1] and ∆k_{32} ≈ 240% for [2]. The 4 other parameters including the three equilibrium constants behave less problematic judged by their relative error […]” We also like to direct the attention of the reader to our statement “[…] also fail to cover the true values within reasonable tails of their posterior”, this means