Bayesian inference of kinetic schemes for ion channels by Kalman filtering
Abstract
Inferring adequate kinetic schemes for ion channel gating from ensemble currents is a daunting task due to limited information in the data. We address this problem by using a parallelized Bayesian filter to specify hidden Markov models for current and fluorescence data. We demonstrate the flexibility of this algorithm by including different noise distributions. Our generalized Kalman filter outperforms both a classical Kalman filter and a rate equation approach when applied to patchclamp data exhibiting realistic openchannel noise. The derived generalization also enables inclusion of orthogonal fluorescence data, making unidentifiable parameters identifiable and increasing the accuracy of the parameter estimates by an order of magnitude. By using Bayesian highest credibility volumes, we found that our approach, in contrast to the rate equation approach, yields a realistic uncertainty quantification. Furthermore, the Bayesian filter delivers negligibly biased estimates for a wider range of data quality. For some data sets, it identifies more parameters than the rate equation approach. These results also demonstrate the power of assessing the validity of algorithms by Bayesian credibility volumes in general. Finally, we show that our Bayesian filter is more robust against errors induced by either analog filtering before analogtodigital conversion or by limited time resolution of fluorescence data than a rate equation approach.
Editor's evaluation
The authors develop a Bayesian approach to modeling signals arising from ensembles of ion channels that can incorporate multiple simultaneously recorded signals such as fluorescence and ionic current. For simulated data from a simple ion channel model where ligand binding drives pore opening, they show that their approach enhances parameter identifiability and/or estimates of parameter uncertainty over more traditional approaches. The developed approach provides a valuable tool for modeling macroscopic time series data including data with multiple observation channels.
https://doi.org/10.7554/eLife.62714.sa0Introduction
Ion channels are essential proteins for the homeostasis of an organism. Disturbance of their function by mutations often causes severe diseases, such as epilepsy (Oyrer et al., 2018; GoldschenOhm et al., 2010), sudden cardiac death (Clancy and Rudy, 2001), or sick sinus syndrome (Verkerk and Wilders, 2014) indicating a medical need (GoldschenOhm et al., 2010) to gain further insight into the biophysics of ion channels. The gating of ion channels is usually interpreted by kinetic schemes which are inferred either from macroscopic currents with rate equations (REs) (Colquhoun and Hawkes, 1995b; Celentano and Hawkes, 2004; Milescu et al., 2005; Stepanyuk et al., 2011; Wang et al., 2012) or from singlechannel currents using dwell time distributions (Neher and Sakmann, 1976; Colquhoun et al., 1997a; Horn and Lange, 1983; Qin et al., 1996; Epstein et al., 2016; Siekmann et al., 2016) or hidden Markov models (HMMs) (Chung et al., 1990; Fredkin and Rice, 1992; Qin et al., 2000; Venkataramanan and Sigworth, 2002). A HMM consists of a discrete set of metastable states. Changes of their occupation occur as random events over time. Each state is characterized by transition probabilities, related to transition rates, and a probability distribution of the observed signal (Rabiner, 1989). It is becoming increasingly clear that the use of Bayesian statistics in HMM estimation constitutes a major advantage (Ball et al., 1999; De Gunst et al., 2001; Rosales et al., 2001; Rosales, 2004; Gin et al., 2009; Siekmann et al., 2012; Siekmann et al., 2011; Hines et al., 2015; Sgouralis and Pressé, 2017b; Sgouralis and Pressé, 2017a; KinzThompson and Gonzalez, 2018). In ensemble patches, simultaneous orthogonal fluorescence measurement of either conformational changes (Zheng and Zagotta, 2000; Taraska and Zagotta, 2007; Taraska et al., 2009; BrueningWright et al., 2007; Kalstrup and Blunck, 2013; Kalstrup and Blunck, 2018; Wulf and Pless, 2018) or ligand binding itself (Biskup et al., 2007; Kusch et al., 2010; Kusch et al., 2011; Wu et al., 2011) has increased insight into the complexity of channel activation.
Currently, a Bayesian estimator that can collect information from crosscorrelations and time correlations inherent in multidimensional signals of ensembles of ion channels is still missing. Traditionally, macroscopic currents are analyzed with solutions of REs which yield a point estimate of the rate matrix or its eigenvalues (Colquhoun et al., 1997a; Sakmann and Neher, 2013; d’Alcantara et al., 2002; Milescu et al., 2005; Wang et al., 2012) if they are fitted to the data. The RE approach is based on a deterministic differential equation derived by averaging the chemical master equation (CME) for the underlying kinetic scheme (Kurtz, 1972; Van Kampen, 1992; Jahnke and Huisinga, 2007). Its accuracy can be improved by processing the information contained in the intrinsic noise (stochastic gating and binding) (Milescu et al., 2005; Munsky et al., 2009). Nevertheless, all deterministic approaches do not use the information of the time and crosscorrelations of the intrinsic noise. These deterministic approaches are asymptotically valid for an infinite number of channels. Thus, a time trace with a finite number of channels contains, strictly speaking, only one independent data point. Previous rigorous attempts to incorporate the autocorrelation of the intrinsic noise of current data into the estimation (Celentano and Hawkes, 2004) suffer from cubic computational complexity (Stepanyuk et al., 2011) in the amount of data points, rendering the algorithm nonoptimal or even impractical for a Bayesian analysis of larger data set. To understand this, note, that a maximum likelihood optimization (ML) usually takes several orders of magnitude fewer likelihood evaluations to converge compared to the number of posterior evaluations when one samples the posterior. One Monte Carlo iteration (Betancourt, 2017) evaluates the posterior distribution and its derivatives many times to propose one sample from the posterior. Stepanyuk suggested an algorithm (Stepanyuk et al., 2011; Stepanyuk et al., 2014) which derives from the algorithm of Celentano and Hawkes, 2004 but evaluates the likelihood quicker. Under certain conditions, Stepanyuk’s algorithm can be faster than the Kalman filter (Moffatt, 2007). The algorithm by Milescu et al., 2005 achieves its superior computation time efficiency at the cost of ignoring the time correlations of the fluctuations. A further argument for our approach, independent of the Bayesian context, is investigated in this paper: The KF is the minimal variance filter (Anderson and Moore, 2012). Instead of strong analog filtering of currents to reduce the noise, but with the inevitable signal distortions (Silberberg and Magleby, 1993), we suggest to apply the KF with higher analyzing frequency on minimally filtered data.
Phenomenological difference between an RE approach and our Bayesian filter
Two major problems for parameter inference for the dynamics of the ion channel ensemble $\mathbf{n}(t)$ are: (I) that currents are only lowdimensional observations (e.g. one dimension for patchclamp or two for cPCF) of a highdimensional process (dimension being the number of model states) blurred by noise and (II) the fluctuations due to the stochastic gating and binding process cause autocorrelation in the signal. Traditional analyses for macroscopic PC data (and also for related fluorescence data) by the RE approach, e.g. Milescu et al., 2005 ignores the longlasting autocorrelations of the deviations (Box 1—figure 1a) blue and orange curves from the mean time trace (black) that occur in real data measured from a finite ensemble. To account for the autocorrelation in the signal, an optimal prediction (Box 1—figure 1b) of the future signal distribution $\mathbb{P}(y({t}_{2}))$ should use the measurement $y({t}_{1})$ from the current time step t_{1} to update the belief about the underlying $\mathbf{n}({t}_{1})$. Based on stochastic modelling of the time evolution of the channel ensemble, it then predicts $\mathbb{P}(y({t}_{2}))$.
To demonstrate the difference how the two algorithms analyze the data, we compute the autocorrelation of the residuals of the data. After the analysis with either the RE approach or the KF, we can construct from the model with the mean predicted signal $\mathbf{H}\mathbb{E}[\mathbf{n}({t}_{i})]$ (see Eq. 4 for the definition of $\mathbf{H}$) and the predicted standard deviation $\sqrt{\mathrm{v}\mathrm{a}\mathrm{r}[y({t}_{i})]}$ the normalized residual time trace of the data which are defined as
Filtering (fitting) with the KF (given the true kinetic scheme) one expects to find a whitenoise process for the residuals. Plots of the autocorrelation function of both signal components (Box 1—figure 2a) confirms our expectation. The estimated autocorrelation vanishes after one multiple of the lag time (the interval between sampling points), which means that the residuals are indeed a whitenoise process. In contrast, the residuals derived from the RE approach (Box 1—figure 2b) display long lasting periodic autocorrelations.
On the one hand, a complete HMM analysis (forward algorithm) would deliver the most exact likelihood of macroscopic data. On the other hand, the computational complexity of the forward algorithm limits this type of analysis in ensemble patches to no more than a few hundred channels per time trace (Moffatt, 2007). To tame the computational complexity (Jahnke and Huisinga, 2007), we approximate the solution of the CME with a Kalman filter (KF), thereby remaining in a stochastic framework Kalman, 1960. This allows us to explicitly model the time evolution of the first two moments (mean value and covariance matrix) of the probability distribution of the hidden channel states. Notably, for linear (first or pseudo) Gaussian system dynamics, the KF is optimal in producing a minimal prediction error for the mean state. KFs have been used previously in several protein expression studies which also demonstrate the connection of the KF to the linear noise approximation (Komorowski et al., 2009; Finkenstädt et al., 2013; Fearnhead et al., 2014; Folia and Rattray, 2018; Calderazzo et al., 2019; Gopalakrishnan et al., 2011).
Our approach generalizes the work of Moffatt, 2007 by including statedependent fluctuations such as openchannel noise and Poisson noise in additional fluorescence data. A central technical difficulty which we solved is that due to the statedependent noise the central Bayesian update equation loses its analytical solution. We derived an approximation which is correct for the first two moments of the probability distributions. Stochastic rather than deterministic modeling is generally preferable for small systems or nonlinear dynamics (Van Kampen, 1992; Gillespie and Golightly, 2012). However, even with simulated data of unrealistic high numbers of channels per patch (more than several thousands within one patch), the KF outperforms the deterministic approach in estimating the model parameters. Moffatt, 2007 already demonstrated the advantage of the KF to learn absolute rates from time traces at equilibrium. Like all algorithms that estimate the variance and the mean (Milescu et al., 2005) the KF can infer the number of channels ${N}_{\mathrm{ch}}$ for each time trace, the singlechannel current $i$ and analogous in optical recordings the mean number ${\lambda}_{\text{b}}$ of photons from bound ligands per recorded frame. To select models and to identify parameters, stochastic models are formulated within the framework of Bayesian statistics where parameters are assigned uncertainties by treating them as random variables (Hines, 2015; Ball, 2016). In contrast, previous work on ensemble currents combined the KF only with ML estimation (Moffatt, 2007). Difficulties in treating simple stochastic models by ML approaches in combination with the KF (AugerMéthé et al., 2016), especially with nonobservable dynamics, justify the computational burden of Bayesian statistics. Bayesian statistics has an intuitive way to incorporate soft or hard constrains from diverse sources of prior information. Those sources include mathematical prerequisites, other experiments, simulations or theoretical assumptions. They are applied as additional model assumptions by a prior probability distribution over the possible parameter space. Hence, knowledge of the model parameters prior to the experiment are correctly accounted for in the analyzes of the new data. Alternatively, some of these benefits of prior knowledge can be incorporated by penalized maximum likelihood (Salari et al., 2018; Navarro et al., 2018). Bayesian inference provides outmatching tools for modeling over point estimates: First, the Bayesian approach is still applicable in situations where parameters are not identifiable (Hines et al., 2014; Middendorf and Aldrich, 2017b) or posteriors are nonGaussian, whereas ML fitting ceases to be valid (Calderhead et al., 2013; Watanabe, 2007). Second, a Bayesian approach provides superior model selection tools for singular models such as HMMs or KFs Gelman et al. (2014). Third, Bayesian statistics has a correct uncertainty quantification (Gillespie and Golightly, 2012) based on the data and the prior for the statistical problem. In contrast, ML or maximum posterior approaches lack uncertainty quantification based on one data set (Joshi et al., 2006). Only under optimal conditions their uncertainty quantification becomes equivalent to Bayesian credibility volumes (Jaynes and Kempthorne, 1976). This study focuses on the effects on the posterior due to formulating the likelihood via a KF instead of an RE approach and the benefits of adding a second dimension of observation. We consider the performance of our algorithm against the gold standards in four different aspects: (I) The relative distance of the posterior to the true values, (II) the uncertainty quantification, here in the form of the shape of the posterior, (III) parameter identifiability, and (IV) robustness against typical misspecifications of the likelihood (such as ignoring that currents are filtered or that the integration time of fluorescence data points is finite) of real experimental data.
Results and discussion
Simulation of ligandgated ionchannel data
Here we treat an exemplary ligandgated channel with two ligand binding steps and one openclosed isomerization described by an HMM (see Figure 1a). For this model, confocal patchclamp fluorometry (cPCF) data were simulated: time courses of ligand binding and channel current upon concentration jumps were generated (see Appendix 5 and Materials and methods section). Idealized example data with added white noise are shown in Figure 1b–d. We added realistic instrumental noise to the simulated data (see Appendix 5). A qualitative description of the statistical problem that needs to be addressed when modeling time series data such as the simulated is outlined in Box. 1.
Kalman filter derived from a Bayesian filter
Here and in the Materials and methods section, we derive the mathematical tools to account correctly for the stochastic Markov dynamics of single molecules in the fluctuations of macroscopic signals. The KF is a Bayesian filter (see Materials and methods), that is a continuous state HMM with a multivariate normal transition probability Ghahramani, 1997 (Figure 2a). We define the hidden ensemble state vector
which counts the number of channels in each state $\mathbf{s}$ (see Methods). To make use of the KF, we assume the following general form of the dynamic model: The evolution of $\mathbf{n}(t)$ is determined by a linear model that is parametrized by the state evolution matrix $\mathrm{T}$
where ∼ means sampled from and $\mathcal{N}(\cdot \mathit{\mu},\mathbf{\Sigma})$ is a shorthand for the multivariate normal distribution, with the mean μ and the variancecovariance matrix $\mathbf{\Sigma}$. The state evolution matrix (transition matrix) is related to the rate matrix $\mathbf{K}$ by the matrix exponential $\mathbf{T}=\mathrm{exp}(\mathbf{K}\mathrm{\Delta}t)$. The mean of the hidden state evolves according to the equation $\mathbb{E}[{\mathbf{n}}_{t+1}{\mathbf{n}}_{t}]=\mathbf{T}{\mathbf{n}}_{t}$. It is perturbed by normally distributed white process noise $\mathit{\omega}$ with the following properties: The mean value of the noise fulfills $\mathbb{E}[{\mathit{\omega}}_{t}]=0$ and the variancecovariance matrix of the noise is $\mathrm{cov}[{\mathit{\omega}}_{t},{\mathit{\omega}}_{t}]=\mathbf{Q}(\mathbf{T},{\mathbf{n}}_{t})$ (see Materials and methods Equation 38d, Ball, 2016). In short, Equation 3 defines a Gaussian Markov process.
The observations $\mathbf{y}}_{t$ depend linearly on the hidden state $\mathbf{n}}_{t$. The linear map is determined by an observation matrix $\mathbf{H}$.
The noise of the measurement setup (Appendix 5 and Equation 43) is modeled as a random perturbation of the mean observation vector. The noise fulfills $\mathbb{E}[\mathit{\nu}]=0$ and $\mathrm{cov}[{\mathit{\nu}}_{t},{\mathit{\nu}}_{t}]={\mathbf{\Sigma}}_{t}$. Equation 4 defines the stateconditioned observation distribution $\mathbb{O}$ (Figure 2a). If the system strictly obeys Equation 3 and Equation 4 then the KF is optimal in the sense that it is the minimum variance filter of that system Anderson and Moore, 2012. If the distributions of ν and ω are not normal, the KF is still the minimum variance filter in the class of all linear filters but there might be better nonlinear filters. In case of colored noise ν and ω the filtering equations (see Materials and methods) can be reformulated by state augmentation or measurementtimedifference approach techniques Chang, 2014. For each element in a sequence of hidden states $\{{\mathbf{n}}_{t}:0<t<T\}$ and for a fixed set of parameters $\mathit{\theta}$, an algorithm based on a Bayesian filter (Figure 2a), explicitly exploits the conditional dependencies of the assumed Markov process. A Bayesian filter recursively predicts prior distributions for the next $\mathbf{n}}_{t$
given what is known about $\mathbf{n}}_{t1$ due to $y}_{t1$. The KF as a special Bayesian filter assumes that the transition probability is multivariate normal according to Equation 3
Note, that Equation 6 is a central approximation of the KF. While the exact transition distribution of an ensemble of ion channels is the generalizedmultinomial distribution (Methods Equation 32), the quality of normal approximations to multinomial Milescu et al., 2005 or generalizedmultinomial Moffatt, 2007 distributions depends on the number of ion channels ${N}_{\mathrm{ch}}$ in the patch and on the position of the probability vector in the simplex space. The difference between the loglikelihoods of the true generalizedmultinomial dynamics and Equation 6 type approximation scales as $1/{N}_{\mathrm{c}\mathrm{h}}$ Moffatt, 2007. As a rule of thumb one should be careful with both algorithms for time traces with ${N}_{\mathrm{c}\mathrm{h}}\in \left[{10}^{1},{10}^{2}\right]$. Below or even inside this interval there are more qualified concepts such as the forward algorithm or even particle filters (Golightly and Wilkinson, 2011; Gillespie and Golightly, 2012) which avoid the normal approximation.
Each prediction of $\mathbf{n}}_{t$ (Equation 6) is followed by a correction step,
that allows to incorporate the current data point into the estimate, based on the Bayesian theorem (Chen, 2003). Additionally, the KF assumes (Anderson and Moore, 2012; Moffatt, 2007) a multivariate normal observation distribution
If the initial prior distribution is multivariate normal then due to the mathematical properties of the normal distributions the prior and posterior $\mathbb{P}(\cdot )$ in Equation 8 become multivariate normal Chen, 2003 for each time step. In this case, one can derive algebraic equations for the prediction (Materials and methods Equation 37, Equation 38d) and correction (Materials and methods Equation 58 and Equation 58) of the mean and covariance. The algebraic equations originate from the fact that a normal prior is the conjugated prior for the mean value of a normal likelihood. Due to the recursiveness of its equations, the KF has a time complexity that is linear in the number of data points, allowing a fast algorithm. The denominator of Equation 8 is the normal distributed marginal likelihood $\mathbb{L}({\mathbf{y}}_{t}{\mathcal{Y}}_{t1},\mathit{\theta})$ for each data point, which constructs by
a product marginal likelihood of normal distributions of the whole time trace $\mathcal{Y}}_{\mathrm{T}}=\{{\mathbf{y}}_{1},\dots ,{\mathbf{y}}_{{N}_{\mathrm{T}}}\$ of length ${N}_{\mathrm{T}}$ for the KF. For the derivation of $\mathbf{P}}_{t$ and $\mathbf{\Sigma}}_{t$ see Materials and methods (Equation 38d) and Equation 43. $\mathbf{P}}_{t$ is the covariance of the prior distribution over $\mathbf{n}(t)$ before the KF took $\mathbf{y}(t)$ into account. The likelihood for the data allows to ascribe a probability to the parameters $\mathit{\theta}$, given the observed data (Methods Equation 20). An illustration for the operation of the KF on the observation space is given in Figure 2b. The predicted mean signal $\mathbf{H}\mathbb{E}[\mathbf{n}(t)]$ corresponds to binding degree $B(t)=\frac{\mathbf{H}\mathbb{E}[\mathbf{n}(t){]}_{1}}{{N}_{\mathrm{c}\mathrm{h}}}$ and open probability $P}_{O}(t)=\frac{\mathbf{H}\mathbb{E}[\mathbf{n}(t){]}_{2}}{{N}_{\mathrm{c}\mathrm{h}}$. These values are plotted as vector trajectories.
The standard KF (Moffatt, 2007; Anderson and Moore, 2012; Chen, 2003) has additive constant noise ${\mathrm{\Sigma}}_{t}=const$ in the observation model. Thus, in this case a constant variance term $\mathrm{\Sigma}$ is added, in Equation 9 to the aleatory variance $\mathbf{H}{\mathbf{P}}_{t}{\mathbf{H}}^{\mathrm{\top}}$ which, as mentioned above, originates (Equation 38d) from the the fact that we do not know the true system state $\mathbf{n}(t)$. For signals with Poissondistributed photon counting or openchannel noise, we need to generalize the noise model to account for additional whitenoise fluctuations with $\mathbf{n}(t)$dependent variance. For instance, in singlechannel currents additional noise is often observed whose variance is referred to by $\sigma}_{\mathrm{o}\mathrm{p}}^{2$. In macroscopic currents this additional noise can be modeled by a term ${\sigma}_{\mathrm{o}\mathrm{p}}^{2}{n}_{4}(t)$, causing statedependency of our noise model.
The second noise term ${\nu}_{\mathrm{op}}$ is defined in terms of the first two moments $\mathbb{E}({\nu}_{\mathrm{o}\mathrm{p}})=0$ and $\mathrm{v}\mathrm{a}\mathrm{r}({\nu}_{\mathrm{o}\mathrm{p}})=\mathbb{E}({\nu}_{\mathrm{o}\mathrm{p}}^{2})={\sigma}_{\mathrm{o}\mathrm{p}}^{2}{n}_{4}(t)$. To the best of our knowledge such a statedependent noise makes the integration of the denominator of Equation 8 (which is also the incremental likelihood) intractable
This is because the state distribution $\mathcal{N}(\mathbf{n}\overline{\mathbf{n}}(t),\mathbf{P}(t))$ as the prior also influences the variance parameter of the likelihood which means that the conjugacy property is lost. While a normal distribution is the conjugated prior of the mean of a normal likelihood, it is not the conjugated prior for the variance. However, by applying the theorem of total variance decomposition Equation 46a we deduce a normal approximation to Equation 8 and to the related problem of Poissondistributed noise in fluorescence Equation 57, Equation 55a data. By computing the mean and the variance or covariance matrix of the signal, we can reformulate the noise model to fit the form of the traditional KF framework. Note, that the derived equations for the covariance matrix are still exact for the more general noise model. Mean and covariance just do not form a set of sufficient statistics anymore.
Our derivation is not limited to ligandgated ion channels. For example, when investigating voltagegated channels, the corresponding noise model can be easily adapted. This holds also when using the P/n protocol for which the noise model resembles that of the additional variance in the fluorescence signal. The additional variance is induced because the mean signal from the ligands swimming in the bulk (Materials amd methods Equation 43 Appendix 5) is eliminated by subtracting scaled mean reference signal which itself has an error. This manipulation adds additional variance to the resulting signal comparable to P/n protocol. Other experimental challenges, as for example series resistance compensation promoting oscillatory behavior of the amplifier, deserve certainly advanced treatment. Nevertheless, for voltageclamp experiments with a rate equation approach it also becomes clear (Lei et al., 2020) that modeling of the actual experimental limitations, including series resistance, membrane and pipette capacitance, voltage offsets, imperfect compensations by the amplifier, and leak currents are necessary for consistent kinetic scheme inference.
The Bayesian posterior distribution
encodes all information from model assumptions and experimental data used during model training (see Materials and methods). A full Bayesian inference is usually not an optimization (finding the global maximum or mode of the posterior or likelihood) but calculates all sorts of quantities derived from the posterior distribution such as mean values of any function $f$ including the mean value or covariance matrix of the parameters themselves or even the likelihood of the data.
Besides the covariance matrix of the parameter to express parameter uncertainty, the posterior allows to calculate a credibility volume. The smallest volume $V}_{P$ that encloses a probability mass $P$ of
is called the Highest Density Credibility Volume/Interval (HDCV/HDCI). Those credibility volumes should not be confused with confidence volumes although under certain conditions they can become equivalent. Given that our model sufficiently captures the true process, the true values $\mathit{\theta}}_{\mathrm{t}\mathrm{r}\mathrm{u}\mathrm{e}$ will be inside that volume with a probability $P$. Unfortunately, typically there is no analytical solution to Equation 12 . However, it can be solved numerically with Monte Carlo techniques, enabling to calculate all quantities related to Equation 13 and Equation 14 . Our algorithm uses automatic differentiation of the statistical model to sample from the posterior (Appendix 1—figure 1a) via Hamiltonian Monte Carlo (HMC) (Betancourt, 2017), see Appendix 7 , as provided by the Stan software (Hoffman and Gelman, 2014; Gelman et al., 2015).
Benchmark for PC data against the gold standard algorithms
We compare the posterior distribution (Figure 3) of our algorithm against Bayesian versions of the deterministic (Milescu et al., 2005) and stochastic (Moffatt, 2007) algorithms, which we consider as the gold standard algorithms for macroscopic patchclamp data. Simulated currents of a patch with $N}_{ch}=5\cdot {10}^{3$ are shown in (Figure 3d). The resulting posteriors (Figure 3a) show that both former algorithms are further away from the true parameter values with their maxima or mean values (Figure 3a). E.g., the relative error of the maximum of the posterior are $\mathrm{\Delta}{k}_{21}\approx 200\mathrm{\%}$ for Milescu et al., 2005 and $\mathrm{\Delta}{k}_{32}\approx 240\mathrm{\%}$ for Moffatt, 2007 . The four other parameters including the three equilibrium constants behave less problematic as judged by their relative error. Additionally, if one does not only judge the performance by the relative distance of maximum (or some other significant point) of the posterior but considers the spread of the posterior as well, it becomes apparent, that the marginal posterior of both former algorithms fail to cover the true values within at least the reasonable parts of their tails. Accordingly, for maximum likelihood inferences the true value would be far outside the estimated confidence interval. For the RE approach only the marginal posterior of $\stackrel{~}{K}}_{21$ is nicely centered over the true values and the marginal of $\stackrel{~}{K}}_{32$ could be considered to cover within a reasonable part of the distribution the true value. Uncertainty quantification is investigated in more detail further down (Figures 4—9). Note that in Figure 3a, parameter unidentifiability by heavy tails/ multiple maxima of the posterior distribution or (anti) correlation is easily visible as non axial symmetric patterns.

Figure 4—source data 1
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data1v2.zip

Figure 4—source data 2
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data2v2.zip

Figure 4—source data 3
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data3v2.zip

Figure 4—source data 4
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data4v2.zip

Figure 4—source data 5
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data5v2.zip

Figure 4—source data 6
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data6v2.zip

Figure 4—source data 7
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data7v2.zip

Figure 4—source data 8
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data8v2.zip

Figure 4—source data 9
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data9v2.zip

Figure 4—source data 10
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data10v2.zip

Figure 4—source data 11
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data11v2.zip

Figure 4—source data 12
 https://cdn.elifesciences.org/articles/62714/elife62714fig4data12v2.zip
To assess the location of the posterior conditioned on ${N}_{\mathrm{ch}}$, we select the median vector $\mathit{\theta}$ of the marginal posteriors and calculate its Euclidean distance to the true values by:
This defines a single value to judge the overall accuracy of the posterior. Varying ${\sigma}_{\text{op}}/i$ reveals the range of the validity (Figure 3b) of the algorithm (red) from Moffatt, 2007 . While both stochastic approaches are nearly equivalent for low openchannel noise, the RE (blue) performs consistently poorer. It may seem surprising that even for In fact, the KF method beha the two stochastic algorithms start to produce different results. But considering the scaling (Materials and methods Equation 46a) of the total openchannel noise (top axis) from currents of an ensemble patch $\propto {({N}_{\mathrm{ch}}{P}_{\mathrm{open},\mathrm{max}}0.5)}^{0.5}{\sigma}_{\mathrm{open}}$ one sees that if $\propto {({N}_{\mathrm{ch}}{P}_{\mathrm{open},\mathrm{max}}0.5)}^{0.5}{\sigma}_{\mathrm{open}}$ approaches σ the traditional KF suffers from ignoring state dependent noise contributions. The lower scale changes with experiments (e.g. ${N}_{\mathrm{ch}}$ and ${\sigma}_{op}$). In contrast, the upper scale is largely independent of the particular measurements. The two different normalizations indicate an experimental intuition: “ Why should I consider the extra noise from the open state of the single channel if only ${\sigma}_{\mathrm{op}}/i={\sigma}_{\mathrm{op}}/\sigma \approx 0.01$” is misleading. The small advantage of our algorithm for small ${\sigma}_{\mathrm{op}}/i$ over Moffatt, 2007 is due to the fact that we could apply an informative prior in the formulation of the inference problem on ${\sigma}_{\mathrm{exp}}\sim \mathrm{normal}({\sigma}_{\mathrm{exp},\mathrm{true}}^{2},{\sigma}_{\mathrm{exp},\mathrm{true}}^{2}\cdot 0.01)$ by taking advantage of our generalization (Equation 46a) Bayesian filter. Further, Figure 3b indicates the importance that the functional form of the likelihood is flexible enough to capture the second order statistics of the noise of the data sufficiently.
For an increasing data quality, which in our benchmark is an increasing ${N}_{\mathrm{ch}}$ per trace, we show (Figure 3c) that the deterministic RE and our Bayesian filter are consistent estimators, that is they converge in distribution to the true parameter values with their posterior maxima or median for increasing data quality. The scaling of the RE approach (blue) and our Bayesian filter (green) vs. ${N}_{\mathrm{ch}}$ shows that for large ${N}_{\mathrm{ch}}$ both algorithms seem to have a constant error ratio relative to each other. They are both well described by $\mathrm{error}({N}_{\mathrm{ch}})\propto a/\sqrt{{N}_{\mathrm{ch}}}$ with an error ratio computed from the fit of 4.4. Thus, although our statistical model is singular (meaning that the fisher information matrix is singular Watanabe, 2007), its asymptotic learning behaviour is similar to a regular model (Figure 4c) which, however, means that the euclidean error from both algorithms stays different also for large ${N}_{\mathrm{ch}}$. For data with $N}_{\mathrm{c}\mathrm{h}}<2\cdot {10}^{3$ the samples from the posterior typically indicate that the posterior is improper which is defined as
We consider this as the case of unidentified parameters. This datadriven definition is in so far different from structural and practical identifiability definitions (Middendorf and Aldrich, 2017a; Middendorf and Aldrich, 2017b) as the two latter cases are not distinguished. Still the practical consequence of structural or practical unidentifiability, which is usually an improper posterior, is captured. Cases of structural or practical unidentifiability which lead to a confined region of constant posterior density will be considered identified as the posterior is still normalizable thus the uncertainty quantification will still be correct, even when this finding is not sufficient to answer the research question at hand.
Benchmarking for cPCF data against the gold standard algorithm
For the simulated time traces with an optimistically high signaltonoise assumption, the posterior of the KF (from hereon KF denotes our Bayesian Filter) and a RE (Milescu et al., 2005) approach are compared for cPCF data (Figure 4a–d). For a brief introduction of the RE approach, see Appendix 8 . The failure to analyze PC data with moderate openchannel noise (Moffatt, 2007; Figure 3a) disqualifies the classical KF with its constant noise variance also as a useful algorithm for fluorescence data, because here the Poisson distribution of the signal generates an even stronger state dependency of the signal variance.
By “high signaltonoise assumption” , we refer to an experimental situation with a standard deviation of the current recordings ${\sigma}_{\mathrm{ex}}/i=0.5$, a low additional ${\sigma}_{\mathrm{op}}/i=0.05$, and a high mean photon rate per bound ligand and frame ${\lambda}_{\mathrm{b}}=5$. Additionally, we assume vanishing fluorescence background noise generated by the ligands in the bulk. The benefit of the high signaltonoise is that the limitations of the two different approximations to the stochastic process of binding and gating can be investigated without running into the risk of being compensated or obscured by the noise from the experimental setup. For these experimental settings (Figure 4a), we calculate the Euclidean distance of the median (Equation 15) for different ${N}_{\mathrm{ch}}$. For ${N}_{\mathrm{c}\mathrm{h}}<500$ (gray shaded area in Figure 4a), the Euclidean error of both algorithms is roughly the same. On the single parameter level (Figure 4b), this can be seen as an onset of correlated deviations from the true value for both algorithms. Each marginal posterior has for each ${N}_{\mathrm{ch}}$ a similar deviation in magnitude and direction. That is in particular true for ${\stackrel{~}{k}}_{32}$ and ${\stackrel{~}{K}}_{32}$ which dominate Equation 15 . In spite of the correlation in direction of the errors of ${\stackrel{~}{k}}_{21}$ and ${\stackrel{~}{K}}_{21}$ their magnitude is still smaller for the KF. In summary, this indicates that in this regime the approximations to the involved multinomial distributions fail in a similar manner for both algorithms. That implies that treating the autocorrelation of the gating and binding becomes similar important compared to the error induced by normal approximations (which are used by the KF and the RE approach). For larger ${N}_{\mathrm{ch}}$, the Euclidean error of the RE is on average 1.6 times larger than the corresponding error of the posterior mode of the KF, which we deduce by fitting the function $\mathrm{error}({N}_{\mathrm{ch}})=\frac{a}{\sqrt{{N}_{\mathrm{ch}}}}$. On the one hand, both algorithms are better in approaching the true values than with patchclamp data alone. On the other hand, the smaller error ratio means, that adding a second observable constrains the posterior, such that much of the overfitting is prevented for the RE approach. By overfitting, we define the adaptation of any inference algorithm to the specific details of the used data set due to experimental and intrinsic noise which is aggravated if too complex kinetic schemes are used. Similarly, (Milescu et al., 2005) showed that the over fitting tendency of the RE can be reduced if the autocorrelation of the data is eliminated. The dotted green curve derives from PC data. The Euclidean error is roughly an order of magnitude larger for ${N}_{\mathrm{c}\mathrm{h}}>2000$. Thus, in this regime the cPCF data set is equivalent to 10^{2} fold more time traces or 10^{2} more ${N}_{\mathrm{ch}}$ in a similar PC data set. For ${N}_{\mathrm{c}\mathrm{h}}<2000$ only cPCF establishes parameter identifiability (given a data set of 10 ligand concentrations and no other prior information). In Figure 4b(16), we demonstrate the 0.95HDCI (Equation 14) of all parameters and their modes vs. ${N}_{\mathrm{ch}}$. Even though the Bayesian filter and the RE approach are both consistent estimators, the RE approach covers the true values with its 0.95HDCI only occasionally. The modeling assumption of the RE approach of treating each data point as if it does not come from a Markov process but from an individual draw from a multinomial distribution with deterministically evolving mean and variance makes the parameter estimates overly confident (Figure 4b(16)) . A likely explanation can be found by analyzing the extreme case where data points are sampled at high frequency relative to the time scales of the channel dynamics. The RE approach treats each data point as a new draw from Equation 67 while in reality the ion channel ensemble had no time to evolve into a new state. In contrast, the KF updates its information about the ensemble state after incorporating the current data point and then predicts from this updated information the generalised multinomial distribution of the next data point. For ${N}_{\mathrm{c}\mathrm{h}}>200$, the marginal posterior of the KF usually contains the true value. Nevertheless, one might depict a bias in both algorithms, in particular (Figure 4b 2,4) for ${\stackrel{~}{k}}_{32}$ and ${\stackrel{~}{K}}_{32}$ for $N}_{\text{ch}}<2\cdot {10}^{3$, similar to the findings of Moffatt, 2007 . A proper investigation of bias can be found in Figure 11 and 12 and in the Appendix. Notably, with the more realistic higher experimental noise level, in those tests the bias is hardly or not all detectable (consider the unfiltered or infinitely fast integrated data). A plausible explanation is that the bias only occurs (Figure 4 2,4) because the data are that perfect that the discrete nature of the ensemble dynamics is almost visually detectable, thus deviating from to the modeling assumption of multivariate normal distributions.
To investigate the six onedimensional 0.95HDCIs simultaneously, we declare the analysis of a data set as successful if all 0.95HDCIs include the true values. Otherwise we define it as a failure. This enables to determine the true probability at which the probability mass of the KF and the RE approach covers the true values in a binomial setting. The left blue vertical line in Figure 4c indicates $p={0.95}^{6}\approx 0.735$ which is the lower limit and which would be the true success probability for an ideal model whose six 0.95HDCIs are drawn from $y\sim \mathrm{binomial}(0.95,6)$. This is the probability of getting 6 successes in 6 trials. The right blue vertical line equals $p=0.95$, signifying the upper limit obtained by treating the six $0.95$ HDCIs as being drawn from $y\sim \mathrm{binomial}(0.95,1)$ each, which is a rather loose approximation. All marginal distributions are computed from the same highdimensional posterior which is formed by one data set for each trial. Thus, the six $0.95$ HDCIs $y\sim \mathrm{binomial}(0.95,1)$ must have success rates between those two extremes if the algorithm creates an accurate posterior. We next combine the binomial likelihood with the conjugated beta prior (Hines et al., 2014) for mathematical convenience. On this occasion, for the sake of the argument, $\mathrm{beta}(1,1)$ seems sufficient. A $\mathrm{beta}(1,1)$ prior is a uniform prior on the open interval $(0,1)$. The estimated true success rate of the RE approach (blue) is $\approx 0.15$ and therefore far away from the success probability an algorithm should have when it is based on an exact likelihood of the data. In contrast, the posterior (green) of the true success probability of the KF resides with a large probability mass between the lower and upper limit of the success probability of an optimal algorithm (given the correct kinetic scheme). As both algorithms use the same prior distribution, the different performance is not induced by the prior.
Exploiting six onedimensional posterior distributions does not necessarily answer whether the posterior is accurate in 6 dimensions but we can refine the used binomial setting. In Figure 4d $\mathbb{P}({\stackrel{~}{k}}_{32},{\stackrel{~}{K}}_{43})$, we see that 2D marginal distributions can, due to their additional degree of freedom, twist around the true value without covering it with HDCV (Equation 14) of reasonable size while simultaneously the two 1–D marginal distribution do cover it with a reasonable HDCI. In general, the KF posterior distribution has its mode much closer to the true value for various parameter combinations and it seems that the posterior is approximately multivariate normal. Further, we recognize that the probability mass of the reasonably sized HDCV of the KF posterior includes the true values whereas the HDCV from the RE does not. In 6 dimensions we lack visual representations of the posterior. Since we showed that both algorithms are consistent for a given identifiable model, we are looking for a way to ask whether the posterior is accurate (has the posterior distribution the right shape). We can answer that question by asking, how much probability mass around the mode (or around multiple modes) needs to be counted to construct a HDCV Equation 14 which includes the true values. Then we can ask for ${N}_{\mathrm{set}}$ data sets how often did we find the true values inside a volume $V(P)$ of a specific probability mass $P$ of the posterior distribution
An algorithm which estimates the parameters of the true process should fulfill this property simultaneously to being consistent. Otherwise credibility volumes or confidence volumes are meaningless. Noteworthy, that this is a empirical test of how sufficient the Bayesian filter and the RE approach hold frequentist coverage property of their HDCVs (Rubin and Schenker, 1986). We explain (Appendix 8) in detail how to quantify the overall shape and $n$dimensional posterior and comment on its geometrical meaning. One way is to use an analytical approximation via the cumulative Chisquared distribution (Figure 5a and b), The other way is to count the probability mass of $n$dimensional histogram bins starting with the highest value until the first bin includes the true values (Figure 5b).
Knowing how much volume/probability mass is needed to include the true rate matrix allows us to test whether all HDCVs constructed from the two probability distributions match the binomial distributions of the ideal model. For each data set and for each HDCV of a fixed probability mass, there are two possible outcomes: The true rate matrix is inside or outside of that volume. For a chosen HDCV with a fixed probability volume, as indicated by a gray space in Figure 5b , we count how many times the true matrix is included in the volume of that probability mass for each trail in a fixed amount of trials. Since the success is binomially distributed, we plot the expected mean of a perfect model $E[y]={N}_{\mathrm{trials}}{P}_{\mathrm{true}}$ and binomial quantiles and compare them with the success rate found in our test runs (Figure 5c) for both algorithms with both methods to determine the posterior shape. The posterior of the KF distributes the probability mass in a consistent manner such that each volume includes the true rate matrix within the quantile range. In contrast, the RE approach fails for all data sets for all HDCVs (from 0 – 0.95 probability mass) and does not include the true values in one single case. Note, that all the binomial trials for each HDCV are made from the same set of data sets which explains the correlated deviation from the mean. For lower but realistic signal to noise ratios, where the fit quality decreases, for example by producing larger errors/wider posterior distributions (Figure 6a), the statistics of the HDCV from the RE approach improve but are still outperformed by the KF. In particular, in our tested case of realistic experimental noise we never find the true values within a 0.65HDCV if the data are analyzed with a RE approach. Even for the highest noise level (Figure 6b), the probability mass of the KF posterior needed to include the true rate matrix remains almost always smaller then the posterior mass of the RE approach. That means that the posterior mass of the KF is much closer to the true value distributed than the posterior mass of the RE. With the KF we find the true rate matrix for one data set in small volume $P<0.05$ around the mode. To achieve the same with the RE approach we need at least a probability mass of 0.3.
In the inset of Figure 6a and b we do not observe a trend, thus no indication that the RE approach has a better performance for large values ${N}_{\mathrm{ch}}$ in this regard. This challenges the common argument that the RE approach should be equivalent to the KF for large ${N}_{\mathrm{ch}}$ because the ratio of mean signal vs. the intrinsic binding and gating noise is so large. Thus, including the autocorrelation into the analysis is important even for unrealistic large ${N}_{\mathrm{ch}}$. One possible explanation is model a signaltonoise ratio which scales $\propto {N}_{\mathrm{ch}}$. From the multinomial distribution both algorithms inherit mean signals which scale $\propto {N}_{\mathrm{ch}}$ and variances which scale in the terms dominating for large ${N}_{\mathrm{ch}}$ similarly with $\propto {N}_{\mathrm{ch}}$. Thus, identical to the real signal, both algorithms model the scaling of the signaltonoise ratio $\propto \sqrt{N}$. It is plausible, that both algorithms remain sensitive for the occurrence of autocorrelation of the noise even for largest signaltonoise ratios. In Figure 5c we compare the Euclidean error of the pessimistic high white noise case with an overoptimistic low noise case. We see, that when increasing ${N}_{\mathrm{ch}}$ there is a regime where the Euclidean error increases faster than ${\sqrt{{N}_{\mathrm{ch}}}}^{1}$ which we indicate with a coarse approximate fit $\propto {N}_{\mathrm{ch}}^{1}$. In that regime two effects happen simultaneously. First, the mean and the intrinsic fluctuations of the signal become more and more dominant over the experimental noise. Second, the standard deviation of intrinsic fluctuations becomes smaller relative to the mean signal. We speculate, that this produces together a learning rate which is faster than the usual asymptotic learning rate ${\sqrt{{N}_{\mathrm{ch}}}}^{1}$ of a regular model but relaxes asymptotically towards ${\sqrt{{N}_{\mathrm{ch}}}}^{1}$.
Statistical properties of both algorithms for more complex models
We have seen in Figure 3c and Figure 4a that the RE and the KF algorithm are consistent estimators, while their error ratio (Figure 7a) seems to have no trend to approach 1 with increasing ${N}_{\mathrm{ch}}$. Adding a second observable increases parameter accuracy and adds identifiability for both algorithms since less aspects of the dynamics need to be statistically inferred (Figure 4a). Furthermore, the second observable takes away much of the tendency (compare Figure 4b 1 – 6 with 8) of the RE approach to overinterpret (overfit) which leads to a shrinking of the error ratio $5.6\pm 1.4$ for PC data to smaller values for cPCF data (Figure 7a) (red) which are on average still bigger than one, while the Euclidean error is reduced (Figure 4a). If we then keep the amount and quality of the PC/cPCF data but increase the complexity of the model which produced the data (Figure 7b and d) from a fourstate to a fivestate model (see kinetic schemes above Figure 7a–c), we see that for cPCF data the error ratio stays roughly the same (difference between Figure 7a and b). For PC data instead both algorithms deliver an unidentified k_{21} for ${N}_{\mathrm{ch}}\leqq 2\cdot {10}^{3}$ (defined as an improper posterior). For larger ${N}_{\mathrm{ch}}$ the KF always identifies all parameters while the RE fails at ${N}_{\mathrm{ch}}\in \{7000,2000,75000\}$ to identify k_{54}. Thus, the KF reduces the risk of unidentified parameters. To calculate the mean error ratio, we exclude the values were some of the parameters are unidentified in total that still amounts to $6.8\pm 2.7$ thus the advantage of the KF (given all parameters are identified) might increase with model complexity for PC data. The lowest Euclidean error for this kinetic scheme has cPCF data analyzed with the KF. (Figure 7d). A 6state1openstates model with cPCF data has again an error ratio of the the usual scale (Figure 7c). As expected, the Euclidean error continuously increases with model complexity (Figure 7d and e). For PC data of the 6state1openstates model even the likelihood of the KF is that weak (Figure 7e) that it delivers unidentified parameters even for and we can detect heavy tailed distributions up until . Using RE on PC data alone does not lead to parameter identification, thus no error ratio can be calculated.
Consistent with our findings, fluorescence data itself, should lower the advantage of the KF compared to PC data simply by signaltonoise arguments. The stochastic aspect of the ligand binding is usually more dominated by the noise of Photon counting and background noise than the stochastic gating is dominated in current data by experimental noise. In terms of uncertainty quantification the advantage of the KF with cPCF varies with the model complexity (see, Appendix 9).
Besides analyzing what causes the changes in the Euclidean error (Figure 7a and b) at the single parameter, we now investigate whether the posterior is a proper representation of uncertainty. Thus, we look back at the HDCIs. The HDCIs of the 4state1openstate (Figure 8) of the PC data from Figure 3 reveal an exacerbated overconfidence problem of the RE approach (blue) compared to cPCFdata (Figures 4b1—6). This, underlines our conclusion of Figures 5 and 6 that the Bayesian posterior sampled by the RE approach is misshaped. As a consequence a confidence volume derived from the curvature at the ML estimate of the RE algorithm understates parameter uncertainty. A possible way for ML methods to derive correct uncertainty quantification is by using bootstrapping data methods (Joshi et al., 2006). Furthermore, the error ratios of each single parameter from its true value in the last column ${\stackrel{~}{k}}_{43}$${\stackrel{~}{K}}_{43}$ strongly increased their magnitudes (insets Figure 8). Even error ratios of $5\cdot {10}^{2}$ are possible. Note, that the way we defined Equation 15 suppresses the influence of the smaller parameter errors in the overall error ratio. Thus the advantage (error ratio) of the KF over RE approach for a single parameter can be much larger or lower compared to the error ratio derived from the Euclidean error if the respective parameter is contributing less to the Euclidean error. The posterior of the KF (green) seems to be unbiased after the transition into the regime $N}_{\mathrm{c}\mathrm{h}}>2\cdot {10}^{3$ where all parameters are identified. Similarly, for the RE algorithm there is no obvious bias in the inference. If we use the RE algorithm and change from the fourstate to the fivestate model (PC data from Figure 7b), bias occurs (Figure 9) in many inferred parameters, even for the highest ${N}_{\mathrm{ch}}$ investigated. Milescu et al., 2005 showed that one or the reason of the biased inference of the RE approach is its ignorance of autocorrelation of the intrinsic noise. We add here that the bias problem clearly aggravates with an increased model complexity. It is even present in unrealistically large patches which in principle could be generated by summing up 10^{2} time traces with ${N}_{\mathrm{ch}}={10}^{3}$. In contrast, the KF algorithm reveals that its parameter inference is either unbiased or at least much less biased in the displayed ${N}_{\mathrm{ch}}$ regime. Furthermore, for both algorithms the position of the HDCI relative to the true value is for some parameters highly correlated, which corresponds to the correlation between optima of the ML method of Milescu et al., 2005 ; Moffatt, 2007.
As a side note, unbiased parameter estimates are a highly desirable feature of an inference algorithm. For example, with a bias in the inference, repeated experiments do not lead to the true value if the arithmetic mean of the parameter inferences is taken. With bias even bootstrapping methods fail to produce reliable uncertainty quantification. Due to the variation of the data the k_{54} parameter is either identified in some neighbourhood of the true value or complete unidentified (Figure 9), if the RE algorithm is used. The unidentified k_{54} occurs even at highquality data such as ${N}_{\mathrm{ch}}=7.5\cdot {10}^{4}$. Only because of the nonphysical prior (Figure 9) of k_{54} induced by the limits of the sampling box of the sampling algorithm the posterior appears to be proper but is in fact either unidentified or or more than two orders of magnitude away from the true value. For the same data using the KF did not result in any unidentified parameters. Note, that comparable inference pathologies such as multimodal distributions of inferred parameter were also reported for the maximum likelihood RE algorithm for low quality PC data or too simple stimulation protocols (Milescu et al., 2005).
In conclusion, the two different perspectives on parameter uncertainty: On the one hand distributions of ML estimates due to the random data (Milescu et al., 2005 ; Moffatt, 2007) and the Bayesian posterior distribution loose their tightly linked (and necessary) connection if the RE algorithm is used. Thus, KF robustifies also ML inferences of the rate matrix. Our findings are consistent with the findings for gene regulatory networks (Gillespie and Golightly, 2012) which show that RE approaches deliver a too narrow posterior in contrast to stochastic approximations which deliver an acceptable posterior compared to the true posterior (defined by a particle filter algorithm). On the data side of the inference problem adding cPCF data eliminates the bias, reduces the variance of the position of the HDCI and eliminates unidentified parameters (Appendix 9—figures 1 and 2) for both investigated algorithms. This advantage increases with modelcomplexity.
For the fivestate and sixstate model, we applied microscopicreversibility (Colquhoun et al., 2004). We enforced it by hierarchical prior distribution (Materials and methods Equation 60) whose parameters can be chosen such that they allow only arbitrarily small violations of microscopicreversibility. But the prior distribution can also be used to enforce some softer regularization around microscopicreversibility. Thus, we can transfer the usually strictly applied algebraic constraint (Salari et al., 2018) of microscopicreversibility to a constraint with scalable softness. In that way we can model the lack of information if microscopicreversibility is exactly fulfilled (Colquhoun et al., 2004) by the given ion channel instead of enforcing the strict constraint upon the model.
Prior critique and model complexity
In the Bayesian framework, the likelihood of the data and the prior generate the posterior. Thus, the performance of both algorithms can be influenced by appropriate prior distributions. We used a uniform prior over the rate matrix which is not optimal. Note, that uniform priors are widely used by several reasons. They appear to be unbiased, and are assumed to be a ‘no prior’ option (which they are not). This is true for location parameters like mean values. In contrast, for other parameters, such as scaling parameters like rates or variances, a uniform prior adds bias to the inference towards faster rates (Zwickl and Holder, 2004). We suspect, that for the PC data even in the simplest model discussed here the lower data quality limit below which we detected unidentified parameters (improper posteriors) is caused by the uniform prior. This lower limit for the KF also increases with the complexity of the model from $N}_{\mathrm{c}\mathrm{h}}<2\cdot {10}^{3$ for the fouestate model till ${N}_{\mathrm{ch}}\leqq 2\cdot {10}^{4}$ for 6state1openstate model. Note, that it is hardly possible to fit the 6state1openstate model with the RE approach for the same amount of PC data. We observe cPCF data eases this problem because the likelihood becomes more concentrated for all parameters. The likelihood dominates the uniform prior. Nevertheless, for most parts of the paper we used a uniform prior over the rates and equilibrium constants to be comparable with the usual default method: a plain ML which influences our results in data regimes in which the data is not strong enough to dominate the bias from the uniform prior. Thus, both algorithms perform better with smarter informative or at least unbiased prior choices for the rate matrix.
In principle, to rule out an influence of the prior, unbiased priors should be used for the rates. The standard concept for unbiased least informative priors is to construct a Jeffreys prior Jeffreys, 1946 for the rate matrix which is, however, beyond the scope of the paper.
The influence of the brightness of the ligands of cPCF data on the inference
To evaluate the advantage of cPCF data Biskup et al., 2007 with respect to PC data only (Figure 10), we compare different types of ligands: Idealized ligands with brightness ${\lambda}_{\text{b}}$, emitting light only when bound to the channels, ‘real’ ligands which also produce background fluorescence when diffusing in the bath solution (Appendix 5) and current data alone. For datasets including fluorescence, the increased precision for the dissociation rate of the first ligand, ${k}_{2,1}$, is that strong that the variance of the posterior $\mathbb{P}({k}_{2,1},{k}_{3,2})$ nearly vanishes in the combined plot with the current data (nearly all probability mass is concentrated in a single point in Figure 10a). The effect on the error of the equilibrium constants ${K}_{i}$ is less strong. Additionally, the bias is reduced and even the estimation of ${N}_{\mathrm{ch}}$ is improved. The brighter the ligands are, the more the posterior of the rates decorrelates, in particular $\mathbb{P}({k}_{2,1},{k}_{3,2})$ (Figure 10a). All median estimates of nine different cPCF data sets (Figure 10b) differ by less than a factor 1.1 from the true parameter except ${k}_{3,2}$, which does not profit as much from the fluorescence data as ${k}_{2,1}$ (Figure 10c). The 95th percentiles, l_{95} of $\mathbb{P}({k}_{2,1})$ and $\mathbb{P}({K}_{1})$ follow ${l}_{95}\sim 1/\sqrt{{\lambda}_{\mathrm{b}}}$. Thus, with increasing magnitude of ligand brightness λ, the estimation of ${k}_{2,1}$ becomes increasingly better compared to that of ${k}_{3,2}$ (Figure 10c). The posterior of the binding and unbinding rates of the first ligand contracts with increasing ${\lambda}_{\mathrm{b}}$. The l_{95} percentiles of other parameters exhibit a weaker dependency on the brightness (${l}_{95}\sim {\lambda}^{0.1}$). For ${\lambda}_{\mathrm{b}}=0.01$ photons per bound ligand and frame, which corresponds to a maximum mean signal of 20 photons per frame, the normal approximation to the Poisson noise hardly captures the asymmetry of photon counting noise included in the time traces. Nevertheless, l_{95} decreases about ten times when cPCF data are used (Figure 10c). The estimated variance of
with the mean predicted signal $\mathbf{H}E[\mathbf{n}({t}_{i})]$, for PC or cPCF data is ${\sigma}^{2}({r}_{i})\approx 1$ (Figure 10d) which means that the modeling predicts the stochastic process correctly up to the variance of the signal. Note that the mean value and covariance of the signal and the state form sufficient statistics of the process, since all involved distributions are approximately multivariate normal. The fat tails and skewness of $\mathbb{P}({k}_{21})$ and $\mathbb{P}({k}_{12})$ arises because the true model is too flexible for current data without further prior information. The KF allows to determine the variance (Figure 10e) of the openchannel current noise for ${\sigma}_{\mathrm{op}}=0.1i$. Adding fluorescence data has roughly the same effect on the estimation of ${\sigma}_{\mathrm{op}}$ like using five times more ion channels to estimate ${\sigma}_{\text{op}}^{2}$.
Sensitivity towards filtering before the analogtodigital conversion of the signal
On the one side, every analog signal to be digitized needs analog filtering for antialiasing according to the Nyquist theorem. On the other side, every analog filter does not only suppress unwanted white noise but also distorts the dynamics (Figure 11a) of the signal of interest (Silberberg and Magleby, 1993). Therefore, (Qin et al., 2000) recommend to avoid analog filtering as much as possible in singlechannel analysis and let the HMM analyze the data in the rawest available form, even with simultaneous drift correction (Sgouralis and Pressé, 2017a). One can also expect that analog filtering of a macroscopic signal is harmful for the inference of the KF and the RE approach. For the CCCO model considered herein we investigated the mean behavior (accuracy and precision) of the posterior of both algorithms with seven data sets (simulated at 100 kHz to mimic an analog signal). A digital fourthorder Bessel filter (Virtanen et al., 2020) was then applied. The maximum analysing frequency f_{ana} of the KF used is $100400$ Hz to be comparable to cPCF setups. The slower frequency at which the Bayesian filter analyzes the data is necessary because the applied Bessel filter has caused additional time correlations in the originally white noise of the signal. Thus, an alldatapoints fit would immediately violate the white noise assumption of Equation 4 which we restore by analyzing at a much lower frequency. We then let the time scales of the induced time correlations become larger and larger by decreasing f_{cut}. Physically, the absolute cutoff frequency f_{cut} is irrelevant; what matters is the magnitude of f_{cut} relative to f_{ana} and to the eigenvalues ${\alpha}_{i}$ of the ensemble (see, Appendix 3), since the eigenvalues determine the time evolution of the mean ensemble state, the autocorrelation, and Fourier spectrum of the fluctuations around the equilibrium distribution (Colquhoun et al., 1997b). The eigenvalues depend on the ligand concentration such that for a fourstate model for each ligand concentration there are three relevant time scales $1/{\alpha}_{i}$ (where $i=2,3,4$) plus the equilibrium solution which satisfies ${\alpha}_{1}=0$. For 10 different time series $3\cdot 10+3$ the outcome is to have different values of ${\alpha}_{i}$.Each eigenvalue is the inverse of the time constant of an exponential decay (see, Appendix 3). For this reason, we normalize in the following (Figure 11) the cutoff frequencies by ${\alpha}_{2}$ at the highest ligand concentration. We analyze the arithmetic mean from 7 different data sets of the median of the posterior of the rate matrix. The mean Euclidean error of the median (Figure 11b) and a series of quantiles demonstrate that overall the error of the mean median of the posterior KF (green) is smaller than that obtained by the RE. For unfiltered data, the accuracy of the mean median of the KF is increased by $\approx 1.6$. Based on the Euclidean error both algorithms benefit slightly from careful analog filtering for ${f}_{\mathrm{cut}}/{\alpha}_{2}\ge 1$ while the offset remains rather constant. A strong negative effect of analog filtering starts for both algorithms around ${f}_{\mathrm{cut}}\approx 1\mathrm{kHZ}$. This is induced by ${f}_{\mathrm{cut}}\to {f}_{\mathrm{ana}}$ (see, Appendix 10). In contrast, based on the level of each individual parameter of the rate matrix (Figure 11c 1–6) the bias induced by analog filtering immediately starts with ${f}_{\mathrm{cut}}=70\mathrm{kHz}$ (Figure 11c 1–3). Note, that visual inspection of the signal (Figure 11a) does not reveal signal distortions ${f}_{\mathrm{cut}}\ge 10\mathrm{kHz}$ though they are detected by both algorithms. For unfiltered data, the maximum of the posterior for the RE approach is a biased estimate $E[{\theta}_{\mathrm{ME}}]\ne {\theta}_{\mathrm{true}}$ for at least the parameters ${\stackrel{~}{k}}_{21},{\stackrel{~}{K}}_{21},{\stackrel{~}{K}}_{32}$ of the true value ${\theta}_{\mathrm{true}}$, which is explained (Milescu et al., 2005) by the fact that RE approaches ignore the autocorrelation of the intrinsic noise. Additionally, the data indicate that for ${\stackrel{~}{K}}_{43}$ the maximum of the posterior is even for the KF a biased estimate which we interpret as limitations induced by the fact that the mean vector and covariancematrix do not constitute sufficient statistics as soon as Poisson distributed photon counting or openchannel noise blurs the signal. For the RE approach, the additional bias induced by the analog filter on the mean maximum of all parameters of the posterior starts with ${f}_{\mathrm{cut}}\approx 70$ kHz or, in other words, at the fastest time scale in the whole data set. The total bias in the estimate is reduced for k_{21} with the additional bias from the analog filtering but increased for k_{32} which for the Euclidean error leads at first to a small increase in accuracy. The KF is more robust towards analog filtering, as the results alter less with f_{cut} (given a reasonable f_{cut}), and less biased for unfiltered data in the estimates of these parameters. On the one hand, the Euclidean error shrinks for ${f}_{\mathrm{c}\mathrm{u}\mathrm{t}}>10$ kHz (Figure 11b). On the other hand, on the singleparameter level (Figure 11c 1–6), the parameter estimates pick up bias due the analog filtering even for high filter frequencies, in particular for the RE approach. Only for k_{43} the KF is more biased than the RE approach.

Figure 11—source data 1
 https://cdn.elifesciences.org/articles/62714/elife62714fig11data1v2.zip

Figure 11—source data 2
 https://cdn.elifesciences.org/articles/62714/elife62714fig11data2v2.zip

Figure 11—source data 3
 https://cdn.elifesciences.org/articles/62714/elife62714fig11data3v2.zip

Figure 11—source data 4
 https://cdn.elifesciences.org/articles/62714/elife62714fig11data4v2.zip

Figure 11—source data 5
 https://cdn.elifesciences.org/articles/62714/elife62714fig11data5v2.zip

Figure 11—source data 6
 https://cdn.elifesciences.org/articles/62714/elife62714fig11data6v2.zip
The KF is the unique minimal variance Bayesian filter for a linear Gaussian process (Anderson and Moore, 2012) which means, given that the assumptions of the KF are fulfilled by the true process of interest, the KF is mathematically proven the best modelbased filter to apply. Consequently, analog filtering does not provide an advantage unless it removes specific high frequency external noise sources (colored noise). We demonstrate (Appendix 10) this for PC data and varied f_{cut} and f_{ana}. On the downside, increasing f_{ana} makes the results of both algorithms more fragile if ${f}_{\mathrm{cut}}\gg {f}_{\mathrm{ana}}$ does not hold. Thus, the critical edge in Figure 11b is indeed induced by f_{cut} approaching f_{ana}. This suggests that the white noise assumption of both algorithms is violated. On the upside, if ${f}_{\mathrm{cut}}\gg {f}_{\mathrm{ana}}$ is given, the KF with an order of magnitude higher f_{ana} has a reduced bias of up to 20% for ${f}_{\mathrm{cut}}\to \mathrm{\infty}$ for individual parameters compared to the KF with lower f_{ana}. Additionally, a higher f_{ana} reduces the variance. To reduce the bias of parameter estimates to a minimum, the experimental design offers two remedies, either doing cPCF experiments with additional discussed advantages or using the KF at a high f_{ana} with even much higher f_{cut}.
By theoretical grounds a further argument for doing less analog filtering is that this benchmark analyzes data of a finite state Markov process, which is a coarse proxy for the true process. In reality, relaxation of a protein is a highdimensional continuousstate Markov process with infinitely many relaxation time scales (eigenvalues) (Frauenfelder et al., 1991) which, however, might be grouped in slower experimentally accessible and nonaccessible faster time scales (Noé et al., 2013). With larger data sets of higher quality from better experiments, the faster time scales might become accessible if they are not distorted by analog filtering. In conclusion, deciding on a specific kinetic scheme and inferring its parameters means finding a model which accommodates in the best way to the set of observed eigenvalues. Analog filtering hampers the RE, KF or HMM forwardbackward algorithm (Qin et al., 2000) to correctly describe the faster time scales.
Error due to finite integration time of fluorescence data
So far, we idealized the fluorescence data integration time as being instantaneously relative to the time scales of ensemble dynamics. In real experiments, the fluorescence signal of cPCF data has orders of magnitude longer minimal integration time ${T}_{\mathrm{int}}$ (time to record all voxels of a frame) or maximal integration frequency ${f}_{\mathrm{int}}=1/{T}_{\mathrm{int}}$, than the possible sampling frequency of current recordings. We mimic the finite integration time
by summing with a sliding window the 100 kHz signal including the white noise to obtain data at an effectively lower sampling frequency (Figure 12a). Additionally we set the Bessel filter for the current data to ${f}_{\mathrm{cut}}/{\alpha}_{2}=4.59$ or ${f}_{\mathrm{cut}}=90$ kHz. The fastest used analysing frequency is ${f}_{\mathrm{ana}}=500\mathrm{Hz}$. We scale mean photo brightness ${\lambda}_{\mathrm{b}}$ and background noise down such that the signaltonoise ratio of the lower integration frequency data is the same as of the highfrequency data ${\lambda}_{\mathrm{b}}/{T}_{\mathrm{int}}=const$ . We do that in order to separate the bias from the finite integration time from other effects such as a better signal to noise ratios for each integrated point. Note that we only analyzed the plot until ${f}_{\mathrm{int}}={f}_{\mathrm{ana}}$. Both algorithms incur very similar bias due to the finite integration time (Figure 12b). The KF (green) is more precise for high integration frequencies ${f}_{\mathrm{cut}}/{\alpha}_{2}$ until ${f}_{\mathrm{cut}}/{\alpha}_{2}\approx 0.08$ then the RE approach becomes more robust. Similar to Besselfiltered current data (Figure 11b) on the single parameter level the systematic deviations start early for example ${f}_{\mathrm{int}}=10$ kHz for K_{21} (Figure 12c4). Possibly the systematic deviations start (Figure 12c2) already at ${f}_{\mathrm{int}}=50$ kHz for k_{32}. The sudden increase of the Euclidean error (Figure 12b) of the mean median at ${f}_{\mathrm{cut}}/{\alpha}_{2}\approx 0.2$ occurs in this case not due to f_{int} approaching f_{cut} but due to ${f}_{\mathrm{int}}\u2a85{\alpha}_{1,2}$ for many ligand concentrations. To show this we plot the results of the fitting of five different data sets without including the highest 4 ligand concentrations (red) which means the largest eigenvalues are much smaller (Figure 12b,C16). Additionally, we keep ${f}_{\mathrm{int}}=const$. Although fluctuations of the posterior medians are higher, the KF becomes more robust. Note, that the fastest eigenvalues of these reduced data sets are indicated by the blue bars (Figure 12b and c4). Based on the Euclidean error (Figures 11a and 12a) the robustness of both algorithms against the cutoff frequency is compared with the robustness against the integration frequency found to be about an order of magnitude higher. That is related to a specific detail of the model used: the binding reaction, corresponds to the fastest time scales of the overall dynamic (difference between Figure 1b and c), which is exposed by the fluorescence signal. Thus, kinetic analysis of any data should make sure that the corresponding frequency of the most dominant timescales of the time series are much slower than the respective f_{int}f_{cut} independently of the investigated algorithms.

Figure 12—source data 1
 https://cdn.elifesciences.org/articles/62714/elife62714fig12data1v2.zip

Figure 12—source data 2
 https://cdn.elifesciences.org/articles/62714/elife62714fig12data2v2.zip

Figure 12—source data 3
 https://cdn.elifesciences.org/articles/62714/elife62714fig12data3v2.zip

Figure 12—source data 4
 https://cdn.elifesciences.org/articles/62714/elife62714fig12data4v2.zip

Figure 12—source data 5
 https://cdn.elifesciences.org/articles/62714/elife62714fig12data5v2.zip

Figure 12—source data 6
 https://cdn.elifesciences.org/articles/62714/elife62714fig12data6v2.zip

Figure 12—source data 7
 https://cdn.elifesciences.org/articles/62714/elife62714fig12data7v2.zip
Conclusions
We generalized the filter equations (Methods Equation 37, 38d, 57, 58 and 59) of the KF for analyzing the gating and binding dynamics of ligandgated ion channels with a realistic signalgenerating model for isolated patchclamp (PC) and confocal patchclamp fluorometry (cPCF) data including openchannel noise, photoncounting noise and background noise. Any other type of linear kinetic scheme (e.g. for voltagedependent channels) and signal can be applied as long as the characteristics of the signal are sufficiently described by normal distributions. Our approach is derived by approximating the chemical master equation of a first order chemical reaction network (which ion channel experiments usually are) which is exact up to the second statistical moment. For firstorder chemical reaction networks, the linear noise approximation (Wallace et al., 2012) are exact up to the second moment too (Grima, 2015). Thus, we can conclude that our Bayesian filter uses a time integrated version of the linear noise approximation. To our understanding of Wallace et al., 2012 our approach is thus equivalent to approaches based on the chemical Langevin or FokkerPlanck equations (Gillespie, 2002). Consequently, this also makes the considerations of the quality of the chemical Langevin equation as an approximation (Gillespie, 2000) of the chemical master equation valid for our approach. Compared to previous attempts Moffatt, 2007, this mathematical generalization is necessary (Figure 3b) in order to use Bayesian filters on macroscopic PC or cPCF data. With our algorithm, we demonstrate (Figures 3c and 7) that the common assumption that for large ensembles of ion channels simpler deterministic modeling by RE approaches is on par with stochastic modeling, such as a KF, is wrong in terms of Euclidean error and uncertainty quantification (Figures 5a–c ,–6a–b).
Enriching the data by fluorescencebased ligand binding reveals two regimes. In one regime, the twodimensional data increase the accuracy of the parameter estimates up to $\approx 10$fold (Figure 4a and c). In the other regime of lower channel expression, enriching the data by the second observable, makes nonidentified parameters to identified parameters. The second observable in cPCF data decreases the overfitting tendency (Figure 4a, b and d) of the RE approach on the true process. Thus, in this regard the advantage of the KF becomes smaller. However, by exploiting Bayesian HDCV we gain a second perspective: We show for various signaltonoise ratios (Figures 5a–c ,–6a–b) that the posterior sampled by a RE approach never covers the true values within a reasonable HDCV. Thus, the central feature of Bayesian statistics, exact uncertainty quantification by having the full posterior, is meaningless in combination with an RE approach (considering the type of data and set of signaltonoise ratios that we tested). This even holds true for very pessimistic signaltonoise assumptions Figure 6b. If HDCVs based on an RE approach cannot be trusted, the same applies to confidence volumes based on the curvature of the likelihood. This is not the case for the KF which delivers properly shaped posteriors (Figures 6a–c ,–5a–c). Increasing the model complexity, at unchanged PC data quality (Figure 7) shows that the RE approach displays unidentified rates even for large ion channel ensembles while our approach identified all parameters for the same data. We also investigated the robustness of both algorithms against the cutoff frequency of a Bessel filter (Figure 11) and showed the overall superior robustness of the KF against errors of analog filtering compared to the RE approach. Analog filtering has its limitations due to distorting the higher frequencies of the Fourier spectrum of the signal. Thus, one should let the KF sample as fast as possible, with a cutoff frequency of at least one order of magnitude higher than the sampling frequency of the KF.
Similar to the Bessel filter, the KF is more robust than the RE approach against errors due to the finite integration time. Nevertheless, it is crucial for both algorithms (Figure 12), that the intrinsic time scales (1/eigenvalues) of the process to be analyzed are slower than the integration time of the data points. Otherwise the accuracy of the inference deteriorates.
Altogether, we demonstrated the performance of the generalized Kalman filter on ion channel data for inference of kinetic schemes. Nevertheless, our approach can approximate any other stochastic system and signal distributions of linear (pseudofirstorder) kinetics (Sorenson and Alspach, 1971). Prospective extensions of the Bayesian filter, for example by Bayesian Gaussian sum filters or similar numerically brute force concepts such as particle filters (Golightly and Wilkinson, 2011; Gillespie and Golightly, 2012), can overcome modeling errors at low ion channel numbers or low photon fluxes.
Materials and methods
We simulated state evolution $s(t)$ with either the software QuB (Nicolai and Sachs, 2014) for PC data or an inhouse Matlab routine (The code will be shared on request.) for cPCF data. The inhouse Matlab routine is an implementation of the Gillespie algorithm Gillespie Daniel T. (1977). Traces were summed up, defining the ensemble state vector $\mathbf{n}(t):={({n}_{1},{n}_{2},{n}_{3},{n}_{4})}^{\top}$, which counts the number of channels in each state. At first we used a 10 kHz sampling frequency for the Gillespie algorithm but for investigating the errors induced by analog filtering the current signal and the finite integration time for each fluorescence data point the Gillespie algorithm sampled at 100 kHz. The KF, RE, and Bayesian filter routines were implemented in Stan (Carpenter et al., 2017) with the interface package PyStan and ran on a high performance computing cluster with O(100) Broadwell and SkyLake nodes. A Tutorial for Patch clamp data can be found on the git hub page https://github.com/JanMuench/Tutorial_Patchclamp_data and for cPCF data, https://github.com/JanMuench/Tutorial_Bayesian_Filter_cPCF_data. The cPCF data simulation code can be found here: https://cloudhsm.itdlz.de/s/QB2pQQ7ycMXEitE (Source code 1).
Methods
Hereinafter, we derive the equations for our Bayesian filter for time series analysis of hidden linear chemical reaction networks (kinetic schemes). A detailed description of the experimental noise is provided in the Appendix 5.
The relation of Bayesian statistics to the Kalman filter
Request a detailed protocolThe following conventions are generally used: Bold symbols are used for multidimensional objects such as vectors or matrices. Calligraphic letters are used for (some) vectorial time series and doublestrike letters are used for probabilities and probability densities. Within the Bayesian paradigm (Hines, 2015; Ball, 2016), each unknown quantity, including model parameters $\mathit{\theta}$ and time series of occupancies of hidden states ${N}_{T}={\{\mathbf{n}({t}_{i})\}}_{i=1}^{T}$, are treated as random variables conditioned on observed time series data $\mathcal{Y}}_{T}={\mathbf{y}({t}_{i})}_{i=1}^{T$. The prior $\mathbb{P}(\mathit{\theta})={\prod}_{j}^{{N}_{\mathrm{par}}}\mathbb{P}({\theta}_{j})$ or posterior distribution $\mathbb{P}(\mathit{\theta}{\mathcal{Y}}_{T})$ encodes the available information about the parameter values before and after analysing the data, respectively. According to the Bayesian theorem, the posterior distribution
is a probability distribution of a parameter set $\mathit{\theta}$ conditioned on $\mathcal{Y}}_{T$. The likelihood $\mathbb{L}({\mathcal{Y}}_{T}\mathit{\theta})$ encodes the distribution of the data by modelling the intrinsic fluctuations of the protein as well as noise coming from the experimental devices. The prior provides either assumptions before measuring data or what has been learnt from previous experiments about $\mathit{\theta}$. The normalization constant
ensures that the posterior is a normalized distribution. The KF is a special class of models in the family of Bayesian filters (Ghahramani, 1997), which is a generalisation of the classical KF. Due to its linear time evolution (Equation 1), the KF is particularly useful for modeling time series data of ensembles dynamics of first order chemical networks. It delivers a set of recursive algebraic equations (Materials and methods Equation 28 and Equation 32) for each time point, which allows to express the prior $\mathbb{P}(\mathbf{n}(t){\mathcal{Y}}_{t1})$ and (after incorporating $\mathbf{y}(t)$) the posterior $\mathbb{P}(\mathbf{n}(t){\mathcal{Y}}_{t})$ occupancies of hidden states $\mathbf{n}(t)$ for all $t$ given a set of parameters $\mathit{\theta}$. This means the KF solves the filtering problem (inference of ${N}_{T}$) by explicitly modeling the time evolution of $\mathbf{n}(t)$ by multivariate normal distributions. This allows us to replace $\mathbb{L}({\mathcal{Y}}_{T}\mathit{\theta})$ of Equation 20 by the expression of Equation 9.
The Bayesian framework (as demonstrated in this article) has various properties which makes it superior to ML estimation (MLE) (McElreath, 2018). Those properties are in particular useful for the analysis of biophysical data since very often the dynamics of interest are hidden or latent in the data. Models with a hidden structure are called singular. For regular (nonsingular) statistical models, maxima ${\mathit{\theta}}_{\mathrm{ML}}$ of the posterior or likelihood converge in distribution
to the true value ${\mathit{\theta}}_{\mathrm{true}}$,where ${\mathbf{F}}^{1}({\mathit{\theta}}_{\mathrm{true}})$ is the inverse Fisher information matrix. Under those conditions it is justified to derive from the curvature of the likelihood at ${\mathit{\theta}}_{\mathrm{ML}}$ via the CramerRaobound theorem
a confidence volume for the inferred parameters. In contrast, consider for example the type of data investigated in this study which probes the protein dynamics by current and light. Singularity means that the Fisher information matrix of a model is not invertible leading to the breakdown of the CramerRao Bound theorem. Due to the breakdown, it cannot be guaranteed that even in the asymptotic limit the loglikelihood function can be approximated by a quadratic form Watanabe, 2007. Thus, usually the MLE does not obey Equation 22. Consequently, the posterior distribution is usually not a normal distribution either (Watanabe, 2007). Using the full posterior distribution without further approximations detects the resulting problems such as deviation from normality or nonidentifiability of parameters, related to the singularity. In conclusion, the posterior is still a valid representation of parameter plausibility while ML fails.
Time evolution of a Markov Model for a single channel
Request a detailed protocolIn the following, we write the time $t$ as function argument rather than a subscript. Following standard approaches, we attribute to each state of the Markov model an element of a vector space with dimension $M$. At a time, a channel can only be in a single state. This implies that the set of possible states is S:=$\{(1,0,0,\mathrm{\dots}),(0,1,0,\mathrm{\dots}),\mathrm{\dots},(\mathrm{\dots},0,1)\}\subset {\{0,\mathrm{\hspace{0.17em}1}\}}^{M}$. In the following, Greek subscripts refer to different states while Latin subscripts refer to different channels. By $\mathbf{s}(t)={\mathbf{e}}_{\alpha}$ we specify that the channel is in state α at time $t$. Mathematically, ${\mathbf{e}}_{\alpha}$ stands for the αth canonical unit Cartesian vector (Table 1).
Assuming that the state transitions can be modeled by a first order Markov process, the path probability can be decomposed as the product of conditional probabilities as follows:
Markov models (MMs) and rate models are widely used for modeling molecular kinetics (Appendix 2). They provide an interpretation of the data in terms of a set of conformational states and the transition rates between these states. For exactness it remains indispensable to model the dynamics with a HMMs (Noé et al., 2013). The core of a hidden Markov model is a conventional Markov model, which is supplemented with a an additional observation model. We will therefore first focus on a conventional Markov model. Statetostate transitions can be equivalently described with a transition matrix $\mathbf{T}$ in discrete time or with a rate matrix $\mathbf{K}$ in continuous time, as follows:
where $\mathrm{exp}$ is the matrix exponential. We aim to infer the elements of the rate matrix $\mathbf{K}$, constituting a kinetic model or reaction network of the channel. Realizations of sequences of states can be produced by the DoobGillespie algorithm Gillespie Daniel T. (1977). To derive succinct equations for the stochastic dynamics of a system, it is beneficial to consider the time propagation of an ensemble of virtual system copies. This allows to ascribe a probability vector $\mathbf{p}(t)$ to the system, in which each element ${p}_{\alpha}(t)$ is the probability to find the system at $t$ in state α. One can interpret the probability vector $\mathbf{p}$ as the instantaneous expectation value of the state vector $\mathbf{s}$.
The probability vector obeys the discretetime Master equation
Time evolution of an ensemble of identical noninteracting channels
Request a detailed protocolWe model the experimentally observed system as a collection of noninteracting channels. A single channel can be modeled with a firstorder MM. The same applies to the ensemble of noninteracting channels. We focus on modeling the time course of extensive macroscopic observables such as the mean current and fluorescence signals as well as their fluctuations. A central quantity is the vector $\mathbf{n}(t)$ which is the occupancy of the channel states at time $t$:
This quantity, like $\mathbf{s}(t)$, is a random variate. Unlike $\mathbf{s}(t)$, its domain is not confined to canonical unit vectors but to $\mathbf{n}\in {\mathbb{N}}^{M}$. From the linearity of Equation 28 in the channel dimension and from the singlechannel CME Equation 27b one can immediately derive the equation for the time evolution of the mean occupancy $\overline{\mathbf{n}}(t)=E[\mathbf{n}(t)]$:
with the transition matrix $\mathbf{T}$. The full distribution $\mathbb{P}(\mathbf{n}(t+1)\mathbf{n}(t))$ is a generalized multinomial distribution. To understand the generalized multinomial distribution and how it can be constructed from the (conventional) multinomial distribution, consider the simplified case where all channels are assumed to be in the same state α. Already after one time step, the channels will have spread out over the state space. The channel distribution after one time step is parametrized by the transition probabilities in row number α of the singlechannel transition matrix $\mathbf{T}$. According to the theory of Markov models, the final distribution of channels originating from state α is the multinomial distribution
In general, the initial ensemble will not have only one but multiple occupied channel states. Because of the independence of the channels, one can imagine each initial subpopulation spreading out over the state space independently. Each subpopulation with initial state α gives rise to its own final multinomial distribution that contributes ${n}_{\beta}^{(\alpha )}$ transitions into state β to the total final distribution. The total number of channels at $t+1$ in each state can then be simply found by adding the number of channels transitioning out of the different states α.
Evidently, the total number of channels is conserved during propagation. The distribution of $\mathbf{n}(t+1)$, defined by Equations 30; 31, is called the generalized multinomial distribution:
While no simple expression exists for the generalized multinomial distribution, closed form expressions for its moments can be readily derived. For large ${N}_{\mathrm{ch}}$ each $\mathbb{P}({\mathbf{n}}^{(\alpha )}(t+1)\mid {n}_{\alpha}{\mathbf{e}}_{\alpha})$ can be approximated by a multivariatenormal distribution such that also $\mathrm{general}\mathrm{multinomial}(\mathbf{n}(t),\mathbf{T})$ has a multivariatenormal approximation. In the next section, we combine the kinetics of channel ensembles with the KF by a moment expansion of the governing equations for the ensemble probability evolution.
Moment expansion of ensemble probability evolution
Request a detailed protocolThe multinomial distribution (Fredkin and Rice, 1992) has the following mean and covariance matrix
where ${\mathbf{T}}_{:,\alpha}$ denotes the column number α of the transition matrix and $\mathrm{diag}({\mathbf{T}}_{:,\alpha})$ describes the diagonal matrix with ${\mathbf{T}}_{;,\alpha}$ on its diagonal. Combining Equation 31 with Equations 33; 34 we deduce the mean and variance of the generalized multinomial distribution:
Note that Equations 35; 36 are conditional expectations that depend on the random state $\mathbf{n}$ at the previous time $t$ and not only on the previous mean $\overline{\mathbf{n}}$. To find the absolute mean, the law of total expectation is applied to Equation 35, giving
in agreement with the simple derivation of Equation 29. We introduce a shorthand $\mathbf{P}(t):=\mathrm{c}\mathrm{o}\mathrm{v}(\mathbf{n}(t),\mathbf{n}(t))$ for the absolute covariance matrix of $\mathbf{n}(t+1)$. Similarly, $\mathbf{P}(t)$ can be found by applying the law of total variance decomposition (Weiss, 2005 to Equations 35; 36), giving
Equations 37, 38d dare compact analytical expressions for the mean and the covariance matrix of the occupancy vector $\mathbf{n}$ at $t+1$ that depend on the mean $\overline{\mathbf{n}}$ and covariance matrix $\mathbf{P}$ at the previous time step $t$. Chaining these equations for different time steps $t=0,\phantom{\rule{thinmathspace}{0ex}}\dots ,\phantom{\rule{thinmathspace}{0ex}}T$ allows to model the whole evolution of a channel ensemble. Moreover, these two equations together with the output statistics of $\mathbb{O}(\mathbf{y}\mathbf{n}(t))$ are sufficient to formulate correction equations Equation 59 of the KF (Moffatt, 2007; Anderson and Moore, 2012). These equations will be used in a Bayesian context to sample the posterior distribution of the model parameters. The sampling entails repeated numerical evaluation of the model likelihood. Therefore, analytical equations for the ensemble evolution that can be quickly evaluated on a computer millions of times are indispensable. This was achieved by deriving Equation 37, Equation 38d. Comparing Equation 38d with the KF prediction equation (Anderson and Moore, 2012) for $\mathbf{P}(t)$, we obtain the statedependent covariance matrix of Equation 3 as
In the following section on properties of measured data and the KF, we no longer need to refer to the random variate $\mathbf{n}(t)$. All subsequent equations can be formulated by only using the mean hidden state $\overline{\mathbf{n}}(t)$ and the variancecovariance matrix of the hidden state $\mathbf{P}(t)$. We therefore drop the overbar in $\overline{\mathbf{n}}(t)$ so that the symbol $\mathbf{n}(t)$ refers from now on to the mean hidden state.
Modeling simultaneous measurement of current and fluorescence
Request a detailed protocolIn the following, we develop a model for the conditional observation distribution $\mathbb{O}(\mathbf{y}\mathbf{n}(t))$ (Appendix 5 for experimental details). Together with the hidden ensemble dynamics this will enable us to derive the output statistics of the KF (see, below). Let $\mathbf{y}(t)$ be the vector of all observations at $t$. Components of the vector are the ion current and fluorescence intensity.
As outlined in the introduction part, in Equation 4 we model the observation by using a conditional probability distribution $\mathbb{O}(\mathbf{y}(t)\mathbf{n}(t))$ that only depends on the mean hidden state $\mathbf{n}(t)$, as well as on fixed channel and other measurement parameters. $\mathbb{O}(\mathbf{y}(t)\mathbf{n}(t))$ is modeled as a multivariate normal distribution with mean $\mathbf{Hn}(t)$ and variancecovariance matrix $\mathbf{\Sigma}(t)$, that can in general depend on the mean state vector $\mathbf{n}(t)$ (much like the covariance matrix of the kinetics in (Equation 38d) ). The observation matrix $\mathbf{H}\in {\mathbb{R}}^{{N}_{\mathrm{obs}}\times M}$ projects the hidden state vector $\mathbf{n}(t)$ onto $\mathbf{Hn}(t)\in {\mathbb{R}}^{{N}_{\mathrm{obs}}}$, the observation space. The observation distribution is
This measurement model is very flexible and allows to include different types of signals and error sources arising from both the molecules and the instruments. A summary of the signals and sources of measurement error and their contributions to the parameters of $\mathbb{O}(\mathbf{y}(t)\mathbf{n}(t))$ is provided by Table 2. Below we address the two types of signals and four noise sources one by one. For this, we decompose the observation matrix and the observation noise covariance matrix into the individual terms:
In the following, we report the individual matrices for the exemplary CCCO model with one open state $\alpha =4$ and three closed states $\alpha =1,2,3$. Matrices can be constructed analogously for the other models. For the definition of ${\mathbf{\Sigma}}_{\mathrm{back}}$ refer to (Appendix 5).
Macroscopic current and openchannel noise
Request a detailed protocolWe model the current and the intrinsic fluctuations of the openchannel state $\mathbf{s}={\mathbf{e}}_{4}$ (the open channel noise) by a statedependent normal distribution with mean $i{n}_{4}(t)$ where ${n}_{4}(t)$ is the number of channels in the open state at $t$ and $i$ is the singlechannel current. The additional variance of the singlechannel current is described by ${\sigma}_{\mathrm{open}}^{2}$. The sum of the instrumental noise of the experimental setup and the open channel noise is modeled as uncorrelated (white) normally distributed noise with the mean $E[{\nu}_{\mathrm{I}}(t)]=0$ and variance $E[{\nu}_{\mathrm{I}}^{2}(t)]={\sigma}_{\mathrm{op}}^{2}{n}_{4}(t)+{\sigma}_{\mathrm{m}}^{2}$. By making the openchannel noise dependent on the hidden state population ${n}_{4}(t)$, we fully take advantage of the flexibility of Bayesian filters which admits an (explicitly or implicitly) timedependent observation model. By tabulating the parameters of the two normal distributions into $\mathbf{H}$ and $\mathbf{\Sigma}$, we obtain
One can now ask for the variance of a data point $y(t)$ given the epistemic and aleatory uncertainty of $\mathbf{n}(t)$ encoded by $\mathbf{P}(t)$ in Equation 38d. By using the law of total variance the signal variance follows as:
See, Appendix 6 for further details.
Fluorescence and photoncounting noise
Request a detailed protocolThe statistics of photon counts in the fluorescence signal are described by a Poisson distribution with emission rate ${\lambda}_{\mathrm{Fl}}$
The total emission rate ${\lambda}_{\mathrm{Fl}}$ can be modeled as a weighted sum of the specific emission rates ${\lambda}_{b}$ of each ligand class $\{0,1,2\}$. The weights are given by the stoichiometric factors which reflect the number of bound ligands. In order to cast the Poisson distribution into the functional form of the observation model (Equation 41), we invoke the central limit theorem to approximate
The larger ${\lambda}_{\mathrm{Fl}}$ the better is the approximation. We assume, that the confocal volume is equally illuminated. For our model of ligand fluorescence, we assume for a moment that there is no signal coming from ligands in the bulk. We will drop this assumption in the next section. With these assumptions, we arrive at the following observation matrix
The matrix $\mathbf{H}$ aggregates the states into two conductivity classes: nonconducting and conducting and three different fluorescence classes. The first element ${(\mathbf{Hn})}_{1}$ is the mean fluorescence ${\lambda}_{\mathrm{Fl}}(t)={\lambda}_{b}[{n}_{2}(t)+2({n}_{3}(t)+{n}_{4}(t))]$. The variancecovariance matrix ${\mathbf{\Sigma}}_{\mathrm{binding}}$ can be derived along the same lines using Equation 48. We find
Under these assumptions, the observation matrix can be written as follows
Output statistics of a Kalman Filter
Request a detailed protocolwith twodimensional statedependent noiseNow simultaneously measured current and fluorescence data $\mathbf{y}\in {\mathbb{R}}^{2}$, obtained by cPCF, are modeled. Thus, the observation matrix fulfills $\mathbf{H}\in {\mathbb{R}}^{2\times M}$. One can formulate the observation distribution as
The vector ${\mathit{\nu}}_{\mathrm{m}}$ denotes the experimental noise, with $E[{\mathit{\nu}}_{\mathrm{m}}]=0$ and variance given by the diagonal matrix ${\mathbf{\Sigma}}_{\mathrm{meas}}+{\mathbf{\Sigma}}_{\mathrm{back}}$. The second noise term arises from Poissondistributed photon counting statistics and the openchannel noise. It has the properties
and
The matrix $\mathbf{\Sigma}$ is a diagonal matrix. To derive the covariance matrix $\mathrm{cov}(\mathbf{y}(t))$ we need to additionally calculate $\mathrm{var}({y}_{\mathrm{fluo}}(t))$ and $\mathrm{c}\mathrm{o}\mathrm{v}({y}_{\mathrm{f}\mathrm{l}\mathrm{u}\mathrm{o}}(t),{y}_{\mathrm{p}\mathrm{a}\mathrm{t}\mathrm{c}\mathrm{h}}(t))$. By the same arguments as above we get
The cross terms can be calculated by using the law of total covariance
yielding the matrix
We assumed that the Poisson distribution is well captured by the normal approximation. In cPCF data, the ligand binding to only a subensemble of the channels is monitored, which we assume to represent the conducting ensemble such that ${N}_{\mathrm{ch},\mathrm{FL}}={N}_{\mathrm{ch},\mathrm{I}}$. For real data, further refinement might be necessary to model the randomness of the subensemble in the summed voxels. With the time evolution equations for the mean (Equation 35) and for the covariance matrix Equation 38d as well as with the expressions for the signal variance, we possess all parameters that are needed in the correction equation of the (Kalman, 1960; Anderson and Moore, 2012).
The correction step
Request a detailed protocolFor completeness we write down the correction step (Bayesian update) of the KF, although its derivation can be found in Chen, 2003; Anderson and Moore, 2012; Moffatt, 2007. The mean ensemble state $\mathbf{n}(t)$ is corrected by the current data point
where Kalman gain matrix ${\mathbf{K}}_{\mathrm{Kal}}:=\mathbf{P}{(t)}_{\mathrm{prior}}{\mathbf{H}}^{\top}{\mathbf{\Sigma}}^{1}$ evaluates the intrinsic noise against the experimental noise. How precise are my model predictions about $\mathbf{n}(t)$ compared with the information gained about $\mathbf{n}(t)$ by measuring $\mathbf{y}(t)$. The covariance $\mathbf{P}(t)$ of the ensemble state $\mathbf{n}(t)$ is corrected by
Equation 58,59, 37 and 38d form the filtering equations which summarize the algorithm. One initialises the first $\mathbf{n}(0)$ and $\mathbf{P}(0)$ and with an equilibrium assumption.
Microscopicreversibility as a hierarchical prior
Request a detailed protocolWe applied microscopicreversibility (Colquhoun et al., 2004) by a hierarchical prior distribution. Usually, microreversibility is strictly enforced by setting the product of the rates of the clockwise loop ${k}_{1},{k}_{2},{k}_{3}{k}_{4}$ equal to the anticlockwise loop ${k}_{5},{k}_{6},{k}_{7},{k}_{8}$ and then solving for the desired rate parameter to be replaced. This means that the classical approach can be described by drawing the resulting rate from a Dirac delta distribution prior with
Following Equation 60, we can model microscopicreversibility with any hierarchical prior distribution whose limit for a vanishing variance is Equation 60. For mathematical convenience, we defined the hierarchical prior by a sharply peaking beta distribution
and by rescaling and adding an offset
we derived a conditional prior which allows at maximum a ±0.005 relative deviation from the strict microscopicreversibility. The ±0.005 microreversibility constraint is applied in (Figure 7b–d). In this way, one could model or even test possible small violation of microscopicreversibility if smaller beta parameters such as $\mathrm{beta}(1,1)$ would be chosen.
Appendix 1
Efficiency of the Hamiltonian Monte Carlo sampler
Bayesian posterior sampling requires a fast and efficient sampler which requires in turn a fast evaluation of the likelihood and prior distribution. Optimally, the calculation time should increase as little as possible with the amount of data. This is why we chose the KF framework and parallelized the processing of a time trace across many CPUs. Compared to the maximization of the likelihood, sampling from the posterior requires approximately one to multiple orders of magnitude times more evaluations of the likelihood function to have a decent representation of the posterior. For example, in Moffatt, 2007 both algorithms need roughly 10^{2} to 10^{3} evaluations of the likelihood function to converge. This is only a small fraction of the number of evaluations needed here to obtain a converged estimate of the posterior (in fact the typical "warmup" phase of the sampler which we used is barely half completed with such a small number of evaluations).
The ratio of the number of used samples to the number of likelihood evaluations can serve as a simple efficiency measure. However, this ratio is a too optimistic measure as revealed by a look into the internal dynamics of Hamiltonian Monte Carlo sampler (HMC). HMC treats the current parameter sample as part of a phase space point of a (dynamic) Hamiltonian system. The force acting on the system is the derivative of the negative logarithm of the posterior density Betancourt, 2017. Each suggested sample is derived from a ballistic movement with random kinetic energy in the augmented parameter space. Calculating this movement is done by integration of the corresponding Hamiltonian equations. Therefore, one suggested sample requires many evaluations of the gradient of the log posterior distribution. How many evaluations were needed in turn depends on the shape of the posterior and the length of the ballistic movement.
Strong changes of the gradient of the posterior (high curvature of the posterior) require more integration steps. Setting all those sampler parameters requires either expert knowledge about the statistical model and sampler or a sampler which automatically learns optimal sampler parameter settings (warmup) before it starts the actual sampling of the parameters of interest. Nevertheless, HMC generates samples (Appendix 1—figure 1a) from highdimensional, correlated posteriors more efficiently than many other classical Markov chain Monte Carlo methods (Betancourt, 2017). By more efficient we mean the product of the speed of drawing one correlated sample with the amount of iterations needed for the autocorrelation function of those samples to vanish sufficiently. By construction, the samples from any Markov chain Monte Carlo method are correlated (Appendix 1—figure 1b). The quality of the adaptive NUTS HMC sampler provided by Stan can by seen in Appendix 1—figure 1b. The sample traces show a small anticorrelation whose absolute value is hardly larger than 0.05. This is achieved by letting the sampler adapt to the geometry of the posterior in the warmup phase. The samples collected during the warmup phase are discarded.
Hereinafter, we show that likelihood and sampler considerations are one aspect of computational efficiency. The other is the implementation of a parallelized algorithm. The blue curve (Appendix 1—figure 2) represents the scaling of the KF with an implementation of the algorithm which saves besides the sampled parameters the predicted mean and the covariance of the signal. This means for cPCF data that for each parameter sample $5\cdot {N}_{\mathrm{traces}}\cdot {N}_{\mathrm{data}}/\mathrm{CPU}$ more memory operations are needed on the cluster node. Note, since we assigned each ligand concentration jump and the following data points to its own CPU the following relation holds: ${N}_{\mathrm{data}}/\mathrm{CPU}={N}_{\mathrm{data}}/\mathrm{trace}$. Skipping saving those derived quantities (orange curve) creates a speedup of roughly 2 orders of magnitude. The derived quantities are redundant quantities which we suggest to calculate by feeding a small subsample of the posterior samples to the KF filter which then reanalyzes the traces by the given draws but does not redraw from the posterior. See GitHub: https://github.com/JanMuench/Tutorial_Patchclamp_data.
We compare the computation time for the KF (green) and the RE approach (blue) for the 4state model (Appendix 1—figure 3a) versus the data quality by ${N}_{\mathrm{ch}}$. The calculation time t_{calc} rises for both algorithms roughly like $\sim \sqrt{{N}_{\mathrm{ch}}}$. For each curve the likelihood evaluations stay constant for the whole plot. Thus this scaling relates to the integration time of the HMC sampler. Taking the average of the last 5 calculation times shows that the KF is about $2.7\pm 0.2$ times slower than the RE approach. The computation time for cPCF data of the 5state2open states model (Appendix 1—figure 3b) shows a similar $\sim \sqrt{{N}_{\mathrm{ch}}}$regime until it seems to become independent of ${N}_{\mathrm{ch}}$. Note, the different ${N}_{\mathrm{ch}}$ interval which is displayed.
Taking the average of the last 4 computation times shows that the KF is about $2.24\pm 0.01$ times slower than the RE approach. Surprisingly, with increasing model complexity the ratio of the computation time becomes not necessarily larger. The computation time for cPCF data of the 6state1open state model (Appendix 1—figure 3c) shows a similar $\sim \sqrt{{N}_{\mathrm{ch}}}$regime until it becomes independent of ${N}_{\mathrm{ch}}$. Taking the average of the last 4 calculation times reveals that the KF is about $3.3\pm 0.3$ times slower than the RE approach. With increasing model complexity, the ratio of the computation times becomes larger. Note, that there is little variation of the absolute calculation time of the algorithm across model complexity which is a good prospect for more complex kinetic schemes. Nevertheless, to achieve this computational speed we used in total 72 CPUs on one node, 20 CPUs per chain for 20 time traces. Thus, nodes with more then 80 CPUs would enable a better performance. Overall, the more complex likelihood calculations of the KF require 2–3 times more calculation for the tested models and cPCF data. Surprisingly, for most part of the displayed interval the KF is faster than the RE approach (Appendix 1—figure 4a) for PC data. Considering the more expensive likelihood evaluation (prediction and correction) of the KF, this result reminds that this computational time benchmark is a downstream test that integrates all aspects of the sampling. One possible explanation could be the following: We describe in the main text (Figures 4b1—6) for the RE approach, that due to the treatment of every data point as an individual draw from a multinomial distribution, HDCVs (Figures 5–6) are too narrow. This problem gets exacerbated (Figure 8) as we used 10 times more data points which we interpret as the cause of the KF being faster than the RE approach (Appendix 1—figure 4a). The narrower the posterior is the more integration steps from the sampler are needed to calculate the proposed trajectory with sufficient precision. Another example displayed (Appendix 1—figure 4b), has an increased kinetic scheme complexity. Note that we reduced the analyzing frequency 10 times to be comparable with the analyzing frequency of cPCF data. Under this condition, the intuition from likelihood calculation complexity is confirmed that the KF is slower. Taking the average of the last 4 calculation times shows that the KF is about $1.8\pm 0.2$ slower than the RE algorithm for PC data.
Convergence diagnostics
In the limit of infinitely many samples drawn from a posterior by any MCMC method, given that posterior space is simply connected, the sampling is guaranteed to converge to the typical set Betancourt, 2017. For a finite amount of samples from the posterior it is not guaranteed that the samples are representative for the typical set of the posterior. To ensure that the trace has converged to the typical set is an active field of statistical research. A discussion of these important consistency checks (Gelman and Rubin, 1992a; Gelman et al., 2013; Vats and Knudson, 2018; Gabry et al., 2019; Vehtari et al., 2021) ensuring that the sampled traces represent the typical set, is beyond the scope of this paper. In short, we usually worked with 4 independent sampling chains of the posterior (Gelman and Rubin, 1992b). On the one hand, we visually checked that all 4 chains do report the same typical set to verify that the sampler is fine. This convergence can be monitored by the $\widehat{R}$ statistics (Gelman et al., 2013; Vehtari et al., 2021). $\widehat{R}$ evaluates differences between the within and betweenchain parameter estimates. A bad mixing is indicated by large values of $\widehat{R}$. Perfectly converged chains would be reported with $\widehat{R}=1$. Additionally, the effective sample size ${N}_{\mathrm{eff}}$ of each object of interest needs to be monitored. Both of them are reported by the “summary” method from the “fit” object in PyStan. We show (Appendix 1—table 1) an edited version of the output from this method. The last two columns would indicate signs of missing convergence. The second column reports the Markov sampling error telling if we drew enough samples in total. Note that any quantity far out in the tails would have a much larger error. There were cases when the sampler was not fine for all chains, meaning that 3 chains showed the same posterior but one showed something different. In such cases we simply discarded that chain. This occurred in particular more often with the RE approach. As we are benchmarking methods and know the true parameter values, it was rather obvious which chain did not work well. In particular, the successful binomial test of the HDCV is a downstream test that confirms besides the assumptions of the algorithm also whether the sampler worked sufficiently. For the Bayesian filter, sampler problems occurred at very low ion channel counts below ${N}_{\mathrm{ch}}=200$.
HMC parameter settings
The sampling efficiency of an HMC algorithm is sensitive to three parameters (Neal, 2011, Hoffman and Gelman, 2014). The three parameters, discretization time $\u03f5$, the metric $\mathbf{M}$ and the number of steps taken $L$ can be predefined or/and are adapted in the warmup phase of the sampler. A typical number of iterations ${N}_{\mathrm{warm}}$ in the warmup phase for each chain is between ${N}_{\mathrm{warm}}=3\cdot {10}^{3}\mathrm{\dots}7\cdot {10}^{3}$. Since a sufficient ${N}_{\mathrm{warm}}$ depends on the correlation structure of the posterior, which depends on the quality and quantity of the data, we did not engage to optimize the number of warmup steps. We rather checked whether for a set of different data (for example with different ${N}_{\mathrm{ch}}$) all chains are fine.
Stan
Stan is a probabilistic programming language that provides the research community with easy access to HMC sampling, variational Bayes, and optimization procedures like Maximum posterior/ML approaches. Stan uses automatic analytical differentiation of the statistical model such that no analytical treatment by hand has to be done for any tested model. The only requirement from the user is to define a statistical model of the data. The Stan compiler calculates the analytical derivatives of the model and then translates the statistical model to C++code and after that into an executable program. Furthermore, Stan uses adaptive HMC such that the length of the ballistic trajectory as well as the variance of the random kinetic energy to create the ballistic trajectory and the number of integration steps are optimized automatically. Then the program can be started from any highlevel data processing language of the user’s choice such as Python, Matlab, Julia, R or from the command line.
Appendix 2
Markov models for a single ion channel
Markov models and rate models are widely used for modeling molecular kinetics. They provide an interpretation of the data in terms of a set of functional states and the transition rates between these states. Markov models can be estimated from experimentally recorded data as well as from computer simulation data. The use of Markov models with onestep memory is supported by the concept of the molecular free energy landscape. Molecular energy landscapes are typically characterized by conformationally welldefined freeenergy minima that are separated by freeenergy barriers. State transitions in molecules are thermally activated barriercrossing events on this landscape (Frauenfelder et al., 1991) leading to a rapid equilibration of the system in the vicinity of this new minimum. Memory of other minima that have been visited in the past is not required. Regarding the wide spectrum of time scales at which processes in a protein take place, one has to be aware that there is typically a small number of relaxation modes with excessively long autocorrelation times and many relaxation modes with much faster autocorrelation times. To model the slow, experimentally accessible processes, it is sufficient to retain the small number of slow modes (Noé et al., 2011). It has been shown rigorously that working with the set of slow modes is equivalent to model the state dynamics with a small number of fuzzily defined metastable states in the full conformational space (Deuflhard and Weber, 2005) Later it has been shown that the set of slow modes can be well approximated with a hidden Markov model (Noé et al., 2013).
Appendix 3
Eigenvalues and fitting of sums exponential decays: Mean time evolution of the spectral components of the time evolution
From the experimental perspective we are used to fit sums of exponential decays to time series data of any signal such as currents or fluorescence.
The mathematical background for this procedure is the spectral decomposition of the solution of the corresponding RE equation
which is a vector differential equation whose evolution is governed by the rate matrix $\mathbf{K}$.The general solution to the mean time evolution of the system is
A Markov state model has as many real valued eigenvalues as it has states $M$. With those eigenvalues one can decompose the solution into its spectral components
All of the eigenvalues obey $\alpha \le 0$. One of them always fulfills ${\alpha}_{0}=0$ which belongs to the equilibrium probability distribution of the ion channel. With $exp(0)=1$ we can write the sum as
The eigenvalue ${\alpha}_{0}=0$ determines the equilibrium solution. Due to ${\alpha}_{i}\le 0$ the solution is stable and converges to
The time evolution of the mean signal in its spectral components can then be derived
by multiplying with the observation matrix $\mathbf{H}$ and setting ${\alpha}_{i}=\frac{1}{{\tau}_{i}}$ we derived Equation 63, Colquhoun and Hawkes, 1995a. To derive the likelihood based on this deterministic approximation Milescu et al., 2005 one assumes for each data point that the ensemble state $\mathbf{n}(t)$ is drawn from
with the additional approximation of the multinomial distribution by a multivariate normal distribution. Subsequently, the equation
provides the contribution of the recording system to the signal of the ensemble state. In contrast, the KF approximates a
distribution with multivariate normal distributions (see Methods) and then the details of the experimental setup are adjoined
Appendix 4
A note about Kalman filtering within other frameworks such as variational Bayes, maximum a posteriori and ML
We like to emphasize that the key property of the Kalman filter likelihood for an autocorrelated time series (see Equation 9) to factorize over time enables us to treat each data point as if it were independent. The alternative is to treat the whole time series as one highdimensional data point as done in Celentano and Hawkes, 2004. A ML Kalman filter (Moffatt, 2007) would try to optimize Equation 9, which is usually done in the log space to improve numerical stability. Taking the logarithm of Equation 9 gives also an insight for the reader, who might be more experienced in using residual sums of squares for fitting.
where we used the logarithmic algebra rule $\mathrm{log}({\prod}_{i}{a}_{i})={\sum}_{i}\mathrm{log}({a}_{i})$ Using the functional form of the multivariate normal distribution, we arrive at
which is a vectorial residual sum of squares whose summands are weighted by ${\mathbf{HP}}_{t}{\mathbf{H}}^{\top}+{\mathbf{\Sigma}}_{t}$. The past observations of time series are encoded in ${\mathbf{P}}_{t}$ and $E[{\mathbf{n}}_{t}]$, which are calculated with the Kalman filtering equations before the likelihood can be evaluated.
Generally, the user of our algorithm is not restricted to the full Bayesian inference. He/she can also stick to optimizing Equation 71. This is because in the probabilistic programming language Stan one only formulates the statistical model. The method to evaluate the statistical model and the data is provided by the Stan compiler and the high level programming language interface. Together they allow to switch quickly between HMC sampling, variational Bayesian posterior estimation, maximum a posteriori and ML. In particular, the variational Bayesian posterior might be a way to ease the computational burden due HMC sampling while keeping many of the benefits of knowing the full posterior distribution.
In the previous versions before PyStan 3.0 all three approaches were supported. Currently, with PyStan 3.0 only HMC sampling of the posterior is supported but, if desired, any other interface can be used (Matlab, Julia, R, CmdStan, Pystan 2.9, CmdStanPy,…) to perform optimization or variational Bayesian approaches. No actual changes in the Stan code are required, eventually with the exception that the CPU based parallelization needs to be eliminated. Given that one chooses optimization, the differences between maximum a posteriori/penalized maximum likelihood or only ML are just a matter of prior choice which need to be coded into the statistical model. Apart from those minor points, there is no need of extensive reimplementation of the code.
Appendix 5
The fluorescence signal of cPCF experiments
First four moments of a photomultiplier signal
In this work, the KF analysis assumes Poisson statistics for the fluorescence signal in cPCF. Many commercial microscopes are not equipped with photon counting detectors or detectors are not operated in photoncounting mode, often to due to ease of use or limitation in dynamic range. Therefore, it is important to verify that the fluorescence signal follows, at least approximately, Poisson counting statistics. In particular, for the KF it is assumed that higher order statistics, such as skewness and excess kurtosis, vanish. The central assumption of the derivation of our Bayesian filter is $\mathrm{var}[{y}_{\mathrm{f}\mathrm{l}}]=\mathbb{E}[{y}_{\mathrm{f}\mathrm{l}}]$
Here we show that this assumption for the detectors used in our previously used LSM 710 (Carl Zeiss, Jena) system (Biskup et al., 2007; Kusch et al., 2010) is fulfilled by rescaling to photonnumbers. The measured variance obeys $\mathrm{var}[{y}_{\mathrm{f}\mathrm{l}}]=a\cdot \mathbb{E}[{y}_{\mathrm{f}\mathrm{l}}]$. It depends linearly on the mean signals (Appendix 5—figure 1).
For the scaled signal $x$ being Poisson distributed follows $\mu =a$. Rescaling of the signal by $1/a$ provides approximately Poisson distributed values. A linear fit yields $a=205\text{a.u.}(16\text{bit})/\text{photon}$ (for 680 V PMT voltage, 3.26 $\mu s$ pixel dwell time). (Appendix 5—figures 2 and 3) show that excess kurtosis and skewness remain small at all levels of photons/pixel but are somewhat higher than theoretically predicted for Poissondistributed data. The proportionalities are correctly described by the Poisson distribution assumption but the skewness and the kurtosis are too small by a constant factor of $\sqrt{2}$ and 4, respectively. This finding has to be verified for different experimental conditions, because at lower concentration/particle densities and higher count rates, particle number fluctuations can dominate statistics (Brown et al., 2008). For comparison another option would be a Gamma distribution which has the mean and the variance of $E[y]=k\theta $ and $var[y]=k{\theta}^{2}$, respectively. Thus, the applied scaling requires that $\theta =1$. The Gamma distribution has a higher skewness by factor two (independently of θ) than a Poisson distribution and overscores the skewness and excess kurtosis of the detector. For simplicity only the Poisson distribution is considered in this work. In conclusion: Typical cPCF fluorescence signal detection rates are well approximated by a Gamma or Poisson distribution which in turn have the desired property that can be approximated by a normal distribution.
Background noise statistics
In cPCF measurements with fluorescencelabeled ligands, the signals of the ligands bound to the receptors overlap with the signals from freely diffusing fluorescencelabelled ligands in the bulk. This bulk signal is subtracted from the total signal (Biskup et al., 2007). While the mean difference signal ${y}_{\mathrm{fl},\mathrm{k}}(t)$ of the confocal voxel $k$ represents the bound ligands in that voxel, its noise ${y}_{\zeta ,k}(t)$ originates from both bound and bulk ligands. The additional bulk signal, e.g. the fraction of bulk solution inside that voxel, varies from voxel to voxel and can hardly be described theoretically. Nevertheless, it can be determined experimentally (Biskup et al., 2007). At low expression levels or at ligand concentrations above low nanomolar levels, this background signal is not negligible. It scales linearly with the ligand concentration, while the signal from bound receptors depends on the affinity, as estimated by the concentration of half maximum binding $B{C}_{50}$, and the number of ion channels in the membrane of the observed volume. The binding signal saturates at high concentrations (Appendix 5—figure 4). Thus, both high affinity (low $B{C}_{50}$) and high expression reduce the relative contribution of the background to the overall signal, improving the signal to noise ratio.
Practically, the bulk signal is estimated by counterstaining the solution with a spectrally distinct reference dye (Biskup et al., 2007). The spatial distribution of this dye mimics the spatial distribution of the freely diffusing ligands. The bulk absolute concentration as well as the molecular brightness of the reference dye and the labeled ligand differ. Hence, the binding signal is calculated as the average pixel intensity of the scaled difference image between the signal of labeled ligand and reference dye according to
where ${\widehat{\lambda}}_{\mathrm{lig},\mathrm{back}}$ and ${\widehat{\lambda}}_{\mathrm{ref},\mathrm{back}}$ are the arithmetic mean background signals of the ligand and reference dye recorded beyond the membrane where no signal should be recorded. They represent a signal offset which needs to be subtracted. The mean intensities in the bulk, ${\widehat{\lambda}}_{\mathrm{bulk}}$ and ${\widehat{\lambda}}_{\mathrm{ref}}$, are estimated outside the pipette. In order to get the correct scaling, the mean intensities need to be corrected by the respective mean background signals. If $\frac{{\widehat{\lambda}}_{\mathrm{lig}}{\widehat{\lambda}}_{\mathrm{lig},\mathrm{back}}}{{\widehat{\lambda}}_{\mathrm{ref}}{\widehat{\lambda}}_{\mathrm{ref},\mathrm{back}}}=1$ holds then ${y}_{\mathrm{fl},\mathrm{bin}}$ would be Skellam distributed (Hwang et al., 2007). The total signal is then ${y}_{\mathrm{fl}}={\sum}_{k}{y}_{\mathrm{fl},\mathrm{k}}$. This procedure creates $E[{y}_{\zeta}]=0$ but adds an additional noise term $\mathit{\zeta}({t}_{j})$. For the general case of different intensities, we name the distribution ’scaled Skellam distributed’. The scaling variance of the background noise in each voxel of the difference image
is derived from simulated data further below. ${\lambda}_{\mathrm{lig}}$ and ${\lambda}_{\mathrm{ref}}$ are the fluorescence intensity from the freely diffusing ligands and reference dye molecules per voxel, respectively. ${\lambda}_{\mathrm{lig}}$ and ${\lambda}_{\mathrm{ref}}$ are proportional to the volume fraction of the voxel, which is occupied by the bulk, and to the respective concentrations. To achieve a symmetric $\mathbb{P}(\zeta )$, one can set ${\lambda}_{\mathrm{lig}}={\lambda}_{\mathrm{ref}}$. The summed variance of all selected voxels can be tabulated according to
To mimic an experiment which creates time series data $\zeta (t)$, we draw Poisson numbers for the signal from the membrane $\mathrm{Poisson}(\mathbf{Hn}(t))$ and for the signal from the bulk we draw numbers from the two respective Poisson distributions. Then subtraction of the two background signals is performed according to
assuming that the dark count signal has been correctly subtracted. Then we add the bulk signal to the bound ligand signal. In this way we produce a time trace with colored noise by the Gillespie algorithm and add white noise to time traces as it is observed in real experiments.
Deriving the moments of the background noise for the difference signal
For the KF the variance, skewness and kurtosis arising from the background noise has to be calculated. Skewness and excess kurtosis of the distribution have to be small compared to the total variance of the signal including all noise sources because only in this case the KF algorithm can be considered as the optimal solution for the filtering and inference problem (Anderson and Moore, 2012). In the following the 2nd to 4th moment of ζ are derived. The noise intensity parameter of the reference dye ${\lambda}_{\mathrm{ref}}$ is proportional to ${\rho}_{\mathrm{ref}}{\lambda}_{\mathrm{bulk}}$ with $V$ being the confocal volume fraction containing fluorophores and ${\rho}_{\mathrm{ref}}$ the density of the fluorophores in this volume. In Appendix 5—figure 6 we deduce master curves for the variance skewness and excess kurtosis of the white noise by drawing $4\cdot {10}^{5}$ Poisson numbers from the respective Poisson distribution and subtract them from each other according to Appendix 5 Equation 78. The variance is derived empirically to be
In Appendix 5—figure 6a, we confirm the intuition ${\lambda}_{\mathrm{ref}}\to \mathrm{\infty}\Rightarrow \mathrm{var}(\zeta )={\lambda}_{\mathrm{lig}}$. Optimally, the skewness should be zero to avoid a biased estimate of $\mathit{\theta}$ when the data are analyzed by the KF. Empirically, for ${\lambda}_{lig}\ll {\lambda}_{ref}$ the skewness holds
Additionally for $\lambda}_{lig}<{\lambda}_{ref$ the skewness holds
It is zero when ${\lambda}_{ref}={\lambda}_{lig}$. The KF is optimal if the kurtosis excess approaches zero, in other words if ζ is distributed normally. Empirically the kurtosis holds this
for $\lambda}_{ref}\le {\lambda}_{lig$. The relative intensity $\lambda}_{lig$ of the voxel fraction compared to the intensity $\lambda}_{b$ depends on the affinity of the ligand to the receptor, the number of receptors in the patch, and the density of the fluorophores $p}_{lig$ at the patch. For larger concentrations the ratio should be $\lambda}_{lig}/{\lambda}_{ref$.
Appendix 6
Output statistics of Bayesian filters
Classical Kalman Filter without openchannel noise
Assuming that current measurements are only compromised by additive technical white noise ν but do not contain openchannel noise ${\nu}_{\mathrm{op}}$, then our noise model reduces to
The noise term $\nu}_{\mathrm{m}$ has a mean of $E[{\nu}_{\mathrm{m}}]=0$ and variance $E[{\nu}_{\mathrm{m}}^{2}]={\sigma}_{\mathrm{m}}^{2}=const$. One has to keep in mind that we have to add an extra variance term originating from the dispersion of channels over the state space, as encoded by $\mathbf{P}(t)$ and $\mathbf{n}(t)$. The uncertainty $\mathbf{P}(t)$ is calculated using Methods Equation 30. The variance of the total output is
The two cross terms $E[\nu ({t}_{1}){(\mathbf{n}E[\mathbf{n}])}^{T}{\mathbf{H}}^{T}]$ and $E[\mathbf{H}(\mathbf{n}E[\mathbf{n}])\nu {({t}_{1})}^{T}]$ are zero since ν is independent of $\mathbf{n}$ and $E[{\nu}_{\mathrm{m}}]=0$. Our derivation is equivalent to marginalization over the predicted normal prior of the ensemble state $\mathbb{P}(\mathbf{n}(t){\mathcal{Y}}_{t1})$ at the time of the measurement except that the prior distribution could be any probability distribution with some mean and variance. Equation 84a is the classical KF variance prediction of a signal. The first term in Equation 84a, describes the variance from stochastic gating and that the ensemble state is hidden. Notably, by Methods Equation 30 we realize that $\mathrm{var}(y(t))$ contains information about $\mathbf{T}$ and $\mathbf{n}(t1)$, which we can exploit with the KF framework.
A generalized Kalman filter with statedependent openchannel noise
In addition to the standard KF with only additive noise (Moffatt, 2007; Anderson and Moore, 2012; Chen, 2003), fluctuations arising from the singlechannel gating lead to a second whitenoise term ${\nu}_{\mathrm{op}}{n}_{4}(t)$, causing statedependency of our noise model. The output model is then
The second noise term ${\nu}_{\mathrm{op}}$ is defined in terms of the first two moments $E({\nu}_{\mathrm{op}})=0$ and therefore $\mathrm{var}({\nu}_{\mathrm{op}})=E({\nu}_{\mathrm{op}}^{2})={\sigma}_{\mathrm{op}}^{2}{\mathbf{n}}_{4}(t)$. To the best of our knowledge such a statedependent noise makes the following integration intractable
When assuming that the relative fluctuations of $\mathbf{n}(t)$ are on average small then n_{4} in the denominator is close to $E({n}_{4})$ of the state. Thus, the incremental likelihood can be written as in the standard KF, with the only difference that the measurement noise is the sum of two components.
To see that this approximation of the variance is correct, we apply the law of total variance decomposition Weiss, 2005.
The terms $\mathbf{HP}(t){\mathbf{H}}^{\top}+{\sigma}_{m}^{2}$ are the standard output covariance matrix. Again $\mathbf{P}(t)$ contains information about $\mathbf{T}$, $\mathbf{n}(t1)$ while the additional variance term includes information about the current $\mathbf{n}(t)$. The information contained in the noise influences likelihood in two ways. By the variance or covariance of the current $\mathbf{y}(t)$ but also for $\mathbf{y}(t+1)$ in correction step by the Kalman gain ${\mathbf{K}}_{\mathrm{Kal}}$ matrix defined in the next section.
Appendix 7
Error induced for the RE or KF approach by analog filtering of PC data
In order to mimic an analog signal before the analogtodigital conversion we simulated 5 different data sets of 100 kHz signals which were then filtered by a digital fourthorder Bessel filter. The activation curves were then analyzed with the Bayesian filter at 125 Hz and the deactivation curves at sampling rates between $166500$ Hz. Operating the Bayesian filter at a lower frequency is necessary because due to applying the Bessel filter the former white noise of the signal obtained additional time correlations. Thus, an all data points fit would immediately violate the white noise assumption of Equation 4 which we restore by analyzing at a much lower frequency. We then let the time scales of the induced time correlations become larger and larger by decreasing f_{cut}. We show (Appendix 7—figure 1a) the results of the Euclidean Error for 3 different cases for PC data. The RE approach (blue) and the Bayesian filter (red) share the same analyzing frequency. In contrast, an increased ${f}_{\mathrm{ana}}=16605000$ Hz (green) is used for the Bayesian filter. Both algorithms (blue, red) with the same f_{ana} show a similar rather constant region separated by an offset. Similar, to cPCF data the KF is more robust. The Bayesian filter with ${f}_{\mathrm{ana}}=16605000$ Hz (green) does not show this constant region but outperforms the Bayesian filter with the lower f_{ana} if the recording system is operated with minimal analog filtering. This becomes even more apparent when we consider the single parameter deviation vs. f_{cut} (Appendix 7—figure 1b(16)). Note, the strong dependence of the critical f_{cut} at which the performance of all algorithms becomes suddenly worse scales with f_{ana}. Additionally, it is crucial, independently of the algorithm, that ${f}_{\mathrm{cut}}\gg {f}_{\mathrm{ana}}$. In (Appendix 7—figure 1b(16)) we demonstrate the distribution of the errors on the single parameter level and compare only the KF for different f_{ana}. Similar to the findings in the main text, on the one hand, the Euclidean error (red) shows some robustness against f_{cut}. On the other hand, the influence of f_{cut} on the individual parameter level sets in immediately and is complex. The best results are achieved with minimal analog filtering and a high f_{ana}.
Appendix 8
Details of the specified two methods to count the probability mass needed to include the true value
In Figure 4c we compare the true against the expected success probabilities of finding the complete true rate matrix within an $n$dimensional HDCV for different expected success probabilities of a perfect model. By ‘expected’ and ‘perfect’ we refer to a fictitious ideal algorithm which exactly models exhaustively all details of the true process, including all measurement details. The first way assumes a multivariate normal posterior distribution with $\mathbb{P}{(\mathit{\theta}y)}_{\mathrm{post}}\approx \mathrm{normal}(\mathit{\theta}E[\mathit{\theta}],\mathbf{\Sigma})$. Then the ellipsoid of a constant probability density is exactly the surface of a HDCV of given probability mass $P$. For a twodimensional representation see Figure 4d below the diagonal. In one dimension, the ellipsoid consists of the two points with a distance ${d}_{\mathrm{Mah}}=\sqrt{{(\theta E[\theta ])}^{2}}/\sigma $ from the mean (inset Figure 5a). In general, the $n$dimensional ellipsoids around the mean of a multivariate normal distribution can be described by points which have the same Mahalanobis distance ${d}_{\mathrm{Mah}}$ from the mean.
For a multivariate standard normal distribution without correlation, the Mahalanobis distance becomes the Euclidean distance $\sqrt{{(\mathit{\theta}E[\mathit{\theta}])}^{T}(\mathit{\theta}E[\mathit{\theta}])}$. Rewriting the multivariate normal distribution in terms of the Mahalanobis distance
reveals that all points $\mathit{\theta}$ with ${d}_{\mathrm{Mah}}=const$ have the same probability density. The random variable ${d}_{\mathrm{Mah}}^{2}$ is χsquare distributed. Thus the χsquare distribution (Figure 5a) is the probability density of drawing an $\mathit{\theta}$ in $n$ dimensions and finding it on the ellipsoid’s $n1$dimensional surface, at a distance ${d}_{\mathrm{Mah}}$ from the mean. This allows us to use the cumulative ${\chi}^{2}$square distribution function to calculate the probability mass inside the ellipsoid which is the desired HDCV. By evaluating ${d}_{\mathrm{Mah}}=\sqrt{({\mathit{\theta}}_{\mathrm{true}}E[\mathit{\theta}]){\mathrm{\Sigma}}^{1}({\mathit{\theta}}_{\mathrm{true}}E[\mathit{\theta}])}$ and ${\chi}_{\mathrm{cdf}}^{2}({d}_{\mathrm{Mah}}^{2})$ we specify how much volume, in units of probability mass, has to be counted until the volume includes the true value. Note, that with increasing dimensionality (Figure 5a) the probability mass is shifted away from the mode of the probability density. For general probability densities, this is also true in higher dimensions where one only rarely finds the true value within a small distance to the maximum of the probability density. Note, that there is no mathematical theorem for singular models (Watanabe, 2007) saying that the posterior approaches asymptotically a multivariate normal with increasing data quality/quantity. Consequently, the underlying assumption that the posterior distribution is multivariate normal, is situation dependent valid or highly questionable. Thus the displayed method should be validated additionally in an independent way. To determine the probability mass needed to include the true value into the HDCV for such a posterior, we can also use a histogrambased method. One starts by constructing an $n$dimensional histogram from the samples of the posterior and by initializing a global variable with zero. Then we start counting from the bin with most counts and check whether the true value falls inside this bin. If not, we add the probability mass inside that bin to the global variable. Repeating this for next lower bins eventually leads to the detection of the bin with the true rate matrix inside. On detection, the global variable contains the soughtafter value $P$. While this procedure does not depend on a multivariate normal assumption, it is prone to errors due to the discrete bins and the finite samples from the posterior.
Nevertheless, both procedures show (Figure 5b) good agreement when plotting the volume/ probability mass needed to include the true parameter value vs. ${N}_{\mathrm{ch}}$. Again ${N}_{\mathrm{ch}}\approx 200$ seems to be a reasonable data quality to trust the statistics of the posterior using the KF. In contrast, the posterior of the RE approach never includes the true values with a reasonable HDCV. Note, that a probability mass of $\approx 1$ to reach the true value means that qualitatively speaking the estimate of the RE approach is infinitely far away from the true value in terms of the Mahalanobis distance. Further, due to the finite sampling the method of histogram counting is not qualified for the largest HDCVs approaching $P\approx 1$. The two developed methods contrast the Euclidean distance from the true values to the maximum mode, median or even the mean of the distribution against all higher moments of the distribution. Thus it tests the overall shape of the posterior.
Appendix 9
Uncertainty quantification for the 5state and 6state model with cPCF data
The effect of the second observable on the single parameter level for the 5state model can be seen in Appendix 9—figure 1. The biased estimates and the unidentified parameters of the RE approach for PC data (Figure 9) are eliminated by the fluorescence data.
Even the overconfidence problem seems to be decreased (compare with Figures 4b1—6). In contrast, the HDCIs of the 6states1openstate model for cPCF data (Appendix 9—figure 2) display again a much increased overconfidence problem of the rate equation approach. This indicates that not simply counting of states but the actual topology of the kinetic scheme and the structure of the observation matrix $\mathbf{H}$ determines the scale of the overconfidence problem. Apparently, the more information comes from the fluorescence data relative to the information from the current data the less grave is the overconfidence problem and the less different are the Euclidean errors.
Data availability
We included the simulated data time traces into supporting files and we uploaded the source code on https://github.com/JanMuench/Tutorial_Patchclamp_data and https://github.com/JanMuench/Tutorial_Bayesian_Filter_cPCF_data.
References

Bayesian inference for ion–channel gating mechanisms directly from single–channel recordings, using Markov chain Monte CarloProceedings of the Royal Society of London. Series A 455:2879–2932.https://doi.org/10.1098/rspa.1999.0432

MCMC for IonChannel SojournTime Data: A Good ProposalBiophysical Journal 111:267–268.https://doi.org/10.1016/j.bpj.2016.02.042

Kinetic relationship between the voltage sensor and the activation gate in spHCN channelsThe Journal of General Physiology 130:71–81.https://doi.org/10.1085/jgp.200709769

Filtering and inference for stochastic oscillators with distributed delaysBioinformatics (Oxford, England) 35:1380–1387.https://doi.org/10.1093/bioinformatics/bty782

Stan: A probabilistic programming languageJournal of Statistical Software 76:10.https://doi.org/10.18637/jss.v076.i01

On kalman filter for linear system with colored measurement noiseJournal of Geodesy 88:1163–1170.https://doi.org/10.1007/s0019001407517

Characterization of single channel currents using digital signal processing techniques based on Hidden Markov ModelsPhilosophical Transactions of the Royal Society of London. Series B, Biological Sciences 329:265–285.https://doi.org/10.1098/rstb.1990.0170

BookA QMatrix CookbookIn: Sakmann B, Neher E, editors. SingleChannel Recording. Boston, MA: Springer. pp. 589–633.https://doi.org/10.1007/9781441912299_20

BookThe Principles of the Stochastic Interpretation of IonChannel MechanismsIn: Sakmann B, Neher E, editors. SingleChannel Recording. Boston, MA: Springer. pp. 397–482.https://doi.org/10.1007/9781441912299_18

On the stochastic properties of single ion channelsProceedings of the Royal Society of London. Series B. Biological Sciences 211:205–235.https://doi.org/10.1098/rspb.1981.0003

How to impose microscopic reversibility in complex reaction mechanismsBiophysical Journal 86:3510–3518.https://doi.org/10.1529/biophysj.103.038679

Statistical Analysis of Ion Channel Data Using Hidden Markov Models With Correlated StateDependent Noise and FilteringJournal of the American Statistical Association 96:805–815.https://doi.org/10.1198/016214501753208519

Robust Perron cluster analysis in conformation dynamicsLinear Algebra and Its Applications 398:161–184.https://doi.org/10.1016/j.laa.2004.10.026

Trajectory inference and parameter estimation in stochastic models with temporally aggregated dataStatistics and Computing 28:1053–1072.https://doi.org/10.1007/s112220179779x

The energy landscapes and motions of proteinsScience (New York, N.Y.) 254:1598–1603.https://doi.org/10.1126/science.1749933

Maximum likelihood estimation and identification directly from singlechannel recordingsProceedings. Biological Sciences 249:125–132.https://doi.org/10.1098/rspb.1992.0094

Visualization in Bayesian workflowJournal of the Royal Statistical Society 182:389–402.https://doi.org/10.1111/rssa.12378

A single series from the gibbs sampler provides A false sense of securityBayesian Statistics 4:75.

Inference from Iterative Simulation Using Multiple SequencesStatistical Science 7:1136.https://doi.org/10.1214/ss/1177011136

Understanding predictive information criteria for Bayesian modelsStatistics and Computing 24:997–1016.https://doi.org/10.1007/s1122201394162

Stan: A probabilistic programming language for bayesian inference and optimizationJ. of Educational and Behavioral Statistics 40:113.https://doi.org/10.3102/1076998615606113

The chemical Langevin equationThe Journal of Chemical Physics 113:297–306.https://doi.org/10.1063/1.481811

Exact stochastic simulation of coupled chemical reactionsThe Journal of Physical Chemistry 81:2340–2361.https://doi.org/10.1021/j100540a008

BookBayesian Inference for the Chemical Master Equation Using Approximate ModelsUlm, Germany: Elsevier.

Markov chain Monte Carlo fitting of singlechannel data from inositol trisphosphate receptorsJournal of Theoretical Biology 257:460–474.https://doi.org/10.1016/j.jtbi.2008.12.020

Linearnoise approximation and the chemical master equation agree up to secondorder moments for a class of chemical systemsPhysical Review. E, Statistical, Nonlinear, and Soft Matter Physics 92:042124.https://doi.org/10.1103/PhysRevE.92.042124

Determination of parameter identifiability in nonlinear biophysical models: A Bayesian approachThe Journal of General Physiology 143:401–416.https://doi.org/10.1085/jgp.201311116

A primer on Bayesian inference for biophysical systemsBiophysical Journal 108:2103–2113.https://doi.org/10.1016/j.bpj.2015.03.042

Analyzing singlemolecule time series via nonparametric Bayesian inferenceBiophysical Journal 108:540–556.https://doi.org/10.1016/j.bpj.2014.12.016

The nouturn sampler: adaptively setting path lengths in hamiltonian monte carloJournal of Machine Learning Research 15:1593–1623.

Estimating kinetic constants from single channel dataBiophysical Journal 43:207–223.https://doi.org/10.1016/S00063495(83)843410

ConferenceSensor noise modeling using the Skellam distribution: Application to the color edge detectionIEEE conference on computer vision and pattern recognition.https://doi.org/10.1109/CVPR.2007.383004

Solving the chemical master equation for monomolecular reaction systems analyticallyJournal of Mathematical Biology 54:1–26.https://doi.org/10.1007/s002850060034x

BookConfidence intervals vs bayesian intervalsIn: Harper WL, Hooker CA, editors. Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science. Berlin, Germany: Springer. pp. 175–257.https://doi.org/10.1007/9789401014366_6

An invariant form for the prior probability in estimation problemsProceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences 186:453–461.https://doi.org/10.1098/rspa.1946.0056

A New Approach to Linear Filtering and Prediction ProblemsJournal of Basic Engineering 82:35–45.https://doi.org/10.1115/1.3662552

The Relationship between Stochastic and Deterministic Models for Chemical ReactionsThe Journal of Chemical Physics 57:2976–2978.https://doi.org/10.1063/1.1678692

How subunits cooperate in cAMPinduced activation of homotetrameric HCN2 channelsNature Chemical Biology 8:162–169.https://doi.org/10.1038/nchembio.747

Accounting for variability in ion current recordings using a mathematical model of artefacts in voltageclamp experimentsPhilosophical Transactions of the Royal Society A 378:20190348.https://doi.org/10.1098/rsta.2019.0348

Structural identifiability of equilibrium ligandbinding parametersThe Journal of General Physiology 149:105–119.https://doi.org/10.1085/jgp.201611702

The structure of binding curves and practical identifiability of equilibrium ligandbinding parametersThe Journal of General Physiology 149:121–147.https://doi.org/10.1085/jgp.201611703

Maximum likelihood estimation of ion channel kinetics from macroscopic currentsBiophysical Journal 88:2494–2515.https://doi.org/10.1529/biophysj.104.053256

Listening to the noise: random fluctuations reveal gene network parametersMolecular Systems Biology 5:318.https://doi.org/10.1038/msb.2009.75

Estimating kinetic mechanisms with prior knowledge II: Behavioral constraints and numerical testsJournal of General Physiology 150:339–354.https://doi.org/10.1085/jgp.201711912

Solving ion channel kinetics with the qub softwareBiophysical Reviews and Letters 08:191–211.https://doi.org/10.1142/S1793048013300053

Projected and hidden Markov models for calculating kinetics and metastable states of complex moleculesThe Journal of Chemical Physics 139:184114.https://doi.org/10.1063/1.4828816

Ion Channels in Genetic Epilepsy: From Genes and Mechanisms to DiseaseTargeted TherapiesPharmacological Reviews 70:142–173.https://doi.org/10.1124/pr.117.014456

A tutorial on hidden Markov models and selected applications in speech recognitionProceedings of the IEEE 77:257–286.https://doi.org/10.1109/5.18626

Bayesian restoration of ion channel records using hidden Markov modelsBiophysical Journal 80:1088–1103.https://doi.org/10.1016/S00063495(01)760870

MCMC for hidden Markov models incorporating aggregation of states and filteringBulletin of Mathematical Biology 66:1173–1199.https://doi.org/10.1016/j.bulm.2003.12.001

Efficiently Simulating the Coverage Properties of Interval EstimatesApplied Statistics 35:159.https://doi.org/10.2307/2347266

BookSingleChannel RecordingBerlin, Germany: Springer Science & Business Media.https://doi.org/10.1007/9781441912299

Estimating kinetic mechanisms with prior knowledge I: Linear parameter constraintsThe Journal of General Physiology 150:323–338.https://doi.org/10.1085/jgp.201711911

An Introduction to Infinite HMMs for SingleMolecule Data AnalysisBiophysical Journal 112:2021–2029.https://doi.org/10.1016/j.bpj.2017.04.027

ICON: An Adaptation of Infinite HMMs for Time Traces with DriftBiophysical Journal 112:2117–2126.https://doi.org/10.1016/j.bpj.2017.04.009

MCMC estimation of Markov models for ion channelsBiophysical Journal 100:1919–1929.https://doi.org/10.1016/j.bpj.2011.02.059

MCMC Can Detect Nonidentifiable ModelsBiophysical Journal 103:2275–2286.https://doi.org/10.1016/j.bpj.2012.10.024

Modelling modal gating of ion channels with hierarchical Markov modelsProceedings of the Royal Society A 472:20160122.https://doi.org/10.1098/rspa.2016.0122

Maximum likelihood estimation of biophysical parameters of synaptic receptors from macroscopic currentsFrontiers in Cellular Neuroscience 8:303.https://doi.org/10.3389/fncel.2014.00303

Structural dynamics in the gating ring of cyclic nucleotidegated ion channelsNature Structural & Molecular Biology 14:854–860.https://doi.org/10.1038/nsmb1281

BookStochastic Processes in Physics and ChemistryAmsterdam, Netherlands: Elsevier.https://doi.org/10.1016/B9780444529657.X50004

Applying hidden Markov models to the analysis of single ion channel activityBiophysical Journal 82:1930–1942.https://doi.org/10.1016/S00063495(02)755422

Pacemaker activity of the human sinoatrial node: effects of HCN4 mutations on the hyperpolarizationactivated currentJournal of the Working Groups on Cardiac Pacing 16:384–395.https://doi.org/10.1093/europace/eut348

ConferenceAlmost All Learning Machines are Singular2007 IEEE Symposium on Foundations of Computational Intelligence. pp. 383–388.https://doi.org/10.1109/FOCI.2007.371500

Statedependent cAMP binding to functioning HCN channels studied by patchclamp fluorometryBiophysical Journal 100:1226–1232.https://doi.org/10.1016/j.bpj.2011.01.034
Article and author information
Author details
Funding
Deutsche Forschungsgemeinschaft (TRR 166 ReceptorLight (Project A5) of the)
 Klaus Benndorf
Deutsche Forschungsgemeinschaft (Research Unit 2518 DynIon (Project P2))
 Klaus Benndorf
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
The authors are grateful to E Schulz for designing a software to simulate channel activity in ensemble patches and to Th Eick for performing the simulations. FP acknowledges funding from the Yen PostDoctoral Fellowship in Interdisciplinary Research and from the National Cancer Institute of the National Institutes of Health (NIH) through Grant CAO93577. The authors are also indebted to M Habeck and I Schroeder for comments on the manuscript, to M Bücker for help with the computer cluster at the Friedrich Schiller University Jena, and to F Noé, R Blunck, G Mirams and S Pressé for helpful discussions. This work was supported by the Research Unit 2,518 DynIon (Project P2) and the TRR 166 ReceptorLight (Project A5) of the Deutsche Forschungsgemeinschaft to KB. Model parameterization, prior distributions, and the general timereversible model in bayesian phylogenetics. Systematic Biology, 53(6):877 – 888.
Copyright
© 2022, Münch et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 1,159
 views

 277
 downloads

 4
 citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Cancer Biology
 Computational and Systems Biology
This study investigates the variability among patients with nonsmall cell lung cancer (NSCLC) in their responses to immune checkpoint inhibitors (ICIs). Recognizing that patients with advancedstage NSCLC rarely qualify for surgical interventions, it becomes crucial to identify biomarkers that influence responses to ICI therapy. We conducted an analysis of singlecell transcriptomes from 33 lung cancer biopsy samples, with a particular focus on 14 core samples taken before the initiation of palliative ICI treatment. Our objective was to link tumor and immune cell profiles with patient responses to ICI. We discovered that ICI nonresponders exhibited a higher presence of CD4+ regulatory T cells, resident memory T cells, and TH17 cells. This contrasts with the diverse activated CD8+ T cells found in responders. Furthermore, tumor cells in nonresponders frequently showed heightened transcriptional activity in the NFkB and STAT3 pathways, suggesting a potential inherent resistance to ICI therapy. Through the integration of immune cell profiles and tumor molecular signatures, we achieved an discriminative power (area under the curve [AUC]) exceeding 95% in identifying patient responses to ICI treatment. These results underscore the crucial importance of the interplay between tumor and immune microenvironment, including within metastatic sites, in affecting the effectiveness of ICIs in NSCLC.