Bayesian inference of kinetic schemes for ion channels by Kalman filtering

  1. Jan L Münch  Is a corresponding author
  2. Fabian Paul
  3. Ralf Schmauder
  4. Klaus Benndorf  Is a corresponding author
  1. Institut für Physiologie II, Universitätsklinikum Jena, Friedrich Schiller University Jena, Germany
  2. Department of Biochemistry and Molecular Biology, University of Chicago, United States

Abstract

Inferring adequate kinetic schemes for ion channel gating from ensemble currents is a daunting task due to limited information in the data. We address this problem by using a parallelized Bayesian filter to specify hidden Markov models for current and fluorescence data. We demonstrate the flexibility of this algorithm by including different noise distributions. Our generalized Kalman filter outperforms both a classical Kalman filter and a rate equation approach when applied to patch-clamp data exhibiting realistic open-channel noise. The derived generalization also enables inclusion of orthogonal fluorescence data, making unidentifiable parameters identifiable and increasing the accuracy of the parameter estimates by an order of magnitude. By using Bayesian highest credibility volumes, we found that our approach, in contrast to the rate equation approach, yields a realistic uncertainty quantification. Furthermore, the Bayesian filter delivers negligibly biased estimates for a wider range of data quality. For some data sets, it identifies more parameters than the rate equation approach. These results also demonstrate the power of assessing the validity of algorithms by Bayesian credibility volumes in general. Finally, we show that our Bayesian filter is more robust against errors induced by either analog filtering before analog-to-digital conversion or by limited time resolution of fluorescence data than a rate equation approach.

Editor's evaluation

The authors develop a Bayesian approach to modeling signals arising from ensembles of ion channels that can incorporate multiple simultaneously recorded signals such as fluorescence and ionic current. For simulated data from a simple ion channel model where ligand binding drives pore opening, they show that their approach enhances parameter identifiability and/or estimates of parameter uncertainty over more traditional approaches. The developed approach provides a valuable tool for modeling macroscopic time series data including data with multiple observation channels.

https://doi.org/10.7554/eLife.62714.sa0

Introduction

Ion channels are essential proteins for the homeostasis of an organism. Disturbance of their function by mutations often causes severe diseases, such as epilepsy (Oyrer et al., 2018; Goldschen-Ohm et al., 2010), sudden cardiac death (Clancy and Rudy, 2001), or sick sinus syndrome (Verkerk and Wilders, 2014) indicating a medical need (Goldschen-Ohm et al., 2010) to gain further insight into the biophysics of ion channels. The gating of ion channels is usually interpreted by kinetic schemes which are inferred either from macroscopic currents with rate equations (REs) (Colquhoun and Hawkes, 1995b; Celentano and Hawkes, 2004; Milescu et al., 2005; Stepanyuk et al., 2011; Wang et al., 2012) or from single-channel currents using dwell time distributions (Neher and Sakmann, 1976; Colquhoun et al., 1997a; Horn and Lange, 1983; Qin et al., 1996; Epstein et al., 2016; Siekmann et al., 2016) or hidden Markov models (HMMs) (Chung et al., 1990; Fredkin and Rice, 1992; Qin et al., 2000; Venkataramanan and Sigworth, 2002). A HMM consists of a discrete set of metastable states. Changes of their occupation occur as random events over time. Each state is characterized by transition probabilities, related to transition rates, and a probability distribution of the observed signal (Rabiner, 1989). It is becoming increasingly clear that the use of Bayesian statistics in HMM estimation constitutes a major advantage (Ball et al., 1999; De Gunst et al., 2001; Rosales et al., 2001; Rosales, 2004; Gin et al., 2009; Siekmann et al., 2012; Siekmann et al., 2011; Hines et al., 2015; Sgouralis and Pressé, 2017b; Sgouralis and Pressé, 2017a; Kinz-Thompson and Gonzalez, 2018). In ensemble patches, simultaneous orthogonal fluorescence measurement of either conformational changes (Zheng and Zagotta, 2000; Taraska and Zagotta, 2007; Taraska et al., 2009; Bruening-Wright et al., 2007; Kalstrup and Blunck, 2013; Kalstrup and Blunck, 2018; Wulf and Pless, 2018) or ligand binding itself (Biskup et al., 2007; Kusch et al., 2010; Kusch et al., 2011; Wu et al., 2011) has increased insight into the complexity of channel activation.

Currently, a Bayesian estimator that can collect information from cross-correlations and time correlations inherent in multi-dimensional signals of ensembles of ion channels is still missing. Traditionally, macroscopic currents are analyzed with solutions of REs which yield a point estimate of the rate matrix or its eigenvalues (Colquhoun et al., 1997a; Sakmann and Neher, 2013; d’Alcantara et al., 2002; Milescu et al., 2005; Wang et al., 2012) if they are fitted to the data. The RE approach is based on a deterministic differential equation derived by averaging the chemical master equation (CME) for the underlying kinetic scheme (Kurtz, 1972; Van Kampen, 1992; Jahnke and Huisinga, 2007). Its accuracy can be improved by processing the information contained in the intrinsic noise (stochastic gating and binding) (Milescu et al., 2005; Munsky et al., 2009). Nevertheless, all deterministic approaches do not use the information of the time- and cross-correlations of the intrinsic noise. These deterministic approaches are asymptotically valid for an infinite number of channels. Thus, a time trace with a finite number of channels contains, strictly speaking, only one independent data point. Previous rigorous attempts to incorporate the autocorrelation of the intrinsic noise of current data into the estimation (Celentano and Hawkes, 2004) suffer from cubic computational complexity (Stepanyuk et al., 2011) in the amount of data points, rendering the algorithm non-optimal or even impractical for a Bayesian analysis of larger data set. To understand this, note, that a maximum likelihood optimization (ML) usually takes several orders of magnitude fewer likelihood evaluations to converge compared to the number of posterior evaluations when one samples the posterior. One Monte Carlo iteration (Betancourt, 2017) evaluates the posterior distribution and its derivatives many times to propose one sample from the posterior. Stepanyuk suggested an algorithm (Stepanyuk et al., 2011; Stepanyuk et al., 2014) which derives from the algorithm of Celentano and Hawkes, 2004 but evaluates the likelihood quicker. Under certain conditions, Stepanyuk’s algorithm can be faster than the Kalman filter (Moffatt, 2007). The algorithm by Milescu et al., 2005 achieves its superior computation time efficiency at the cost of ignoring the time correlations of the fluctuations. A further argument for our approach, independent of the Bayesian context, is investigated in this paper: The KF is the minimal variance filter (Anderson and Moore, 2012). Instead of strong analog filtering of currents to reduce the noise, but with the inevitable signal distortions (Silberberg and Magleby, 1993), we suggest to apply the KF with higher analyzing frequency on minimally filtered data.

Box 1

Phenomenological difference between an RE approach and our Bayesian filter

Box 1—figure 1
The First-order Markov property of the data requires a recursive prediction of the signal (by the model) and correction (by the data) scheme of the algorithm.

(a) Idealized patch-clamp (PC) data in the absence of instrumental noise for either ten (colored) or an infinite number of channels generating the mean time trace (black). The fluctuations with respect to the mean time trace (black) reveal autocorrelation (b) Conceptual idea of the Kalman Filter (KF): the stochastic evolution of the ensemble signal is predicted and the prediction model updated recursively.

Two major problems for parameter inference for the dynamics of the ion channel ensemble n(t) are: (I) that currents are only low-dimensional observations (e.g. one dimension for patch-clamp or two for cPCF) of a high-dimensional process (dimension being the number of model states) blurred by noise and (II) the fluctuations due to the stochastic gating and binding process cause autocorrelation in the signal. Traditional analyses for macroscopic PC data (and also for related fluorescence data) by the RE approach, e.g. Milescu et al., 2005 ignores the long-lasting autocorrelations of the deviations (Box 1—figure 1a) blue and orange curves from the mean time trace (black) that occur in real data measured from a finite ensemble. To account for the autocorrelation in the signal, an optimal prediction (Box 1—figure 1b) of the future signal distribution (y(t2)) should use the measurement y(t1) from the current time step t1 to update the belief about the underlying n(t1). Based on stochastic modelling of the time evolution of the channel ensemble, it then predicts (y(t2)).

Box 1—figure 2
The residuals between model prediction and data reveal long autocorrelations if the analysis algorithm ignores the first-order Markov property.

(a) Autocorrelation of the residuals rI of two ligand concentrations of currents (blue) and of the fluorescence (red) after the data have been analyzed with the KF approach. (b) autocorrelation of rI after analysing with he RE approach.

To demonstrate the difference how the two algorithms analyze the data, we compute the autocorrelation of the residuals of the data. After the analysis with either the RE approach or the KF, we can construct from the model with the mean predicted signal HE[n(ti)] (see Eq. 4 for the definition of H) and the predicted standard deviation var[y(ti)] the normalized residual time trace of the data which are defined as

(1) r(ti):=y(ti)(HE[n(ti)])var[y(ti)].

Filtering (fitting) with the KF (given the true kinetic scheme) one expects to find a white-noise process for the residuals. Plots of the autocorrelation function of both signal components (Box 1—figure 2a) confirms our expectation. The estimated autocorrelation vanishes after one multiple of the lag time (the interval between sampling points), which means that the residuals are indeed a white-noise process. In contrast, the residuals derived from the RE approach (Box 1—figure 2b) display long lasting periodic autocorrelations.

On the one hand, a complete HMM analysis (forward algorithm) would deliver the most exact likelihood of macroscopic data. On the other hand, the computational complexity of the forward algorithm limits this type of analysis in ensemble patches to no more than a few hundred channels per time trace (Moffatt, 2007). To tame the computational complexity (Jahnke and Huisinga, 2007), we approximate the solution of the CME with a Kalman filter (KF), thereby remaining in a stochastic framework Kalman, 1960. This allows us to explicitly model the time evolution of the first two moments (mean value and covariance matrix) of the probability distribution of the hidden channel states. Notably, for linear (first or pseudo) Gaussian system dynamics, the KF is optimal in producing a minimal prediction error for the mean state. KFs have been used previously in several protein expression studies which also demonstrate the connection of the KF to the linear noise approximation (Komorowski et al., 2009; Finkenstädt et al., 2013; Fearnhead et al., 2014; Folia and Rattray, 2018; Calderazzo et al., 2019; Gopalakrishnan et al., 2011).

Our approach generalizes the work of Moffatt, 2007 by including state-dependent fluctuations such as open-channel noise and Poisson noise in additional fluorescence data. A central technical difficulty which we solved is that due to the state-dependent noise the central Bayesian update equation loses its analytical solution. We derived an approximation which is correct for the first two moments of the probability distributions. Stochastic rather than deterministic modeling is generally preferable for small systems or non-linear dynamics (Van Kampen, 1992; Gillespie and Golightly, 2012). However, even with simulated data of unrealistic high numbers of channels per patch (more than several thousands within one patch), the KF outperforms the deterministic approach in estimating the model parameters. Moffatt, 2007 already demonstrated the advantage of the KF to learn absolute rates from time traces at equilibrium. Like all algorithms that estimate the variance and the mean (Milescu et al., 2005) the KF can infer the number of channels Nch for each time trace, the single-channel current i and analogous in optical recordings the mean number λb of photons from bound ligands per recorded frame. To select models and to identify parameters, stochastic models are formulated within the framework of Bayesian statistics where parameters are assigned uncertainties by treating them as random variables (Hines, 2015; Ball, 2016). In contrast, previous work on ensemble currents combined the KF only with ML estimation (Moffatt, 2007). Difficulties in treating simple stochastic models by ML approaches in combination with the KF (Auger-Méthé et al., 2016), especially with non-observable dynamics, justify the computational burden of Bayesian statistics. Bayesian statistics has an intuitive way to incorporate soft or hard constrains from diverse sources of prior information. Those sources include mathematical prerequisites, other experiments, simulations or theoretical assumptions. They are applied as additional model assumptions by a prior probability distribution over the possible parameter space. Hence, knowledge of the model parameters prior to the experiment are correctly accounted for in the analyzes of the new data. Alternatively, some of these benefits of prior knowledge can be incorporated by penalized maximum likelihood (Salari et al., 2018; Navarro et al., 2018). Bayesian inference provides outmatching tools for modeling over point estimates: First, the Bayesian approach is still applicable in situations where parameters are not identifiable (Hines et al., 2014; Middendorf and Aldrich, 2017b) or posteriors are non-Gaussian, whereas ML fitting ceases to be valid (Calderhead et al., 2013; Watanabe, 2007). Second, a Bayesian approach provides superior model selection tools for singular models such as HMMs or KFs Gelman et al. (2014). Third, Bayesian statistics has a correct uncertainty quantification (Gillespie and Golightly, 2012) based on the data and the prior for the statistical problem. In contrast, ML or maximum posterior approaches lack uncertainty quantification based on one data set (Joshi et al., 2006). Only under optimal conditions their uncertainty quantification becomes equivalent to Bayesian credibility volumes (Jaynes and Kempthorne, 1976). This study focuses on the effects on the posterior due to formulating the likelihood via a KF instead of an RE approach and the benefits of adding a second dimension of observation. We consider the performance of our algorithm against the gold standards in four different aspects: (I) The relative distance of the posterior to the true values, (II) the uncertainty quantification, here in the form of the shape of the posterior, (III) parameter identifiability, and (IV) robustness against typical misspecifications of the likelihood (such as ignoring that currents are filtered or that the integration time of fluorescence data points is finite) of real experimental data.

Results and discussion

Simulation of ligand-gated ion-channel data

Here we treat an exemplary ligand-gated channel with two ligand binding steps and one open-closed isomerization described by an HMM (see Figure 1a). For this model, confocal patch-clamp fluorometry (cPCF) data were simulated: time courses of ligand binding and channel current upon concentration jumps were generated (see Appendix 5 and Materials and methods section). Idealized example data with added white noise are shown in Figure 1b–d. We added realistic instrumental noise to the simulated data (see Appendix 5). A qualitative description of the statistical problem that needs to be addressed when modeling time series data such as the simulated is outlined in Box. 1.

Kinetic scheme and simulated data.

(a) The Markov state model (kinetic scheme) consists of two binding steps and one opening step. The rate matrix K is parametrized by the absolute rates kij, the ratios Kij between on and off rates (i.e. equilibrium constants) and L, the ligand concentration in the solution. The units of the rates are s-1 and μM-1s-1, respectively. The liganded states are C2, C3, O4. The open state O4 conducts a mean single-channel current i=1. Note, that absolute magnitude of the single channel current is irrelevant regarding this study what matters is its relative magnitude compared with σop and σex. Simulations were performed with 10 kHz or 100 kHz (for Figures 11 and 12) sampling, KF analysis frequency fana for cPCF data is in the range of (200-500) Hz while pure current data is analyzed at 2-5 kHz. (b-c) Normalized time traces of simulated relaxation experiments of ligand concentration jumps with Nch=103 channels, λb=0.375 mean photons per bound ligand per frame and single-channel current i=1, open-channel noise with σop2=0.1i2 and an instrumental noise with the variance σm2=i2. The current ycurr and fluorescence yflu time courses are calculated from the same simulation. For visualization, the signals are normalized by the respective median estimates of the KF. The black lines are the theoretical open probabilities Po(t) and the average binding per channel B(t) for Nch of the used model. Typically, we used 10 ligand concentrations which are (0.0625, 0.125, 0.25, 0.5, 1, 2,4, 8, 16, 64) μM. d, Equilibrium binding and open probability as function of the ligand concentration L.

Kalman filter derived from a Bayesian filter

Here and in the Materials and methods section, we derive the mathematical tools to account correctly for the stochastic Markov dynamics of single molecules in the fluctuations of macroscopic signals. The KF is a Bayesian filter (see Materials and methods), that is a continuous state HMM with a multivariate normal transition probability Ghahramani, 1997 (Figure 2a). We define the hidden ensemble state vector

(2) n(t):=(n1(t),n2(t),n3(t),n4(t))=i=1Nchsi(t),

which counts the number of channels in each state s (see Methods). To make use of the KF, we assume the following general form of the dynamic model: The evolution of n(t) is determined by a linear model that is parametrized by the state evolution matrix T

(3) nt+1=Tnt+ωtN(|Tnt,Qt),

where ∼ means sampled from and N(|μ,Σ) is a shorthand for the multivariate normal distribution, with the mean μ and the variance-covariance matrix Σ. The state evolution matrix (transition matrix) is related to the rate matrix K by the matrix exponential T=exp(KΔt). The mean of the hidden state evolves according to the equation E[nt+1|nt]=Tnt. It is perturbed by normally distributed white process noise ω with the following properties: The mean value of the noise fulfills E[ωt]=0 and the variance-covariance matrix of the noise is cov[ωt,ωt]=Q(T,nt) (see Materials and methods Equation 38d, Ball, 2016). In short, Equation 3 defines a Gaussian Markov process.

The first-order hidden Markov structure is explicitly used by the Bayesian filter.

The filter can be seen as continuous state analog of the forward algorithm. (a) Graphical model of the conditional dependencies of the stochastic process. Horizontal black arrows represent the conditional multivariate normal transition probability N(nt+1|Tnt,Qt) of a continuous state Markov process. Notably, it is n(t) which is treated as the Markov state by the KF. The transition matrix T and the time-dependent covariance Qt=Q(T,nt) characterise the single-channel dynamics. The vertical black arrows represent the conditional observation distribution O(yt|nt). The observation distribution summarizes the noise of the experiment, which in the KF is assumed to be multivariate normal. Given a set of model parameters and a data point yt, the Bayesian theorem allows to calculate in the correction step P(nt|yt) (red arrow). The posterior is propagated linearly in time by the model, predicting a state distribution P(nt+1) (orange arrow). The propagated posterior predicts together with the observation distribution the mean and covariance of the next observation. Thus, it creates a multivariate normal likelihood for each data point in the observation space. (b) Observation space trajectories of the predictions and data of the binding per channel vs. open probability for different ligand concentrations. The curves are normalized by the median estimates of λb, i and Nch and the ratio of open-channels ycurrNchi which approximates the open probability Po(t). The black crosses represent the predicted mean signal HE[nt+1], which is calculated by multiplying the observational matrix H with the mean predicted state E[nt+1]. For clarity, we used the mean value of the posterior of the KF. The green and blue trajectories represent the part of the time traces with after the jump to non-zero ligand concentration and after jumping backt to zero ligand concentration in the bulk, respectively.

The observations yt depend linearly on the hidden state nt. The linear map is determined by an observation matrix H.

(4) yt=Hnt+νtO(|Hnt):=N(|Hnt,Σt)

The noise of the measurement setup (Appendix 5 and Equation 43) is modeled as a random perturbation of the mean observation vector. The noise fulfills E[ν]=0 and cov[νt,νt]=Σt. Equation 4 defines the state-conditioned observation distribution O (Figure 2a). If the system strictly obeys Equation 3 and Equation 4 then the KF is optimal in the sense that it is the minimum variance filter of that system Anderson and Moore, 2012. If the distributions of ν and ω are not normal, the KF is still the minimum variance filter in the class of all linear filters but there might be better non-linear filters. In case of colored noise ν and ω the filtering equations (see Materials and methods) can be reformulated by state augmentation or measurement-time-difference approach techniques Chang, 2014. For each element in a sequence of hidden states {nt:0<t<T} and for a fixed set of parameters θ, an algorithm based on a Bayesian filter (Figure 2a), explicitly exploits the conditional dependencies of the assumed Markov process. A Bayesian filter recursively predicts prior distributions for the next nt

(5) P(nt)=P(nt|nt1)P(nt1|yt1)dnt1,

given what is known about nt1 due to yt1. The KF as a special Bayesian filter assumes that the transition probability is multivariate normal according to Equation 3

(6) P(nt)=N(nt|Tnt1,Qt1)P(nt1|yt1)dnt1

Note, that Equation 6 is a central approximation of the KF. While the exact transition distribution of an ensemble of ion channels is the generalized-multinomial distribution (Methods Equation 32), the quality of normal approximations to multinomial Milescu et al., 2005 or generalized-multinomial Moffatt, 2007 distributions depends on the number of ion channels Nch in the patch and on the position of the probability vector in the simplex space. The difference between the log-likelihoods of the true generalized-multinomial dynamics and Equation 6 type approximation scales as 1/Nch Moffatt, 2007. As a rule of thumb one should be careful with both algorithms for time traces with Nch[101,102]. Below or even inside this interval there are more qualified concepts such as the forward algorithm or even particle filters (Golightly and Wilkinson, 2011; Gillespie and Golightly, 2012) which avoid the normal approximation.

Each prediction of nt (Equation 6) is followed by a correction step,

(7) P(nt|yt)=O(yt|nt)P(nt)O(yt|nt)P(nt)dnt,

that allows to incorporate the current data point into the estimate, based on the Bayesian theorem (Chen, 2003). Additionally, the KF assumes (Anderson and Moore, 2012; Moffatt, 2007) a multivariate normal observation distribution

(8) P(nt|yt)=N(yt|Hnt,Σt)P(nt)N(yt|Hnt,Σt)P(nt)dnt,

If the initial prior distribution is multivariate normal then due to the mathematical properties of the normal distributions the prior and posterior () in Equation 8 become multivariate normal Chen, 2003 for each time step. In this case, one can derive algebraic equations for the prediction (Materials and methods Equation 37, Equation 38d) and correction (Materials and methods Equation 58 and Equation 58) of the mean and covariance. The algebraic equations originate from the fact that a normal prior is the conjugated prior for the mean value of a normal likelihood. Due to the recursiveness of its equations, the KF has a time complexity that is linear in the number of data points, allowing a fast algorithm. The denominator of Equation 8 is the normal distributed marginal likelihood L(yt|Yt1,θ) for each data point, which constructs by

(9) L(YT|θ)=t=2NTL(yt|Yt1,θ)=t=2NTO(yt|nt)P(nt|Yt1,θ)dnt=t=2NTN(yt|HE[nt],HPtH+Σt),

a product marginal likelihood of normal distributions of the whole time trace YT={y1,,yNT} of length NT for the KF. For the derivation of Pt and Σt see Materials and methods (Equation 38d) and Equation 43. Pt is the covariance of the prior distribution over n(t) before the KF took y(t) into account. The likelihood for the data allows to ascribe a probability to the parameters θ, given the observed data (Methods Equation 20). An illustration for the operation of the KF on the observation space is given in Figure 2b. The predicted mean signal HE[n(t)] corresponds to binding degree B(t)=HE[n(t)]1Nch and open probability PO(t)=HE[n(t)]2Nch. These values are plotted as vector trajectories.

The standard KF (Moffatt, 2007; Anderson and Moore, 2012; Chen, 2003) has additive constant noise Σt=const in the observation model. Thus, in this case a constant variance term Σ is added, in Equation 9 to the aleatory variance HPtH which, as mentioned above, originates (Equation 38d) from the the fact that we do not know the true system state n(t). For signals with Poisson-distributed photon counting or open-channel noise, we need to generalize the noise model to account for additional white-noise fluctuations with n(t)-dependent variance. For instance, in single-channel currents additional noise is often observed whose variance is referred to by σop2. In macroscopic currents this additional noise can be modeled by a term σop2n4(t), causing state-dependency of our noise model.

(10) y(t)=Hn(t)+νm(t)+νop(t)yO(y|n)=N(y|Hn(t),σm2+n4(t)σop2)=N(y|Hn(t),Σt)

The second noise term νop is defined in terms of the first two moments E(νop)=0 and var(νop)=E(νop2)=σop2n4(t). To the best of our knowledge such a state-dependent noise makes the integration of the denominator of Equation 8 (which is also the incremental likelihood) intractable

(11a) P(y(t))=N(y|Hn,σm2+n4σop2)N(n|n¯(t),P(t))dn
(11b) =1constexp((yHn)22(σm2+n4σop2))exp(12(nn¯(t))P1(nn¯(t)))dn

This is because the state distribution N(n|n¯(t),P(t)) as the prior also influences the variance parameter of the likelihood which means that the conjugacy property is lost. While a normal distribution is the conjugated prior of the mean of a normal likelihood, it is not the conjugated prior for the variance. However, by applying the theorem of total variance decomposition Equation 46a we deduce a normal approximation to Equation 8 and to the related problem of Poisson-distributed noise in fluorescence Equation 57, Equation 55a data. By computing the mean and the variance or covariance matrix of the signal, we can reformulate the noise model to fit the form of the traditional KF framework. Note, that the derived equations for the covariance matrix are still exact for the more general noise model. Mean and covariance just do not form a set of sufficient statistics anymore.

Our derivation is not limited to ligand-gated ion channels. For example, when investigating voltage-gated channels, the corresponding noise model can be easily adapted. This holds also when using the P/n protocol for which the noise model resembles that of the additional variance in the fluorescence signal. The additional variance is induced because the mean signal from the ligands swimming in the bulk (Materials amd methods Equation 43 Appendix 5) is eliminated by subtracting scaled mean reference signal which itself has an error. This manipulation adds additional variance to the resulting signal comparable to P/n protocol. Other experimental challenges, as for example series resistance compensation promoting oscillatory behavior of the amplifier, deserve certainly advanced treatment. Nevertheless, for voltage-clamp experiments with a rate equation approach it also becomes clear (Lei et al., 2020) that modeling of the actual experimental limitations, including series resistance, membrane and pipette capacitance, voltage offsets, imperfect compensations by the amplifier, and leak currents are necessary for consistent kinetic scheme inference.

The Bayesian posterior distribution

(12) P(θ|YT)=L(YT|θ)P(θ)L(YT|θ)P(θ)dθ

encodes all information from model assumptions and experimental data used during model training (see Materials and methods). A full Bayesian inference is usually not an optimization (finding the global maximum or mode of the posterior or likelihood) but calculates all sorts of quantities derived from the posterior distribution such as mean values of any function f including the mean value or covariance matrix of the parameters themselves or even the likelihood of the data.

(13) E[f]=f(θ)P(θ|YT)dθ

Besides the covariance matrix of the parameter to express parameter uncertainty, the posterior allows to calculate a credibility volume. The smallest volume VP that encloses a probability mass P of

(14) P=VPP(θ|YT)dθ.

is called the Highest Density Credibility Volume/Interval (HDCV/HDCI). Those credibility volumes should not be confused with confidence volumes although under certain conditions they can become equivalent. Given that our model sufficiently captures the true process, the true values θtrue will be inside that volume with a probability P. Unfortunately, typically there is no analytical solution to Equation 12 . However, it can be solved numerically with Monte Carlo techniques, enabling to calculate all quantities related to Equation 13 and Equation 14 . Our algorithm uses automatic differentiation of the statistical model to sample from the posterior (Appendix 1—figure 1a) via Hamiltonian Monte Carlo (HMC) (Betancourt, 2017), see Appendix 7 , as provided by the Stan software (Hoffman and Gelman, 2014; Gelman et al., 2015).

Benchmark for PC data against the gold standard algorithms

We compare the posterior distribution (Figure 3) of our algorithm against Bayesian versions of the deterministic (Milescu et al., 2005) and stochastic (Moffatt, 2007) algorithms, which we consider as the gold standard algorithms for macroscopic patch-clamp data. Simulated currents of a patch with Nch=5103 are shown in (Figure 3d). The resulting posteriors (Figure 3a) show that both former algorithms are further away from the true parameter values with their maxima or mean values (Figure 3a). E.g., the relative error of the maximum of the posterior are Δk21200% for Milescu et al., 2005 and Δk32240% for Moffatt, 2007 . The four other parameters including the three equilibrium constants behave less problematic as judged by their relative error. Additionally, if one does not only judge the performance by the relative distance of maximum (or some other significant point) of the posterior but considers the spread of the posterior as well, it becomes apparent, that the marginal posterior of both former algorithms fail to cover the true values within at least the reasonable parts of their tails. Accordingly, for maximum likelihood inferences the true value would be far outside the estimated confidence interval. For the RE approach only the marginal posterior of K~21 is nicely centered over the true values and the marginal of K~32 could be considered to cover within a reasonable part of the distribution the true value. Uncertainty quantification is investigated in more detail further down (Figures 49). Note that in Figure 3a, parameter unidentifiability by heavy tails/ multiple maxima of the posterior distribution or (anti-) correlation is easily visible as non axial symmetric patterns.

The Bayesian filter overcomes the sensitivity to varying (open-channel) noise of the classic Kalman filter and does not show the overconfidence of the RE-approach.

Overall it shows the highest accuracy and the posterior covers the true values. The classical deterministic RE (blue), 2007 Kalman filter (red) and our Bayesian filter (green) are implemented as a full Bayesian version and the obtained posterior distributions are compared. For all PC data sets in the figure the analysing frequencies fana ranges within 2-5. (a) Posterior of the parameters for the 3 algorithms for the data set displayed in panel d. The blue crosses indicate the true values. All samples are normalized by their true values which is indicated by the ∼ above the parameters. For clarity, we only show a fraction of the samples of the posterior for blue and red. b, Effect of open channel noise: The Euclidean error for all three approaches is plotted vs. σop/i (low axis).The upper axis displays the ratio of the ‘typical’ standard deviation of the open channel excess noise of the ensemble of channels σopN0.5Po,max to the standard deviation of instrumental noise. c, Influence of patch size: Scaling of the Euclidean error vs. Nch follows (Nch)0.5 indicated by the dashed lines for Nch>2103 for the RE and the Bayesian filter approach. The data indicates a constant error ratio (orange) for large Nch. For Nch<2103 samples of the posteriors for many data sets suggest an improper posterior. An instrumental noise of σex/i=1 and σop/i=0.01 was used. (d) The time traces on which the posteriors of panel a are based (for the ligand concentrations see Figure 1). Panel b used the same data too, but σ and σop were varied.

Figure 3—source data 1

The data folder includes all 15 sets of time traces.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig3-data1-v2.zip
For multidimensional data (cPCF) the RE approach almost approaches the accuracy (Euclidean error) of the Bayesian Filter.

However, only the Bayesian filter covers the true value in a reasonable HDCV while RE based posteriors are too narrow. All samples are normalized by their true values which is indicated by the ∼ above the parameters. (a) Euclidean errors of the maximum for the rate kij and equilibrium constants Kij obtained by the KF (green) and from the REs (blue) are plotted against Nch for σex/i=0.5, σop/i=0.05 and λb=5. Both algorithms scale like 1/Nch (dashed lines) for larger Nch which is the expected scaling For smaller Nch<500 (gray range) the error is roughly the same indicating that limitations of the normal approximation to the multinomial distribution dominate the overall error in this regime. The combination of fluorescence and current data(cPCF) decreases the eucleadian error for both approaches compared to current data alone(PC). (b), HDCI and the mode of the 3 kij and 3 Kij plotted vs. Nch revealing that the maximum is a consistent estimator (converges in distribution to the true value with increasing data quality). While the KF (green) 0.95-HDCI includes usually the true value, the RE HDCI (blue) is too narrow and, thus, the real values are frequently not included. (c) Bayesian estimation of true success probability for the event that all 6 0.95-HDCI include the respective true values at the same time by a binomial likelihood. Since the data sets have different Nch and the model approximations become better with increasing Nch, we use a cut-off for Nch=200. d, Comparison of 1-D and combinations of 2-D marginal posteriors of the parameters of interest for both algorithms calculated from a Nch=103 simulation. Blue lines indicate the true value. We depict that in two dimensions the disproportion of the deviation of the mode and the spread of RE (blue) approach is worsened while KF (green) posterior includes the true values with more reasonable probability mass.

Figure 4—source data 1

6µMol.

Each of the source data folders contains for a specific ligand concentration the time traces of cPCF data for all Nch.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig4-data1-v2.zip
Figure 4—source data 2

64µMol.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig4-data2-v2.zip
Figure 4—source data 3

32µMol.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig4-data3-v2.zip
Figure 4—source data 4

8µMol.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig4-data4-v2.zip
Figure 4—source data 5

4µMol.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig4-data5-v2.zip
Figure 4—source data 6

1µMol.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig4-data6-v2.zip
Figure 4—source data 7

2µMol.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig4-data7-v2.zip
Figure 4—source data 8

025µMol.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig4-data8-v2.zip
Figure 4—source data 9

05µMol.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig4-data9-v2.zip
Figure 4—source data 10

00625µMol.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig4-data10-v2.zip
Figure 4—source data 11

003125µMol.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig4-data11-v2.zip
Figure 4—source data 12

0125µMol.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig4-data12-v2.zip
The HDCV of the posterior of the KF follows the requested binomial statistics while the HDCVs of the RE approach are too narrow.

(a) Cumulative χ-square distribution vs. the Mahalanobis distance dMah. The y axis denotes the probability mass which is counted by moving away from the maximum before an ellipsoid with distance dMah is reached. The different colours represent the changes of the cdf with an increasing number of rate parameters. The blue cdf at dMah=1 represents how much probability mass can be found from σσnormal(θ,0,σ)dθ, see inset. In one dimension, we can expect to find the true value within 2σ around the mean with the usual probability of P=0.682 for univariate normally distributed random variables. The six parameters (brown) of the full rate matrix will almost certainly be beyond dMah=1.0. The higher the dimensions of the space the less important becomes the maximum of the probability density distribution for the typical set which is by definition the region where the probability mass resides. The mathematical reason for this is that the probability mass P=V(θ)dV is the integrated product of volume and probability density. b, The two methods to count volume in units of probability mass for the KF (green) and the RE (blue). The gray area indicates which data sets are considered a success if one chooses to evaluate a proababilty mass of 0.4 of each posterior around its mode. All data sets in the white area are considered a failure. For the optimistic noise assumptions σex=0.5i, σop=0.05i and a mean photon count per bound ligand per frame λb=5 the RE approach (blue) distributes the probability mass such that the HDCV never includes the true rate matrix. From Nch>100 both HDCV estimates of the KF posterior (green curves) include the true value within a reasonable volume and show a similar behaviour. c, Binomial success statistics of HDCV to cover the true value vs. the expected probability constructed from the data of (b). Calculated for i=0.25σ and σop=0.025i and λb=5 and minimal background noise.

Even for the highest tested experimental noise the RE approach does not follow the required binomial statistics, generating an underestimated uncertainty.

(a) Binomial success statistics of HDCV to cover the true value vs. the expected probability. Calculated for i=σ and σop=0.1i and λb=0.375 and a strong background noise. (b) Binomial success statistics of HDCV to cover the true value vs. the expected probability. For 10i=σ and σop=1i and λb=0.375 and a strong background noise. For both algorithms, the adaptation of the sampler of the posterior was more fragile for small Nch, leading to differences in the posterior if the posterior is constructed from different independent sampling chains. Those data sets were then excluded. We assume that these instabilities are induced in both algorithms by the shortcomings of the multivariate normal assumptions. (c) Comparison of the Euclidean error vs. Nch for the pessimistic noise case (solid lines) with Euclidean error for the optimistic noise case (dotted lines).

Higher model complexity drastically increases the minimal requirements of the data.

With PC data the RE approach is frequently incapable to identify all parameters while the Bayesian filter is more robust. cPCF data alleviate the parameter unidentifiabilities for patch sizes for which PC data are insufficient. Each panel column corresponds to a particular true process with increasing complexity from left to right, as indicated by the model schemes on top. Within all kinetic schemes, each transition to the right adds one bound ligand. Each transition to left is an unbinding step. Vertical transitions are either conformational or opening transitions. Plots in each row share the same y-axis respectively. Each column shares the same abscissa. (a-c) Error ratio for PC data (blue) and cPCF data (red). The dashed lines indicate the mean error ratio under the simplifying assumption that the error ratio does not depend on Nch. The vertical bars are the standard deviations of the mean values. Theses values were calculated from the Euclidean errors shown in Figures 3c and 4a for a, and panels (d-e), for (b-c), respectively. Results from the KF algorithm (green) and the RE algorithm (blue) are compared for PC (lighter shades) and cPCF (strong lines). The diagonal gray areas indicate a (Nch)-0.5 proportionality. For simulating the underlying PC data, we used standard deviations of σop=0.1 and σ=1 and for the cPCF data additionally a ligand of brightness λb=5. To facilitate the inference for the two more complex models, we assumed that the experimental noise and the single channel current are well characterized, meaning iN(i|1,0.01), σN(σ|1,0.01) and σopgamma(σop|1,100). In the models containing loops (last 2 columns), a prior was used to enforce microscopic-reversibility and set to k25beta(100,100) multiplied by k1=k5k6k7k8(k2k3k4)-10.995+0.01k1.

Figure 7—source data 1

The folder of the five-state model includes 15 sets of time traces in the interval Nch[103,5105].

Each of them has 10 ligand concentrations. The number in the file name reports the amount of ion channels in the patch.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig7-data1-v2.zip
Figure 7—source data 2

The folder of the six-state model includes 14 sets of time traces in the interval Nch[103,106].

Each of them has 10 ligand concentrations. The number in the file name reports the amount of ion channels in the patch.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig7-data2-v2.zip
Revisiting the PC data obtained by the 4-state-1-open-state model shows that the KF succeeds to produce realistic uncertainty quantification, while the overconfidence problem (unreliable uncertainty quantification) of the RE approach remains.

Comparison of a series of HDCIs shown as functions of Nch for each parameter of the rate matrix obtained by the KF (green) and the RE algorithm (blue). The differing shades of green and blue indicate the set of (0.95,0.6,0.2,0.1)-HDCIs. Only the interval Nch>2103 in which all parameters are identified is displayed. The data are taken from the KF vs. RE benchmark of Figures 3c and 7a . The first row corresponds to three rates kij the second row to the equilibrium constants Kij. All parameters are normalized by their true value. The insets show the error ratios of the respective single parameter estimates. Note that the error ratios on the single-parameter level can be even of the order of magnitude of 102. Thus, they can be much larger than the error ratios calculated from the Euclidean error if the errors of the respective parameters are small compared to other error terms in the Euclidean error Equation 15 .The lowest Euclidean error for this kinetic scheme has cPCF data analyzed with the KF. (Figure 7d). A 6-state-1-open-states model with cPCF data has again an error ratio of the the usual scale (Figure 7c). As expected, the Euclidean error continuously increases with model complexity (Figure 7d and e). For PC data of the 6-state-1-open-states model even the likelihood of the KF is that weak (Figure 7e) that it delivers unidentified parameters even for Nch=104 and we can detect heavy tailed distributions up until Nch=105. Using RE on PC data alone does not lead to parameter identification, thus no error ratio can be calculated.

The HDCIs for PC data for a 5-state-2-open-states model show negligible bias for the KF with the true value being included.

In contrast, the HDCE for RE approach frequently does not include the true value and in general appears biased and frequently leaves certain parameters unidentified. Comparison of a series of (0.95,0.6,0.2,0.1)-HDCIs as functions of Nch for each parameter of the rate matrix obtained by the KF (green) and the RE algorithm (blue). The HDCIs correspond to the PC data displayed in Figure 7b and d . The first row corresponds to three rates kij the second row to the equilibrium constants Kij. All parameters are normalized by their true value. K~25 is because of the microscopic-reversibility prior a parameter which is strongly dependent on the other rates and ratios. Refer to the caption of Figure 7 for details about the prior that enforces microscopic-reversibility. Thus, the deviations of K~25 are inherited from the other parameters. The rate k~54 is frequently not identified by the RE approach and only the limits of the sampling box confines he posterior.

To assess the location of the posterior conditioned on Nch, we select the median vector θ of the marginal posteriors and calculate its Euclidean distance to the true values by:

(15) Euclidean Error=i[θi/θi,true-1]2

This defines a single value to judge the overall accuracy of the posterior. Varying σop/i reveals the range of the validity (Figure 3b) of the algorithm (red) from Moffatt, 2007 . While both stochastic approaches are nearly equivalent for low open-channel noise, the RE (blue) performs consistently poorer. It may seem surprising that even for In fact, the KF method beha the two stochastic algorithms start to produce different results. But considering the scaling (Materials and methods Equation 46a) of the total open-channel noise (top axis) from currents of an ensemble patch (NchPopen,max0.5)0.5σopen one sees that if (NchPopen,max0.5)0.5σopen approaches σ the traditional KF suffers from ignoring state dependent noise contributions. The lower scale changes with experiments (e.g. Nch and σop). In contrast, the upper scale is largely independent of the particular measurements. The two different normalizations indicate an experimental intuition: “ Why should I consider the extra noise from the open state of the single channel if only σop/i=σop/σ0.01” is misleading. The small advantage of our algorithm for small σop/i over Moffatt, 2007 is due to the fact that we could apply an informative prior in the formulation of the inference problem on σexpnormal(σexp,true2,σexp,true20.01) by taking advantage of our generalization (Equation 46a) Bayesian filter. Further, Figure 3b indicates the importance that the functional form of the likelihood is flexible enough to capture the second order statistics of the noise of the data sufficiently.

For an increasing data quality, which in our benchmark is an increasing Nch per trace, we show (Figure 3c) that the deterministic RE and our Bayesian filter are consistent estimators, that is they converge in distribution to the true parameter values with their posterior maxima or median for increasing data quality. The scaling of the RE approach (blue) and our Bayesian filter (green) vs. Nch shows that for large Nch both algorithms seem to have a constant error ratio relative to each other. They are both well described by error(Nch)a/Nch with an error ratio computed from the fit of 4.4. Thus, although our statistical model is singular (meaning that the fisher information matrix is singular Watanabe, 2007), its asymptotic learning behaviour is similar to a regular model (Figure 4c) which, however, means that the euclidean error from both algorithms stays different also for large Nch. For data with Nch<2103 the samples from the posterior typically indicate that the posterior is improper which is defined as

(16) (θ|y)dθ=

We consider this as the case of unidentified parameters. This data-driven definition is in so far different from structural and practical identifiability definitions (Middendorf and Aldrich, 2017a; Middendorf and Aldrich, 2017b) as the two latter cases are not distinguished. Still the practical consequence of structural or practical unidentifiability, which is usually an improper posterior, is captured. Cases of structural or practical unidentifiability which lead to a confined region of constant posterior density will be considered identified as the posterior is still normalizable thus the uncertainty quantification will still be correct, even when this finding is not sufficient to answer the research question at hand.

Benchmarking for cPCF data against the gold standard algorithm

For the simulated time traces with an optimistically high signal-to-noise assumption, the posterior of the KF (from hereon KF denotes our Bayesian Filter) and a RE (Milescu et al., 2005) approach are compared for cPCF data (Figure 4a–d). For a brief introduction of the RE approach, see Appendix 8 . The failure to analyze PC data with moderate open-channel noise (Moffatt, 2007; Figure 3a) disqualifies the classical KF with its constant noise variance also as a useful algorithm for fluorescence data, because here the Poisson distribution of the signal generates an even stronger state dependency of the signal variance.

By “high signal-to-noise assumption” , we refer to an experimental situation with a standard deviation of the current recordings σex/i=0.5, a low additional σop/i=0.05, and a high mean photon rate per bound ligand and frame λb=5. Additionally, we assume vanishing fluorescence background noise generated by the ligands in the bulk. The benefit of the high signal-to-noise is that the limitations of the two different approximations to the stochastic process of binding and gating can be investigated without running into the risk of being compensated or obscured by the noise from the experimental setup. For these experimental settings (Figure 4a), we calculate the Euclidean distance of the median (Equation 15) for different Nch. For Nch<500 (gray shaded area in Figure 4a), the Euclidean error of both algorithms is roughly the same. On the single parameter level (Figure 4b), this can be seen as an onset of correlated deviations from the true value for both algorithms. Each marginal posterior has for each Nch a similar deviation in magnitude and direction. That is in particular true for k~32 and K~32 which dominate Equation 15 . In spite of the correlation in direction of the errors of k~21 and K~21 their magnitude is still smaller for the KF. In summary, this indicates that in this regime the approximations to the involved multinomial distributions fail in a similar manner for both algorithms. That implies that treating the autocorrelation of the gating and binding becomes similar important compared to the error induced by normal approximations (which are used by the KF and the RE approach). For larger Nch, the Euclidean error of the RE is on average 1.6 times larger than the corresponding error of the posterior mode of the KF, which we deduce by fitting the function error(Nch)=aNch. On the one hand, both algorithms are better in approaching the true values than with patch-clamp data alone. On the other hand, the smaller error ratio means, that adding a second observable constrains the posterior, such that much of the overfitting is prevented for the RE approach. By overfitting, we define the adaptation of any inference algorithm to the specific details of the used data set due to experimental and intrinsic noise which is aggravated if too complex kinetic schemes are used. Similarly, (Milescu et al., 2005) showed that the over fitting tendency of the RE can be reduced if the autocorrelation of the data is eliminated. The dotted green curve derives from PC data. The Euclidean error is roughly an order of magnitude larger for Nch>2000. Thus, in this regime the cPCF data set is equivalent to 102 fold more time traces or 102 more Nch in a similar PC data set. For Nch<2000 only cPCF establishes parameter identifiability (given a data set of 10 ligand concentrations and no other prior information). In Figure 4b(1-6), we demonstrate the 0.95-HDCI (Equation 14) of all parameters and their modes vs. Nch. Even though the Bayesian filter and the RE approach are both consistent estimators, the RE approach covers the true values with its 0.95-HDCI only occasionally. The modeling assumption of the RE approach of treating each data point as if it does not come from a Markov process but from an individual draw from a multinomial distribution with deterministically evolving mean and variance makes the parameter estimates overly confident (Figure 4b(1-6)) . A likely explanation can be found by analyzing the extreme case where data points are sampled at high frequency relative to the time scales of the channel dynamics. The RE approach treats each data point as a new draw from Equation 67 while in reality the ion channel ensemble had no time to evolve into a new state. In contrast, the KF updates its information about the ensemble state after incorporating the current data point and then predicts from this updated information the generalised multinomial distribution of the next data point. For Nch>200, the marginal posterior of the KF usually contains the true value. Nevertheless, one might depict a bias in both algorithms, in particular (Figure 4b 2,4) for k~32 and K~32 for Nch<2103, similar to the findings of Moffatt, 2007 . A proper investigation of bias can be found in Figure 11 and 12 and in the Appendix. Notably, with the more realistic higher experimental noise level, in those tests the bias is hardly or not all detectable (consider the unfiltered or infinitely fast integrated data). A plausible explanation is that the bias only occurs (Figure 4 2,4) because the data are that perfect that the discrete nature of the ensemble dynamics is almost visually detectable, thus deviating from to the modeling assumption of multi-variate normal distributions.

To investigate the six one-dimensional 0.95-HDCIs simultaneously, we declare the analysis of a data set as successful if all 0.95-HDCIs include the true values. Otherwise we define it as a failure. This enables to determine the true probability at which the probability mass of the KF and the RE approach covers the true values in a binomial setting. The left blue vertical line in Figure 4c indicates p=0.9560.735 which is the lower limit and which would be the true success probability for an ideal model whose six 0.95-HDCIs are drawn from ybinomial(0.95,6). This is the probability of getting 6 successes in 6 trials. The right blue vertical line equals p=0.95, signifying the upper limit obtained by treating the six 0.95- HDCIs as being drawn from ybinomial(0.95,1) each, which is a rather loose approximation. All marginal distributions are computed from the same high-dimensional posterior which is formed by one data set for each trial. Thus, the six 0.95- HDCIs ybinomial(0.95,1) must have success rates between those two extremes if the algorithm creates an accurate posterior. We next combine the binomial likelihood with the conjugated beta prior (Hines et al., 2014) for mathematical convenience. On this occasion, for the sake of the argument, beta(1,1) seems sufficient. A beta(1,1) prior is a uniform prior on the open interval (0,1). The estimated true success rate of the RE approach (blue) is 0.15 and therefore far away from the success probability an algorithm should have when it is based on an exact likelihood of the data. In contrast, the posterior (green) of the true success probability of the KF resides with a large probability mass between the lower and upper limit of the success probability of an optimal algorithm (given the correct kinetic scheme). As both algorithms use the same prior distribution, the different performance is not induced by the prior.

Exploiting six one-dimensional posterior distributions does not necessarily answer whether the posterior is accurate in 6 dimensions but we can refine the used binomial setting. In Figure 4d (k~32,K~43), we see that 2-D marginal distributions can, due to their additional degree of freedom, twist around the true value without covering it with HDCV (Equation 14) of reasonable size while simultaneously the two 1–D marginal distribution do cover it with a reasonable HDCI. In general, the KF posterior distribution has its mode much closer to the true value for various parameter combinations and it seems that the posterior is approximately multivariate normal. Further, we recognize that the probability mass of the reasonably sized HDCV of the KF posterior includes the true values whereas the HDCV from the RE does not. In 6 dimensions we lack visual representations of the posterior. Since we showed that both algorithms are consistent for a given identifiable model, we are looking for a way to ask whether the posterior is accurate (has the posterior distribution the right shape). We can answer that question by asking, how much probability mass around the mode (or around multiple modes) needs to be counted to construct a HDCV Equation 14 which includes the true values. Then we can ask for Nset data sets how often did we find the true values inside a volume V(P) of a specific probability mass P of the posterior distribution

(17) successbinomial(Nset,P(V)).

An algorithm which estimates the parameters of the true process should fulfill this property simultaneously to being consistent. Otherwise credibility volumes or confidence volumes are meaningless. Noteworthy, that this is a empirical test of how sufficient the Bayesian filter and the RE approach hold frequentist coverage property of their HDCVs (Rubin and Schenker, 1986). We explain (Appendix 8) in detail how to quantify the overall shape and n-dimensional posterior and comment on its geometrical meaning. One way is to use an analytical approximation via the cumulative Chi-squared distribution (Figure 5a and b), The other way is to count the probability mass of n-dimensional histogram bins starting with the highest value until the first bin includes the true values (Figure 5b).

Knowing how much volume/probability mass is needed to include the true rate matrix allows us to test whether all HDCVs constructed from the two probability distributions match the binomial distributions of the ideal model. For each data set and for each HDCV of a fixed probability mass, there are two possible outcomes: The true rate matrix is inside or outside of that volume. For a chosen HDCV with a fixed probability volume, as indicated by a gray space in Figure 5b , we count how many times the true matrix is included in the volume of that probability mass for each trail in a fixed amount of trials. Since the success is binomially distributed, we plot the expected mean of a perfect model E[y]=NtrialsPtrue and binomial quantiles and compare them with the success rate found in our test runs (Figure 5c) for both algorithms with both methods to determine the posterior shape. The posterior of the KF distributes the probability mass in a consistent manner such that each volume includes the true rate matrix within the quantile range. In contrast, the RE approach fails for all data sets for all HDCVs (from 0 – 0.95 probability mass) and does not include the true values in one single case. Note, that all the binomial trials for each HDCV are made from the same set of data sets which explains the correlated deviation from the mean. For lower but realistic signal to noise ratios, where the fit quality decreases, for example by producing larger errors/wider posterior distributions (Figure 6a), the statistics of the HDCV from the RE approach improve but are still outperformed by the KF. In particular, in our tested case of realistic experimental noise we never find the true values within a 0.65-HDCV if the data are analyzed with a RE approach. Even for the highest noise level (Figure 6b), the probability mass of the KF posterior needed to include the true rate matrix remains almost always smaller then the posterior mass of the RE approach. That means that the posterior mass of the KF is much closer to the true value distributed than the posterior mass of the RE. With the KF we find the true rate matrix for one data set in small volume P<0.05 around the mode. To achieve the same with the RE approach we need at least a probability mass of 0.3.

In the inset of Figure 6a and b we do not observe a trend, thus no indication that the RE approach has a better performance for large values Nch in this regard. This challenges the common argument that the RE approach should be equivalent to the KF for large Nch because the ratio of mean signal vs. the intrinsic binding and gating noise is so large. Thus, including the autocorrelation into the analysis is important even for unrealistic large Nch. One possible explanation is model a signal-to-noise ratio which scales Nch. From the multinomial distribution both algorithms inherit mean signals which scale Nch and variances which scale in the terms dominating for large Nch similarly with Nch. Thus, identical to the real signal, both algorithms model the scaling of the signal-to-noise ratio N. It is plausible, that both algorithms remain sensitive for the occurrence of autocorrelation of the noise even for largest signal-to-noise ratios. In Figure 5c we compare the Euclidean error of the pessimistic high white noise case with an over-optimistic low noise case. We see, that when increasing Nch there is a regime where the Euclidean error increases faster than Nch-1 which we indicate with a coarse approximate fit Nch-1. In that regime two effects happen simultaneously. First, the mean and the intrinsic fluctuations of the signal become more and more dominant over the experimental noise. Second, the standard deviation of intrinsic fluctuations becomes smaller relative to the mean signal. We speculate, that this produces together a learning rate which is faster than the usual asymptotic learning rate Nch-1 of a regular model but relaxes asymptotically towards Nch-1.

Statistical properties of both algorithms for more complex models

We have seen in Figure 3c and Figure 4a that the RE and the KF algorithm are consistent estimators, while their error ratio (Figure 7a) seems to have no trend to approach 1 with increasing Nch. Adding a second observable increases parameter accuracy and adds identifiability for both algorithms since less aspects of the dynamics need to be statistically inferred (Figure 4a). Furthermore, the second observable takes away much of the tendency (compare Figure 4b 1 – 6 with 8) of the RE approach to overinterpret (overfit) which leads to a shrinking of the error ratio 5.6±1.4 for PC data to smaller values for cPCF data (Figure 7a) (red) which are on average still bigger than one, while the Euclidean error is reduced (Figure 4a). If we then keep the amount and quality of the PC/cPCF data but increase the complexity of the model which produced the data (Figure 7b and d) from a four-state to a five-state model (see kinetic schemes above Figure 7a–c), we see that for cPCF data the error ratio stays roughly the same (difference between Figure 7a and b). For PC data instead both algorithms deliver an unidentified k21 for Nch2103 (defined as an improper posterior). For larger Nch the KF always identifies all parameters while the RE fails at Nch{7000,2000,75000} to identify k54. Thus, the KF reduces the risk of unidentified parameters. To calculate the mean error ratio, we exclude the values were some of the parameters are unidentified in total that still amounts to 6.8±2.7 thus the advantage of the KF (given all parameters are identified) might increase with model complexity for PC data. The lowest Euclidean error for this kinetic scheme has cPCF data analyzed with the KF. (Figure 7d). A 6-state-1-open-states model with cPCF data has again an error ratio of the the usual scale (Figure 7c). As expected, the Euclidean error continuously increases with model complexity (Figure 7d and e). For PC data of the 6-state-1-open-states model even the likelihood of the KF is that weak (Figure 7e) that it delivers unidentified parameters even for and we can detect heavy tailed distributions up until . Using RE on PC data alone does not lead to parameter identification, thus no error ratio can be calculated.

Consistent with our findings, fluorescence data itself, should lower the advantage of the KF compared to PC data simply by signal-to-noise arguments. The stochastic aspect of the ligand binding is usually more dominated by the noise of Photon counting and background noise than the stochastic gating is dominated in current data by experimental noise. In terms of uncertainty quantification the advantage of the KF with cPCF varies with the model complexity (see, Appendix 9).

Besides analyzing what causes the changes in the Euclidean error (Figure 7a and b) at the single parameter, we now investigate whether the posterior is a proper representation of uncertainty. Thus, we look back at the HDCIs. The HDCIs of the 4-state-1-open-state (Figure 8) of the PC data from Figure 3 reveal an exacerbated over-confidence problem of the RE approach (blue) compared to cPCF-data (Figures 4b16). This, underlines our conclusion of Figures 5 and 6 that the Bayesian posterior sampled by the RE approach is misshaped. As a consequence a confidence volume derived from the curvature at the ML estimate of the RE algorithm understates parameter uncertainty. A possible way for ML methods to derive correct uncertainty quantification is by using bootstrapping data methods (Joshi et al., 2006). Furthermore, the error ratios of each single parameter from its true value in the last column k~43K~43 strongly increased their magnitudes (insets Figure 8). Even error ratios of 5102 are possible. Note, that the way we defined Equation 15 suppresses the influence of the smaller parameter errors in the overall error ratio. Thus the advantage (error ratio) of the KF over RE approach for a single parameter can be much larger or lower compared to the error ratio derived from the Euclidean error if the respective parameter is contributing less to the Euclidean error. The posterior of the KF (green) seems to be unbiased after the transition into the regime Nch>2103 where all parameters are identified. Similarly, for the RE algorithm there is no obvious bias in the inference. If we use the RE algorithm and change from the four-state to the five-state model (PC data from Figure 7b), bias occurs (Figure 9) in many inferred parameters, even for the highest Nch investigated. Milescu et al., 2005 showed that one or the reason of the biased inference of the RE approach is its ignorance of autocorrelation of the intrinsic noise. We add here that the bias problem clearly aggravates with an increased model complexity. It is even present in unrealistically large patches which in principle could be generated by summing up 102 time traces with Nch=103. In contrast, the KF algorithm reveals that its parameter inference is either unbiased or at least much less biased in the displayed Nch regime. Furthermore, for both algorithms the position of the HDCI relative to the true value is for some parameters highly correlated, which corresponds to the correlation between optima of the ML method of Milescu et al., 2005 ; Moffatt, 2007.

As a side note, unbiased parameter estimates are a highly desirable feature of an inference algorithm. For example, with a bias in the inference, repeated experiments do not lead to the true value if the arithmetic mean of the parameter inferences is taken. With bias even bootstrapping methods fail to produce reliable uncertainty quantification. Due to the variation of the data the k54 parameter is either identified in some neighbourhood of the true value or complete unidentified (Figure 9), if the RE algorithm is used. The unidentified k54 occurs even at high-quality data such as Nch=7.5104. Only because of the nonphysical prior (Figure 9) of k54 induced by the limits of the sampling box of the sampling algorithm the posterior appears to be proper but is in fact either unidentified or or more than two orders of magnitude away from the true value. For the same data using the KF did not result in any unidentified parameters. Note, that comparable inference pathologies such as multimodal distributions of inferred parameter were also reported for the maximum likelihood RE algorithm for low quality PC data or too simple stimulation protocols (Milescu et al., 2005).

In conclusion, the two different perspectives on parameter uncertainty: On the one hand distributions of ML estimates due to the random data (Milescu et al., 2005 ; Moffatt, 2007) and the Bayesian posterior distribution loose their tightly linked (and necessary) connection if the RE algorithm is used. Thus, KF robustifies also ML inferences of the rate matrix. Our findings are consistent with the findings for gene regulatory networks (Gillespie and Golightly, 2012) which show that RE approaches deliver a too narrow posterior in contrast to stochastic approximations which deliver an acceptable posterior compared to the true posterior (defined by a particle filter algorithm). On the data side of the inference problem adding cPCF data eliminates the bias, reduces the variance of the position of the HDCI and eliminates unidentified parameters (Appendix 9—figures 1 and 2) for both investigated algorithms. This advantage increases with model-complexity.

For the five-state and six-state model, we applied microscopic-reversibility (Colquhoun et al., 2004). We enforced it by hierarchical prior distribution (Materials and methods Equation 60) whose parameters can be chosen such that they allow only arbitrarily small violations of microscopic-reversibility. But the prior distribution can also be used to enforce some softer regularization around microscopic-reversibility. Thus, we can transfer the usually strictly applied algebraic constraint (Salari et al., 2018) of microscopic-reversibility to a constraint with scalable softness. In that way we can model the lack of information if microscopic-reversibility is exactly fulfilled (Colquhoun et al., 2004) by the given ion channel instead of enforcing the strict constraint upon the model.

Prior critique and model complexity

In the Bayesian framework, the likelihood of the data and the prior generate the posterior. Thus, the performance of both algorithms can be influenced by appropriate prior distributions. We used a uniform prior over the rate matrix which is not optimal. Note, that uniform priors are widely used by several reasons. They appear to be unbiased, and are assumed to be a ‘no prior’ option (which they are not). This is true for location parameters like mean values. In contrast, for other parameters, such as scaling parameters like rates or variances, a uniform prior adds bias to the inference towards faster rates (Zwickl and Holder, 2004). We suspect, that for the PC data even in the simplest model discussed here the lower data quality limit below which we detected unidentified parameters (improper posteriors) is caused by the uniform prior. This lower limit for the KF also increases with the complexity of the model from Nch<2103 for the foue-state model till Nch2104 for 6-state-1-open-state model. Note, that it is hardly possible to fit the 6-state-1-open-state model with the RE approach for the same amount of PC data. We observe cPCF data eases this problem because the likelihood becomes more concentrated for all parameters. The likelihood dominates the uniform prior. Nevertheless, for most parts of the paper we used a uniform prior over the rates and equilibrium constants to be comparable with the usual default method: a plain ML which influences our results in data regimes in which the data is not strong enough to dominate the bias from the uniform prior. Thus, both algorithms perform better with smarter informative or at least unbiased prior choices for the rate matrix.

In principle, to rule out an influence of the prior, unbiased priors should be used for the rates. The standard concept for unbiased least informative priors is to construct a Jeffreys prior Jeffreys, 1946 for the rate matrix which is, however, beyond the scope of the paper.

The influence of the brightness of the ligands of cPCF data on the inference

To evaluate the advantage of cPCF data Biskup et al., 2007 with respect to PC data only (Figure 10), we compare different types of ligands: Idealized ligands with brightness λb, emitting light only when bound to the channels, ‘real’ ligands which also produce background fluorescence when diffusing in the bath solution (Appendix 5) and current data alone. For datasets including fluorescence, the increased precision for the dissociation rate of the first ligand, k2,1, is that strong that the variance of the posterior (k2,1,k3,2) nearly vanishes in the combined plot with the current data (nearly all probability mass is concentrated in a single point in Figure 10a). The effect on the error of the equilibrium constants Ki is less strong. Additionally, the bias is reduced and even the estimation of Nch is improved. The brighter the ligands are, the more the posterior of the rates decorrelates, in particular (k2,1,k3,2) (Figure 10a). All median estimates of nine different cPCF data sets (Figure 10b) differ by less than a factor 1.1 from the true parameter except k3,2, which does not profit as much from the fluorescence data as k2,1 (Figure 10c). The 95th percentiles, l95 of (k2,1) and (K1) follow l951/λb. Thus, with increasing magnitude of ligand brightness λ, the estimation of k2,1 becomes increasingly better compared to that of k3,2 (Figure 10c). The posterior of the binding and unbinding rates of the first ligand contracts with increasing λb. The l95 percentiles of other parameters exhibit a weaker dependency on the brightness (l95λ-0.1). For λb=0.01 photons per bound ligand and frame, which corresponds to a maximum mean signal of 20 photons per frame, the normal approximation to the Poisson noise hardly captures the asymmetry of photon counting noise included in the time traces. Nevertheless, l95 decreases about ten times when cPCF data are used (Figure 10c). The estimated variance of

(18) r(ti):=y(ti)(HE[n(ti)])var[y(ti)]

with the mean predicted signal HE[n(ti)], for PC or cPCF data is σ2(ri)1 (Figure 10d) which means that the modeling predicts the stochastic process correctly up to the variance of the signal. Note that the mean value and covariance of the signal and the state form sufficient statistics of the process, since all involved distributions are approximately multivariate normal. The fat tails and skewness of (k21) and (k12) arises because the true model is too flexible for current data without further prior information. The KF allows to determine the variance (Figure 10e) of the open-channel current noise for σop=0.1i. Adding fluorescence data has roughly the same effect on the estimation of σop like using five times more ion channels to estimate σop2.

The Benchmark of the KF for PC versus cPCF data with different bright ligands shows that even adding a weak fluorescence binding signal can add enough information to identify before unidentified parameters.

(a) Posteriors of PC data (blue), cPCF data with λb=0.00375 (orange) and cPCF data with λb=0.375 (green). For the data set with λb=0.375, we additionally accounted for the superimposing fluorescence of unbound ligands in solution. In all cases Nch=103. The black lines represent the true values of the simulated data. The posteriors for cPCF (k2,1,k3,2) are centered around the true values that are hardly visible on the scale of the posterior for the PC data. The solid lines on the diagonal are kernel estimates of the probability density. (b) Accuracy and precision of the median estimates visualized by a violin plot for the parameters of the rate matrix for 5different data sets. Four of the five data sets are used a second time with different instrumental noise, with λb=0.375 and superimposing bulk signal. The blue lines represent the median, mean and the maximal and minimal extreme value. (c) The 95th percentile of the marginalized posteriors vs. λb normalized by the true value of each parameter. A regime with l951/λ is shown for k2,1 and K1, while other parameters show a weaker dependency on the ligand brightness. (d) Histograms of the residuals r of cPCF with λb=2.510-3 data and PC data. The randomness of the normalized residuals of the cPCF or PC data is well described by rinormal(0,σres2=1). The estimated variance is σres2=0.98+0.26. Note that the fluorescence signal per frame is very low such that the normal approximation to Poisson counting statistics does not hold. e, Posterior of the open-channel noise (σop2/σop,true2) for PC data with Nch=103 (green) and Nch=105 (blue) as well as for cPCF data with Nch=103 (red) with λb=0.375. We assumed as prior for the instrumental variance P(σ2)=N(1,0.01).

Figure 10—source data 1

Five different sets of time traces of panel b.

All instrumental noise is already added.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig10-data1-v2.zip
Figure 10—source data 2

Eighteen sets of time traces of panel c.

The number in the file name indicates the brightness.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig10-data2-v2.zip

Sensitivity towards filtering before the analog-to-digital conversion of the signal

On the one side, every analog signal to be digitized needs analog filtering for antialiasing according to the Nyquist theorem. On the other side, every analog filter does not only suppress unwanted white noise but also distorts the dynamics (Figure 11a) of the signal of interest (Silberberg and Magleby, 1993). Therefore, (Qin et al., 2000) recommend to avoid analog filtering as much as possible in single-channel analysis and let the HMM analyze the data in the rawest available form, even with simultaneous drift correction (Sgouralis and Pressé, 2017a). One can also expect that analog filtering of a macroscopic signal is harmful for the inference of the KF and the RE approach. For the CCCO model considered herein we investigated the mean behavior (accuracy and precision) of the posterior of both algorithms with seven data sets (simulated at 100 kHz to mimic an analog signal). A digital fourth-order Bessel filter (Virtanen et al., 2020) was then applied. The maximum analysing frequency fana of the KF used is 100-400 Hz to be comparable to cPCF setups. The slower frequency at which the Bayesian filter analyzes the data is necessary because the applied Bessel filter has caused additional time correlations in the originally white noise of the signal. Thus, an all-data-points fit would immediately violate the white noise assumption of Equation 4 which we restore by analyzing at a much lower frequency. We then let the time scales of the induced time correlations become larger and larger by decreasing fcut. Physically, the absolute cut-off frequency fcut is irrelevant; what matters is the magnitude of fcut relative to fana and to the eigenvalues αi of the ensemble (see, Appendix 3), since the eigenvalues determine the time evolution of the mean ensemble state, the autocorrelation, and Fourier spectrum of the fluctuations around the equilibrium distribution (Colquhoun et al., 1997b). The eigenvalues depend on the ligand concentration such that for a four-state model for each ligand concentration there are three relevant time scales -1/αi (where i=2,3,4) plus the equilibrium solution which satisfies α1=0. For 10 different time series 310+3 the outcome is to have different values of αi.Each eigenvalue is the inverse of the time constant of an exponential decay (see, Appendix 3). For this reason, we normalize in the following (Figure 11) the cut-off frequencies by α2 at the highest ligand concentration. We analyze the arithmetic mean from 7 different data sets of the median of the posterior of the rate matrix. The mean Euclidean error of the median (Figure 11b) and a series of quantiles demonstrate that overall the error of the mean median of the posterior KF (green) is smaller than that obtained by the RE. For unfiltered data, the accuracy of the mean median of the KF is increased by 1.6. Based on the Euclidean error both algorithms benefit slightly from careful analog filtering for fcut/α21 while the offset remains rather constant. A strong negative effect of analog filtering starts for both algorithms around fcut1kHZ. This is induced by fcutfana (see, Appendix 10). In contrast, based on the level of each individual parameter of the rate matrix (Figure 11c 1–6) the bias induced by analog filtering immediately starts with fcut=70kHz (Figure 11c 1–3). Note, that visual inspection of the signal (Figure 11a) does not reveal signal distortions fcut10kHz though they are detected by both algorithms. For unfiltered data, the maximum of the posterior for the RE approach is a biased estimate E[θME]θtrue for at least the parameters k~21,K~21,K~32 of the true value θtrue, which is explained (Milescu et al., 2005) by the fact that RE approaches ignore the autocorrelation of the intrinsic noise. Additionally, the data indicate that for K~43 the maximum of the posterior is even for the KF a biased estimate which we interpret as limitations induced by the fact that the mean vector and covariance-matrix do not constitute sufficient statistics as soon as Poisson distributed photon counting or open-channel noise blurs the signal. For the RE approach, the additional bias induced by the analog filter on the mean maximum of all parameters of the posterior starts with fcut70 kHz or, in other words, at the fastest time scale in the whole data set. The total bias in the estimate is reduced for k21 with the additional bias from the analog filtering but increased for k32 which for the Euclidean error leads at first to a small increase in accuracy. The KF is more robust towards analog filtering, as the results alter less with fcut (given a reasonable fcut), and less biased for unfiltered data in the estimates of these parameters. On the one hand, the Euclidean error shrinks for fcut>10 kHz (Figure 11b). On the other hand, on the single-parameter level (Figure 11c 1–6), the parameter estimates pick up bias due the analog filtering even for high filter frequencies, in particular for the RE approach. Only for k43 the KF is more biased than the RE approach.

The KF is robust against moderate analog filtering of the current signal.

High (Bayesian) sampling frequencies and minimal analog filtering does minimize bias which otherwise deteriorates parameter identification. In order to mimic an analog signal before the analog-to-digital conversion we simulated seven different 100 kHz signals which were then filtered by a digital fourth-order (4 pole) Bessel filter. The activation curves were then analyzed with the Bayesian filter at 125 Hz and the deactivation curves at sampling rates between 166-500 Hz. We chose for the analog signal σexp/i=10, σop/i=0.1, thus a stronger background noise, and we set the mean photon count per bound ligand as λb=5. For the ensemble size we choose Nch=103. (a) Current time trace filtered with different fcut. Except for 100 Hz (red) the signal distortion is visually undetectable. Nevertheless, the invisible signal distortions from analog filtering are problematic for both algorithms. (b) Estimate of the distribution mean Euclidean error of the median of the posterior vs. the cut-off frequency of a 4 pole Bessel filter (upper scale is in units of kHz) or scaled to the channel time scale (lower scale, see text). The fastest two eigenvalues α1,2/α2 for the highest ligand concentration are indicated by the black vertical lines. The fastest ratios α1,2/α2 for the next smaller ligand concentration are indicated by the red vertical lines. The slowest eigenvalue ratio α3/α2 at the highest ligand concentration is beyond the left limit of the x-axis. The solid line is the mean median of five data sets of the respective RE posterior (blue) and KF posterior (green). The green shaded area indicates the 0.6 quantile (ranging from the 20th percentile till the 80th percentile), demonstrating the distribution of the error of the posterior median due to the randomness of the data. (c) 1–3, Accuracy (bias) and precision of the maxima of the posterior k~max,ij of the posterior maxima of the rates vs. the cut-off frequency of a Bessel filter. The shaded areas indicate the 0.6 quantiles (ranging from the 20th percentile till the 80th percentile) due the variability among data sets while the error bars show the standard error of the mean. The deviation of the mean from the true value is an estimate of the accuracy of the algorithm while the quantile indicates the precision. (c) 4–6, Accuracy and precision of the maxima of the posterior K~max,ij of the posterior maxima of the corresponding equilibria vs. the cut-off frequency of a Bessel filter.

Figure 11—source data 1

Representative data set of cPCF data whose current has been filtered with decreasing fcut.

In the name of each file is fcut encoded in units of 100 kHz. The time axis is identical for all time traces and saved in the file Time.txt. The folder is similar to six other provided source data folders which have in total been used in this figure.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig11-data1-v2.zip
Figure 11—source data 2

Representative data set of cPCF data whose current has been filtered with decreasing fcut.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig11-data2-v2.zip
Figure 11—source data 3

Representative data set of cPCF data whose current has been filtered with decreasing fcut.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig11-data3-v2.zip
Figure 11—source data 4

Representative data set of cPCF data whose current has been filtered with decreasing fcut.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig11-data4-v2.zip
Figure 11—source data 5

Representative data set of cPCF data whose current has been filtered with decreasing fcut.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig11-data5-v2.zip
Figure 11—source data 6

Representative data set of cPCF data whose current has been filtered with decreasing fcut.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig11-data6-v2.zip

The KF is the unique minimal variance Bayesian filter for a linear Gaussian process (Anderson and Moore, 2012) which means, given that the assumptions of the KF are fulfilled by the true process of interest, the KF is mathematically proven the best model-based filter to apply. Consequently, analog filtering does not provide an advantage unless it removes specific high frequency external noise sources (colored noise). We demonstrate (Appendix 10) this for PC data and varied fcut and fana. On the downside, increasing fana makes the results of both algorithms more fragile if fcutfana does not hold. Thus, the critical edge in Figure 11b is indeed induced by fcut approaching fana. This suggests that the white noise assumption of both algorithms is violated. On the upside, if fcutfana is given, the KF with an order of magnitude higher fana has a reduced bias of up to 20% for fcut for individual parameters compared to the KF with lower fana. Additionally, a higher fana reduces the variance. To reduce the bias of parameter estimates to a minimum, the experimental design offers two remedies, either doing cPCF experiments with additional discussed advantages or using the KF at a high fana with even much higher fcut.

By theoretical grounds a further argument for doing less analog filtering is that this benchmark analyzes data of a finite state Markov process, which is a coarse proxy for the true process. In reality, relaxation of a protein is a high-dimensional continuous-state Markov process with infinitely many relaxation time scales (eigenvalues) (Frauenfelder et al., 1991) which, however, might be grouped in slower experimentally accessible and non-accessible faster time scales (Noé et al., 2013). With larger data sets of higher quality from better experiments, the faster time scales might become accessible if they are not distorted by analog filtering. In conclusion, deciding on a specific kinetic scheme and inferring its parameters means finding a model which accommodates in the best way to the set of observed eigenvalues. Analog filtering hampers the RE, KF or HMM forward-backward algorithm (Qin et al., 2000) to correctly describe the faster time scales.

Error due to finite integration time of fluorescence data

So far, we idealized the fluorescence data integration time as being instantaneously relative to the time scales of ensemble dynamics. In real experiments, the fluorescence signal of cPCF data has orders of magnitude longer minimal integration time Tint (time to record all voxels of a frame) or maximal integration frequency fint=1/Tint, than the possible sampling frequency of current recordings. We mimic the finite integration time

(19) ydigital(ti)=tstartti=tstart+Tintyanalog(t)dtj[tstart,tstart+Tint]y(tj)Δt

by summing with a sliding window the 100 kHz signal including the white noise to obtain data at an effectively lower sampling frequency (Figure 12a). Additionally we set the Bessel filter for the current data to fcut/α2=4.59 or fcut=90 kHz. The fastest used analysing frequency is fana=500Hz. We scale mean photo brightness λb and background noise down such that the signal-to-noise ratio of the lower integration frequency data is the same as of the high-frequency data λb/Tint=const . We do that in order to separate the bias from the finite integration time from other effects such as a better signal to noise ratios for each integrated point. Note that we only analyzed the plot until fint=fana. Both algorithms incur very similar bias due to the finite integration time (Figure 12b). The KF (green) is more precise for high integration frequencies fcut/α2 until fcut/α20.08 then the RE approach becomes more robust. Similar to Bessel-filtered current data (Figure 11b) on the single parameter level the systematic deviations start early for example fint=10 kHz for K21 (Figure 12c4). Possibly the systematic deviations start (Figure 12c2) already at fint=50 kHz for k32. The sudden increase of the Euclidean error (Figure 12b) of the mean median at fcut/α20.2 occurs in this case not due to fint approaching fcut but due to fintα1,2 for many ligand concentrations. To show this we plot the results of the fitting of five different data sets without including the highest 4 ligand concentrations (red) which means the largest eigenvalues are much smaller (Figure 12b,C1-6). Additionally, we keep fint=const. Although fluctuations of the posterior medians are higher, the KF becomes more robust. Note, that the fastest eigenvalues of these reduced data sets are indicated by the blue bars (Figure 12b and c4). Based on the Euclidean error (Figures 11a and 12a) the robustness of both algorithms against the cut-off frequency is compared with the robustness against the integration frequency found to be about an order of magnitude higher. That is related to a specific detail of the model used: the binding reaction, corresponds to the fastest time scales of the overall dynamic (difference between Figure 1b and c), which is exposed by the fluorescence signal. Thus, kinetic analysis of any data should make sure that the corresponding frequency of the most dominant timescales of the time series are much slower than the respective fintfcut independently of the investigated algorithms.

Finite integration time of fluorescence recordings acts also as a filter.

Thus the sampling should be faster than the fastest eigenvalues to avoid biased results. We simulated five different 100 kHz cPCF signals. All forms of noise were added and then the fluorescence signal was summed up using a sliding window to account for the integration time to produce one digital data point. The activation curves were then analyzed with the Bayesian filter at 125 Hz and the deactivation curves at 166-500 kHz, see caption of Figure 8. We plot the 0.6-quantile (interval between the 20th and the 80th percentile) to mimic ±one standard deviation from the mean as well as the mean of the distribution of the maxima of the posterior for different data sets. (Note, this is not equivalent to the mean and quantiles of the posterior of a single data set.). The quantiles represent the randomness of the data while the error bars indicate the standard error of the mean maximum of the posterior. Blue (RE) and green (KF) indicate the two algorithms with the standard data set while red (KF) shows examples that use only the six smallest ligand concentrations for the analysis in order to limit the highest eigenvalues. a, Instantaneous probing of the ligand binding (blue) compared with a probing signal which runs at fint=1 kHz. The integrated brightness of the bound ligand is λb=5 photons per frame. Although the red curves seem like decent measurements of the process except for the highest two shown ligand concentrations, the mean error is roughly an order of magnitude worse than for fint=10 kHz. Note, that for visualization we plot at a higher frequency than the Kalman filter analyzed the data. b, Estimate of the distribution of the (Euclidean error of the mean median of the posterior) vs. the scaled integration frequency fint/α2=1/(α2Tintegration). We use integration frequency instead of the integration time to make the plot comparable to the Bessel filter plot. The solid line is the mean median of five data sets of the respective KF posterior (green, red) and RE posterior (blue). The shaded areas indicate the 0.6-quantile which visualizes the spread of the distribution of point estimates. The two fastest time scales (eigenvalues) at the highest ligand concentration are indicated by the vertical black lines, the time scales of the next lower ligand concentrations with the red vertical lines. c 1–3, Accuracy (bias) and precision of the maxima of the posterior kij,max rates vs. the integration frequency. c 4–6, Accuracy and precision of the maxima of the posterior Kij,max of the corresponding equilibria vs. the cut-off frequency of a Bessel filter.

Figure 12—source data 1

The data folder contains one data set of cPCF data.

The fluorescence has been integrated with decreasing fint. The number in the file name encodes the amount of 100kHz data points which have been summed up to mimic the integration time. The time axis is identical for all time traces and saved in the file Time.txt. The folder is similar to six other provided source data folders which have in total been used in this figure.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig12-data1-v2.zip
Figure 12—source data 2

The data folder contains one data set of cPCF data.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig12-data2-v2.zip
Figure 12—source data 3

The data folder contains one data set of cPCF data.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig12-data3-v2.zip
Figure 12—source data 4

The data folder contains one data set of cPCF data.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig12-data4-v2.zip
Figure 12—source data 5

The data folder contains one data set of cPCF data.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig12-data5-v2.zip
Figure 12—source data 6

The data folder contains one data set of cPCF data.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig12-data6-v2.zip
Figure 12—source data 7

The data folder contains one data set of cPCF data.

https://cdn.elifesciences.org/articles/62714/elife-62714-fig12-data7-v2.zip

Conclusions

We generalized the filter equations (Methods Equation 37, 38d, 57, 58 and 59) of the KF for analyzing the gating and binding dynamics of ligand-gated ion channels with a realistic signal-generating model for isolated patch-clamp (PC) and confocal patch-clamp fluorometry (cPCF) data including open-channel noise, photon-counting noise and background noise. Any other type of linear kinetic scheme (e.g. for voltage-dependent channels) and signal can be applied as long as the characteristics of the signal are sufficiently described by normal distributions. Our approach is derived by approximating the chemical master equation of a first order chemical reaction network (which ion channel experiments usually are) which is exact up to the second statistical moment. For first-order chemical reaction networks, the linear noise approximation (Wallace et al., 2012) are exact up to the second moment too (Grima, 2015). Thus, we can conclude that our Bayesian filter uses a time integrated version of the linear noise approximation. To our understanding of Wallace et al., 2012 our approach is thus equivalent to approaches based on the chemical Langevin or Fokker-Planck equations (Gillespie, 2002). Consequently, this also makes the considerations of the quality of the chemical Langevin equation as an approximation (Gillespie, 2000) of the chemical master equation valid for our approach. Compared to previous attempts Moffatt, 2007, this mathematical generalization is necessary (Figure 3b) in order to use Bayesian filters on macroscopic PC or cPCF data. With our algorithm, we demonstrate (Figures 3c and 7) that the common assumption that for large ensembles of ion channels simpler deterministic modeling by RE approaches is on par with stochastic modeling, such as a KF, is wrong in terms of Euclidean error and uncertainty quantification (Figures 5a–c ,6a–b).

Enriching the data by fluorescence-based ligand binding reveals two regimes. In one regime, the two-dimensional data increase the accuracy of the parameter estimates up to 10-fold (Figure 4a and c). In the other regime of lower channel expression, enriching the data by the second observable, makes non-identified parameters to identified parameters. The second observable in cPCF data decreases the overfitting tendency (Figure 4a, b and d) of the RE approach on the true process. Thus, in this regard the advantage of the KF becomes smaller. However, by exploiting Bayesian HDCV we gain a second perspective: We show for various signal-to-noise ratios (Figures 5a–c ,6a–b) that the posterior sampled by a RE approach never covers the true values within a reasonable HDCV. Thus, the central feature of Bayesian statistics, exact uncertainty quantification by having the full posterior, is meaningless in combination with an RE approach (considering the type of data and set of signal-to-noise ratios that we tested). This even holds true for very pessimistic signal-to-noise assumptions Figure 6b. If HDCVs based on an RE approach cannot be trusted, the same applies to confidence volumes based on the curvature of the likelihood. This is not the case for the KF which delivers properly shaped posteriors (Figures 6a–c ,5a–c). Increasing the model complexity, at unchanged PC data quality (Figure 7) shows that the RE approach displays unidentified rates even for large ion channel ensembles while our approach identified all parameters for the same data. We also investigated the robustness of both algorithms against the cut-off frequency of a Bessel filter (Figure 11) and showed the overall superior robustness of the KF against errors of analog filtering compared to the RE approach. Analog filtering has its limitations due to distorting the higher frequencies of the Fourier spectrum of the signal. Thus, one should let the KF sample as fast as possible, with a cut-off frequency of at least one order of magnitude higher than the sampling frequency of the KF.

Similar to the Bessel filter, the KF is more robust than the RE approach against errors due to the finite integration time. Nevertheless, it is crucial for both algorithms (Figure 12), that the intrinsic time scales (1/eigenvalues) of the process to be analyzed are slower than the integration time of the data points. Otherwise the accuracy of the inference deteriorates.

Altogether, we demonstrated the performance of the generalized Kalman filter on ion channel data for inference of kinetic schemes. Nevertheless, our approach can approximate any other stochastic system and signal distributions of linear (pseudo-first-order) kinetics (Sorenson and Alspach, 1971). Prospective extensions of the Bayesian filter, for example by Bayesian Gaussian sum filters or similar numerically brute force concepts such as particle filters (Golightly and Wilkinson, 2011; Gillespie and Golightly, 2012), can overcome modeling errors at low ion channel numbers or low photon fluxes.

Materials and methods

We simulated state evolution s(t) with either the software QuB (Nicolai and Sachs, 2014) for PC data or an inhouse Matlab routine (The code will be shared on request.) for cPCF data. The inhouse Matlab routine is an implementation of the Gillespie algorithm Gillespie Daniel T. (1977). Traces were summed up, defining the ensemble state vector n(t):=(n1,n2,n3,n4), which counts the number of channels in each state. At first we used a 10 kHz sampling frequency for the Gillespie algorithm but for investigating the errors induced by analog filtering the current signal and the finite integration time for each fluorescence data point the Gillespie algorithm sampled at 100 kHz. The KF, RE, and Bayesian filter routines were implemented in Stan (Carpenter et al., 2017) with the interface package PyStan and ran on a high performance computing cluster with O(100) Broadwell and SkyLake nodes. A Tutorial for Patch clamp data can be found on the git hub page https://github.com/JanMuench/Tutorial_Patch-clamp_data and for cPCF data, https://github.com/JanMuench/Tutorial_Bayesian_Filter_cPCF_data. The cPCF data simulation code can be found here: https://cloudhsm.it-dlz.de/s/QB2pQQ7ycMXEitE (Source code 1).

Methods

Hereinafter, we derive the equations for our Bayesian filter for time series analysis of hidden linear chemical reaction networks (kinetic schemes). A detailed description of the experimental noise is provided in the Appendix 5.

The relation of Bayesian statistics to the Kalman filter

Request a detailed protocol

The following conventions are generally used: Bold symbols are used for multi-dimensional objects such as vectors or matrices. Calligraphic letters are used for (some) vectorial time series and double-strike letters are used for probabilities and probability densities. Within the Bayesian paradigm (Hines, 2015; Ball, 2016), each unknown quantity, including model parameters θ and time series of occupancies of hidden states NT={n(ti)}i=1T, are treated as random variables conditioned on observed time series data YT=y(ti)i=1T. The prior (θ)=jNpar(θj) or posterior distribution P(θ|YT) encodes the available information about the parameter values before and after analysing the data, respectively. According to the Bayesian theorem, the posterior distribution

(20) P(θ|YT)=1Z(YT)L(YT|θ)jNparP(θj)

is a probability distribution of a parameter set θ conditioned on YT. The likelihood L(YT|θ) encodes the distribution of the data by modelling the intrinsic fluctuations of the protein as well as noise coming from the experimental devices. The prior provides either assumptions before measuring data or what has been learnt from previous experiments about θ. The normalization constant

(21) Z(YT)=L(YT|θ)P(θ)dθ

ensures that the posterior is a normalized distribution. The KF is a special class of models in the family of Bayesian filters (Ghahramani, 1997), which is a generalisation of the classical KF. Due to its linear time evolution (Equation 1), the KF is particularly useful for modeling time series data of ensembles dynamics of first order chemical networks. It delivers a set of recursive algebraic equations (Materials and methods Equation 28 and Equation 32) for each time point, which allows to express the prior P(n(t)|Yt1) and (after incorporating y(t)) the posterior P(n(t)|Yt) occupancies of hidden states n(t) for all t given a set of parameters θ. This means the KF solves the filtering problem (inference of NT) by explicitly modeling the time evolution of n(t) by multivariate normal distributions. This allows us to replace L(YT|θ) of Equation 20 by the expression of Equation 9.

The Bayesian framework (as demonstrated in this article) has various properties which makes it superior to ML estimation (MLE) (McElreath, 2018). Those properties are in particular useful for the analysis of biophysical data since very often the dynamics of interest are hidden or latent in the data. Models with a hidden structure are called singular. For regular (non-singular) statistical models, maxima θML of the posterior or likelihood converge in distribution

(22) limnn(θMLθtrue)N(0,F1(θtrue))

to the true value θtrue,where F-1(θtrue) is the inverse Fisher information matrix. Under those conditions it is justified to derive from the curvature of the likelihood at θML via the Cramer-Rao-bound theorem

(23) covar[θML]=F-1(θML)

a confidence volume for the inferred parameters. In contrast, consider for example the type of data investigated in this study which probes the protein dynamics by current and light. Singularity means that the Fisher information matrix of a model is not invertible leading to the breakdown of the Cramer-Rao Bound theorem. Due to the breakdown, it cannot be guaranteed that even in the asymptotic limit the log-likelihood function can be approximated by a quadratic form Watanabe, 2007. Thus, usually the MLE does not obey Equation 22. Consequently, the posterior distribution is usually not a normal distribution either (Watanabe, 2007). Using the full posterior distribution without further approximations detects the resulting problems such as deviation from normality or non-identifiability of parameters, related to the singularity. In conclusion, the posterior is still a valid representation of parameter plausibility while ML fails.

Time evolution of a Markov Model for a single channel

Request a detailed protocol

In the following, we write the time t as function argument rather than a subscript. Following standard approaches, we attribute to each state of the Markov model an element of a vector space with dimension M. At a time, a channel can only be in a single state. This implies that the set of possible states is S:={(1,0,0,),(0,1,0,),,(,0,1)}{0, 1}M. In the following, Greek subscripts refer to different states while Latin subscripts refer to different channels. By s(t)=eα we specify that the channel is in state α at time t. Mathematically, eα stands for the α-th canonical unit Cartesian vector (Table 1).

Table 1
Important symbols.
SymbolMeaning
θSet of all unknown model parameters for which the posterior distribution is sampled
n(t)Hidden ensemble occupancy vector of channel states in a specific patch at time t which is a continuous Markov state vector n(t)RM
P(t)Variance-covariance matrix of a hidden ensemble state n(t) n a specific patch at time t which contains the dispersion of the ensemble and the lacking knowledge of the algorithm about the true n(t)
TTransition matrix of a single channel
KRate matrix which is the logarithm of the transition matrix
HObservation matrix which projects the hidden ensemble state vector onto its mean signal.
sSingle-molecule Markov state vector
ki,jSpecific transition rate from state j to state i, [K]i,j=ki,j ,
KiRatio of two transition rates i.e. an equilibrium constant
y(t)Data point at time
TNumber of observations in a time series
YTTime series of T data points, YT=y(ti)i=1T
NTTime series of T hidden ensemble states, NT={n(ti)}i=1T
Nch,jNumber of channels in patch number
iMean electrical current through a single-channel
σm2Variance of the current including all noise from the patch and the recording system
σop2Variance of the current noise generated by a single open-channel
λbMean brightness of a bound ligand
λFlMean brightness of the fluorescence signal from bulk and bound ligands
σbulk2Variance of the fluorescence generated by unbound ligands after subtraction of the image obtained for the reference dye
MNumber of single-channel states which is the dimension of n(t)NM in the KF algorithm
NobsDimensions of the observational space
F(Y)True probability density of Y, i.e. the true data-generating process
L(Y|θ)Likelihood function of the model parameters
P(θ|Y)Posterior distribution of the model parameters
Ppred(Y~|Y)Predictive distribution of the new data points
O(y|n)Distribution of observables for a single time step
N(|μ,Σ)Normal distribution with mean μ and variance-covariance matrix 
E[]Mean value

Assuming that the state transitions can be modeled by a first order Markov process, the path probability can be decomposed as the product of conditional probabilities as follows:

(24) (path)=(s(0),s(1),,s(T))=(s(0))(s(1)s(0))(s(2)s(1))(s(T)s(T-1)).

Markov models (MMs) and rate models are widely used for modeling molecular kinetics (Appendix 2). They provide an interpretation of the data in terms of a set of conformational states and the transition rates between these states. For exactness it remains indispensable to model the dynamics with a HMMs (Noé et al., 2013). The core of a hidden Markov model is a conventional Markov model, which is supplemented with a an additional observation model. We will therefore first focus on a conventional Markov model. State-to-state transitions can be equivalently described with a transition matrix T in discrete time or with a rate matrix K in continuous time, as follows:

(25) Tα,β:=(s(t+1)=eαs(t)=eβ)=exp(KΔt)α,β,

where exp is the matrix exponential. We aim to infer the elements of the rate matrix K, constituting a kinetic model or reaction network of the channel. Realizations of sequences of states can be produced by the Doob-Gillespie algorithm Gillespie Daniel T. (1977). To derive succinct equations for the stochastic dynamics of a system, it is beneficial to consider the time propagation of an ensemble of virtual system copies. This allows to ascribe a probability vector p(t) to the system, in which each element pα(t) is the probability to find the system at t in state α. One can interpret the probability vector p as the instantaneous expectation value of the state vector s.

(26) p(t)=E[s(t)]

The probability vector obeys the discrete-time Master equation

(27a) p(t+1)=Tp(t)
(27b) E[s(t+1)]=TE[s(t)]

Time evolution of an ensemble of identical non-interacting channels

Request a detailed protocol

We model the experimentally observed system as a collection of non-interacting channels. A single channel can be modeled with a first-order MM. The same applies to the ensemble of non-interacting channels. We focus on modeling the time course of extensive macroscopic observables such as the mean current and fluorescence signals as well as their fluctuations. A central quantity is the vector n(t) which is the occupancy of the channel states at time t:

(28) n(t)=i=1Nchsi(t)

This quantity, like s(t), is a random variate. Unlike s(t), its domain is not confined to canonical unit vectors but to nM. From the linearity of Equation 28 in the channel dimension and from the single-channel CME Equation 27b one can immediately derive the equation for the time evolution of the mean occupancy n¯(t)=E[n(t)]:

(29) n¯α(t+1)=βTα,βn¯β(t)

with the transition matrix T. The full distribution (n(t+1)|n(t)) is a generalized multinomial distribution. To understand the generalized multinomial distribution and how it can be constructed from the (conventional) multinomial distribution, consider the simplified case where all channels are assumed to be in the same state α. Already after one time step, the channels will have spread out over the state space. The channel distribution after one time step is parametrized by the transition probabilities in row number α of the single-channel transition matrix T. According to the theory of Markov models, the final distribution of channels originating from state α is the multinomial distribution

(30) (n(α)(t+1)nαeα)=(n1,,nMn(t)=nαeα)=nα!n1!nM!T1,αn1TM,αnM

In general, the initial ensemble will not have only one but multiple occupied channel states. Because of the independence of the channels, one can imagine each initial sub-population spreading out over the state space independently. Each sub-population with initial state α gives rise to its own final multinomial distribution that contributes nβ(α) transitions into state β to the total final distribution. The total number of channels at t+1 in each state can then be simply found by adding the number of channels transitioning out of the different states α.

(31) n(t+1)=αn(α)(t+1)

Evidently, the total number of channels is conserved during propagation. The distribution of n(t+1), defined by Equations 30; 31, is called the generalized multinomial distribution:

(32) n(t+1)general-multinomial(n(t),T)

While no simple expression exists for the generalized multinomial distribution, closed form expressions for its moments can be readily derived. For large Nch each (n(α)(t+1)nαeα) can be approximated by a multivariate-normal distribution such that also general-multinomial(n(t),T) has a multivariate-normal approximation. In the next section, we combine the kinetics of channel ensembles with the KF by a moment expansion of the governing equations for the ensemble probability evolution.

Moment expansion of ensemble probability evolution

Request a detailed protocol

The multinomial distribution (Fredkin and Rice, 1992) has the following mean and covariance matrix

(33) n¯(α)(t+1)=nαT:,α
(34) Σ(α)(t+1)=nαdiag(T:,α)-nαT:,α,:T:,α

where T:,α denotes the column number α of the transition matrix and diag(T:,α) describes the diagonal matrix with T;,α on its diagonal. Combining Equation 31 with Equations 33; 34 we deduce the mean and variance of the generalized multinomial distribution:

(35) E[n(t+1)n(t)]=αnα(t)T:,α=Tn(t)
(36) cov[n(t+1),n(t+1)n(t)]=αnα(t)(diag(T:,α)-T:,αT:,α)=diag(Tn(t))-Tdiag(n(t))T

Note that Equations 35; 36 are conditional expectations that depend on the random state n at the previous time t and not only on the previous mean n¯. To find the absolute mean, the law of total expectation is applied to Equation 35, giving

(37) n¯(t+1)=E[E[n(t+1)|n(t)]]=Tn¯(t),

in agreement with the simple derivation of Equation 29. We introduce a shorthand P(t):=cov(n(t),n(t)) for the absolute covariance matrix of n(t+1). Similarly, P(t) can be found by applying the law of total variance decomposition (Weiss, 2005 to Equations 35; 36), giving

(38a) P(t+1)=E[cov(n(t+1),n(t+1)n(t))]+cov[E(n(t+1)n(t)),E(n(t+1)n(t))]
(38b) =diag(Tn¯(t))Tdiag(n¯(t))T+cov(Tn(t),Tn(t))
(38c) =diag(Tn¯(t))Tdiag(n¯(t))T+Tcov(n(t),n(t))T
(38d) =diag(Tn¯(t))Tdiag(n¯(t))T+TP(t)T

Equations 37, 38d dare compact analytical expressions for the mean and the covariance matrix of the occupancy vector n at t+1 that depend on the mean n¯ and covariance matrix P at the previous time step t. Chaining these equations for different time steps t=0,,T allows to model the whole evolution of a channel ensemble. Moreover, these two equations together with the output statistics of O(y|n(t)) are sufficient to formulate correction equations Equation 59 of the KF (Moffatt, 2007; Anderson and Moore, 2012). These equations will be used in a Bayesian context to sample the posterior distribution of the model parameters. The sampling entails repeated numerical evaluation of the model likelihood. Therefore, analytical equations for the ensemble evolution that can be quickly evaluated on a computer millions of times are indispensable. This was achieved by deriving Equation 37, Equation 38d. Comparing Equation 38d with the KF prediction equation (Anderson and Moore, 2012) for P(t), we obtain the state-dependent covariance matrix of Equation 3 as

(39) Q(T,n¯(t))=diag(Tn¯(t))Tdiag(n¯(t))TT

In the following section on properties of measured data and the KF, we no longer need to refer to the random variate n(t). All subsequent equations can be formulated by only using the mean hidden state n¯(t) and the variance-covariance matrix of the hidden state P(t). We therefore drop the overbar in n¯(t) so that the symbol n(t) refers from now on to the mean hidden state.

Modeling simultaneous measurement of current and fluorescence

Request a detailed protocol

In the following, we develop a model for the conditional observation distribution O(y|n(t)) (Appendix 5 for experimental details). Together with the hidden ensemble dynamics this will enable us to derive the output statistics of the KF (see, below). Let y(t) be the vector of all observations at t. Components of the vector are the ion current and fluorescence intensity.

(40) y(t)=(fluorescence intensity(t)ion current(t))=(yflu(t)ycurr(t))

As outlined in the introduction part, in Equation 4 we model the observation by using a conditional probability distribution O(y(t)|n(t)) that only depends on the mean hidden state n(t), as well as on fixed channel and other measurement parameters. O(y(t)|n(t)) is modeled as a multivariate normal distribution with mean Hn(t) and variance-covariance matrix Σ(t), that can in general depend on the mean state vector n(t) (much like the covariance matrix of the kinetics in (Equation 38d) ). The observation matrix HNobs×M projects the hidden state vector n(t) onto Hn(t)Nobs, the observation space. The observation distribution is

(41) O(y(t)|n(t))=N(y(t)|Hn(t),Σ(n(t)))y(t)=Hn(t)+ν(t).

This measurement model is very flexible and allows to include different types of signals and error sources arising from both the molecules and the instruments. A summary of the signals and sources of measurement error and their contributions to the parameters of O(y(t)|n(t)) is provided by Table 2. Below we address the two types of signals and four noise sources one by one. For this, we decompose the observation matrix and the observation noise covariance matrix into the individual terms:

(42) H=HI+Hbinding
(43) Σ(t)=Σopen(t)+Σmeas.+Σbinding(t)+Σback
Table 2
Summary of signals and noise sources for the exemplary CCCO model with the closed states α=1,2,3 and the open state α=4.

The observed space is two-dimensional with yFl=fluorescence and yI=ion current. The fluorescence signal is assumed to be derived from the difference of two spectrally different Poisson distributed fluorescent signals. That procedure results in a scaled Skellam distribution of the noise.

ion currentfluorescence
current signalmeasurement noisefluorescence signalbackground fluorescence
Signaling statesOpen state-Ligand-bound states-
Error termOpen-channel noiseMeasurement noisePhoton countsBulk noise
Affected signalCurrentCurrentFluorescenceFluorescence
DistributionNormal (in4,σop2n4)Normal (0,σm2)Poisson (λbni(t))Scaled Skellam
Contribution to HH2,4=i-H1,:=(0,λb,2λb,2λb)-
Contribution to ΣΣ2,2=σop2n4(t)Σ2,2=σm2Σ1,1=(0,λb,2λb,2λb)n(t)Σ1,1=σback2

In the following, we report the individual matrices for the exemplary CCCO model with one open state α=4 and three closed states α=1,2,3. Matrices can be constructed analogously for the other models. For the definition of Σback refer to (Appendix 5).

Macroscopic current and open-channel noise

Request a detailed protocol

We model the current and the intrinsic fluctuations of the open-channel state s=e4 (the open channel noise) by a state-dependent normal distribution with mean in4(t) where n4(t) is the number of channels in the open state at t and i is the single-channel current. The additional variance of the single-channel current is described by σopen2. The sum of the instrumental noise of the experimental setup and the open channel noise is modeled as uncorrelated (white) normally distributed noise with the mean E[νI(t)]=0 and variance E[νI2(t)]=σop2n4(t)+σm2. By making the open-channel noise dependent on the hidden state population n4(t), we fully take advantage of the flexibility of Bayesian filters which admits an (explicitly or implicitly) time-dependent observation model. By tabulating the parameters of the two normal distributions into H and Σ, we obtain

(44) HI:=(0000000i)
(45) Σopen(t)+Σmeas.:=(000σop2n4(t)+σm2)

One can now ask for the variance of a data point y(t) given the epistemic and aleatory uncertainty of n(t) encoded by P(t) in Equation 38d. By using the law of total variance the signal variance follows as:

(46a) var(y(t))=E[var[y(t)|n(t)]]+var[E[y(t)|n(t)]]
(46b) =E[σop2n4(t)+σm2]+var[HIn(t)]
(46c) =σop2E[n4(t)]+σm2+(HIP(t)HI)2,2

See, Appendix 6 for further details.

Fluorescence and photon-counting noise

Request a detailed protocol

The statistics of photon counts in the fluorescence signal are described by a Poisson distribution with emission rate λFl

(47) yFl(t)Pois(λFl(t)).

The total emission rate λFl can be modeled as a weighted sum of the specific emission rates λb of each ligand class {0,1,2}. The weights are given by the stoichiometric factors which reflect the number of bound ligands. In order to cast the Poisson distribution into the functional form of the observation model (Equation 41), we invoke the central limit theorem to approximate

(48) yFlPois(λFl)N(λFl(t),λFl(t))

The larger λFl the better is the approximation. We assume, that the confocal volume is equally illuminated. For our model of ligand fluorescence, we assume for a moment that there is no signal coming from ligands in the bulk. We will drop this assumption in the next section. With these assumptions, we arrive at the following observation matrix

(49) Hbinding:=(0λb2λb2λb0000)

The matrix H aggregates the states into two conductivity classes: non-conducting and conducting and three different fluorescence classes. The first element (Hn)1 is the mean fluorescence λFl(t)=λb[n2(t)+2(n3(t)+n4(t))]. The variance-covariance matrix Σbinding can be derived along the same lines using Equation 48. We find

(50) Σbinding(t):=((Hn(t))1000)

Under these assumptions, the observation matrix can be written as follows

(51) H:=(0λb2λb2λb000i)

Output statistics of a Kalman Filter

Request a detailed protocol

with two-dimensional state-dependent noiseNow simultaneously measured current and fluorescence data y2, obtained by cPCF, are modeled. Thus, the observation matrix fulfills H2×M. One can formulate the observation distribution as

(52) y(t)=Hn(t)+νm(t)+(νpois(t)νop(t))yN(Hn(t),Σ(t)).

The vector νm denotes the experimental noise, with E[νm]=0 and variance given by the diagonal matrix Σmeas+Σback. The second noise term arises from Poisson-distributed photon counting statistics and the open-channel noise. It has the properties

(53) E[(νpois(t)νop(t))]=0

and

(54) cov((νpois(t)νop(t)),(νpois(t)νop(t)))=Σopen(t)+Σbinding(t).

The matrix Σ is a diagonal matrix. To derive the covariance matrix cov(y(t)) we need to additionally calculate var(yfluo(t)) and cov(yfluo(t),ypatch(t)). By the same arguments as above we get

(55a) var[yfluo(t)]=E[var(y(t)|n(t))]+var[E(y(t)|n(t)]
(55b) =E[σback2+(Hn(t))1]+var(Hn(t))
(55c) =σback2+(Hn(t))1+(Hn(t))HT)1,1

The cross terms can be calculated by using the law of total covariance

(56a) cov(ypatch,yfluo)=E[cov(ypatch,yfluo|n)]+cov(E(ypatch|n),E(yfluo|n))
(56b) =0+cov(H2,:n,H1,:n)
(56c) =H2,:cov(n,n)H1,:=H2,:P(t)H1,:

yielding the matrix

(57) cov(y,y)=HP(t)H+Σ(t)

We assumed that the Poisson distribution is well captured by the normal approximation. In cPCF data, the ligand binding to only a sub-ensemble of the channels is monitored, which we assume to represent the conducting ensemble such that Nch,FL=Nch,I. For real data, further refinement might be necessary to model the randomness of the sub-ensemble in the summed voxels. With the time evolution equations for the mean (Equation 35) and for the covariance matrix Equation 38d as well as with the expressions for the signal variance, we possess all parameters that are needed in the correction equation of the (Kalman, 1960; Anderson and Moore, 2012).

The correction step

Request a detailed protocol

For completeness we write down the correction step (Bayesian update) of the KF, although its derivation can be found in Chen, 2003; Anderson and Moore, 2012; Moffatt, 2007. The mean ensemble state n(t) is corrected by the current data point

(58) nposterior(t)=+nprior(t)+KKal(y(t)-Hnprior(t))

where Kalman gain matrix KKal:=P(t)priorHΣ-1 evaluates the intrinsic noise against the experimental noise. How precise are my model predictions about n(t) compared with the information gained about n(t) by measuring y(t). The covariance P(t) of the ensemble state n(t) is corrected by

(59) Pposterior(t)=Pprior(t)-KKal(HPprior(t)H+Σ(t))K

Equation 58,59, 37 and 38d form the filtering equations which summarize the algorithm. One initialises the first n(0) and P(0) and with an equilibrium assumption.

Microscopic-reversibility as a hierarchical prior

Request a detailed protocol

We applied microscopic-reversibility (Colquhoun et al., 2004) by a hierarchical prior distribution. Usually, micro-reversibility is strictly enforced by setting the product of the rates of the clockwise loop k1,k2,k3k4 equal to the anti-clockwise loop k5,k6,k7,k8 and then solving for the desired rate parameter to be replaced. This means that the classical approach can be described by drawing the resulting rate from a Dirac delta distribution prior with

(60) k1δ(k1-k5k6k7k8k2k3k4)

Following Equation 60, we can model microscopic-reversibility with any hierarchical prior distribution whose limit for a vanishing variance is Equation 60. For mathematical convenience, we defined the hierarchical prior by a sharply peaking beta distribution

(61) k1beta(100.01,100.01)

and by rescaling and adding an offset

(62) k1=k5k6k7k8k2k3k40.995+0.01k1

we derived a conditional prior which allows at maximum a ±0.005 relative deviation from the strict microscopic-reversibility. The ±0.005 micro-reversibility constraint is applied in (Figure 7b–d). In this way, one could model or even test possible small violation of microscopic-reversibility if smaller beta parameters such as beta(1,1) would be chosen.

Appendix 1

Efficiency of the Hamiltonian Monte Carlo sampler

Bayesian posterior sampling requires a fast and efficient sampler which requires in turn a fast evaluation of the likelihood and prior distribution. Optimally, the calculation time should increase as little as possible with the amount of data. This is why we chose the KF framework and parallelized the processing of a time trace across many CPUs. Compared to the maximization of the likelihood, sampling from the posterior requires approximately one to multiple orders of magnitude times more evaluations of the likelihood function to have a decent representation of the posterior. For example, in Moffatt, 2007 both algorithms need roughly 102 to 103 evaluations of the likelihood function to converge. This is only a small fraction of the number of evaluations needed here to obtain a converged estimate of the posterior (in fact the typical "warm-up" phase of the sampler which we used is barely half completed with such a small number of evaluations).

Appendix 1—figure 1
The posterior samples display autocorrelation which lowers the effective sample size.

(a) Sample traces from the 6-dimensional posterior of the rate matrix recorded after the warm-up of the sampler. (b) These traces are anti-correlated (in time) such that the effective sample size with which all quantities of relevance are estimated is smaller than the actual number of samples. Note that even at one iteration the absolute value of the auto-correlation is never larger than 0.05. The standard error of the bootstrapped mean auto-correlation is in the order of 10-4.

The ratio of the number of used samples to the number of likelihood evaluations can serve as a simple efficiency measure. However, this ratio is a too optimistic measure as revealed by a look into the internal dynamics of Hamiltonian Monte Carlo sampler (HMC). HMC treats the current parameter sample as part of a phase space point of a (dynamic) Hamiltonian system. The force acting on the system is the derivative of the negative logarithm of the posterior density Betancourt, 2017. Each suggested sample is derived from a ballistic movement with random kinetic energy in the augmented parameter space. Calculating this movement is done by integration of the corresponding Hamiltonian equations. Therefore, one suggested sample requires many evaluations of the gradient of the log posterior distribution. How many evaluations were needed in turn depends on the shape of the posterior and the length of the ballistic movement.

Appendix 1—figure 2
Computational time in units of 1000 samples per day vs Ndata/CPU10-4.

The two colors represent two different implementations. The difference between the two implementations is that the derived quantities which are the predicted mean signal and the predicted signal variance for each data point in the less efficient sampling algorithm are saved. In the more efficient algorithm these derived quantities are discarded.

Strong changes of the gradient of the posterior (high curvature of the posterior) require more integration steps. Setting all those sampler parameters requires either expert knowledge about the statistical model and sampler or a sampler which automatically learns optimal sampler parameter settings (warm-up) before it starts the actual sampling of the parameters of interest. Nevertheless, HMC generates samples (Appendix 1—figure 1a) from high-dimensional, correlated posteriors more efficiently than many other classical Markov chain Monte Carlo methods (Betancourt, 2017). By more efficient we mean the product of the speed of drawing one correlated sample with the amount of iterations needed for the autocorrelation function of those samples to vanish sufficiently. By construction, the samples from any Markov chain Monte Carlo method are correlated (Appendix 1—figure 1b). The quality of the adaptive NUTS HMC sampler provided by Stan can by seen in Appendix 1—figure 1b. The sample traces show a small anticorrelation whose absolute value is hardly larger than 0.05. This is achieved by letting the sampler adapt to the geometry of the posterior in the warm-up phase. The samples collected during the warm-up phase are discarded.

Hereinafter, we show that likelihood and sampler considerations are one aspect of computational efficiency. The other is the implementation of a parallelized algorithm. The blue curve (Appendix 1—figure 2) represents the scaling of the KF with an implementation of the algorithm which saves besides the sampled parameters the predicted mean and the covariance of the signal. This means for cPCF data that for each parameter sample 5NtracesNdata/CPU more memory operations are needed on the cluster node. Note, since we assigned each ligand concentration jump and the following data points to its own CPU the following relation holds: Ndata/CPU=Ndata/trace. Skipping saving those derived quantities (orange curve) creates a speedup of roughly 2 orders of magnitude. The derived quantities are redundant quantities which we suggest to calculate by feeding a small subsample of the posterior samples to the KF filter which then reanalyzes the traces by the given draws but does not redraw from the posterior. See Git-Hub: https://github.com/JanMuench/Tutorial_Patch-clamp_data.

We compare the computation time for the KF (green) and the RE approach (blue) for the 4-state model (Appendix 1—figure 3a) versus the data quality by Nch. The calculation time tcalc rises for both algorithms roughly like Nch. For each curve the likelihood evaluations stay constant for the whole plot. Thus this scaling relates to the integration time of the HMC sampler. Taking the average of the last 5 calculation times shows that the KF is about 2.7±0.2 times slower than the RE approach. The computation time for cPCF data of the 5-state-2-open states model (Appendix 1—figure 3b) shows a similar Nch-regime until it seems to become independent of Nch. Note, the different Nch interval which is displayed.

Appendix 1—figure 3
Computational time for cPCF data.

Computational time in units of 1000 samples per day vs. the amount of ion channels per trace Nch. The green curve belongs to the results from the KF, the blue curve to the results from the RE approach. The brighter blue curve corresponds to the ratio of computational time. a, four-state model with cPCF data. The posterior of the RE approach was sampled from three independent chains. We drew 18,000 samples per chain which included a large fraction of 6000 warm-up samples per chain. The posterior of the KF was sampled from 4independent chains. We drew 26,000 samples per chain which included a large fraction of 4000 warm-up samples per chain. b Computational time of the 5-state-2-open-states model with cPCF data. The KF algorithm drew 9000 samples which included a large fraction of 6,000 warm-up samples per chain. The RE algorithm drew 10,000 samples which included a large fraction of 5,000 warm-up samples per chain. In total, we used four independent chains. c, Computational time of the 6-state-1-open-state model with cPCF data. The KF algorithm drew 10,000 samples which included a large fraction of 6000 warm-up samples per chain. The RE algorithm drew 12,000 samples which included a large fraction of 7000 warm-up samples per chain. In total, we used four independent sampling chains.

Taking the average of the last 4 computation times shows that the KF is about 2.24±0.01 times slower than the RE approach. Surprisingly, with increasing model complexity the ratio of the computation time becomes not necessarily larger. The computation time for cPCF data of the 6-state-1-open state model (Appendix 1—figure 3c) shows a similar Nch-regime until it becomes independent of Nch. Taking the average of the last 4 calculation times reveals that the KF is about 3.3±0.3 times slower than the RE approach. With increasing model complexity, the ratio of the computation times becomes larger. Note, that there is little variation of the absolute calculation time of the algorithm across model complexity which is a good prospect for more complex kinetic schemes. Nevertheless, to achieve this computational speed we used in total 72 CPUs on one node, 20 CPUs per chain for 20 time traces. Thus, nodes with more then 80 CPUs would enable a better performance. Overall, the more complex likelihood calculations of the KF require 2–3 times more calculation for the tested models and cPCF data. Surprisingly, for most part of the displayed interval the KF is faster than the RE approach (Appendix 1—figure 4a) for PC data. Considering the more expensive likelihood evaluation (prediction and correction) of the KF, this result reminds that this computational time benchmark is a downstream test that integrates all aspects of the sampling. One possible explanation could be the following: We describe in the main text (Figures 4b16) for the RE approach, that due to the treatment of every data point as an individual draw from a multinomial distribution, HDCVs (Figures 56) are too narrow. This problem gets exacerbated (Figure 8) as we used 10 times more data points which we interpret as the cause of the KF being faster than the RE approach (Appendix 1—figure 4a). The narrower the posterior is the more integration steps from the sampler are needed to calculate the proposed trajectory with sufficient precision. Another example displayed (Appendix 1—figure 4b), has an increased kinetic scheme complexity. Note that we reduced the analyzing frequency 10 times to be comparable with the analyzing frequency of cPCF data. Under this condition, the intuition from likelihood calculation complexity is confirmed that the KF is slower. Taking the average of the last 4 calculation times shows that the KF is about 1.8±0.2 slower than the RE algorithm for PC data.

Appendix 1—figure 4
Computational time for PC data.

The computational time in units of 1,000 samples per day are plotted vs the amount of ion channels per trace Nch. The green curve belongs to the results from the KF, the blue curve to the results from the RE approach. a, Four-state model with PC data. In contrast to the cPCF data (Appendix 1—figure 3), fana is increased by an order of magnitude. Thus, 10 times more data points per time trace and CPU are evaluated. These results were obtained by four independent chains which drew 9000 samples per chain, including a large fraction of 7000 warm-up samples per chain.b, Five-state model with PC data. Note that here fana equals the analyzing frequency of cPCF data (Appendix 1—figure 3).

Convergence diagnostics

In the limit of infinitely many samples drawn from a posterior by any MCMC method, given that posterior space is simply connected, the sampling is guaranteed to converge to the typical set Betancourt, 2017. For a finite amount of samples from the posterior it is not guaranteed that the samples are representative for the typical set of the posterior. To ensure that the trace has converged to the typical set is an active field of statistical research. A discussion of these important consistency checks (Gelman and Rubin, 1992a; Gelman et al., 2013; Vats and Knudson, 2018; Gabry et al., 2019; Vehtari et al., 2021) ensuring that the sampled traces represent the typical set, is beyond the scope of this paper. In short, we usually worked with 4 independent sampling chains of the posterior (Gelman and Rubin, 1992b). On the one hand, we visually checked that all 4 chains do report the same typical set to verify that the sampler is fine. This convergence can be monitored by the R^ statistics (Gelman et al., 2013; Vehtari et al., 2021). R^ evaluates differences between the within- and between-chain parameter estimates. A bad mixing is indicated by large values of R^. Perfectly converged chains would be reported with R^=1. Additionally, the effective sample size Neff of each object of interest needs to be monitored. Both of them are reported by the “summary” method from the “fit” object in PyStan. We show (Appendix 1—table 1) an edited version of the output from this method. The last two columns would indicate signs of missing convergence. The second column reports the Markov sampling error telling if we drew enough samples in total. Note that any quantity far out in the tails would have a much larger error. There were cases when the sampler was not fine for all chains, meaning that 3 chains showed the same posterior but one showed something different. In such cases we simply discarded that chain. This occurred in particular more often with the RE approach. As we are benchmarking methods and know the true parameter values, it was rather obvious which chain did not work well. In particular, the successful binomial test of the HDCV is a downstream test that confirms besides the assumptions of the algorithm also whether the sampler worked sufficiently. For the Bayesian filter, sampler problems occurred at very low ion channel counts below Nch=200.

Appendix 1—table 1
Typical summary statistics of the posterior as printed by the “summary” method from the “fit” object in Pystan.

We deleted the 25% and 75% percentiles to decrease the size of the table and changed some of the names with respect to the used name in the algorithm to the symbol used in this article. Note that we assumed to measure 2 ligand concentrations (which includes 4 jumps) from one patch. Thus, we assumed for 10 ligand concentrations that they are coming from 5 patches which leads to a five-dimensional channels per patch vector Nch. The bias of Nchi discussed in the main article, is also observed here.

meansemeansd2.5%50%97.5%neffRhat
k21507.650.0611.91484.74507.46531.62412621.0
k32481.150.1628.63428.34480.19540.32336101.0
k43509.950.046.99496.39509.91523.77360061.0
K10.52.6e-55.2e-30.490.50.51405481.0
K20.516.7e-50.010.490.510.54332281.0
K30.494.2e-57.5e-30.470.490.5314451.0
Nch[1]1007.10.1420.05968.841006.81047.2198381.0
Nch[2]1006.10.1420.04967.831005.81046.0198741.0
Nch[3]1003.90.1420.0965.661003.61043.8198771.0
Nch[4]1007.00.1420.09968.671006.61047.1198431.0
Nch[5]1005.90.1420.09967.611005.61045.9198741.0
i1.07.0e-59.8e-30.981.01.02199711.0
σexp20.251.2e-52.5e-30.250.250.26431421.0
σop22.5e-31.2e-72.5e-52.5e-32.5e-32.5e-3427071.0
σfluo21.3e-52.3e-84.6e-61.0e-51.2e-52.5e-5413061.0
λ9.981.1e-40.029.949.9810.02351091.0

HMC parameter settings

The sampling efficiency of an HMC algorithm is sensitive to three parameters (Neal, 2011, Hoffman and Gelman, 2014). The three parameters, discretization time ϵ, the metric M and the number of steps taken L can be predefined or/and are adapted in the warm-up phase of the sampler. A typical number of iterations Nwarm in the warm-up phase for each chain is between Nwarm=31037103. Since a sufficient Nwarm depends on the correlation structure of the posterior, which depends on the quality and quantity of the data, we did not engage to optimize the number of warm-up steps. We rather checked whether for a set of different data (for example with different Nch) all chains are fine.

Stan

Stan is a probabilistic programming language that provides the research community with easy access to HMC sampling, variational Bayes, and optimization procedures like Maximum posterior/ML approaches. Stan uses automatic analytical differentiation of the statistical model such that no analytical treatment by hand has to be done for any tested model. The only requirement from the user is to define a statistical model of the data. The Stan compiler calculates the analytical derivatives of the model and then translates the statistical model to C++code and after that into an executable program. Furthermore, Stan uses adaptive HMC such that the length of the ballistic trajectory as well as the variance of the random kinetic energy to create the ballistic trajectory and the number of integration steps are optimized automatically. Then the program can be started from any high-level data processing language of the user’s choice such as Python, Matlab, Julia, R or from the command line.

Appendix 2

Markov models for a single ion channel

Markov models and rate models are widely used for modeling molecular kinetics. They provide an interpretation of the data in terms of a set of functional states and the transition rates between these states. Markov models can be estimated from experimentally recorded data as well as from computer simulation data. The use of Markov models with one-step memory is supported by the concept of the molecular free energy landscape. Molecular energy landscapes are typically characterized by conformationally well-defined free-energy minima that are separated by free-energy barriers. State transitions in molecules are thermally activated barrier-crossing events on this landscape (Frauenfelder et al., 1991) leading to a rapid equilibration of the system in the vicinity of this new minimum. Memory of other minima that have been visited in the past is not required. Regarding the wide spectrum of time scales at which processes in a protein take place, one has to be aware that there is typically a small number of relaxation modes with excessively long autocorrelation times and many relaxation modes with much faster autocorrelation times. To model the slow, experimentally accessible processes, it is sufficient to retain the small number of slow modes (Noé et al., 2011). It has been shown rigorously that working with the set of slow modes is equivalent to model the state dynamics with a small number of fuzzily defined metastable states in the full conformational space (Deuflhard and Weber, 2005) Later it has been shown that the set of slow modes can be well approximated with a hidden Markov model (Noé et al., 2013).

Appendix 3

Eigenvalues and fitting of sums exponential decays: Mean time evolution of the spectral components of the time evolution

From the experimental perspective we are used to fit sums of exponential decays to time series data of any signal such as currents or fluorescence.

(63) I(t)=I+i=1MIiexp(-t/τi)

The mathematical background for this procedure is the spectral decomposition of the solution of the corresponding RE equation

(64) dp(t)dt=Kp(t),

which is a vector differential equation whose evolution is governed by the rate matrix K.The general solution to the mean time evolution of the system is

p(t)=exp(Kt)p(t=0),

A Markov state model has as many real valued eigenvalues as it has states M. With those eigenvalues one can decompose the solution into its spectral components

p(t)=C1exp(α0t)p~0+i=1M-1Ciexp(αit)p~i

All of the eigenvalues obey α0. One of them always fulfills α0=0 which belongs to the equilibrium probability distribution of the ion channel. With exp(0)=1 we can write the sum as

(65) p(t)=C1p~0+i=1M-1Ciexp(αit)p~i

The eigenvalue α0=0 determines the equilibrium solution. Due to αi0 the solution is stable and converges to

p(t=)=C1p~0

The time evolution of the mean signal in its spectral components can then be derived

(66) I(t)=H(C1p~0+i=1M-1Ciexp(αit)p~i)

by multiplying with the observation matrix H and setting αi=-1τi we derived Equation 63, Colquhoun and Hawkes, 1995a. To derive the likelihood based on this deterministic approximation Milescu et al., 2005 one assumes for each data point that the ensemble state n(t) is drawn from

(67) n(t)multinomial(Nch,p(t))

with the additional approximation of the multinomial distribution by a multivariate normal distribution. Subsequently, the equation

(68) y(t)normal(Hn(t),Σ(t))

provides the contribution of the recording system to the signal of the ensemble state. In contrast, the KF approximates a

(69) n(t+Δt)generalized-multinomial(n(t),T)

distribution with multivariate normal distributions (see Methods) and then the details of the experimental setup are adjoined

Appendix 4

A note about Kalman filtering within other frameworks such as variational Bayes, maximum a posteriori and ML

We like to emphasize that the key property of the Kalman filter likelihood for an autocorrelated time series (see Equation 9) to factorize over time enables us to treat each data point as if it were independent. The alternative is to treat the whole time series as one high-dimensional data point as done in Celentano and Hawkes, 2004. A ML Kalman filter (Moffatt, 2007) would try to optimize Equation 9, which is usually done in the log space to improve numerical stability. Taking the logarithm of Equation 9 gives also an insight for the reader, who might be more experienced in using residual sums of squares for fitting.

(70) logL(YT|θ)=logt=2NTN(yt|HE[nt],HPtH+Σt)=t=2NTlogN(yt|HE[nt],HPtH+Σt),

where we used the logarithmic algebra rule log(iai)=ilog(ai) Using the functional form of the multi-variate normal distribution, we arrive at

(71) logL(YT|θ)=t=2NTlog(1(2π)2det(HPtH+Σt))(0.5)(ytHE[nt])[HPtH+Σt]1(ytHE[nt]),

which is a vectorial residual sum of squares whose summands are weighted by HPtH+Σt. The past observations of time series are encoded in Pt and E[nt], which are calculated with the Kalman filtering equations before the likelihood can be evaluated.

Generally, the user of our algorithm is not restricted to the full Bayesian inference. He/she can also stick to optimizing Equation 71. This is because in the probabilistic programming language Stan one only formulates the statistical model. The method to evaluate the statistical model and the data is provided by the Stan compiler and the high level programming language interface. Together they allow to switch quickly between HMC sampling, variational Bayesian posterior estimation, maximum a posteriori and ML. In particular, the variational Bayesian posterior might be a way to ease the computational burden due HMC sampling while keeping many of the benefits of knowing the full posterior distribution.

In the previous versions before PyStan 3.0 all three approaches were supported. Currently, with PyStan 3.0 only HMC sampling of the posterior is supported but, if desired, any other interface can be used (Matlab, Julia, R, CmdStan, Pystan 2.9, CmdStanPy,…) to perform optimization or variational Bayesian approaches. No actual changes in the Stan code are required, eventually with the exception that the CPU based parallelization needs to be eliminated. Given that one chooses optimization, the differences between maximum a posteriori/penalized maximum likelihood or only ML are just a matter of prior choice which need to be coded into the statistical model. Apart from those minor points, there is no need of extensive re-implementation of the code.

Appendix 5

The fluorescence signal of cPCF experiments

First four moments of a photomultiplier signal

Appendix 5—figure 1
Benchmark of the signal variance vs its mean for experimental solution data recorded under cPCF conditions: The concentrations of the fluorescent ligand were 0.25, 3, and 15 μM and a reference dye was present. The laser intensities covered 1.6 orders of magnitude at constant detection settings. The data points were obtained from 1.4106pixel. The red and blue lines indicate the theoretical prediction for a Poisson and Gamma distribution, respectively, assuming θ=1. The linear relation allows to relate the measured a.u. (top, right axis) to photons (bottom, left axis). The inset provides the corresponding log-log plot. Important for the KF algorithm is that skewness and excess kurtosis is small.

In this work, the KF analysis assumes Poisson statistics for the fluorescence signal in cPCF. Many commercial microscopes are not equipped with photon counting detectors or detectors are not operated in photon-counting mode, often to due to ease of use or limitation in dynamic range. Therefore, it is important to verify that the fluorescence signal follows, at least approximately, Poisson counting statistics. In particular, for the KF it is assumed that higher order statistics, such as skewness and excess kurtosis, vanish. The central assumption of the derivation of our Bayesian filter is var[yfl]=E[yfl]

Here we show that this assumption for the detectors used in our previously used LSM 710 (Carl Zeiss, Jena) system (Biskup et al., 2007; Kusch et al., 2010) is fulfilled by re-scaling to photon-numbers. The measured variance obeys var[yfl]=aE[yfl]. It depends linearly on the mean signals (Appendix 5—figure 1).

(72) var[y]=aE[y]
(73) var[μx]=aE[μx]
(74) μ2var[x]=aμE[x]

For the scaled signal x being Poisson distributed follows μ=a. Re-scaling of the signal by 1/a provides approximately Poisson distributed values. A linear fit yields a=205a.u.(16bit)/photon (for 680 V PMT voltage, 3.26 μs pixel dwell time). (Appendix 5—figures 2 and 3) show that excess kurtosis and skewness remain small at all levels of photons/pixel but are somewhat higher than theoretically predicted for Poisson-distributed data. The proportionalities are correctly described by the Poisson distribution assumption but the skewness and the kurtosis are too small by a constant factor of 2 and 4, respectively. This finding has to be verified for different experimental conditions, because at lower concentration/particle densities and higher count rates, particle number fluctuations can dominate statistics (Brown et al., 2008). For comparison another option would be a Gamma distribution which has the mean and the variance of E[y]=kθ and var[y]=kθ2, respectively. Thus, the applied scaling requires that θ=1. The Gamma distribution has a higher skewness by factor two (independently of θ) than a Poisson distribution and overscores the skewness and excess kurtosis of the detector. For simplicity only the Poisson distribution is considered in this work. In conclusion: Typical cPCF fluorescence signal detection rates are well approximated by a Gamma or Poisson distribution which in turn have the desired property that can be approximated by a normal distribution.

Appendix 5—figure 2
Benchmark of the signal Skewness vs its the mean for the identical experimental solution data under cPCF conditions.

The Skewness is small but the values are slightly larger than theoretically predicted. The inset provides the corresponding log-log plot. Important for the KF algorithm is that skewness and excess kurtosis is small.

Appendix 5—figure 3
Benchmark of the signal Kurtosis vs its mean for the identical experimental solution data under cPCF conditions.

The Kurtosis is small but the values are slightly larger than theoretically predicted. The inset provides the corresponding log-log plot. Important for the KF algorithm is that skewness and excess kurtosis is small.

Background noise statistics

In cPCF measurements with fluorescence-labeled ligands, the signals of the ligands bound to the receptors overlap with the signals from freely diffusing fluorescence-labelled ligands in the bulk. This bulk signal is subtracted from the total signal (Biskup et al., 2007). While the mean difference signal yfl,k(t) of the confocal voxel k represents the bound ligands in that voxel, its noise yζ,k(t) originates from both bound and bulk ligands. The additional bulk signal, e.g. the fraction of bulk solution inside that voxel, varies from voxel to voxel and can hardly be described theoretically. Nevertheless, it can be determined experimentally (Biskup et al., 2007). At low expression levels or at ligand concentrations above low nano-molar levels, this background signal is not negligible. It scales linearly with the ligand concentration, while the signal from bound receptors depends on the affinity, as estimated by the concentration of half maximum binding BC50, and the number of ion channels in the membrane of the observed volume. The binding signal saturates at high concentrations (Appendix 5—figure 4). Thus, both high affinity (low BC50) and high expression reduce the relative contribution of the background to the overall signal, improving the signal to noise ratio.

Appendix 5—figure 4
Simulated binding signals.

a, Comparison of binding of a labeled ligand at two concentrations. A simple two-ligand binding process was simulated with the Hill equation for the two expression levels of 1000 or 10,000 binding sites per patch and a BC50 of 1000 (BC50a) or 10,000(BC50b), respectively, given in molecules per observation unit. The observed signal is the sum of the signal from ligands free in solution and bound to the receptors. The solution signal scales linearly with the concentration, while the binding signal saturates.

Appendix 5—figure 5
Simulated binding signals.

Relative contribution of the binding signal to the total signal. Note that the contribution of the binding signal scales linearly with the expression level and inversely with the.BC50

Practically, the bulk signal is estimated by counter-staining the solution with a spectrally distinct reference dye (Biskup et al., 2007). The spatial distribution of this dye mimics the spatial distribution of the freely diffusing ligands. The bulk absolute concentration as well as the molecular brightness of the reference dye and the labeled ligand differ. Hence, the binding signal is calculated as the average pixel intensity of the scaled difference image between the signal of labeled ligand and reference dye according to

(75) yfl,k=ylig,total-λ^lig,back-(yfl,ref-λ^ref,back)λ^lig-λ^lig,backλ^ref-λ^ref,back,

where λ^lig,back and λ^ref,back are the arithmetic mean background signals of the ligand and reference dye recorded beyond the membrane where no signal should be recorded. They represent a signal offset which needs to be subtracted. The mean intensities in the bulk, λ^bulk and λ^ref, are estimated outside the pipette. In order to get the correct scaling, the mean intensities need to be corrected by the respective mean background signals. If λ^lig-λ^lig,backλ^ref-λ^ref,back=1 holds then yfl,bin would be Skellam distributed (Hwang et al., 2007). The total signal is then yfl=kyfl,k. This procedure creates E[yζ]=0 but adds an additional noise term ζ(tj). For the general case of different intensities, we name the distribution ’scaled Skellam distributed’. The scaling variance of the background noise in each voxel of the difference image

(76) σζ2=λlig+λlig2λref

is derived from simulated data further below. λlig and λref are the fluorescence intensity from the freely diffusing ligands and reference dye molecules per voxel, respectively. λlig and λref are proportional to the volume fraction of the voxel, which is occupied by the bulk, and to the respective concentrations. To achieve a symmetric (ζ), one can set λlig=λref. The summed variance of all selected voxels can be tabulated according to

(77) Σback=(σζ2000)

To mimic an experiment which creates time series data ζ(t), we draw Poisson numbers for the signal from the membrane Poisson(Hn(t)) and for the signal from the bulk we draw numbers from the two respective Poisson distributions. Then subtraction of the two background signals is performed according to

(78) ybulk=ylig,bulk-yref,bulkλlig,bulkλref,bulk

assuming that the dark count signal has been correctly subtracted. Then we add the bulk signal to the bound ligand signal. In this way we produce a time trace with colored noise by the Gillespie algorithm and add white noise to time traces as it is observed in real experiments.

Deriving the moments of the background noise for the difference signal

Appendix 5—figure 6
Master curves of 2nd till 4th centralized moment of photon counting noise ζ arising from the difference signal of fluorescent ligands and the dye in the bulk.

The curves are created from 4105 draws from Poisson distributions with different combinations of intensities for the reference dye λref and of the intensity of the confocal voxel fraction.λlig

For the KF the variance, skewness and kurtosis arising from the background noise has to be calculated. Skewness and excess kurtosis of the distribution have to be small compared to the total variance of the signal including all noise sources because only in this case the KF algorithm can be considered as the optimal solution for the filtering and inference problem (Anderson and Moore, 2012). In the following the 2nd to 4th moment of ζ are derived. The noise intensity parameter of the reference dye λref is proportional to ρrefλbulk with V being the confocal volume fraction containing fluorophores and ρref the density of the fluorophores in this volume. In Appendix 5—figure 6 we deduce master curves for the variance skewness and excess kurtosis of the white noise by drawing 4105 Poisson numbers from the respective Poisson distribution and subtract them from each other according to Appendix 5 Equation 78. The variance is derived empirically to be

(79) σζ2λref=λligλref+λlig2λref2.

In Appendix 5—figure 6a, we confirm the intuition λrefvar(ζ)=λlig. Optimally, the skewness should be zero to avoid a biased estimate of θ when the data are analyzed by the KF. Empirically, for λligλref the skewness holds

(80) skew(ζ)λref=λrefλlig

Additionally for λlig<λref the skewness holds

(81) skew(ζ)λrefλrefλlig

It is zero when λref=λlig. The KF is optimal if the kurtosis excess approaches zero, in other words if ζ is distributed normally. Empirically the kurtosis holds this

(82) kur(ζ)λrefλrefλlig

for λrefλlig. The relative intensity λlig of the voxel fraction compared to the intensity λb depends on the affinity of the ligand to the receptor, the number of receptors in the patch, and the density of the fluorophores plig at the patch. For larger concentrations the ratio should be λlig/λref.

Appendix 6

Output statistics of Bayesian filters

Classical Kalman Filter without open-channel noise

Assuming that current measurements are only compromised by additive technical white noise ν but do not contain open-channel noise νop, then our noise model reduces to

(83) y(t)=Hn(t)+ν(t)yO(y|n)=normal(Hn(t),σm2)

The noise term νm has a mean of E[νm]=0 and variance E[νm2]=σm2=const. One has to keep in mind that we have to add an extra variance term originating from the dispersion of channels over the state space, as encoded by P(t) and n(t). The uncertainty P(t) is calculated using Methods Equation 30. The variance of the total output is

(84a) var(y(t),y(t))=E[(y(t)E[y(t)])(y(t)E[y(t)])]
(84b) =E[(y(t)-HE[n(t)])(y(t)-HE[n(t)])]
(84c) =E[(Hn(t)+ν(t)-HE[n(t)])(Hn(t)+ν(t)-HE[n(t)])]
(84d) =HE[(n(t)-E[n(t)])(n(t)-E[n(t)]T]H+E[ν(t)2]
(84e) =HP(t)H+σm

The two cross terms E[ν(t1)(n-E[n])THT] and E[H(n-E[n])ν(t1)T] are zero since ν is independent of n and E[νm]=0. Our derivation is equivalent to marginalization over the predicted normal prior of the ensemble state P(n(t)|Yt1) at the time of the measurement except that the prior distribution could be any probability distribution with some mean and variance. Equation 84a is the classical KF variance prediction of a signal. The first term in Equation 84a, describes the variance from stochastic gating and that the ensemble state is hidden. Notably, by Methods Equation 30 we realize that var(y(t)) contains information about T and n(t-1), which we can exploit with the KF framework.

A generalized Kalman filter with state-dependent open-channel noise

In addition to the standard KF with only additive noise (Moffatt, 2007; Anderson and Moore, 2012; Chen, 2003), fluctuations arising from the single-channel gating lead to a second white-noise term νopn4(t), causing state-dependency of our noise model. The output model is then

(85) y(t)=Hn(t)+νm(t)+νop(t)yp(y|n)=normal(y|Hn(t),σm2+n4(t)σop2)

The second noise term νop is defined in terms of the first two moments E(νop)=0 and therefore var(νop)=E(νop2)=σop2n4(t). To the best of our knowledge such a state-dependent noise makes the following integration intractable

(86a) (y(t))=normal(y|Hn,σm2+n4σop2)normal(n|n¯(t),P(t))dn
(86b) =1constexp((y-Hn)22(σm2+n4σop2))exp(12(n-n¯(t))P-1(n-n¯(t)))dn

When assuming that the relative fluctuations of n(t) are on average small then n4 in the denominator is close to E(n4) of the state. Thus, the incremental likelihood can be written as in the standard KF, with the only difference that the measurement noise is the sum of two components.

(87) y(t)normal(Hn¯(t),σm2+σop2n¯4(t)+HPH)

To see that this approximation of the variance is correct, we apply the law of total variance decomposition Weiss, 2005.

(88a) var(y(t))=E[var[y(t)|n(t)]]+var[E[y(t)|n(t)]]
(88b) =E[Σ+σop2n4(t)]+var[Hn(t)]
(88c) =σm2+σop2E[n4(t)]+HP(t)H

The terms HP(t)H+σm2 are the standard output covariance matrix. Again P(t) contains information about T, n(t-1) while the additional variance term includes information about the current n(t). The information contained in the noise influences likelihood in two ways. By the variance or covariance of the current y(t) but also for y(t+1) in correction step by the Kalman gain KKal matrix defined in the next section.

Appendix 7

Error induced for the RE or KF approach by analog filtering of PC data

Appendix 7—figure 1
Influence of analog filtering on the accuracy and precision of the maximum of the posterior from PC data We chose for the analog signal σexp/i=10, σop/i=0.1, thus a strong background noise.

For the ensemble size, we chose Nch=4103 such that the likelihood dominates the uniform prior. (a) Estimate of the distribution of the accuracy (mean Euclidean error of the median of the posterior) vs. the cut-off frequency of a 4th-order Bessel filter scaled to the channel time scale. The solid line is the mean median of 5 data sets of the respective RE posterior (blue) and KF posterior (green). The green shaded area indicates the 0.6 quantile (ranging from 0.2 till 0.8) demonstrating the distribution of the error of the median of the posterior due to the randomness of data.(b 1–3) Accuracy and precision of the maxima of the posterior k~ij,max of the maxima of the posterior rates vs. the cut-off frequency of a Bessel filter. The shaded areas indicate the 0.6 quantiles (ranging from 0.2 till 0.8) due to the variability from data set do data set while the error bars show the standard error of the mean. (b 4–6) Accuracy and precision of the maxima of the posterior K~ij,max of the maxima of the posterior of the corresponding equilibria vs. the cut-off frequency of a Bessel filter.

In order to mimic an analog signal before the analog-to-digital conversion we simulated 5 different data sets of 100 kHz signals which were then filtered by a digital fourth-order Bessel filter. The activation curves were then analyzed with the Bayesian filter at 125 Hz and the deactivation curves at sampling rates between 166-500 Hz. Operating the Bayesian filter at a lower frequency is necessary because due to applying the Bessel filter the former white noise of the signal obtained additional time correlations. Thus, an all data points fit would immediately violate the white noise assumption of Equation 4 which we restore by analyzing at a much lower frequency. We then let the time scales of the induced time correlations become larger and larger by decreasing fcut. We show (Appendix 7—figure 1a) the results of the Euclidean Error for 3 different cases for PC data. The RE approach (blue) and the Bayesian filter (red) share the same analyzing frequency. In contrast, an increased fana=1660-5000 Hz (green) is used for the Bayesian filter. Both algorithms (blue, red) with the same fana show a similar rather constant region separated by an offset. Similar, to cPCF data the KF is more robust. The Bayesian filter with fana=1660-5000 Hz (green) does not show this constant region but outperforms the Bayesian filter with the lower fana if the recording system is operated with minimal analog filtering. This becomes even more apparent when we consider the single parameter deviation vs. fcut (Appendix 7—figure 1b(1-6)). Note, the strong dependence of the critical fcut at which the performance of all algorithms becomes suddenly worse scales with fana. Additionally, it is crucial, independently of the algorithm, that fcutfana. In (Appendix 7—figure 1b(1-6)) we demonstrate the distribution of the errors on the single parameter level and compare only the KF for different fana. Similar to the findings in the main text, on the one hand, the Euclidean error (red) shows some robustness against fcut. On the other hand, the influence of fcut on the individual parameter level sets in immediately and is complex. The best results are achieved with minimal analog filtering and a high fana.

Appendix 8

Details of the specified two methods to count the probability mass needed to include the true value

In Figure 4c we compare the true against the expected success probabilities of finding the complete true rate matrix within an n-dimensional HDCV for different expected success probabilities of a perfect model. By ‘expected’ and ‘perfect’ we refer to a fictitious ideal algorithm which exactly models exhaustively all details of the true process, including all measurement details. The first way assumes a multivariate normal posterior distribution with (θ|y)postnormal(θ|E[θ],Σ). Then the ellipsoid of a constant probability density is exactly the surface of a HDCV of given probability mass P. For a two-dimensional representation see Figure 4d below the diagonal. In one dimension, the ellipsoid consists of the two points with a distance dMah=(θ-E[θ])2/σ from the mean (inset Figure 5a). In general, the n-dimensional ellipsoids around the mean of a multivariate normal distribution can be described by points which have the same Mahalanobis distance dMah from the mean.

(89) dMah(θ)=(θ-E[θ])Σ-1(θ-E[θ])

For a multivariate standard normal distribution without correlation, the Mahalanobis distance becomes the Euclidean distance (θ-E[θ])T(θ-E[θ]). Rewriting the multivariate normal distribution in terms of the Mahalanobis distance

(90) normal(θ)=1det(2πΣ)exp[-0.5d(θ)Mah2]

reveals that all points θ with dMah=const have the same probability density. The random variable dMah2 is χ-square distributed. Thus the χ-square distribution (Figure 5a) is the probability density of drawing an θ in n dimensions and finding it on the ellipsoid’s n-1-dimensional surface, at a distance dMah from the mean. This allows us to use the cumulative χ2-square distribution function to calculate the probability mass inside the ellipsoid which is the desired HDCV. By evaluating dMah=(θtrue-E[θ])Σ-1(θtrue-E[θ]) and χcdf2(dMah2) we specify how much volume, in units of probability mass, has to be counted until the volume includes the true value. Note, that with increasing dimensionality (Figure 5a) the probability mass is shifted away from the mode of the probability density. For general probability densities, this is also true in higher dimensions where one only rarely finds the true value within a small distance to the maximum of the probability density. Note, that there is no mathematical theorem for singular models (Watanabe, 2007) saying that the posterior approaches asymptotically a multivariate normal with increasing data quality/quantity. Consequently, the underlying assumption that the posterior distribution is multivariate normal, is situation dependent valid or highly questionable. Thus the displayed method should be validated additionally in an independent way. To determine the probability mass needed to include the true value into the HDCV for such a posterior, we can also use a histogram-based method. One starts by constructing an n-dimensional histogram from the samples of the posterior and by initializing a global variable with zero. Then we start counting from the bin with most counts and check whether the true value falls inside this bin. If not, we add the probability mass inside that bin to the global variable. Repeating this for next lower bins eventually leads to the detection of the bin with the true rate matrix inside. On detection, the global variable contains the sought-after value P. While this procedure does not depend on a multivariate normal assumption, it is prone to errors due to the discrete bins and the finite samples from the posterior.

Nevertheless, both procedures show (Figure 5b) good agreement when plotting the volume/ probability mass needed to include the true parameter value vs. Nch. Again Nch200 seems to be a reasonable data quality to trust the statistics of the posterior using the KF. In contrast, the posterior of the RE approach never includes the true values with a reasonable HDCV. Note, that a probability mass of 1 to reach the true value means that qualitatively speaking the estimate of the RE approach is infinitely far away from the true value in terms of the Mahalanobis distance. Further, due to the finite sampling the method of histogram counting is not qualified for the largest HDCVs approaching P1. The two developed methods contrast the Euclidean distance from the true values to the maximum mode, median or even the mean of the distribution against all higher moments of the distribution. Thus it tests the overall shape of the posterior.

Appendix 9

Uncertainty quantification for the 5-state and 6-state model with cPCF data

The effect of the second observable on the single parameter level for the 5-state model can be seen in Appendix 9—figure 1. The biased estimates and the unidentified parameters of the RE approach for PC data (Figure 9) are eliminated by the fluorescence data.

Appendix 9—figure 1
HDCIs vs Nch for the 5-state-2-open-states model for cPCF data.

The single parameter level corresponds to the Euclidean error of Figure 7d. Compared to the main text we switched rows with columns of the panels, the first column represents now the rates and the second column the equilibrium constants. The prior that enforces microscopic-reversibility is identical to the prior mentioned in the caption of Figure 7.

Even the over-confidence problem seems to be decreased (compare with Figures 4b16). In contrast, the HDCIs of the 6-states-1-open-state model for cPCF data (Appendix 9—figure 2) display again a much increased over-confidence problem of the rate equation approach. This indicates that not simply counting of states but the actual topology of the kinetic scheme and the structure of the observation matrix H determines the scale of the over-confidence problem. Apparently, the more information comes from the fluorescence data relative to the information from the current data the less grave is the over-confidence problem and the less different are the Euclidean errors.

Appendix 9—figure 2
HDCIs vs Nch for the 6-states-1-open-state model for cPCF data.

The single parameter level corresponds to the Euclidean error of Figure 7e. Compared to the main text we switched rows with columns of the panels, the first column represents now the rates and the second column the equilibrium constants. The prior that enforces microscopic-reversibility is identical to the prior mentioned in the Figure 7.

Data availability

We included the simulated data time traces into supporting files and we uploaded the source code on https://github.com/JanMuench/Tutorial_Patch-clamp_data and https://github.com/JanMuench/Tutorial_Bayesian_Filter_cPCF_data.

References

  1. Book
    1. Anderson BD
    2. Moore JB
    (2012)
    Optimal Filtering
    Massachusetts, United States: Courier Corporation.
  2. Book
    1. Colquhoun D
    2. Hawkes GA
    (1995a) A Q-Matrix Cookbook
    In: Sakmann B, Neher E, editors. Single-Channel Recording. Boston, MA: Springer. pp. 589–633.
    https://doi.org/10.1007/978-1-4419-1229-9_20
  3. Book
    1. Colquhoun D
    2. Hawkes GA
    (1995b) The Principles of the Stochastic Interpretation of Ion-Channel Mechanisms
    In: Sakmann B, Neher E, editors. Single-Channel Recording. Boston, MA: Springer. pp. 397–482.
    https://doi.org/10.1007/978-1-4419-1229-9_18
    1. Colquhoun D
    2. Hawkes GA
    3. Bernard K
    (1997a) On the stochastic properties of single ion channels
    Proceedings of the Royal Society of London. Series B. Biological Sciences 211:205–235.
    https://doi.org/10.1098/rspb.1981.0003
    1. Gelman A
    2. Rubin DB
    (1992a)
    A single series from the gibbs sampler provides A false sense of security
    Bayesian Statistics 4:75.
  4. Book
    1. Ghahramani Z
    (1997)
    Learning Dynamic Bayesian Networks
    Berlin, Germany: Springer.
  5. Book
    1. Gillespie CS
    2. Golightly A
    (2012)
    Bayesian Inference for the Chemical Master Equation Using Approximate Models
    Ulm, Germany: Elsevier.
    1. Hoffman MD
    2. Gelman A
    (2014)
    The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo
    Journal of Machine Learning Research 15:1593–1623.
  6. Book
    1. Jaynes ET
    2. Kempthorne O
    (1976) Confidence intervals vs bayesian intervals
    In: Harper WL, Hooker CA, editors. Foundations of Probability Theory, Statistical Inference, and Statistical Theories of Science. Berlin, Germany: Springer. pp. 175–257.
    https://doi.org/10.1007/978-94-010-1436-6_6
    1. Jeffreys H
    (1946) An invariant form for the prior probability in estimation problems
    Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences 186:453–461.
    https://doi.org/10.1098/rspa.1946.0056
  7. Book
    1. Sakmann B
    2. Neher E
    (2013) Single-Channel Recording
    Berlin, Germany: Springer Science & Business Media.
    https://doi.org/10.1007/978-1-4419-1229-9
  8. Conference
    1. Watanabe S
    (2007) Almost All Learning Machines are Singular
    2007 IEEE Symposium on Foundations of Computational Intelligence. pp. 383–388.
    https://doi.org/10.1109/FOCI.2007.371500
  9. Book
    1. Weiss NA
    (2005)
    A Course in Probability
    Boston, U.S.A: Addison-Wesley.

Decision letter

  1. Marcel P Goldschen-Ohm
    Reviewing Editor; University of Texas at Austin, United States
  2. Richard W Aldrich
    Senior Editor; University of Texas at Austin, United States
  3. Marcel P Goldschen-Ohm
    Reviewer; University of Texas at Austin, United States
  4. Lorin S Milescu
    Reviewer; University of Maryland, Baltimore, United States
  5. Colin Kinz-Thompson
    Reviewer; Rutgers, United States

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Decision letter after peer review:

Thank you for submitting your article "Bayesian selection of Hidden Markov models for multi-dimensional ion channel data" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, including Marcel P Goldschen-Ohm as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Richard Aldrich as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Lorin S Milescu (Reviewer #2); Colin Kinz-Thompson (Reviewer #3).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

We would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). Specifically, when editors judge that a submitted work as a whole belongs in eLife but that some conclusions require additional discussion or a modest amount of additional analysis, as they do with your paper, we are asking that the manuscript be revised to address the points raised by the reviewers.

Our expectation is that the authors will eventually carry out the additional analyses and report on how they affect the relevant conclusions either in a preprint on bioRxiv or medRxiv, or if appropriate, as a Research Advance in eLife, either of which would be linked to the original paper.

Summary:

Extracting ion channel kinetic models from experimental data is an important and perennial problem. Much work has been done over the years by different groups, with theoretical frameworks and computational algorithms developed for specific combinations of data and experimental paradigms, from single channels to real-time approaches in live neurons. At one extreme of the data spectrum, single channel currents are traditionally analyzed by maximum likelihood fitting of dwell time probability distributions; at the other extreme, macroscopic currents are typically analyzed by fitting the average current and other extracted features, such as activation curves. Robust analysis packages exist (e.g., HJCFIT, QuB), and they have been put to good use in the literature.

Münch et al. focus here on several areas that need improvement: dealing with macroscopic recordings containing relatively low numbers of channels (i.e., hundreds to tens of thousands), combining multiple types of data (e.g., electrical and optical signals), incorporating prior information, and selecting models. The main idea is to approach the data with a predictor-corrector type of algorithm, implemented via a Kalman filter that approximates the discrete-state process (a meta-Markov model of the ensemble of active channels in the preparation) with a continuous-state process that can be handled efficiently within a Bayesian estimation framework, which is also used for parameter estimation and model selection.

With this approach, one doesn't fit the macroscopic current against a predicted deterministic curve, but rather infers – point by point – the ensemble state trajectory given the data and a set of parameters, themselves treated as random variables. This approach, which originated in the signal processing literature as the Forward-Backward procedure (and the related Baum-Welch algorithm), has been applied since the early 90s to single channel recordings (e.g., Chung et al., 1990), and later has been extended to macroscopic data, in a breakthrough study by Moffatt (2007). In this respect, the study by Münch et al. is not necessarily a conceptual leap forward. However, their work strengthens the existing mathematical formalism of state inference for macroscopic ion channel data, and embeds it very nicely in a rigorous Bayesian estimation framework.

The main results are very convincing: basically, model parameters can be estimated with greater precision – as much as an order of magnitude better – relative to the traditional approach where the macroscopic data are treated as noisy but deterministic (but see comments below). Estimated uncertainty can be further improved by incorporating prior information on parameters (e.g., diffusion limits), and by including other types of data, such as fluorescence.

The manuscript is well written and overall clear, and the mathematical treatment is a rigorous tour-de-force. However, the reviewers raised a number points that need further clarification, better discussion or amendment. These concerns are likely to be addressable largely by changes to the main text and software documentation along with some additional analyses. Once addressed, a revised version of this manuscript would likely be suitable for publication in eLife. That said, the study is very nice and ambitious, but clarity is a bit impaired by dealing with perhaps too many issues. The state inference and the bayesian model selection are very important but completely different issues. The authors should consider whether they may be better treated separately, or even perhaps in a more specialized journal where they can be developed in more detail.

Essential revisions:

1. Tutorial-style computational examples must be provided, along with well commented/documented code. The interested readers should be able to implement the method described here in their own code/program. The supplied code is not well documented, and it is unclear whether it is applicable beyond the specific models examined in the paper. It was supplied as.txt files, but looks like C code, and is unlikely to be easily adaptable by others to their own models.

2. The authors should clearly discuss the types of data and experimental paradigms that can be optimally handled by this approach, and they must explain when and where it fails or cannot be applied or becomes inefficient in comparison with other methods. One must be aware that ion channel data are very often subject to noise and artifacts that alter the structure of microscopic fluctuations. It seems that the state inference algorithm would work optimally with low noise, stable, patch-clamp recordings (and matching fluorescence recordings) in heterologous expression systems (e.g., HEK293 cells), where the currents are relatively small, and only the channel of interest is expressed (macropatches?). In contrast, it may not be effective with large currents that are recorded with low gain, are subject to finite series resistance, limited rise time, restricted bandwidth, colored noise, contaminated by other currents that are (partially) eliminated with the P/n protocol with the side effect of altering the noise structure, power line 50/60 Hz noise, baseline fluctuations, etc. This basically excludes some types of experimental data and experimental paradigms, such as recordings from neurons in brain slices or in vivo, oocytes, etc. Of course, artifacts can affect all estimation algorithms, but approaches based on fitting the predicted average current have the obvious benefit of averaging out some of these artifacts.

The discussion in the manuscript is insufficient in this regard and must be expanded. The method should be tested under non-ideal but commonly occurring conditions, such as limited bandwidth and in the presence of contaminating noise. For example, compare estimates obtained without filtering with estimates obtained with 2, 3 times over-filtering, with and without large measurement noise added (whole cell recordings with low-gain feedback resistors and series resistance compensation are quite noisy), with and without 50/60 Hz interference. How does the algorithm deal with limited bandwidth that distorts the noise spectrum? How are the estimated parameters affected? The reader will have to get sense of how sensitive is this method to artifacts. Also, fluorescence data in particular is usually of much lower temporal resolution than current measurements. How does this impact its benefit to parameter estimation?

3. Even more emphasis on the approximation of n(t) as being distributed according to a multivariate normal, and thus being continuous, should be placed in the main text. It seems that this may limit the applicability of the method to data with > ~100s of channels; although the point is not investigated. In Figure 3, the method is only benchmarked to a lower limit of ~500 channels. As shown in Milescu et al., 2005, fitting macroscopic currents is asymptotically unbiased. In other words, the estimates are accurate, unless the number of channels is small (tens or hundreds), in which case the multinomial distribution is not very well approximated by a Gaussian. What about the predictor-corrector method? How accurate are the estimates, particularly at low channel counts (10 or 100)? Since the Kalman filter also uses a Gaussian to approximate the multinomial distribution of state fluctuations, I would also expect asymptotic accuracy. Parameter accuracy should be tested, not just precision.

4. Achieving the ability to rigorously perform model selection is very impressive aspect of this work and a large contribution to the field. However, the manuscript offers too many solutions to performing that model selection itself along with a long discussion of the field (for instance, line 376-395 could be completely cut). Since probabilistic model selection is an entire area of study by itself, the authors do not need to present underdeveloped investigations of each of them in a paper on modeling channel data (e.g., of course WAIC outperforms AIC. Why not cover BIC and WBIC?). The authors should pick one, and maybe write a second paper on the others instead of presenting non-rigorous comparisons (e.g., one kinetic scheme and set of parameters). As a side note, it is strange that the authors did not consider obtaining evidence or Bayes factors to directly perform Bayesian model selection – for instance, they could have used thermodynamic integration since they used MC to obtain posteriors anyway (c.f., Computing Bayes Factors Using Thermodynamic Integration by Lartillot and Philippe, Systematic Biology, 2006, 55(2), 195-207. DOI: 10.1080/10635150500433722)

5. Regarding model selection for the PC data in Figure 7, why does the RE model with BC appear to provide a visually better identification of the true model than the KF model with BC if the KF model does a better job at estimating the parameters and setting non true rates to zero? Doesn't this suggest that RE with cross validation is better than the proposed KF approach in regards to model selection? In terms of parameter estimates (i.e. as shown in Figure 3), how does RE + BC stack up?

6. A better comparison with alternative parameter estimation approaches is necessary. First of all, explain more clearly what is different from the predictor-corrector formalism originally proposed by Moffatt (2007). The manuscript mentions that it expands on that, but exactly how? If it is only an incremental improvement, of interest to a very limited audience, a specialized, technical journal, is more appropriate.

Second, the method proposed by Celentano and Hawkes, 2004, is not a predictor-corrector type but it utilizes the full covariance matrix between data values at different time points. It seems that the covariance matrix approach uses all the information contained in the macroscopic data and should be on par with the state inference approach. However, this method is only briefly mentioned here and then it's quickly dismissed as "impractical". We all agree that it's a slower computation than, say, fitting exponentials, but so is the Kalman filter. Where do we draw the line of impracticability? Computational speed should be balanced with computational simplicity, estimation accuracy, and parameter and model identifiability. Moreover, that method was published in 2004, and the computational costs reported there should be projected to present day computational power. We are not saying that the authors should code the C&H procedure and run it here, but should at least give it credit and discuss its potential against the KF method.

The only comparison provided in the manuscript is with the "rate equation" approach, by which the authors understand the family of methods that fit the data against a predicted average trajectory. In principle, this comparison is sufficient, but there are some issues with the way it's done.

Table 3 compares different features of their state inference algorithm and the "rate equation fitting", referencing Milescu et al., 2005. However, there seems to be a misunderstanding: the algorithm presented in that paper does in fact predict and use not only the average but also – optionally – the variance of the current, as contributed by stochastic state fluctuations and measurement noise. These quantities are predicted at any point in time as a function of the initial state, which is calculated from the experimental conditions. In contrast, the KF calculates the average and variance at one point in time as a projection of the average and variance at the previous point. However, both methods (can) compare the data value against a predicted probability distribution. The Kalman filter can produce more precise estimates but presumably with the cost of more complex and slower computation, and increased sensitivity to data artifacts.

Figure 3 is very informative in this sense, showing that estimates obtained with the state inference (KF) algorithm are about 10 times more precise that those obtained with the "rate equation" approach. However, for this test, the "rate equation" method was allowed to use only the average, not the variance.

Considering this, the comparison made in Figure 3 should be redone against a "rate equation" method that utilizes not only the expected average but also the expected variance to fit the data, as in Milescu et al., 2005. Calculating this variance is trivial and the authors should be able to implement it easily (reviewers are happy to provide feedback). The comparison should include calculation times, as well as convergence.

7. The manuscript nicely points out that a "rate equation" approach would need 10 times more channels (N) to attain the same parameter precision as with the Kalman filter, when the number of channels is in the approximate range of 10^2 … 10^4. With larger N, the two methods become comparable in this respect.

This is very important, because it means that estimate precision increases with N, regardless of the method, which also means that one should try to optimize the experimental approach to maximize the number of channels in the preparation. However, it should be pointed out that one could simply repeat the recording protocol 10 times (in the same cell or across cells) to accumulate 10 times more channels, and then use a "rate equation" algorithm to obtain estimates that are just as good. Presumably, the "rate equation" calculation is significantly faster than the Kalman filter (particularly when one fits "features", such as activation curves), and repeating a recording may only add seconds or minutes of experiment time, compared to a comprehensive data analysis that likely involves hours and perhaps days. Although obvious, this point can be easily missed by the casual reader and so it would be useful to be mentioned in the manuscript.

8. A misunderstanding is that a current normalization is mandatory with "rate equation" algorithms. This is really not the case, as shown in Milescu et al., 2005, where it is demonstrated clearly that one can explicitly use channel count and unitary current to predict the observed macroscopic data. Consequently, these quantities can also be estimated, but state variance must be included in the calculation. Without variance, one can only estimate the product i x N, where i is unitary current and N is channel count. This should be clarified in the manuscript: any method that uses variance can be used to estimate i and N, not just the Kalman filter. In fact, the non-stationary noise analysis does exactly that: a model-blind estimation of N and i from non-equilibrium data. Also, one should be realistic here: in some circumstances it is far more efficient to fit data "features", such as the activation curve, in which case the current needs to be normalized.

9. It's great that the authors develop a rigorous Bayesian formalism here, but it would be a good idea to explain – even briefly – how to implement a (presumably simpler) maximum likelihood version that uses the Kalman filter. This should satisfy those readers who are less interested in the Bayesian approach and will also be suitable for situations when no prior information is available.

10. The Bayesian formalism is not the only way of incorporating prior knowledge into an estimation algorithm. In fact, the reader may have more practical and pressing problems than guessing what the parameter prior distribution should be, whether uniform or Gaussian or other. More likely one would want to enforce a certain KD, microscopic (i)reversibility, an (in)equality relationship between parameters, a minimum or maximum rate constant value, or complex model properties and behaviors, such as maximum Popen or half-activation voltage. A comprehensive framework for handling these situations via parameter constraints (linear or non-linear) and cost function penalty has been recently published (Salari et al/Navarro et al., 2018). Obviously, the Bayesian approach has merit, but the authors should discuss how it can better handle the types of practical problems presented in those papers, if it is to be considered an advance in the field, or at least a usable alternative.

11. The methods section should include information concerning the parameter initialization choices, HMC parameters (e.g. number of steps) and any burn-in period used in the analyses used in Figures 3-6. For example, how is convergence established? How many iterations does it take to reach convergence? How long does it take to run? How does it scale with the data length, channel count, and model state count? How long does it take to optimize a large model (e.g., 10 or 20 states)? Provide some comparison with the "rate equation method".

12. In the section on priors, the entire part concerning the use of a β distribution should be removed or replaced, because it is a probabilistic misrepresentation of the actual prior information that the authors claim to have in the manuscript text. The max-entropy prior derived for the situation described in the text (i.e., an unknown magnitude where you don't know any moments but do have upper and lower bounds; the latter could be from the length from the experiment) is actually P(x) = (ln(x_{max}) – ln(x_{min}))^{-1} * x^{-1}. Reviewers are happy to discuss more with the authors.

13. Here and there, the manuscript somehow gives the impression that existing algorithms that extract kinetic parameters by fitting the average macroscopic current ("fitting rate equations") are less "correct", or ignorant of the true mathematical description of the data. This is not the case. Published algorithms often clearly state what data they apply to, what their limitations are, and what approximations were made, and thus they are correct within that defined context and are meant to be more effective than alternatives. Some quick editing throughout the manuscript should eliminate this impression.

14. The manuscript refers to the method where the data are fitted against a predicted current as "rate equations". However, it is not completely clear what that means. The rate equation is something intrinsic to the model, not a feature of any algorithm. An alternative terminology must be found. Perhaps different algorithms could be classified based on what statistical properties are used and how. E.g., average (+variance) predicted from the starting probabilities (Milescu et al., 2005), full covariance (Celentano and Hawkes, 2004), point-by-point predictor-corrector (Moffatt, 2007).

15. The manuscript needs line editing and proofreading (e.g., on line 494, "Roa" should be "Rao"; missing an equals sign in equation 13). Additionally, in many paragraphs, several of the sentences are tangential and distract from communicating the message of the paper (e.g., line 55). Removing them will help to streamline the text, which is quite long.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for submitting your article "Bayesian inference of kinetic schemes for ion channels by Kalman filtering" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, including Marcel P Goldschen-Ohm as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Richard Aldrich as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Colin Kinz-Thompson (Reviewer #3); Lorin Milescu (Reviewer #4).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

Reviewer #1 (Recommendations for the authors):

The authors develop a Bayesian approach to modeling macroscopic signals arising from ensembles of individual units described by a Markov process, such as a collection of ion channels. Their approach utilizes a Kalman filter to account for temporal correlations in the bulk signal. For simulated data from a simple ion channel model where ligand binding drives pore opening, they show that their approach enhances parameter identifiability and/or estimates of parameter uncertainty over more traditional approaches. Furthermore, simultaneous measurement of both binding and pore gating signals further increases parameter identifiability. The developed approach provides a valuable tool for modeling macroscopic time series data with multiple observation channels.

The authors have spent considerable effort to address the previous reviewer comments, and I applaud them for the breadth and depth of their current analysis.

1. The figure caption titles tend to say what analysis or comparison is being presented, but not what the take home message of the figure is. I suggest changing them to emphasize the latter. This will especially help non-experts to understand what the figures are trying to convey to them.

2. I very much appreciate the GitHub code and examples for running your software. However, I feel that a hand-holding step-by-step explicit example of running a new model on new data is likely necessary for many to be able to utilize your software. Much more hand-holding than the current instructions on the GitHub repo.

3. Figure captions sometimes do not explain enough of what is shown. I appreciate that many of these details are in the main text, but data that is displayed in the figure and not concretely described in the caption can makes the figures hard to follow. e.g. Figure 4a – "With lighter green we indicate the Euclidean error of patch-clamp data set." But what data set do the other colors reflect? It is not stated in the caption. Again, I realize this is described in the main text, but it also needs to be defined in the caption where the data is presented. Figure 4d – Please spell out what "both algorithms" intends. Also, a suggestion: instead of having to refer to the caption for the color codes (i.e. RE vs. KF, etc.) it would speed figure interpretation to have a legend in the figures themselves. Few other examples: Box 1. Figure 2. – Please define the solid vs. dashed lines in the caption. Figure 3c – Please define the solid vs. dashed lines in the caption. Figure 12 – "We simulated 5 different 100kHz signals." What kind of signals? Fluorescence I assume, but this needs to be explicitly defined. I'd check all figures for similar.

Reviewer #3 (Recommendations for the authors):

In this revised manuscript, the Münch et al. have addressed all of my original concerns. It is significantly revised, though, and includes many new investigations of the algorithm's performance. Overall, the narrative of this manuscript is now to introduce an approximation to the solution for a Bayesian Kalman Filter, and then spend time demonstrating that this approximation is reasonable and even better than previous methods. In my opinion, they successfully do this, although, as they mention in their comments, their manuscript is very long.

I am not 100% certain, but the approximation that the author's make seems to be equivalent (or at least similar to) an approximation of the chemical master equation using just the 1st and 2nd moments, which is just the Fokker-Planck equation. The authors should discuss any connection to this approximation, as there is a great deal of literature on this topic (e.g., see van Kampen's book).

In Figures 3A and 4D, it is unclear to me what is plotted for the RE and classical Kalman filter (i.e., how is there a posterior if they are not Bayesian methods)? Perhaps it is buried in the methods or appendices, but, if so, it needs to at least be clarified in the figure captions.

The Bayesian statistical tests devised to determine success (e.g., on pgs. 12-13) seem a little ad hoc, but are technically acceptable. I do not see a need for additional metrics.

Line 939: Equation 61 is absolutely not "close to uniform distribution". The α and β parameters of 100.01 are much larger than 1. It is incredibly peaked around 0.5. Perhaps this is a typo?

Line 942: The allowed small breaking of microscopic reversibility in the prior is an interesting idea that I wish the authors would expound upon more.

Line 712: The authors state that the simulation 'code will be shared up-on request'. They should include it with their github pages tutorials for running the examples in case others wish to check their work and/or use it. There is no reason to withhold it.

Line 707: 'Perspectively' is not a commonly used word.

Reviewer #4 (Recommendations for the authors):

The authors have addressed all my comments and suggestions. The manuscript is nice and extremely comprehensive, and should advance the field.

Nevertheless, the manuscript is also very long (but justifiably so), and certain statements could be a little clearer. Most of these statements refer to the comparison with the so-called rate equation (RE) methods, with which I'm more familiar. For example:

Abstract: "Furthermore, the Bayesian filter delivers unbiased estimates for a wider range of data quality and identifies parameters which the rate equation approach does not identify."

The first part of this statement is not quite true, as Figure 4 shows clearly that the Bayesian estimates are biased (and the authors acknowledge this elsewhere in the manuscript). If they are biased in one "range of data quality", that probably means they are always biased, just to different degrees. This is not surprising, because the Kalman filter is a continuous state approximation to a discrete state process, and the overall estimation algorithm makes a number of approximations to intractable probability distributions. It would definitely be correct to say that the estimates are very good, but not unbiased.

Second, this statement is also ambiguous. Are you referring to the theoretical non-identifiability caused by having more parameters in the model than what the combination of estimator+experimental protocol can extract from data? In this case, it's not a matter of certain parameters that cannot be identified, but a matter of how many can be identified uniquely. The more information is extracted from the data, the more unique parameters can be identified, so the Kalman filter should do better. Or, alternatively, are you referring to specific parameters that are poorly identified because, for example, they refer to transitions that are rarely undertaken by the channel? In this case, it would be a matter of obtaining looser or tighter estimates with one method or another, but the parameters should be intrinsically identifiable, I imagine, regardless of the algorithm. In any case, it's not clear that the better identifiability is the result of the Bayesian side, or of the predictor-corrector state inference filter. I would guess it is the Kalman filter, but I'm not sure.

Perhaps it would be clearer if you said that the KF method produces good estimates and generally improves parameter identifiability compared to other methods, as it extracts more information from the data?

Introduction:

32: I'm not sure, but if the intention here is to cite mathematical treatments for estimation, you may add references to the "macroscopic" papers by Celentano, Milescu, Stepaniuk and perhaps a few others that use "rate equations". Also, you may cite Qin et al., 1996, as a single channel paper describing a method used in hundreds of studies.

Pg. 3:

51: I remain skeptical that it is a good idea to use "rate equations" (RE) as a term to refer to those methods that are fundamentally different from the approach described here (also see my comment to the first submission). The rate equations must always be used to predict a future state from a past or current state, by all methods, explicitly or implicitly, because REs simply describe the channel dynamics. In this very manuscript, Equation 3, central to the Kalman filter formalism, is nothing but a deterministic rate equation with a stochastic term added to approximate the stochastic evolution of an ensemble of channels. In fact, there are some old papers by Fox and some more recent by Osorio (I'm not exactly sure of the name and I don't remember the years) that discuss that approximation and its shortcomings – perhaps that is the source of bias?

Whether that prediction is then corrected, as in the Kalman filter approach, with a corrector step is irrelevant, as far as the rate equations. Furthermore, it's all a matter of degree. For example, the Milescu et al. approach, which is classified here under the RE umbrella, predicts future states as a mean + variance (or as a mean only), but only using the initial state. There is no correction at each point, as with the Kalman filter method, but there is a correction of the initial state from one iteration to the next (by the way, it's not clear if you implemented that feature here, which would make a big difference). Then, because it considers the stochastical aspect of the data, the Milescu et al. approach should also be considered a stochastic method, just one that doesn't use all the information contained in the data (and so it is fast). Imagine a situation where you ran the same stimulation protocol multiple times, but each time you record only one data point, further and further away from the beginning of the stimulation protocol. A "stochastic" algorithm applied to this type of data would be exactly as described in Milescu et al. Of course, in reality all points are recorded in sequence, but that doesn't mean that the approach is not stochastic, just that it 's simplifying and discarding some information to gain speed. All methods (such as the Kalman filter described here) make some compromises to reduce complexity and increase speed.

The bottom line is that there is the most basic approach of solving the rate equations deterministically, without considering any variance whatsoever, and then there is everything else.

58: "Thus, a time trace with a finite number of channels contains, strictly speaking, only one independent data point."

I don't understand this sentence. Could you please clarify?

74: "The KF is the minimal variance filter Anderson and Moore (2012). Thus, instead of excessive analog filtering of currents with the inevitable signal distortions Silberberg and Magleby (1993) one can apply the KF with higher analysing frequency on less filtered data."

Yes, but I'm not sure why should we use excessive filtering? Where is this idea coming from?

131: "However, even with simulated data of unrealistic high numbers of channels per patch, the KF outperforms the deterministic approach in estimating the model parameters."

It is clearly true from your figures, but please give a number as to what is unrealistic (10,000? 100,000?). Also, outperforms, but by how much? As I commented above and below, all methods tested here seem to produce good estimates under the conditions they were designed for (and even outside), and one might argue that it's not worth adding the additional computational complexity and running time for a possibly small increase in accuracy. How is one to judge that increase in accuracy?

276: Has the RE method been used with the mean + variance or mean only? Also, how was the initial probability vector calculated with the RE method? If you used the mean + variance, then (as mentioned above), you can't call it deterministic. If you used only the mean, then it is indeed a deterministic method, but it's not the approach described in Milescu et al. Please explain.

229: "added, in Equation 9 to the variance … which originates (Equation 38d) from the fact that we do not know the true system state"

Which noise are you referring to? The state noise or the measurement noise?

326: Which are the two algorithms?

278: "Both former algorithms are far away from the true parameter values with their maxima and also fail to cover the true values within reasonable tails of their posterior."

I think the authors might have gotten a little carried away when they made this statement. There is no doubt that the Kalman/Bayesian method produces more accurate estimates, but the other two methods (KF and RE) are very reasonable as well. I see estimates that are within 5, 10, or 20% from the true values, for most parameters. This is not "failure" by any stretch of the imagination. Most people would call this quite good, given the nature of the data.

294: "The small advantage of our algorithm for small … over Moffatt (2007) is due to the fact that we could apply an informative prior in the formulation of the inference problem … by taking advantage of our Bayesian filter. "

This is an interesting and important statement. I interpret it as saying that the Bayesian aspect itself makes only a small contribution to the quality of the estimates, when comparing it with the Moffatt method, which is a Kalman filter as well. The only issue with the Moffatt method is the lack of an explicit formalism for the excess state noise (which could presumably be added). It also suggests to me that any method that tries to use the noise may run into difficulties when the noise model is unknown (typical real-life scenario). The Moffatt method is confronted with unknown noise and it fails. What about when the Bayesian method is confronted with unknown noise? There are some comments in the manuscript, but nothing tangible. Could you please comment?

Figure 3: There are no data sets in the figure. What data were analyzed here?

Is there a typo regarding the colors in a? What are the red, blue, black, green symbols? Please verify.

Assuming the red is KF, there is something very curious about its estimates, which are very different in distribution from the Bayesian estimates. Could there be a coding problem? If red is actually RE, then it would make more sense. Could you explain? Is this the effect of the "excessive" open state noise? If so, I find this situation a bit unfair, because then the 2007 KF algorithm is tested with data that it is not meant to work with.

Also, I think it would be more interesting to have a comparison between the original KF and the newer Bayesian approach, so we can understand what Bayesian estimation does. In any case, I can't really interpret the data until you clarify the colors.

The legend is unclear overall.

303: What means "singular"?

324: What shaded area in Figure 1a? Perhaps you mean Figure 4a?

Figure 4: I don't see any light green.

368: "The estimated true success rate by the RE approach (blue) is ≈ 0.15 and therefore far away from the true success probability. In contrast, the posterior (green) of the true success probability of the KF resides with a large probability mass between the lower and upper limit of the success probability of an optimal algorithm (given the correct kinetic scheme)."

I'm really a bit puzzled here: yes, the KF is more accurate (given the correct kinetic scheme), but what I see in this figure is that both algorithms are biased, yet both are generally within 10% (and quite a bit better for KF) of the true values. If one would plot the log of the rates, to transform into δ Gs, the differences would be even smaller. I would bet that experimental artifacts would contort the estimates to a greater degree anyway.

It's very nice that the RE approach was embedded and tested within a Bayesian framework, but it would still be interesting to know what is the contribution of the Bayesian aspect (unless I missed this point in the manuscript).

436: "…while their error ratio (Figure 7a) seems to have no trend to diminish with increasing NCh".

This would be unexpected, because the Kalman filter will always use more information than RE. Why would the KF approach become relatively more erroneous at higher Nc? I don't see any reason. To me, an important question is when do the overall errors become dominated by external factors, such as recording artifacts, using the wrong model, etc.

511: "Thus, transferring the unusually strictly applied algebraic constraint Salari et al. (2018) of microscopic-reversibility to constraint with scalable softness we can model the lack information if microscopic-reversibility is exactly fulfilled Colquhoun et al. (2004) by the given ion channel instead of forcing it upon the model."

I'm not sure I understand what you mean here. I think it is very usual to constrain microscopic reversibility EXACTLY, whether through the SVD method (Qin et al., Plested et al., Milescu et al., Salari et al), or through some other algebraic method (Plested et al). It's not "forcing" it on the channel, but simply testing the data for that condition. Of course, one could test the condition where there is no reversibility enforced at all (i.e., it's "given by the channel"), or anything in between.

675: "With our algorithm we demonstrate (Figure 3c and 7) that the common assumption that for large ensembles of ion channels simpler deterministic modeling by RE approaches is on par with stochastic modeling, such as a KF, is wrong in terms of Euclidean error and uncertainty quantification (Figure 5a-c and Figure 6a-b)".

I find this statement a little subjective. I think anyone who cares about stochastic vs. deterministic modeling knows enough that any method that uses more information from data should produce better estimates, regardless of the number of channels. I would say that the more likely assumption is that deterministic estimators produce poor estimates with small numbers of channels, perfect estimates with infinitely many channels, and anything in between. In fact, the KF method behaves exactly the same way, just with overall higher accuracy. Looking at the tests in this manuscript, I would say that all the previous studies that modeled macroscopic data using deterministic methods are safe. They don't need to be redone. The future, of course, is a different matter.

https://doi.org/10.7554/eLife.62714.sa1

Author response

Essential revisions:

1. Tutorial-style computational examples must be provided, along with well commented/documented code. The interested readers should be able to implement the method described here in their own code/program. The supplied code is not well documented, and it is unclear whether it is applicable beyond the specific models examined in the paper. It was supplied as.txt files, but looks like C code, and is unlikely to be easily adaptable by others to their own models.

An extended tutorial-style git-hub page for PC data can be found at https://github.com/JanMuench/Tutorial_Patch-clamp_data and for confocal patch-clamp fluorometry data, https://github.com/JanMuench/Tutorial_Bayesian_Filter_cPCF_data.

2. The authors should clearly discuss the types of data and experimental paradigms that can be optimally handled by this approach, and they must explain when and where it fails or cannot be applied or becomes inefficient in comparison with other methods. One must be aware that ion channel data are very often subject to noise and artifacts that alter the structure of microscopic fluctuations. It seems that the state inference algorithm would work optimally with low noise, stable, patch-clamp recordings (and matching fluorescence recordings) in heterologous expression systems (e.g., HEK293 cells), where the currents are relatively small, and only the channel of interest is expressed (macropatches?). In contrast, it may not be effective with large currents that are recorded with low gain, are subject to finite series resistance, limited rise time, restricted bandwidth, colored noise, contaminated by other currents that are (partially) eliminated with the P/n protocol with the side effect of altering the noise structure, power line 50/60 Hz noise, baseline fluctuations, etc. This basically excludes some types of experimental data and experimental paradigms, such as recordings from neurons in brain slices or in vivo, oocytes, etc. Of course, artifacts can affect all estimation algorithms, but approaches based on fitting the predicted average current have the obvious benefit of averaging out some of these artifacts.

The discussion in the manuscript is insufficient in this regard and must be expanded. The method should be tested under non-ideal but commonly occurring conditions, such as limited bandwidth and in the presence of contaminating noise. For example, compare estimates obtained without filtering with estimates obtained with 2, 3 times over-filtering, with and without large measurement noise added (whole cell recordings with low-gain feedback resistors and series resistance compensation are quite noisy), with and without 50/60 Hz interference. How does the algorithm deal with limited bandwidth that distorts the noise spectrum? How are the estimated parameters affected? The reader will have to get sense of how sensitive is this method to artifacts. Also, fluorescence data in particular is usually of much lower temporal resolution than current measurements. How does this impact its benefit to parameter estimation?

We developed the algorithm for analysing patch-clamp or patch-clamp fluorometry data though we are convinced that other experimental settings are applicable, as long as normal distributions can be reasonably applied and the white noise assumption for the measurement process holds. We show that it is advantageous to use the data in the ”rawest” form possible, e.g. without additional filtering (Figure 11 and App. 7 for patch-clamp data). Instead of smoothing, averaging, or subtraction, optimal statistical description of noise sources is needed in our approach. For example, in uncompensated whole-cell measurements the RC-circuit of the membrane is a low-pass filter with known characteristics. It should be noted that any active compensation of the series resistance certainly affects the noise characteristics in a hardly predictable way. It should therefore be used with particular care. In contrast, traces corrected by P/n protocols (given sufficiently low open-probability of the inactive channel) should be treatable with our approach when including the additional variance of the noise introduced by the P/n trace adequately. In fact, the assumed fluorescence signal in a confocal patch-clamp fluorometry experiment is a difference signal just as the current signal obtained with the P/n technique. Here, the amount of mean signal of swimming bulk ligands is subtracted from the original signal. Thus, in the current signal in Equation 4 we simply need to add a second noise term whose variance needs to be characterized. Non-stochastic powerline or slow baseline fluctuations are not described by our approach and should be minimized by experimental means. Although, adding a drift term in Equation 4, governed by a Gaussian process with an appropriate kernel, is worth to try. In the similar way, with a periodic kernel powerline errors could be tackled. In the “Kalman filter derived from a Bayesian filter” section (line 244-255) we specify those considerations.

Moreover, we selected now two non-ideal aspects induced by limitations of real experiments to investigate the robustness of the idealizations of our algorithm versus the approach by Milescu et al., 2005. First, we analyzed the bias induced by the analog filtering of confocal patch-clamp fluorometry data before the analog-to-digital conversion (Figure 11). For PC only data see App. 7. Second, we investigated finite integration times (Figure 12) of fluorescence data (as explicitly asked by Reviewer 1). We chose to combine those two aspects because of their similar physical origin. Both of them induce an intrinsic time scale of the recording system which is not represented in the algorithms.

3. Even more emphasis on the approximation of n(t) as being distributed according to a multivariate normal, and thus being continuous, should be placed in the main text. It seems that this may limit the applicability of the method to data with > ~100s of channels; although the point is not investigated. In Figure 3, the method is only benchmarked to a lower limit of ~500 channels.

We expanded for confocal patch-clamp fluorometry data (Figure 4a, 7d,e) the benchmark to channel numbers of 5 · 101 to 106 per patch. At the lower limit below Nch < 200 we see that highest density volumina (HDCV) do not report the correct success probability of the true parameters to be inside this volume. Below this limit both algorithms have very similar Euclidean errors and the deviations of the highest density intervals for both algorithms for each Nch become similar. Therefore, we suspect that the error due to the approximations of the multinomial distribution becomes more relevant than the autocorrelation of the intrinsic noise. In principle, the almost equal errors could also result from a bias induced by the uniform prior over the rate matrix. Therefore, we tested the Jeffreys Prior (loguniform on the dwell times) and dirichilet distribution on the probabilities which transition is chosen and did not detect differences for confocal patch-clamp fluorometry data. For PC data (Figure 3c), we see that below Nch < 2000 the posterior becomes improper for some parameters (unidentified parameters) such that here a comparison is problematic to interpret. Below Nch < 2000 the mentioned Jeffreys prior has an enormous effect but to avoid prolongation of the text we save this aspect for a separate paper about priors in kinetic scheme estimation. We added in the main text a reference to Moffat, 2007 (line 201-207). He applied his algorithm even for ion channel numbers as low as Nch = 1 and found in Figure 4 an inverse relationship 1Nch for the difference between the true log likelihood value and the loglikelihood of his Kalman filter approximation which establishes around Nch = 20. Of course, the differences also depend on which data points of the time traces are used for the inference because the quality of the used approximations also depend on where on the probability simplex it is applied. Thus, in this regard our results should be seen as a rule of thumb for the lower limit. Nevertheless, we expect a similar behaviour.

As shown in Milescu et al., 2005, fitting macroscopic currents is asymptotically unbiased. In other words, the estimates are accurate, unless the number of channels is small (tens or hundreds), in which case the multinomial distribution is not very well approximated by a Gaussian. What about the predictor-corrector method? How accurate are the estimates, particularly at low channel counts (10 or 100)?

We expanded the benchmarking for confocal patch-clamp fluorometry data (Figure 4) to low channel counts of 50. We observe that both algorithms have similar Euclidean errors if the channel numbers for confocal patch-clamp fluorometry data for the simple model drops below Nch < 200. For those small ion channel ensembles, the uncertainty quantification by Bayesian credibility intervals/volume (Figure 4. b1-6) becomes unreliable even for the Kalman filter. Note, we showed that uncertainty quantification by Bayesian credibility volume in combination with the rate equation approach is generally unreliable because the posterior is to narrow (over-confident) leading to bad coverage properties of its Highest Density Credibility Volume/Interval (Figure 5 and 6). Thus, in that regime both algorithms fail in a similar way.

On the one hand, these problems can originate from the used approximations of both algorithms to derive the likelihood. One the other hand, as information from the data gets weaker in the lower regime it could also be a bias from the uniform prior of the rate matrix. This is because a uniform prior is not an unbiased minimum information prior for rates, as indicated by Reviewer 3. Tests with a Jeffreys prior for confocal patch-clamp fluorometry data with the 4-state model did not reveal a significant difference such that we consider in the article only the first option and for clarity took out prior discussions.

We also like to note that the difference between the exact likelihood and a Bayesian/Kalman Filter approximation is investigated for patch-clamp data of a 3-state model in (Moffatt, 2007) (Figure 2, 3 and 4). He shows that the difference of the log likelihood of data between an algorithm with the exact likelihood and the Kalman filter implementation scales with ∝ 1/Nch if Nch > 20.

Since the Kalman filter also uses a Gaussian to approximate the multinomial distribution of state fluctuations, I would also expect asymptotic accuracy.

For a 5-state-2-open-states model for PC data (Figure 9), we demonstrate that the advantage of the KF is a bias reduction compared to the RE approach. Thus, a high accuracy is reached by the KF already at Nch = 103. In contrast, we detect bias and inaccurate inferences of the RE approach even for ensemble sizes as large as Nch = 5·105. Due to the small number of test data sets a small undetected amount of inaccuracy in parameter inferences of the KF is possible. Additionally, for some data sets of high quality Nch = 7.5·104 the RE algorithm fails to identify parameters, whereas the KF algorithm has no identification problems.

In contrast, with confocal patch-clamp fluorometry data for all tested models bias and inaccuracy is a smaller but still present issue (Figure 11 and 12). The over-confidence problem of the RE algorithm remains for all tested models.

Parameter accuracy should be tested, not just precision.

consistent estimators (given an identifiable model) as their Euclidean error (distance from the true value) vanishes in distributions with increasing data quality. If any estimator is consistent it has to be asymptotically accurate and precise.

Accurate means

accurate = E[θiθi,true] = 0 (1)

with i indicating the i-th parameter and E means the arithmetic mean of many inferences with different data sets. The common mathematical definition of precision is

(2) σ=1/NinferenceiNinference(θiE[θ])2.

For a data set of finite quality or amount, the estimators could still be inaccurate (biased) and/or imprecise but it could also be that inferences are perfectly accurate (unbiased) and only imprecise. To see that the last aspect is true, take for example a perfectly accurate inference: Thus, the mean equals exactly the true value E[θ] − θtrue = 0 and assume its imprecision is Σ = 1. In this case the Euclidean error iNprametersθi2 , is a random variable that is χ-distributed. Thus, the mean Euclidean distance equates to E[error]=2Γ((Nparameter+1)/2)Γ(Nparameter/2)0, with Γ() being the γ function. In conclusion, the Euclidean error is not a direct estimator of precision but has good summarizing properties.

We display now in Figure 4b1-6, 8 and 9 and App. 9 accuracy and precision implicitly by the narrowest highest density interval. By implicitly, we mean that both aspects are demonstrated by the local trend (accuracy) and the spread (precision) around the trend vs. Nch. In this representation, perfect accuracy means E[θ˜i] = 1 as the parameters are normalized to their true value. The advantage of the single parameter level is that systematically inaccurate (biased) estimates E[θ˜i] 6 = 1 for a finite data quality can be detected Figure 9 for which the more summarizing Euclidean error is blind.

Conceptually, we prefer the Bayesian highest density credibility interval. In the new version of the article we argue that Euclidean error and Bayesian highest density volumes together are a very effective way to probe the performance of an algorithm (Figure 5. and Figure 6). Highest density credibility volumes (HDCV) allow uncertainty quantification exactly in the way the researcher desires in most cases. HDCVs allow to determine the probability mass of the true parameters to be in a certain volume for a given data set. This guided us to the binomial testing of the overall shape of the posterior. This interpretation is not possible for the usual confidence intervals/volumes of maximum likelihood, Janes, et al., 1976. In the benchmark situation of this article, we used the HDCV to show that the RE approach is generally too confident in the Bayesian parameter uncertainty quantification, which also leads to too confident (narrow) confidence volumes or errorbars in a maximum likelihood context. Note, that this shortcoming had remained undetected if we would have explored only accuracy and precision. Later in the article we worked explicitly with the classical definition of accuracy and precision because we want to understand the mean bias (for all possible random data sets) induced by analog Bessel-filtering (Figure 11 and App. 7 Figure 1) the data before analysing them. The same applies to the mean bias due to limited time resolution of fluorescence data (Figure 12).

4. Achieving the ability to rigorously perform model selection is very impressive aspect of this work and a large contribution to the field. However, the manuscript offers too many solutions to performing that model selection itself along with a long discussion of the field (for instance, line 376-395 could be completely cut). Since probabilistic model selection is an entire area of study by itself, the authors do not need to present underdeveloped investigations of each of them in a paper on modeling channel data (e.g., of course WAIC outperforms AIC. Why not cover BIC and WBIC?). The authors should pick one, and maybe write a second paper on the others instead of presenting non-rigorous comparisons (e.g., one kinetic scheme and set of parameters). As a side note, it is strange that the authors did not consider obtaining evidence or Bayes factors to directly perform Bayesian model selection – for instance, they could have used thermodynamic integration since they used MC to obtain posteriors anyway (c.f., Computing Bayes Factors Using Thermodynamic Integration by Lartillot and Philippe, Systematic Biology, 2006, 55(2), 195-207. DOI: 10.1080/10635150500433722)

We took out the whole model selection part and saved it for a successor publication.

5. Regarding model selection for the PC data in Figure 7, why does the RE model with BC appear to provide a visually better identification of the true model than the KF model with BC if the KF model does a better job at estimating the parameters and setting non true rates to zero? Doesn't this suggest that RE with cross validation is better than the proposed KF approach in regards to model selection? In terms of parameter estimates (i.e. as shown in Figure 3), how does RE + BC stack up?

We are pleased to clarify this point though the model selection will be placed in a second article. In the previous version of the article, we presented in Figure 6 and Figure 7 two related aspects.

First, we define overfitting in the usual way as a performance difference of a trained model when it is used to predict the training data and prediction data. A trained model, no matter if under complex or overly complex performs on average better to predict the already used trainings data than unseen new data. When we choose a too complex kinetic scheme in which the true process is nested we observe, that the RE approach, when applied to a more complex model, has a higher tendency to overfit. We concluded that from the observation that the RE algorithm concentrates less the posterior mass around zero for rates which do not exist in the true process (Figure 6 of the previous article) than the KF. The RE algorithm places during the training more statistical weight on values of rates to explain fluctuations which are specific for the training data set. This happens no matter which model is used just because the amount of data is finite. That leads due to the higher flexibility of a too complex model to even more ways to overfit the data set. This means for a kinetic scheme trained by an RE algorithm, that if the trained model is now used to predict new data, it does not perform (generalize) as good as if the simpler model would have been trained and then used to predict new data (cross validation). In mathematical terms, more statistical weight on structures which are not present in the real process reduces, of course, the value of Equation 16 of the previous version of the article. This tendency to overfit more of the RE approach can be exploited if the model inference is done by Bayesian cross validation (Figure 7a, see Equation 16 of the previous version of the article). Because cross-validation penalizes more the bigger tendency of RE approach on the too complex model to overfit to the trainings data.

Thus, the conclusion of Reviewer 1 is the same as we drew in lines 406 -410 of the previous version of the article “This means, ignoring intrinsic fluctuations of macroscopic data, such as RE approaches do, leads to the inference of more states than present in the true model if the model performance is evaluated by the training data. One can exploit this weakness by choosing the kinetic scheme on cross-validated data, since too complex models derived from the RE approach do not generalize as good to new data as a parsimonious model”. We were intending to state that if one uses a RE approach one is limited to cross validation for the kinetic scheme selection. To the best of our knowledge, cross-validation is not the default strategy how models are selected. But the data (Figure 7a in the previous version of the article) show that it should be the method of choice if an RE approach is used.

The Bayesian filter is more flexible than a RE approach. It is more reliable to detect structures that are over-fitted by asking for which rates is the probability mass close to zero (continuous model expansion). But it can also use estimators of the minimum Kullback-Leibler entropy (Bayesian cross validation or information criteria such as WAIC). Because, Bayesian cross validation is data intensive, it is an advantage to have an algorithm which is able to perform this task on the training data by WAIC.

6. A better comparison with alternative parameter estimation approaches is necessary. First of all, explain more clearly what is different from the predictor-corrector formalism originally proposed by Moffatt (2007). The manuscript mentions that it expands on that, but exactly how? If it is only an incremental improvement, of interest to a very limited audience, a specialized, technical journal, is more appropriate.

We thank the Reviewer for mentioning these concerns. First, we inserted the new Figure 3 to address these concerns. In particular, Figure 3b shows even for tiny standard deviations of the open-channel noise, measured in units of single channel current σop/i ≈ 0.01, that the previous Kalman filter, Moffat, 2007, produces a larger Euclidean error than our algorithm. The magnitude of the Euclidean error of the parameters of the kinetic scheme inferred by Moffat, 2007, becomes quickly even worse than the Euclidean Error of Milescu et al., 2005 while our algorithm and the one used by, Milescu et al., 2005 is hardly affected. The reason for this is revealed if one considers the scaling of the total amount of open-channel noise in the macroscopic signal. In fact, we interpret our findings such that for most ion channels this generalization from pure additive to additional multiplicative noise (from adding instrumental noise of a constant variance to adding instrumental noise whose variance depends on the ensemble state) is a necessary step to apply the Kalman filter. Second, we discuss the mathematical difficulties now in more detail in the main text in the subsection “Kalman filter derived from a Bayesian filter”.

Second, the method proposed by Celentano and Hawkes, 2004, is not a predictor-corrector type but it utilizes the full covariance matrix between data values at different time points. It seems that the covariance matrix approach uses all the information contained in the macroscopic data and should be on par with the state inference approach. However, this method is only briefly mentioned here and then it's quickly dismissed as "impractical".

Thanks for pointing out that our concerns with the algorithm by Celentano and Hawkes, 2004, were not convincing. We agree that the likelihood of the Celentano and Hawkes algorithm seems to be a mathematical equivalent. But our major concern is computational speed. We agree that there is no sharp borderline between possible and impossible with regard to a Bayesian adaptation of the Celentano and Hawkes, 2004, algorithm and our algorithm. Thus, we softened the tone by writing “non-optimal or even impractical” (line 62-63). Nevertheless, we clarified our reasons to call it impractical here after the next quote and in the main article (line 60-72).

We all agree that it's a slower computation than, say, fitting exponentials, but so is the Kalman filter. Where do we draw the line of impracticability? Computational speed should be balanced with computational simplicity, estimation accuracy, and parameter and model identifiability. Moreover, that method was published in 2004, and the computational costs reported there should be projected to present day computational power.

To clarify our concerns, note that a maximum likelihood optimization usually takes some orders of magnitude fewer likelihood evaluations compared to the number of posterior evaluations when one samples the posterior (App. 1). The reason is not simply the larger number of samples by some orders of magnitude compared to the iteration steps of a maximum likelihood optimizer: For each excepted sample, the Hamiltonian Monte Carlo sampler (Betancourt, 2017) evaluates the derivative of the likelihood and prior distribution many times. This is necessary for numerically integrating the Hamiltonian equations via a leapfrog integrator. The evaluations of the likelihood by Celentano and Hawkes, 2004, have a computational complexity due to matrix inversion from Ndata2.3 to Ndata3 while the Kalman filter has a linear computational complexity in Ndata (App. 1).

We are not saying that the authors should code the C&H procedure and run it here, but should at least give it credit and discuss its potential against the KF method.

We expanded our considerations on the Celentano and Hawkes algorithm, and its offsprings, in the introduction (line 60-72). Based on the Celentano and Hawkes algorithm, Stepanyuk and Borisyuk, 2011, and Stepanyuk, 2014, published an algorithm which evaluates the likelihood quicker. Under certain conditions they claim that their algorithm can be faster than the Kalman filter from Moffatt, 2007.

A second point relates again to the computational complexity. Analog filtering before the analog-to-digital conversion distorts the dynamics of interest. Instead of a extensive analog filtering, it is beneficial to use the Kalman filter with a high analysing frequency because it is mathematically proven to be the minimal variance filter, Anderson et all, 2012, given the assumptions mentioned in the article are fulfilled. Under these conditions the linear computational complexity relative to Ndata becomes even more important.

That said it is not sufficient to just care about the scaling of the computational time. Memory operations (App. 1 Figure 2) of the code between the CPUs become quickly the bottleneck of the total computational time. For example, if the algorithm is asked for each data point to save the covariance matrix and mean vector of the signal, the computational time increases roughly by 5 · 10 − 102 times (App. 1 Figure 2).

A third reason for choosing the KF, is the implementation of a second dimension of the signal which is straightforward for the Bayesian filter as all of the Kalman filter equations are vector and matrix equations.

The only comparison provided in the manuscript is with the "rate equation" approach, by which the authors understand the family of methods that fit the data against a predicted average trajectory. In principle, this comparison is sufficient, but there are some issues with the way it's done.

Table 3 compares different features of their state inference algorithm and the "rate equation fitting", referencing Milescu et al., 2005. However, there seems to be a misunderstanding: the algorithm presented in that paper does in fact predict and use not only the average but also – optionally – the variance of the current, as contributed by stochastic state fluctuations and measurement noise. These quantities are predicted at any point in time as a function of the initial state, which is calculated from the experimental conditions. In contrast, the KF calculates the average and variance at one point in time as a projection of the average and variance at the previous point. However, both methods (can) compare the data value against a predicted probability distribution. The Kalman filter can produce more precise estimates but presumably with the cost of more complex and slower computation, and increased sensitivity to data artifacts.

We thank the Reviewer for this criticism because it allows us to point out what we liked to say with Table 3 in the previous version of the article. We used the terminology of the Kalman filter which separates the prediction equations for the mean and covariances of the system from the correction equations. The prediction equation evolves the mean and the covariance in time. The prediction of the mean value of the RE approach is identical to the one of the Kalman filter but there is no prediction equation for the covariance in the terminology sense of the Kalman filter. This does not mean that we intended to say that the RE approach cannot predict a covariance of the signal and state. The RE approach uses the multinomial assumption and the mean values for the prediction of the covariance. But this distinction obviously leads to unnecessary confusion and is not necessary. We therefore deleted the table.

As a rule of thumb at moderate analysing frequencies, the KF is usually 2 to 3 times slower than the RE algorithm (App. 1 Figure 3) for cPCF data. Using higher analysing frequencies for the patch-clamp data (App. 1 Figure 4a) the computational time of the Kalman filter can be even faster. This effect is not present at a lower analysing frequency. (App. 1 Figure 4b). For the two investigated data artifacts (Figure 11, 12 and App. 7) we could show that the Kalman filter performs better.

Figure 3 is very informative in this sense, showing that estimates obtained with the state inference (KF) algorithm are about 10 times more precise that those obtained with the "rate equation" approach. However, for this test, the "rate equation" method was allowed to use only the average, not the variance.

Considering this, the comparison made in Figure 3 should be redone against a "rate equation" method that utilizes not only the expected average but also the expected variance to fit the data, as in Milescu et al., 2005. Calculating this variance is trivial and the authors should be able to implement it easily (reviewers are happy to provide feedback). The comparison should include calculation times, as well as convergence.

We benchmarked against an in-house algorithm, but of course, Milescu et al., 2005, and Moffat, 2007, should be considered the gold standard. Therefore, we now challenge (Figure 3) our algorithm for patch-clamp data by the Bayesian version of Milescu et al., 2005 and Moffat, 2007. The error ratio between both algorithms amounts to 5.6 ± 1.4 for 4-state model and to 6.8 ± 2.7 for 5-state model (Figure 7).

In particular, the algorithm used by Moffat, 2007 suffers from state-depended noise (Figure 3b). Consequently, his approach is not applicable for Poissonian distributed fluorometry data. Thus, we benchmark only against Milescu et al., 2005 for confocal patch-clamp fluorometry data on a 4-state model (Figure 4), and on two more complex models (Figure 7,8 and 9) with 5 and 6 states. We show that the error ratio between the KF and the RE algorithm gets smaller when the second observable is added. For all tested models (Figure 7) the error ratio is about 1.5-2.0. Thus, adding the variance information, as asked by the Reviewer, made a large difference, similar to patch-clamp data, Milescu et al., 2005. That said, we show in Figure 4,5,6 that even if the error ratios are smaller the posterior sampled by an RE algorithm is meaningless as it is over-confident, which has also consequences for the confidence volume of maximum likelihood estimation (see discussion around Figure 4b,5,6,8,9).

In Figure 7b and d, we show that the advantage of the KF for patch-clamp data depends on the complexity of the process to be inferred. The KF is unbiased and identifies all parameters while the RE algorithm generates on the same data biased estimates and sometimes unidentified parameters.

For patch-clamp data of the most complex tested 6-state model (Figure 7 e) even the KF requires at least Nch > 4 · 104 channels per patch to identify all parameters. We suspect that this effect is caused by the 12 parameter dimensions of the rate matrix. Probably, the 12 dimensions make it difficult for the likelihood to dominate the bias picked up from the uniform prior.

We discuss calculation times and convergence in App. 1 Figure 2-4. Normally, the Kalman filter is 2-3 times slower than the RE approach. But against intuition, (App. 1 Figure 4a) we show that the Kalman filter can be even faster than a Rate equation type algorithm if the analysing frequency is high. This is only possible if the integration times of the Hamiltonian dynamics of the sampler are taking much longer for the RE approach. We showed that the RE approach creates too confident posteriors which in turn could lead to more integration steps of the leap-frog integrator of the sampler. Thus, algorithms with lower computational complexity in the likelihood evaluation can still be slower regarding the total computational time.

7. The manuscript nicely points out that a "rate equation" approach would need 10 times more channels (N) to attain the same parameter precision as with the Kalman filter, when the number of channels is in the approximate range of 10^2 … 10^4. With larger N, the two methods become comparable in this respect.

This is very important, because it means that estimate precision increases with N, regardless of the method, which also means that one should try to optimize the experimental approach to maximize the number of channels in the preparation. However, it should be pointed out that one could simply repeat the recording protocol 10 times (in the same cell or across cells) to accumulate 10 times more channels, and then use a "rate equation" algorithm to obtain estimates that are just as good.

As we were asked to redo the benchmark against the RE approach using the likelihood of Milescu, 2005, instead of least squares, there are many things that changed in our interpretation. First, we do not see anymore, in any of the tested scenarios for patch-clamp data (Figure 3c) our patch-clamp fluorometry data (Figures 4a, 7d,e) that both algorithms become equivalent in terms of the Euclidean error (Figures 3c, 4a, 7d,e) or HDCV (Figures 5c, 6a-b) with increasing channel numbers. They rather have a constant mean error ratio (Figure 7a,b) such that our old interpretation, to repeat the experiment 10 times, does not hold anymore.

Second, the ratio of the Euclidean error changes, such that with the second observable the KF is less better compared to the rate equation approach, given a constant model complexity (Figure 7a).

Third, we show that the rate equation approach is incapable to produce reliable credibility volumes. Thus, the Bayesian benefit that a parameter uncertainty quantification based on one data set is possible requires to use an algorithm which mimics the autocorrelation of the intrinsic noise, even if accuracy differences are small (Figures 5a-c). That is true even under pessimistic signal to experimental noise conditions (Figure 6b). This has consequences also for confidence volumes of maximum likelihood which at best are equivalent to the credibility volume if the likelihood dominates the prior, Jaynes, et all, 1976. For a discussion on limitations and work arounds of maximum likelihood error quantification of a dynamical system, see Joshi, et all, 2006.

Presumably, the "rate equation" calculation is significantly faster than the Kalman filter (particularly when one fits "features", such as activation curves), and repeating a recording may only add seconds or minutes of experiment time, compared to a comprehensive data analysis that likely involves hours and perhaps days. Although obvious, this point can be easily missed by the casual reader and so it would be useful to be mentioned in the manuscript.

Of course, the computational complexity for the calculation of the likelihood in a rate equation approach is smaller than the computational complexity of the Bayesian filter. We confirm the intuition that typically the RE algorithm is 1.8 till 3 times faster (App. 1 Figure 3) with one important exception for which the RE algorithm is slower (App. 1 Figure 4). We hypothesize that this exception is caused by the fact that, the full computational time to gain N samples depends not only on the computational complexity and memory operations but also how many integration steps of the Hamiltonian equations need to be applied to suggest a new sample. If an algorithm such as the RE approach generates much narrower posteriors, the curvature of the posterior is higher and, consequently the HMC sampling would require more integration.

But even if our discussed limits of the RE approach are considered to be unimportant, the bottle neck in the whole scientific process data acquisition or data analysis/modeling depends on the one hand on both the speed of the data acquisition and the complexity of the modeling question. On the other hand, it depends on the resources of the laboratory. Data acquisition of fast voltage-gated channels might take many orders of magnitude less time than a conconfocal patch-clamp fluorometry experiments in slowly gating HCN channels. Taking other limitations into account, such as e.g. photo toxicity when working with fluorescent ligands, it is not even a question of the time scales but also a matter of how many repetitions can be done at all.

8. A misunderstanding is that a current normalization is mandatory with "rate equation" algorithms. This is really not the case, as shown in Milescu et al., 2005, where it is demonstrated clearly that one can explicitly use channel count and unitary current to predict the observed macroscopic data. Consequently, these quantities can also be estimated, but state variance must be included in the calculation. Without variance, one can only estimate the product i x N, where i is unitary current and N is channel count. This should be clarified in the manuscript: any method that uses variance can be used to estimate i and N, not just the Kalman filter. In fact, the non-stationary noise analysis does exactly that: a model-blind estimation of N and i from non-equilibrium data. Also, one should be realistic here: in some circumstances it is far more efficient to fit data "features", such as the activation curve, in which case the current needs to be normalized.

We apologise for this error and corrected the introduction accordingly. In our old version this sentence referred rather to the in-house algorithm against which we benchmarked originally. The subtle but important differences in RE approaches were carelessly deleted during the text refinement. We indeed intended this article to be about kinetic scheme estimation not about fitting certain aspects of the data. Thus, we assume a situation where the data is qualified enough to try kinetic scheme inference.

9. It's great that the authors develop a rigorous Bayesian formalism here, but it would be a good idea to explain – even briefly – how to implement a (presumably simpler) maximum likelihood version that uses the Kalman filter. This should satisfy those readers who are less interested in the Bayesian approach and will also be suitable for situations when no prior information is available.

The programming language Stan offers to select from three options: Sampling the posterior, variational Bayesian inference and maximising the posterior. The same code provided by us can be used (see, Appendix 4). If one maximises the parameter of a posterior or simply of a likelihood it is then a matter of statements of prior distributions within the code. There is no need for reimplementation of the code.

10. The Bayesian formalism is not the only way of incorporating prior knowledge into an estimation algorithm. In fact, the reader may have more practical and pressing problems than guessing what the parameter prior distribution should be, whether uniform or Gaussian or other. More likely one would want to enforce a certain KD, microscopic (i)reversibility,

We thank the Reviewer for pointing out that our former text was not stating that adding prior information is not unique to Bayesian statistics. Similar approaches with regard to the maximum likelihood method for algebraic constrains (Salari et al., 2018) and behavioral constraints (Navarro et al., 2018) called penalized maximum likelihood, are now cited. We like to note, that the penalized maximum likelihood is synonymously called maximum a posteriori because the penalizing function is usually the logarithm of a prior (Gelman et all 2013). Thus, we do not think that it is helpful to separate simple prior considerations like uniform, Jeffrey or informative priors on subsets of or single parameters from more complex priors incorporating relations (Navarro et al., 2018) between parameters: For example, we demonstrate in the discussion of the methods section related to (Figures 7 and 9) how enforce from strictly to softly micro reversibility by means of a conditional prior on

(3) k5P(k5|k1k2k3k4k6k7k8)

In this way we translated the algebraic constraint (Salari et al., 2018) to a rather behavioral constraint (Navarro et al., 2018) which might be chosen to be strict (thus to be effectively algebraic) or soft to allow testing whether for a channel the thermodynamic equilibrium assumption is fulfilled.

As this article is already quite long with investigating the behaviour and the robustness of the likelihood, we hardly dwell into prior selecting questions expect that we demonstrate a way to incorporate micro-reversibility. In general, all problems which can be formulated as log-likelihood with an added penalizing function for the parameters can be expressed as a prior multiplied by the likelihood and sampled (given that exp[penalizing function] can be integrated).

…an (in)equality relationship between parameters, a minimum or maximum rate constant value, or complex model properties and behaviors, such as maximum Popen or half-activation voltage.

The programming language Stan employs base parameter types, whose support (the interval between their minimal and maximal values) can be readily defined by the user. The Stan compiler does then the necessary transformation to an unconstrained parameter space where the sampling is takes places. More complex parameter relationships, such as inequalities, are incorporated by other predefined parameter types such as simplexes, ordered vectors etc. Again, the stan compiler does the nasty parameter transformation to an unconstrained space automatically such that these constraints are fulfilled. Thus, testing model assumptions can be implemented fast. A constraint such as Popen can quickly be enforced from the fact that

Popen = f(K,L) (4)

which means for each parameter sample of the rate matrix K we can calculate the open probability for any ligand concentration L. Suppose that we have derived from previous data posteriors P(Popen(L)) of a set of open probabilities Popen for a set ligand concentrations. Then we can simply define the prior by

P(Popen) = P(f(K,L)) (5)

It is not even necessary that the inverse function f−1(Popen) = K exists (which it obviously does not). The only prerequisite for doing this information fusion in a mathematically rigorous attempt is that the previous analysis derived a posterior on Popen. Proxy’s such as point estimates and their standard errors of should be used cautiously.

A hard limit for a derived quantity, which is not naturally fulfilled, can be rather tricky. A possible solution is defining a own prior on this derived quantity in the same manner as Equation 2. The Prerequisite is that it can be differentiated such as a sum of Gaussians. With increasing terms in the sum it is possible to model a hard limit as a soft limit. That in turn alleviates hard limit problems.

A comprehensive framework for handling these situations via parameter constraints (linear or non-linear) and cost function penalty has been recently published (Salari et al/Navarro et al., 2018). Obviously, the Bayesian approach has merit, but the authors should discuss how it can better handle the types of practical problems presented in those papers, if it is to be considered an advance in the field, or at least a usable alternative.

Unless the asymptotic large data limiting case of the Bernstein-von Mises theorem hasn’t taken over, it is crucial to select a reasonable prior distribution in the Bayesian context. Nevertheless, we restricted ourselves mainly to the behaviour of the likelihood of both algorithms (in the asymptotic strong data case) which offers many things to explore. If faced with the situation that a substantial amount of information enters the parameter inference via soft or inequality constraints (Salari et al/Navarro et al., 2018) one risks a waste of information if one uses a point estimator. That is because if one for example reports only the parameter vector of a penalized maximum likelihood inference and its covariance matrix most of the time one does not report the sufficient statistics of the corresponding posterior (Gelman et al., Chapter 4.5, 2013 third Edition). Take for example an inequality constraint that cuts at one standard deviation distance from the mean through the sampling distribution of a maximum likelihood inference. The mentioned Bernstein-von Mises theorem is a Bayesian justification for doing maximum likelihood inferences in the asymptotic limit (Gelman et al., Chapter 4.5, 2013 third Edition) when all prior information becomes irrelevant. Thus, we do not see any relevant differences in which statistical framework prior information for a parametric model can be formulated. But Bayesian statistics will make the most of the prior information unless the data completely dominates the prior information. A rather detailed discussion on priors is beyond the scope of this article.

11. The methods section should include information concerning the parameter initialization choices, HMC parameters (e.g. number of steps) and any burn-in period used in the analyses used in Figures 3-6.

We thank the Reviewer for pointing out the missing information: We included in App. 1 a section about the typically used parameter settings and convergence diagnostics. The Stan compiler generates a HMC sampler which adaptively tunes the sampling parameters. Additionally, the no-U-turn implementation (NUTS) automatically selects an appropriate number of leapfrog steps in each iteration. In that way there is much less knowledge of the user required to operate the program. Typically, we used 4 independent sampling chains with a warm up phase of 3 − 9 · 103 iterations per chain followed by the actual iterations which generate the posterior. Convergence is monitored by the Rˆ statistics (Appendix 1).

For example, how is convergence established? How many iterations does it take to reach convergence? How long does it take to run? How does it scale with the data length, channel count, and model state count? How long does it take to optimize a large model (e.g., 10 or 20 states)? Provide some comparison with the "rate equation method".

We dedicated App. 1 to this discussion

12. In the section on priors, the entire part concerning the use of a β distribution should be removed or replaced, because it is a probabilistic misrepresentation of the actual prior information that the authors claim to have in the manuscript text. The max-entropy prior derived for the situation described in the text (i.e., an unknown magnitude where you don't know any moments but do have upper and lower bounds; the latter could be from the length from the experiment) is actually P(x) = (ln(x_{max}) – ln(x_{min}))^{-1} * x^{-1}. Reviewers are happy to discuss more with the authors.

We picked up these suggestions. We deleted this section and continued to employ uniform priors. We do not want to advocate a uniform prior for rate matrices. We use the uniform prior to be comparable to plain maximum likelihood, as it is the default method of most researchers, and to reduce the size of this publication. Unfortunately, the article is already quite long which makes us focus on the likelihood side of the Bayesian formalism. Therefore, we restrict the comparison between the algorithms to the situation where the likelihood dominates the uniform prior. Nevertheless, there are clear indications in Figure 3, 7, 8 and 9 that the uniform prior becomes insufficient for patch-clamp data, in particular for more complex (data generating) models and if the RE approach is used. We will present a detailed prior investigation in the near future in the archive version of another article.

13. Here and there, the manuscript somehow gives the impression that existing algorithms that extract kinetic parameters by fitting the average macroscopic current ("fitting rate equations") are less "correct", or ignorant of the true mathematical description of the data. This is not the case. Published algorithms often clearly state what data they apply to, what their limitations are, and what approximations were made, and thus they are correct within that defined context and are meant to be more effective than alternatives. Some quick editing throughout the manuscript should eliminate this impression.

We did not intend to generate this impression. What we liked to communicate was to emphasize both the similarities but in several details also relevant differences in the assumptions adopted by both approaches. We revised the manuscript accordingly.

14. The manuscript refers to the method where the data are fitted against a predicted current as "rate equations". However, it is not completely clear what that means. The rate equation is something intrinsic to the model, not a feature of any algorithm. An alternative terminology must be found. Perhaps different algorithms could be classified based on what statistical properties are used and how. E.g., average (+variance) predicted from the starting probabilities (Milescu et al., 2005), full covariance (Celentano and Hawkes, 2004), point-by-point predictor-corrector (Moffatt, 2007).

We thank the Reviewer for this criticism which helped us to clarify our terminology. We agree that rate equations are something intrinsic of a kinetic scheme but this does not exclude them to be also a feature of an algorithm: A good Bayesian model is always a forward model. This means, given a proper prior, one can simulate parameters and then (given parameters) sample data at every time point from the likelihood which in turn is calculated either by the deterministic rate equation or by the first order Markov process of the KF. For the inverse problem every Bayesian inference algorithm takes the model assumptions of the forward model and uses them to implement a calculation rule for the likelihood and the prior given the data. In this way, it makes sense to us to speak of an algorithm as a mathematical (non-unique) representation of the set of model assumptions/features which allows to calculate the posterior. The same applies to maximum likelihood models except that parameters are fixed and can not be updated once new information is available. They have to be refitted completely. Our terminology is coherent with the terminology of the statistical research community. They mean by statistical model either a likelihood or the combination of likelihood and prior.

15. The manuscript needs line editing and proofreading (e.g., on line 494, "Roa" should be "Rao"; missing an equals sign in equation 13). Additionally, in many paragraphs, several of the sentences are tangential and distract from communicating the message of the paper (e.g., line 55). Removing them will help to streamline the text, which is quite long.

Thanks. We changed Roa to Rao and also streamlined and divided the text. The changes evoked by the detailed criticism slightly prolonged the article.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Essential revisions:

Reviewer #1 (Recommendations for the authors):

The authors develop a Bayesian approach to modeling macroscopic signals arising from ensembles of individual units described by a Markov process, such as a collection of ion channels. Their approach utilizes a Kalman filter to account for temporal correlations in the bulk signal. For simulated data from a simple ion channel model where ligand binding drives pore opening, they show that their approach enhances parameter identifiability and/or estimates of parameter uncertainty over more traditional approaches. Furthermore, simultaneous measurement of both binding and pore gating signals further increases parameter identifiability. The developed approach provides a valuable tool for modeling macroscopic time series data with multiple observation channels.

The authors have spent considerable effort to address the previous reviewer comments, and I applaud them for the breadth and depth of their current analysis.

We are glad that our revision could convince the reviewer.

1. The figure caption titles tend to say what analysis or comparison is being presented, but not what the take home message of the figure is. I suggest changing them to emphasize the latter. This will especially help non-experts to understand what the figures are trying to convey to them.

We checked and modified the figure caption titles that all provide now a brief take home message.

2. I very much appreciate the GitHub code and examples for running your software. However, I feel that a hand-holding step-by-step explicit example of running a new model on new data is likely necessary for many to be able to utilize your software. Much more hand-holding than the current instructions on the GitHub repo.

We followed the suggestion and edited the tutorial https://github.com/

JanMuench/Tutorial_Patch-clamp_data/blob/main/README.md

and https://github.com/JanMuench/Tutorial_Bayesian_Filter_cPCF_data/blob/main/ README.md to make it more a step-by-step description of how to analyze a new data set with a new model.

3. Figure captions sometimes do not explain enough of what is shown. I appreciate that many of these details are in the main text, but data that is displayed in the figure and not concretely described in the caption can makes the figures hard to follow. e.g. Figure 4a – "With lighter green we indicate the Euclidean error of patch-clamp data set." But what data set do the other colors reflect? It is not stated in the caption. Again, I realize this is described in the main text, but it also needs to be defined in the caption where the data is presented. Figure 4d – Please spell out what "both algorithms" intends. Also, a suggestion: instead of having to refer to the caption for the color codes (i.e. RE vs. KF, etc.) it would speed figure interpretation to have a legend in the figures themselves. Few other examples: Box 1. Figure 2. – Please define the solid vs. dashed lines in the caption. Figure 3c – Please define the solid vs. dashed lines in the caption. Figure 12 – "We simulated 5 different 100kHz signals." What kind of signals? Fluorescence I assume, but this needs to be explicitly defined. I'd check all figures for similar.

We edited the Figure 3,4 caption in this regard. Every panel has its legend and the color coding of all panel matches. But in order to keep the figure caption and the figure together small enough that they fit onto one page we took away some of the redundant information from the caption. Additionally, Figure 8, 9 got the information how much probability mass is included in each HDCI. We also edited the caption of Figure 12 and state now that we simulated cPCF data.

Reviewer #3 (Recommendations for the authors):

In this revised manuscript, the Münch et al. have addressed all of my original concerns. It is significantly revised, though, and includes many new investigations of the algorithm's performance. Overall, the narrative of this manuscript is now to introduce an approximation to the solution for a Bayesian Kalman Filter, and then spend time demonstrating that this approximation is reasonable and even better than previous methods. In my opinion, they successfully do this, although, as they mention in their comments, their manuscript is very long.

We thank the reviewer that he appreciates our work.

I am not 100% certain, but the approximation that the author's make seems to be equivalent (or at least similar to) an approximation of the chemical master equation using just the 1st and 2nd moments, which is just the Fokker-Planck equation. The authors should discuss any connection to this approximation, as there is a great deal of literature on this topic (e.g., see van Kampen's book).

We added a respective sentence in the introduction about protein expression studies which approximates the solution of the chemical master equation (CME) with the linear noise approximation to derive a Kalman filter. The linear noise approximation is the van Kampen’s System size expansion “pursued only up to first order”[13], or a linearized Fokker-Planck equation [11].

We added a paragraph in the conclusion to contextualize our method against the background of the classical approximations of the CME. We conclude due to the first order dynamics of chemical reaction network our approach is actually equivalent to the full chemical Fokker-Planck equation.

In Figures 3A and 4D, it is unclear to me what is plotted for the RE and classical Kalman filter (i.e., how is there a posterior if they are not Bayesian methods)? Perhaps it is buried in the methods or appendices, but, if so, it needs to at least be clarified in the figure captions.

We implemented a full Bayesian Version of Milescu 2005, algorithm and Moffatt 2007, algorithm, thus all results in this paper were obtained by posterior sampling. Since we used uniform priors, the mode of the posterior equals the maximum of the likelihood, i.e. the outcome of the classical implementation as a point estimator. However, our implementation of Milescu 2005 algorithm and Moffatt 2007 algorithms has the advantage that all sorts of priors can be applied. Unfortunately, our wording was somewhat ambiguous: Every Kalman filter is Bayesian filter (Moffatt, 2007) regarding the time varying unknown n(t) ensemble state. That is independent of how the time constant parameters such as the rate matrix parameters are inferred. The wording “Bayesian Filter” for our algorithm was used to distinguish it from the previous algorithm (Moffatt, 2007), as only our generalized Bayesian filter allows the inclusion of state dependent noise. We added to the manuscript in the caption and the main text that we implemented both algorithms in the full Bayesian framework.

The Bayesian statistical tests devised to determine success (e.g., on pgs. 12-13) seem a little ad hoc, but are technically acceptable. I do not see a need for additional metrics.

Interestingly, we found out is is not that actually not that “ad hoc” as we thought in the beginning. We in fact test the frequentist coverage property of “credibility” intervals or volumes, see [8] Equation 2.1. Usually for parametric Bayesian statistics it is at least asymptotically clear by the Bernstein von-Mises theorem (when the data swamped the prior) that coverage property should hold the frequentist interpretation of probability [9]. Just like us, others interpret the Bayesian uncertainty quantification as “invalid” if this frequentist coverage does not hold. [10]. We add a sentence “Noteworthy, that this is a empirical test of how sufficient the Bayesian filter and the RE approach hold frequentist coverage property of their HDCVs.” to indicate the theoretical foundation of our approach.

Line 939: Equation 61 is absolutely not "close to uniform distribution". The α and β parameters of 100.01 are much larger than 1. It is incredibly peaked around 0.5. Perhaps this is a typo?

This was indeed a left-over text fragment from the previous version of the article. We changed it to ”sharply peaked β distribution”.

Line 942: The allowed small breaking of microscopic reversibility in the prior is an interesting idea that I wish the authors would expound upon more.

Thanks for this encouragement. Indeed, we plan to go more into detail on this topic in the future.

Line 712: The authors state that the simulation 'code will be shared up-on request'. They should include it with their github pages tutorials for running the examples in case others wish to check their work and/or use it. There is no reason to withhold it.

The simulation code is exemplified here:

https://cloudhsm.it-dlz.de/s/QB2pQQ7ycMXEitE and now also in the manuscript.

Line 707: 'Perspectively' is not a commonly used word.

Thanks. We changed “Perspectively, extensions …” to “Prospective extensions …”

Reviewer #4 (Recommendations for the authors):

The authors have addressed all my comments and suggestions. The manuscript is nice and extremely comprehensive, and should advance the field.

Nevertheless, the manuscript is also very long (but justifiably so), and certain statements could be a little clearer. Most of these statements refer to the comparison with the so-called rate equation (RE) methods, with which I'm more familiar. For example:

We thank the Reviewer for his appreciation of our work.

Abstract: "Furthermore, the Bayesian filter delivers unbiased estimates for a wider range of data quality and identifies parameters which the rate equation approach does not identify."

The first part of this statement is not quite true, as Figure 4 shows clearly that the Bayesian estimates are biased (and the authors acknowledge this elsewhere in the manuscript). If they are biased in one "range of data quality", that probably means they are always biased, just to different degrees.

We agree that it is right to say “If they are biased in one “range of data quality” this probably means they are always biased, just to different degrees”. The approximations lead to all sorts of errors including bias. This argument is also supported by the findings of Moffatt, 2007 (Figure 11) in which he detects bias in the deterministic approach (Milescu, 2005) and his own Kalman filter. It can certainly be stated that Figure 4b indicates a bias. However, even for optimistically high signal-to-noise in the fluorescence signal as in Figure 4, reasonable large HDCIs such as the 0.95−HDCIs of Figure 4c are covering the true value. Thus, the bias can be interpreted as negligible relative to the parameter uncertainty quantification. One can actually observe how the size (the probability mass) of the HDCV of the Bayesian filter overcomes the bias in Figure 5c. We toned down our statement by adding the word “negligible” in the context with “biased estimates” in the abstract. Considering the differences in the bias in Figure 9, this seems justified to us. We also mention the bias now in the discussion of Figure 4.

This is not surprising, because the Kalman filter is a continuous state approximation to a discrete state process, and the overall estimation algorithm makes a number of approximations to intractable probability distributions. It would definitely be correct to say that the estimates are very good, but not unbiased.

We agree with the reviewer and added to the discussion of Figure 4 a statement indicating the bias of both algorithms.

Second, this statement is also ambiguous. Are you referring to the theoretical non-identifiability caused by having more parameters in the model than what the combination of estimator+experimental protocol can extract from data? In this case, it's not a matter of certain parameters that cannot be identified, but a matter of how many can be identified uniquely. The more information is extracted from the data, the more unique parameters can be identified, so the Kalman filter should do better. Or, alternatively, are you referring to specific parameters that are poorly identified because, for example, they refer to transitions that are rarely undertaken by the channel? In this case, it would be a matter of obtaining looser or tighter estimates with one method or another, but the parameters should be intrinsically identifiable, I imagine, regardless of the algorithm.

In the context of the whole article this sentence seems to be unambiguous to us. The abstract alone leaves of course room for interpretation but we defined what we mean by unidentified parameters by Equation 16 in the main article. To avoid confusion we changed in the abstract the sentence to “For some data sets it identifies more parameters than the rate equation approach”. By our definition we are able to detect many cases of both practical and structural unidentifiability (Middendorf,Aldrich 2017). But our definition is not equivalent to the definition by (Middendorf,Aldrich 2017). We added an explanatory paragraph after Equation 16.

In any case, it's not clear that the better identifiability is the result of the Bayesian side, or of the predictor-corrector state inference filter. I would guess it is the Kalman filter, but I'm not sure.

Indeed, it is only the “predictor-corrector” side of the algorithm. Unfortunately, our wording was somewhat ambiguous which gives rise to a couple of the Reviewer’s following questions: Every Kalman filter is a Bayesian Filter but not every Bayesian Filter is (the classical) Kalman filter. In fact, Moffatt, 2007 derives his Kalman filter as a Bayesian forward algorithm in continuous space. He thus shows that the forward algorithm and the Kalman filtering performed for discrete and continuous state Markov processes, respectively, are the same. We come to the same conclusion herein, see Equation 5-9. (Nevertheless, he optimizes the rate matrix etc. via maximum likelihood optimization.)

We additionally describe in Equation 11 the consequences of the open-channel noise, that Equation 11 cannot be derived by the original Kalman filter equations. A more flexible noise modeling is necessary. This leads to our algorithm which is exact up to the second moment. That also includes other state-dependent noise such as photon counting noise. We believe that this explicit noise description is the reason for the performance advantage of our algorithm. We would also expect this advantage if our algorithm would be implemented as point estimator. The full Bayesian implementation results in a fully sampled posterior and, in principle, allows for arbitrary priors. Since we used uniform priors, the mode of the posterior equals the maximum of the likelihood, i.e. the outcome of the classical implementation as point estimator.

Perhaps it would be clearer if you said that the KF method produces good estimates and generally improves parameter identifiability compared to other methods, as it extracts more information from the data?

As mentioned above, we changed the abstract to avoid misunderstandings.

Introduction:

32: I'm not sure, but if the intention here is to cite mathematical treatments for estimation, you may add references to the "macroscopic" papers by Celentano, Milescu, Stepaniuk and perhaps a few others that use "rate equations". Also, you may cite Qin et al., 1996, as a single channel paper describing a method used in hundreds of studies.

Thank you for pointing out the Qin paper, 1996. We also refer now to the other proposed papers at that position in the text.

Pg. 3:

51: I remain skeptical that it is a good idea to use "rate equations" (RE) as a term to refer to those methods that are fundamentally different from the approach described here (also see my comment to the first submission).

We follow with our terminology on stochastic processes that of other groups in the community [6], [4], (Walczak 2011). In this terminology, the rate equation or (reaction-)rate equation is a deterministic ordinary differential equation describing the time evolution of the first moment (the mean value) of the chemical master equation (CME) [3, 4]. To our understanding the rate equation is the basis of the work of Milescu 2005 which justifies the wording.

The rate equations must always be used to predict a future state from a past or current state, by all methods, explicitly or implicitly, because REs simply describe the channel dynamics. In this very manuscript, Equation 3, central to the Kalman filter formalism, is nothing but a deterministic rate equation with a stochastic term added to approximate the stochastic evolution of an ensemble of channels.

Indeed, all approximations (deterministic or stochastic) which make physically sense should have as the l