Linking timeseries of singlemolecule experiments with molecular dynamics simulations by machine learning
Abstract
Singlemolecule experiments and molecular dynamics (MD) simulations are indispensable tools for investigating protein conformational dynamics. The former provide timeseries data, such as donoracceptor distances, whereas the latter give atomistic information, although this information is often biased by model parameters. Here, we devise a machinelearning method to combine the complementary information from the two approaches and construct a consistent model of conformational dynamics. It is applied to the folding dynamics of the forminbinding protein WW domain. MD simulations over 400 μs led to an initial Markov state model (MSM), which was then "refined" using singlemolecule Förster resonance energy transfer (FRET) data through hidden Markov modeling. The refined or dataassimilated MSM reproduces the FRET data and features hairpin one in the transitionstate ensemble, consistent with mutation experiments. The folding pathway in the dataassimilated MSM suggests interplay between hydrophobic contacts and turn formation. Our method provides a general framework for investigating conformational transitions in other proteins.
https://doi.org/10.7554/eLife.32668.001Introduction
Protein folding is an important subject not only for basic research in molecular biology but also for understanding folding diseases and designing new polymeric materials (Dill and MacCallum, 2012). Transient, partially folded states are often encountered on folding pathways, and have been characterized experimentally in solution by methods such as laser temperature jumps, fluorescence labeling, and solution Xray scattering. Mutagenesis evaluated by Φvalue analysis (Fersht and Daggett, 2002), for instance, has also provided residuelevel information on transition states. Recently, stateoftheart singlemolecule (sm) measurements, singlemolecule Förster resonance energy transfer (smFRET) (Chung et al., 2012; Oikawa et al., 2013) and force spectroscopy (smFS) (Neupane et al., 2016) have become powerful tools in proteinfolding research, providing reliable information on transitionpath (barriercrossing process) times (Chung et al., 2012) and heterogeneity in the unfolded state (Oikawa et al., 2013). A major limitation of smFRET is that the observables are restricted to ‘lowdimensional’ structural data, such as the donor–acceptor distance. Computational modeling should help us to interpret singlemolecule timeseries measurements and should contribute to solving some remaining puzzles, such as the reduced solvent viscosity dependence of the transitionpath times (Chung and Eaton, 2013), the internal viscosity (Chung and Eaton, 2013) and the impact of nonMarkovian property (Chung et al., 2015). Theories and computational methods have been developed to extract structural information and dynamics from smFRET data (Haas et al., 2013; Hoefling et al., 2011; Matsunaga et al., 2015; Sun et al., 2016).
Molecular dynamics (MD) simulation is another powerful approach for investigating protein dynamics and folding over relatively long time periods — hundreds of microseconds or longer (LindorffLarsen et al., 2011). In theory, ‘lowdimensional’ smFRET measurements are interpreted in terms of the atomic structural models obtained with MD simulations. However, it is still a challenge to achieve quantitative agreement between simulation and experimental data during the entire folding process, owing not only to simulation time limitations but also to inherent forcefield biases. In particular, while local interactions are well described by the current force fields, it is still difficult to reproduce energetic balances between unfolded and folded states. Indeed, Piana and coworkers showed, in their protein folding simulation study, that the folding mechanism of the villin headpiece depends substantially on the choice of force field (Piana et al., 2011). It is also known that most force fields produce unfolded states that are more compact and structured than those suggested experimentally (Piana et al., 2014). Methods based on maximum entropy (Beauchamp et al., 2014; Boomsma et al., 2014; Cavalli et al., 2013; Olsson et al., 2013, 2014, 2017; Pitera and Chodera, 2012; Roux and Weare, 2013) or Bayesian statistics (Bonomi et al., 2016) were recently developed for guiding simulations or models to generate ensembles that match experimental ensembleaveraged observations.
Exploiting timeseries data from singlemolecule experiments is another way to link simulation with experiment and has several advantages over the ensembleaverage based approaches: (i) more latent states can be uncovered by inferring states from their historical evolution than from their static ensembles (Li et al., 2008; Matsunaga et al., 2015; Schuetz et al., 2010); (ii) the transition state can be uniquely identified as a dynamic bottleneck by following the actual dynamics. However, the timescale gap between experiment and simulation previously hampered the direct use of timeseries analysis methods in other disciplines, such as data assimilations that are based on the sequential Monte Carlo method (Matsunaga et al., 2015).
Here, we develop a new approach for singlemolecule timeseries data based on the Markov state model (MSM) (Pande et al., 2010), a statistical model that approximates dynamics by memoryless probabilistic transitions between discrete conformational states. In MSM, the probability of transition from the discrete state i to state j in a lag time of τ is described by a transitionprobability matrix $T\left(\tau \right)=\left\{{T}_{ij}\left(\tau \right)\right\}$. $T\left(\tau \right)$ can be estimated from a set of short simulations instead of a single long simulation. Therefore, MSM is often used to generate longtime dynamics (Silva et al., 2014) whose timescale is comparable to those of experimental measurements (Feng et al., 2016; Noé et al., 2011; Snow et al., 2002). However, estimation of $T\left(\tau \right)$ is largely dependent on the simulation forcefields, which may have uncertainties or biases for certain conformations (Olsson et al., 2017). To overcome this problem, we propose to "refine" $T\left(\tau \right)$ using highresolution measurement of singlemolecules as timeseries data. Specifically, we use a machinelearning approach to estimate $T\left(\tau \right)$ between hidden Markov states from lowdimensional timeseries data (Bishop, 2006). To distinguish the original $T\left(\tau \right)\left(={T}_{\text{simulation}}\left(\tau \right)\right)$, we refer to the refined $T\left(\tau \right)$ as ${T}_{\text{experiment}}\left(\tau \right)$ in this paper.
We propose a twostep procedure in our machinelearning approach, which links simulations and singlemolecule experiments (Figure 1). (i) Supervised learning step. We first construct an initial MSM from a raw set of simulation data. After defining discrete conformational states by clustering the trajectory snapshots, $T\left(\tau \right)={T}_{\text{simulation}}\left(\tau \right)$ is estimated directly by counting transitions between the discrete states in the simulation trajectories. This step is the same as conventional MSM (Pande et al., 2010). (ii) Unsupervised learning step. Using ${T}_{\text{simulation}}\left(\tau \right)$ as an initial estimate, we perform hidden Markov modeling (Bronson et al., 2009; McKinney et al., 2006; Okamoto and Sako, 2012; Pirchi et al., 2016; Rabiner and Juang, 1986; Schröder and Grubmüller, 2003) to refine the initial MSM using singlemolecule measurement timeseries data. $T\left(\tau \right)$ is optimized so that the "refined" or dataassimilated MSM with ${T}_{\text{experiment}}\left(\tau \right)$ can reproduce the timeseries data most accurately.
We applied this procedure to the folding dynamics of the forminbinding protein (FBP) WW domain, a 37residue threestranded βsheet protein. In the construction of the initial MSM, extensive MD simulations of a dyelabeled WW domain were performed for an aggregated time of ~400 μs. Although there are a number of folding simulation studies for the FBP WW domain and its homologs (Ensign and Pande, 2009; Freddolino et al., 2008; Karanicolas and Brooks, 2003; ZanettiPolzi et al., 2017; Zhou et al., 2014), this may be the first folding simulation to use FRET dyes. High timeresolution smFRET measurements of WW domain folding and unfolding dynamics were used for the unsupervised learning (Chung et al., 2012). The data were previously measured to resolve the durations of the folding and unfolding transitions with microsecond resolution by Chung and coworkers (Chung et al., 2012). The initial and dataassimilated MSM showed different folding pathways and transitionstate ensembles of WW domains. An independent mutation experiment with Φvalue analysis (Fersht and Daggett, 2002) validated the dataassimilated MSM. We discuss our timeseries analysis method in the context of machinelearning theory and its applicability to conformational transitions in other biomolecules.
Results
Singlemolecule FRET measurements
SmFRET experiments were carried out by Chung and coworkers, and details for the experiments are given in Chung et al., 2012. Here we summarize the essential experimental setups necessary for our machinelearning procedure. Photon trajectories were measured for the FBP WW domain with donor (Alexa 488) and acceptor (Alexa 594) fluorophores attached to the terminal residues in the protein. In order to improve the time resolution of the smFRET data, Chung and coworkers illuminated the protein with a very high intensity laser (10 kW/cm^{2}), increasing the number of photons observed per time (~650 photons/ms) (Chung et al., 2012). Photon color, either donor green or acceptor red, and the absolute time of arrivals were recorded within ~0.5 ns. Each photon trajectory was split into folded and unfolded segments by finding the photon interval with the maximum transition probability (Gopich and Szabo, 2009). The final set of smFRET photon sequences comprises 527 trajectories, each of which contains a single folding or unfolding event.
Molecular dynamics simulations
A dyelabeled WW domain was built in silico for MD simulations (Figure 1—figure supplement 1). Starting from the NMR structure of the WW domain (PDB code: 1E0L [Macias et al., 2000]), we made the same substitution mutation W30A as in the experiments (Chung et al., 2012), and the terminal residues were labeled with donor (Alexa Fluor 488) or acceptor (Alexa Fluor 594) dyes by using the AMBERDYES package (Graen et al., 2014).
We conducted eleven simulations of length 25.6 μs in the NVT ensemble (370 K) from the unfolded structures. By monitoring the fraction of native contacts, Q, we observed folding events in four trajectories out of the eleven (Figure 2—figure supplement 1A). Also, six simulations of length 10 μs and four simulations of 14 μs were performed from the folded structure. In these ten trajectories, we observed seven unfolding events. Our simulations, thus, sampled intermediate regions between the unfolded and folded states sufficiently for construction of an initial MSM. The simulation length is ~400 μs in total.
Clustering of sampled structures in MD simulations
For the MSM construction, we chose a twodimensional (2D) space spanned by the native contact Q and the expected FRET efficiency ε. The value of ε is calculated from the distance r between donor and acceptor dyes using the Förster theory, $\epsilon =1/\left[1+{\left(r/{R}_{0}\right)}^{6}\right]$, where R_{0} is the Förster radius (see 'Methods' for details). Here, Q is chosen because it is historically known to be the best reaction coordinate to describe a folding process. Also, ε is employed for comparison with smFRET data as well as for differentiating compact and elongated structures. Figure 2A shows a scatter plot of the sampled conformations in the MD simulations. Expected FRET efficiency ε successfully resolves the elongated unfolded states and compact states, whereas ε fails to discriminate the folded state (corresponding to Q ~ 0.7–1.0 and ε ~ 0.7–1.0) from the compact unfolded state (Q ~ 0.0–0.3 and ε ~ 0.7–1.0) without the help of Q. This suggests that approaches that are based on ensemble average may have a difficulty when the histogram of ε is used to add biases to the protein in MD simulations.
Sampled conformations were clustered into discrete states in the 2D space (cluster centers are shown in Figure 2B). Regular spatial clustering was applied to partition the space in an equidistant manner regardless of the population size (Senne et al., 2012). This spatial clustering is essential in this study because relatively minor populations can have high probabilities after the refinement of the initial MSM with the help of experimental timeseries data.
In order to obtain structural insights, we calculated both the mean and the variance of the donoracceptor distance r for each state. Figure 2—figure supplement 2 shows samples plotted in the 2D space spanned by Q and r. There are distance gaps between states, which are supportive of the Markovian assumption. Large standard deviations are observed in the case of small donoracceptor distances. This results from the lower spatial resolution in FRET efficiency ε when ε is not close to 0.5 (e.g., states with ε ~ 1 cover r = 030 Å) as discussed in the informationbinning study by Watkins and Yang (2004).
Construction of an initial MSM as the supervised learning step
MSM has two tunable parameters: the number of discrete states and the lag time τ for $T\left(\tau \right)$. Theoretically, increasing the number of discrete states and/or the lag time will produce a MSM with the least discretization error (Prinz et al., 2011), while a large number of discrete states or a longer lag time will decrease the number of samples for estimation, resulting in large statistical errors (the socalled biasvariance tradeoff) (McGibbon et al., 2014). Thus, it is common to make the number of discrete states and the lag time as small as possible, even though it compromises the accuracy of the model. Here, we examined various numbers of discrete states for MSM by calculating implied time scales (Schwantes and Pande, 2013). The ith implied time scale t_{i} of a MSM with $T\left(\tau \right)$ is given by
where λ_{i} is the ith eigenvalue of $T\left(\tau \right)$. As the implied time scales are always underestimated relative to their true values, the slower time scales are indicative of smaller discretization errors. The implied time scales were calculated for various numbers of clustered states as a function of the lag time τ (Figure 2—figure supplement 3A). We found that the converged values of the slowest implied time scale (related to folding dynamics) successfully reproduce the time scale of folding (~5 μs) in the MD data when the number of states is larger than 80 (Figure 2—figure supplement 3B). From these observations, we chose 87 states by adjusting the cluster radius (0.08) in the regular spatial clustering.
A lag time of τ = 200 ns was chosen as a minimum time scale to achieve converged implied time scales. Note that the number of states chosen here is an order of magnitude fewer than those in other MSM studies. It is well known that the RMSD metric requires a larger number of discrete states in MSM, whereas welldefined metrics that are based on slow motions or smooth coordinates (such as contactmaps [Kellogg et al., 2012], or coordinates extracted by timestructurebased independent component analysis [Schwantes and Pande, 2013]) require a smaller number of states. A long lag time of τ = 200 ns also helps the MSM to satisfy the Markov assumption as well as improving FRET photoncount statistics in the next unsupervised learning step.
Figure 2C shows a graphical representation of the initial MSM constructed only from MD simulation data. The node areas are proportional to the equilibrium populations, and the edge line widths are proportional to the transition probabilities, ${T}_{ij}\left(\tau \right)$. The MSM seems to overemphasize compact unfolded states (Q ~ 0.0–0.3 and ε ~ 0.7–1.0), which results from biases of the forcefield parameters not only for proteins (Best et al., 2014; Piana et al., 2014) but also for FRET dyes (Best et al., 2015).
Refinement of transition probabilities as the unsupervised learning step
In the unsupervised learning, the total log likelihood function of all smFRET photoncounting sequences $\mathrm{l}\mathrm{n}\text{}L\left(\mathbf{T}\left(\tau \right)\right)=\sum _{k}\mathrm{l}\mathrm{n}\text{}{L}_{k}\left(\mathbf{T}\left(\tau \right)\right)$ is maximized by optimizing $T\left(\tau \right)$. First, we treated smFRET data as N independent photoncounting sequences in discretized time windows. Each sequence O_{k} = $\left\{{o}_{1}^{(k)}\dots {o}_{I}^{(k)}\right\}$ consists of a set of donor and acceptor photon counts $o}_{i}=\left\{{N}_{D}^{(i)},{N}_{A}^{(i)}\right\$ detected in ith time window. I is the number of time windows and 200 ns was chosen for the photoncounting timewindow width as well as for the lag time τ of MSM. The likelihood function ${L}_{k}\left(T\left(\tau \right)\right)$ is then defined as a probability to observe the kth smFRET photoncounting sequence O_{k} with a given $T\left(\tau \right)$ (Gopich and Szabo, 2012):
M is the number of discrete states in MSM, and s_{i} denotes MSM’s state at the ith time window. $p\left({s}_{i}{s}_{i1}\right)={T}_{{s}_{i1}{s}_{i}}\left(\tau \right)$ is the transition probability from state s_{i}_{1} to state s_{i}. $p\left({s}_{1}\right)$ is the equilibrium probability of being in state s_{1}. $h\left({o}_{i}{s}_{i}\right)$ is the probability of observing donor and acceptor photoncounts $o}_{i}=\left\{{N}_{D}^{(i)},{N}_{A}^{(i)}\right\$ given state s_{i}.
By maximizing the above likelihood function, we obtained the dataassimilated MSM with the optimized $T\left(\tau \right)={T}_{\text{experiment}}\left(\tau \right)$, which matches with the smFRET timeseries data. Figure 2D shows that the dataassimilated MSM differs from the initial MSM. In the dataassimilated MSM, the compact unfolded state (Q ~ 0.0–0.3 and ε ~ 0.7–1.0) disappears due to a minor population with high ε in the smFRET data (see Figure 3B), while an elongated unfolded region (Q ~ 0.0–0.2 and ε ~ 0.5–0.6) is stabilized. As noted, this may reflect the biases of the forcefield. Also, the different solvent conditions between the simulation (TIP3P water molecules) and the smFRET experiment (denaturant concentration) may affect the unfolded state distribution (Zheng et al., 2016). Interestingly, as another stable region, the folded state (Q ~ 0.8 and ε ~ 0.6–0.8) appears instead of other states with the same ε.
We compared the transition probabilities ${T}_{ij}\left(\tau \right)$ of the two MSMs (Figure 2—figure supplement 4). Figure 2—figure supplement 4A shows the implied time scales of the MSMs. We can see that the slowest time scale increases from 2.6 μs to 100 μs after the hidden Markov modeling. These time scales are related to folding/unfolding transitions of WW domain in both cases. Simulation time scale is faster than that in experiments because of the lower viscosity of the TIP3P water model (Mahoney and Jorgensen, 2001) compared to those of pure water and the viscogen added in the smFRET experiment (Chung et al., 2012). This gap was improved by the information from the smFRET data. In Figure 2—figure supplement 4B, ${T}_{ij}\left(\tau \right)$ are directly compared in a scatter plot where each ${T}_{ij}\left(\tau \right)$ is colored using the FRET efficiency ε of state i before transition. ${T}_{ij}\left(\tau \right)$ from states with high FRET efficiencies are correlated with each other. This means that ${T}_{ij}\left(\tau \right)$ related to compact states derived from MD simulations are consistent with the smFRET data. By contrast, ${T}_{ij}\left(\tau \right)$ from states with middle or low FRET efficiencies are less correlated. This means that the hidden Markov modeling mainly updated ${T}_{ij}\left(\tau \right)$ of elongated states.
Reproducibility of experimental data
Figure 3 shows histograms of the ‘measured’ FRET efficiency E of the original smFRET data and those generated by the initial and dataassimilated MSMs (see 'Methods' for emulation or stochastic simulation of smFRET data). The measured FRET efficiency E is calculated from the numbers of photons emitted from donor and acceptor dyes in a certain time window, N_{D} and N_{A}, respectively. It is defined by the ratio of donor photon counts to total photon counts, E = N_{A}/(N_{A} + N_{D}). E is calculated from measured photons in a time window whereas ε is calculated from the donoracceptor distance of each instantaneous structure. In Figures 3A, 200 ns timewindows were used for photon counts. The dataassimilated MSM produced a histogram close to that found with original smFRET data (the mean squared error between the normalized histograms is $3.7\times {10}^{5}$), confirming the reliability of the optimized parameters obtained by machine learning.
We also calculated the observed FRET efficiency E with 50 μs time windows (Figure 3B), which corresponds to the folding timescale observed in the smFRET data (Chung et al., 2012). The histogram of the original smFRET data has double peaks, corresponding to unfolded and folded states, respectively. The initial MSM, however, only shows a single sharp peak at high FRET efficiency because of overemphasis of compact conformations (Best et al., 2014). The dataassimilated MSM successfully reproduces the double peaks of the histogram of the original data. The histogram of the dataassimilated MSM seems smoothed compared to that obtained experimentally, presumably because of the accumulation of photoncounting noise using 200 ns timewindows.
Folding mechanisms of the FBP WW domain
To quantify the difference between the initial and dataassimilated MSMs, we calculated p_{fold} by applying transitionpath theory (Metzner et al., 2009; Noé et al., 2009) (Figure 4A and B). p_{fold} is the probability of undergoing a folding transition defined for each state. States with p_{fold} >0.5 are kinetically closer to the folded state, whereas those with p_{fold} <0.5 approximate the unfolded state. Those with p_{fold} = 0.5 define the transitionstate ensemble. The calculated p_{fold} tends to depend only on Q for the initial MSM (Figure 4A), as corroborated by a previous analysis of folding simulation data for various proteins (Best et al., 2013). On the other hand, p_{fold} of the dataassimilated MSM depends on ε as well as Q, suggesting that not only Q but also compactness needs to be factored into the folding mechanism (Figure 4B). p_{fold} was mapped onto the RMSDs of hairpins 1 and 2 of the native structure (Figure 4C and d). In the initial MSM, the transitionstate region (p_{fold} = 0.4–0.6) is located in a rather compact region where the formation of both hairpins 1 and 2 can be just discerned (Figure 4E). In the dataassimilated MSM, the transitionstate ensemble presents only hairpin 1 (Figure 4F). This is consistent with a mutagenesis experiment, where mutations in hairpin 1 produce large Φvalues (Petrovich et al., 2006), implying that formation of hairpin 1 contributes energetically to the transitionstate ensemble.
The flux of folding trajectories can be decomposed into individual pathways for both models (Figure 5A and B). The decomposition extracts a set of pathways along with their fluxes. The dominant pathways with large fluxes provide the statistically probable order of events during folding. The figures show that folding pathways with largest fluxes contribute 50% of the total flux. In the dataassimilated MSM (Figure 5B), the formation of hydrophobic sidechain cores (core 1 consists of Trp8, Tyr20, Asn22, Thr29 and Pro33, and core 2 consists of Thr9, Tyr11, Tyr 19 and Tyr21) stabilizes the βsheet structure in hairpin 1. The increased stability of hairpin 1 seems to guide the formation of the second hairpin (hairpin 2) by the interstrand hydrophobic interactions. Again, this scenario is consistent with sitedirected mutagenesis experiments for FBP (Petrovich et al., 2006) and the homologous Pin WW domains (Jäger et al., 2001). These experiments implied that interactions between conserved hydrophobic residues contribute to the stability of only the native state and not to the transition state ensemble. Previous simulation studies have suggested the existence of registershifted structures as trapped (NoeNoé et al., 2009) or intermediate (Mu et al., 2006) states, whereas in the dataassimilated MSM, such states were rarely observed. Hairpin 1 formation as a ratelimiting step together with the paucity of registershifted states in the dataassimilated MSM are consistent with the scenario of the WakoSaitôMuñozEaton (WSME) model (Muñoz et al., 1997; Wako and Saitô, 1978), which postulates that the formation of a local turn is a bottleneck for βsheet formation. Interestingly, hairpin 2 formation is driven by a hydrophobic collapse (Dinner et al., 1999) in the dataassimilated MSM. This implies that the interplay between turns and hydrophobic clusters has an important role in the formation of multiple hairpins in βsheet proteins.
In terms of the theory of the coilglobule transition (collapse transition) (Ziv et al., 2009), the formation of hairpin 1 would be the collapse step. This collapse transition is specific in the sense that the collapsed structure does not contain a large number of nonnative contacts, whereas the collapse of homopolymers are often treated as a nonspecific transition. As predicted by the theory (Ziv et al., 2009), this indicates that the folding and collapse transition temperatures are close for this WW domain.
Discussion
We have proposed a twostep procedure for the construction of a dataassimilated MSM with $T\left(\tau \right)={T}_{\text{experiment}}\left(\tau \right)$ matching singlemolecule timeseries data. Using smFRET data for the FBP WW domain, we show that the dataassimilated MSM successfully reproduces the original smFRET data, and yields a transitionstate ensemble consistent with an independent mutational experiment (Petrovich et al., 2006). The folding mechanism based on the dataassimilated MSM suggests an interplay between hairpin and hydrophobic formations.
In the context of machinelearning theory, the proposed twostep procedure can be regarded as a semisupervised learning algorithm, which tries to learn from both labeled and unlabeled data (Rudzinski et al., 2016; Zhu and Goldberg, 2009). In the context of MSM, simulation data correspond to labeled data while experimental data are unlabeled data. In a typical case of semisupervised learning (e.g., image recognition), the labeled data are correct and usually expensive (e.g., images that are manually labeled by investigator). Therefore, unlabeled data are often ‘deemphasized’ by scaling their contribution in the likelihood function (Zhu and Goldberg, 2009). By contrast, in our case, labeled data (simulation) may have incorrect transition counts caused by forcefield biases whereas unlabeled data (experimental) possess more reliable information on dynamics. Thus, in our twostep procedure, the estimates ${T}_{\text{simulation}}\left(\tau \right)$ from labeled data (simulation) are replaced with ${T}_{\text{experiment}}\left(\tau \right)$ "refined" with unlabeled data (experiment). This is regarded as the limit of ‘deemphasis’ on labeled data (simulation).
When fitting a rather complex model to any experimental data, the model can overreact to noise in the data (the overfitting problem). In particular, our MSM for the FBP WW domain has a rather large number of parameters ($87\times 87$ transition probabilities), which could be easily overfitted to the smFRET data. To assess the overfitting in the unsupervised learning, we divided the smFRET data in half, and unsupervised learning was independently applied to the two subsets. Qualitatively similar network structures appeared in both, and were similar to that obtained with the full data set (Figure 2—figure supplement 4). This is because the effective number of parameters is considerably reduced from $87\times 87$ down to only those involving populated states (ε ~ 0.5–0.8). In order to counteract overfitting, a maximum caliber approach for minimally perturbing the initial MSM could be a promising direction for a future study (Dixit and Dill, 2014, 2015, 2018; Wan et al., 2016; Zhou et al., 2017). Furthermore, to see the dependence on the choice of the Förster radius R_{0}, we carried out the unsupervised learning using a set of different R_{0} values (R_{0} = 54, 55, 57, and 58 Å). The overall structure of the MSM network was robust against the choice of R_{0} except for R_{0} = 58 Å (Figure 2—figure supplement 5).
The initial condition in the unsupervised learning is another issue. Since MSM has a larger number of parameters than in typical hidden Markov modeling, unsupervised learning requires a good initial condition for optimization to avoid being trapped in a local minimum. By using ${T}_{\text{simulation}}\left(\tau \right)$ as the initial condition, we achieved a likelihood convergence with $\mathrm{ln}L\left(T\left(\tau \right)\right)$ = −584,947 with 10,000step optimization (taking one week using the parallel implementation of the BaumWelch algorithm [Rabiner and Juang, 1986]). For comparison, we also performed the optimization using a random matrix as the initial condition (Figure 2—figure supplement 8C). In the figure, the optimization of the likelihood looks stacked after 10,000 iterations and its value is lower than that of ${T}_{\text{simulation}}\left(\tau \right)$ as the initial condition. This result suggests that the model from a random matrix could be ruled out because of trapping in a local minimum, or at least that global optimization of a random matrix is practically very inefficient. This also emphasizes the importance of transferring knowledge ${T}_{\text{simulation}}\left(\tau \right)$ learned from simulations for improving the unsupervised learning ${T}_{\text{experiment}}\left(\tau \right)$ from experimental data (Torrey and Jude, 2009). Although just ${T}_{\text{simulation}}\left(\tau \right)$ was used as the initial condition here, advanced algorithms in transfer learning (Torrey and Jude, 2009) can be incorporated in a future study.
We here used the conventional constanttime binning for photon counting because the standard MSM is based on constanttime binning. A promising possibility for future studies is to apply continuoustime Markov modeling (McGibbon and Pande, 2015), which may allow us to use informationbased binning (Watkins and Yang, 2004) or photonbyphoton analysis (Gopich and Szabo, 2009; Okamoto and Sako, 2012; Pirchi et al., 2016), avoiding photon counting noise.
In conclusion, exploiting the temporal information embedded in experimental timeseries data to improve the simulationbased model has provided a rich, dynamic and experimentally consistent picture of the folding mechanism for the FBP WW domain. The dataassimilated MSM pathway could be used to improve the forcefield parameters of proteins, nucleic acids, and other biomolecules. The semisupervised learning combined with MSM method developed here is a quite general framework that can be used to understand conformational transitions in proteins and other biomolecules. It can be extended to interpret other experimental data possibly using more advanced techniques.
Methods
Molecular dynamics simulation
Request a detailed protocolMonte Carlo searches were performed for labeling the dyes without any steric crashes with the protein. The constructed dyelabeled WW domain was solvated by TIP3P water molecules in a cubic box of 64.3 Å side length. Sodium ions were added to make the net charge of the system neutral.
In order to obtain unfolded structures as the initial structures for production runs, we first performed eleven 80 ns simulations at high temperature (600 K) in the NVT ensemble. Then, each system was equilibrated by 40 ns simulation in the NPT ensemble (1 atm and 370 K, slightly lower than the estimated melting temperature in the previous simulation with a different force field [Mu et al., 2006]). After determining the average volume size in these eleven trajectories, the volume size of each simulation was reset to the average value. Then, we conducted production simulation of eleven systems for 25.6 μs in the NVT ensemble (370 K). Furthermore, we performed six additional production runs of lengths 10 μs and another four simulations of 14 μs. All of these additional ten simulations started from the native structure.
All production simulations were conducted with the Amber 14 GPU version of the PMEMD module (SalomonFerrer et al., 2013) (using the SPFP precision model [Le Grand et al., 2013]) on GPU computers. Amber ff99SB (Hornak et al., 2006) was used for the force field. For the FRET dyes (Alexa 488, Alexa 594, and linkers), we used the AMBERDYE force field (Graen et al., 2014), which is optimized for use with the Amber ff99SB and TIP3P water model. A cutoff of 8 Å was applied for the LennardJones and shortrange electrostatic interactions. For the longrange electrostatic interactions, we used the Particle Mesh Ewald method (Darden et al., 1993). All bonds involving hydrogen atoms were constrained with the SHAKE/SETTLE algorithm (Miyamoto and Kollman, 1992; Ryckaert et al., 1977). Using hydrogen mass repartitioning (Hopkins et al., 2015), a time step of 4 fs was used. Temperature and pressure were controlled by a Berendsen thermostat (Berendsen et al., 1984) with a coupling constant of 1 ps and the Monte Carlo barostat, respectively. Trajectories were saved every 200 ps. Q was calculated following the definition of Best et al. (2013).
It is known that conventional force fields including Amber99SB used in this study overstabilize compact states in the unfolded or disordered states (Piana et al., 2014, 2015). Recently, Best and coworkers modified shortrange proteinrwater pair interactions to correct this bias for a derivative of the Amber ff03 force field with the TIP4P/2005 water model (Best et al., 2014). Specifically, they scaled the LennardJones ε_{Oi} between the oxygen of water molecules and all protein atoms by using a factor of 1.1. In order to sample noncompact conformations, we scaled ε_{Oi} of Amber ff99SB and TIP3P in the same manner, and conducted folding simulations. Starting from the unfolded structures, which were generated in the NVT ensemble (600 K), we performed ten 7 μs simulations in the NPT ensemble (1 atm, and 360 K, slightly lower than the previous case for more conservative simulations). We confirmed that the unfolded states in these trajectories prefer more elongated conformations compared with the original Amber ff99SB. However, we did not observe any folding events, suggesting that the scaling may destabilize the native state at least in the case of Amber ff99SB and TIP3P (Figure 2—figure supplement 1B). Thus, we decided to use only the simulation data of the original Amber ff99SB in this work. For these additional simulations, we used GENESIS (Jung et al., 2015; Kobayashi et al., 2017) and K computer as well as the Amber 14 PMEMD module and GPU computers. All structural figures were prepared with VMD (Humphrey et al., 1996).
Markov state model and semisupervised learning
Request a detailed protocolThe regular spatial clustering was applied with RegularSpatial function in MSMBuilder (Harrigan et al., 2017) in the 2D space spanned by Q and ε.
$h\left({o}_{i}{s}_{i}\right)$ in the likelihood function (Equation 2) is the probability of observing donor and acceptor photoncounts ${o}_{i}=\left\{{N}_{D},{N}_{A}\right\}$ given MSM’s state s_{i}. Denoting the donor and acceptor photon count rates in the state s_{i} by n_{D}(s_{i}) and n_{A}(s_{i}), this probability is given by the product of Poisson distributions (Gopich and Szabo, 2012),
Following the previous analysis by Chung and coworkers (Chung et al., 2012), we applied the condition that the sum of the donor and acceptor count rates is independent of the conformational states, that is, $n={n}_{D}\left({s}_{i}\right)+{n}_{A}\left({s}_{i}\right)\equiv \text{const}$. This condition is met when the gamma factor, which is the ratio of the quantum yields and detection efficiencies of the acceptor and donor photons, is equal to one in all conformational states. Under this condition, Equation 3 is rewritten as
Here, the expected FRET efficiency $\epsilon ={n}_{A}\left({s}_{i}\right)/\left({n}_{D}\left({s}_{i}\right)+{n}_{A}\left({s}_{i}\right)\right)$ is related to the distance between donor and acceptor r(s_{i}) through the Förster theory,
where R_{0} and κ^{2} are the Förster radius and the orientation factor between the transition dipoles of dyes, respectively. By analyzing the structures in MD simulations, we evaluated the contribution of the orientation factor κ^{2} to R_{0}. We calculated the directions of the dipoles by assuming that the transition dipole moments are aligned with the long axis of each chromophore (Best et al., 2015). Rather than evaluating the orientation factor of each MSM state, we evaluated the averages and standard deviations of the orientation factor in four local regions defined along the Q and ε axes respectively because fluctuations in the instantaneous orientation factor required a large number of samples. The average (standard deviation) of each region along the Q axis was κ^{2} = 0.63 (0.64) for Q = 0.00–0.25, 0.63 (0.64), for Q = 0.25–0.50, 0.64 (0.63), for Q = 0.50–0.75, and 0.60 (0.62) for Q = 0.75–1.00. Also, the average (standard deviation) of each region along the ε axis was κ^{2} = 0.66 (0.69) for ε = 0.00–0.25, 0.64 (0.68) for ε = 0.25–0.50, 0.61 (0.66) for ε = 0.50–0.75, and 0.63 (0.64) for ε = 0.75–1.00. The results suggest that κ^{2} hardly depends on states within standard deviations. Thus, we here employed the isotropic average approximation κ^{2} = 2/3, and R_{0} = 56 Å (Jäger et al., 2006) was used. In the same way, the donoracceptor distance r was calculated from the geometric centers of the donor and acceptor chromophores. The averaged value of r within each state s_{i} was used for r(s_{i}).
The total log likelihood function $\mathrm{I}\mathrm{n}\text{}L\left(\mathbf{T}\left(\tau \right)\right)=\sum _{k}\mathrm{I}\mathrm{n}\text{}{L}_{k}\left(\mathbf{T}\left(\tau \right)\right)$ of observing smFRET timeseries data was optimized using the BaumWelch algorithm (Rabiner and Juang, 1986), imposing the detailedbalance condition as a constraint (McGibbon et al., 2014a; Noé et al., 2013). A numerical benefit of imposing the condition is that the maximum eigenvalue of the transition probability matrix always becomes one and its corresponding eigenvector represents the equilibrium probabilities of states. For this intensive calculation, inhouse MATLAB codes (https://github.com/ymatsunaga/mdtoolbox) were developed and parallelized over photonsequences. The codes are publicly available at https://github.com/ymatsunaga/mdtoolbox under the BSD 3Clause License (Matsunaga, 2018); copy archived at https://github.com/elifesciencespublications/mdtoolbox). In the BaumWelch algorithm, the parameters whose initial values are zero are always kept as zero. In order to relax this topological constraint, very weak random noise was added to ${T}_{\text{simulation}}\left(\tau \right)$ before the optimization. In the early phase of the optimization (100 steps), unfolded states are stabilized irrespective of their compactness (with the likelihood value of $\mathrm{I}\mathrm{n}\text{}L\left(\mathbf{T}\left(\tau \right)\right)$ = −588,314, Figure 2—figure supplement 7B). Then, during the convergence of the likelihood in an optimization of 10,000 steps, the compact unfolded state disappeared while the folded state becomes stabilized (with a larger likelihood value of $\mathrm{I}\mathrm{n}\text{}L\left(\mathbf{T}\left(\tau \right)\right)$ = −584,947, Figure 2—figure supplement 7C).
In order to examine the overfitting of the model to the smFRET data, we divided smFRET data into halves, and the likelihood optimization was independently applied to the two subsets. Both results generated qualitatively similar network structures as with the full data set (Figure 2—figure supplement 4).
To see the dependence on the choice of the Förster radius R_{0}, we carried out the unsupervised learning using a set of different R_{0} values (R_{0} = 54 Å, 55 Å, 57 Å, and 58 Å) with the same 87 states. The overall structure of the MSM network was robust against the choice of R_{0} except for R_{0} = 58 Å (Figure 2—figure supplement 5).
It is known that Equation (5) only approximately holds at the weak excitationlimit of the donor dye. Here, its validity of the assumption is questionable because a very high intensity laser (10 kW/cm^{2}) was used in the current smFRET measurement to increase the number of photons (Chung et al., 2012). Thus, we examined the FRET efficiency outside the weakexcitation limit by using the following relation given by Camley et al. (2009):
Here, Λ depends on all rates of dye photophysics other than the energy transfer. Λ = 1 corresponds to the weakfield limit (Equation 5). Λ >1 reflects the inability of doubly excited dye pairs to undergo FRET within commonly accepted physical models. Here, using the same 87 states, we optimized the likelihood function by using ε’ with Λ = 1.065, a value used in Camley et al., 2009. The optimized model is plotted in Figure 2—figure supplement 6. In the figure, although the intermediate states look more stabilized, the locations of stabilized states are qualitatively the same as the weakfield limit (Λ = 1, Equation 5). This suggests that the folding mechanism is robust against the definition of ε.
In order to evaluate the dependence on the initial condition, we also performed the optimization using a random matrix as the initial condition. The convergence of the likelihood function is shown in Figure 2—figure supplement 8.
Analysis of Markov state models
Request a detailed protocolWe analyzed the dynamic properties of the constructed MSMs by generating long MSM simulation trajectories with stochastic simulations. We first generated trajectories of states by using $T\left(\tau \right)$ (τ = 200 ns). Specifically, a random number between 0 and 1 was drawn at every step to determine which state the system will jump to in the next step according to $T\left(\tau \right)$. For the reproducibility test against the original smFRET data, we generated a total of 10 independent trajectories of states each having the same timelength as the smFRET data, and then virtually emitted photons according to the likelihood function (Equation 2) from the states. We compared the histogram of observed FRET efficiencies E = N_{A}/(N_{A} + N_{D}) using 200 ns and 50 μs timewindows.
In order to examine the overfitting again, we performed a kfold cross validation test (with k=4) and calculated errors by using histograms of measured FRET efficiencies E. We partitioned the smFRET data set into k subsets (k1 subsets as the training data, a single subset as the test data). We evaluated the mean squared error between the normalized FRET efficiency histograms calculated from the model and the subsets of the smFRET data (shown in Figure 3—figure supplement 1). The prediction error (the socalled cross validation error) for the test data was found to be $11.8\times {10}^{5}$, which is quite small. This suggests that overfitting is not a critical issue in our modeling.
Trajectories of conformations were generated from the trajectories of states by choosing a random conformation from a state at each step. These conformational trajectories were used to characterize the timecourse behavior of Q and the gyration radius (Figure 4—figure supplement 1), as well as the transition state ensemble (Figure 4E and F).
The folding behavior was further characterized by calculating p_{fold}, the probability of a given state to fold before it unfolds. The p_{fold} was solved by applying the transitionpath theory (Metzner et al., 2009; Noé et al., 2009) (with committors function in MSMBuilder [Harrigan et al., 2017]). The p_{fold} was mapped onto geometric space (CαRMSDs of the hairpins 1 and 2) using the trajectories generated as described above.
We conducted pathway analysis from the unfolded to the folded state by decomposing the flux of folding trajectories into individual pathways (Metzner et al., 2009; Noé et al., 2009). In the algorithm, after calculating the net flux matrix between states, the largest flux pathway from the unfolded to the folded state was searched by using Dijkstra’s algorithm. Then, the largest flux was subtracted from the net flux matrix. Subsequently, the second largest flux pathway was determined by using Dijkstra’s algorithm. Representative pathways were obtained by repeating this procedure using the paths function of MSMbuilder (Harrigan et al., 2017).
Data availability
All of the inhouse program codes for the method proposed in this study, including test datasets, are available at GitHub: https://github.com/ymatsunaga/mdtoolbox.
References

Bayesian energy landscape tilting: towards concordant models of molecular ensemblesBiophysical Journal 106:1381–1390.https://doi.org/10.1016/j.bpj.2014.02.009

Molecular dynamics with coupling to an external bathThe Journal of Chemical Physics 81:3684–3690.https://doi.org/10.1063/1.448118

Balanced proteinwater interactions improve properties of disordered proteins and nonspecific protein associationJournal of Chemical Theory and Computation 10:5113–5124.https://doi.org/10.1021/ct500569b

Metainference: A Bayesian inference method for heterogeneous systemsScience Advances 2:e1501177.https://doi.org/10.1126/sciadv.1501177

Combining experiments and simulations using the maximum entropy principlePLoS Computational Biology 10:e1003406.https://doi.org/10.1371/journal.pcbi.1003406

Förster transfer outside the weakexcitation limitThe Journal of Chemical Physics 131:104509.https://doi.org/10.1063/1.3230974

Particle mesh Ewald: An N ⋅log( N ) method for Ewald sums in large systemsThe Journal of Chemical Physics 98:10089–10092.https://doi.org/10.1063/1.464397

The proteinfolding problem, 50 years onScience 338:1042–1046.https://doi.org/10.1126/science.1219021

Inferring microscopic kinetic rates from stationary state distributionsJournal of Chemical Theory and Computation 10:3002–3005.https://doi.org/10.1021/ct5001389

Caliber corrected markov modeling (c_{2}m_{2}): Correcting equilibrium markov modelsJournal of Chemical Theory and Computation 14:1111–1119.https://doi.org/10.1021/acs.jctc.7b01126

Inferring transition rates of networks from populations in continuoustime markov processesJournal of Chemical Theory and Computation 11:5464–5472.https://doi.org/10.1021/acs.jctc.5b00537

Conformational dynamics of apoglnbp revealed by experimental and computational analysisAngewandte Chemie International Edition 55:13990–13994.https://doi.org/10.1002/anie.201606613

Tenmicrosecond molecular dynamics simulation of a fastfolding WW domainBiophysical Journal 94:L75–L77.https://doi.org/10.1529/biophysj.108.131565

Theory of singlemolecule FRET efficiency histogramsAdvances in Chemical Physics 146:245–297.https://doi.org/10.1002/9781118131374.ch10

Decoding the pattern of photon colors in singlemolecule FRETThe Journal of Physical Chemistry B 113:10965–10973.https://doi.org/10.1021/jp903671p

AMBERDYES: Characterization of charge fluctuations and force field parameterization of fluorescent dyes for molecular dynamics simulationsJournal of Chemical Theory and Computation 10:5505–5512.https://doi.org/10.1021/ct500869p

Expectationmaximization of the potential of mean force and diffusion coefficient in Langevin dynamics from single molecule FRET data photon by photonThe Journal of Physical Chemistry B 117:15591–15605.https://doi.org/10.1021/jp405983d

MSMBuilder: Statistical models for biomolecular dynamicsBiophysical Journal 112:10–15.https://doi.org/10.1016/j.bpj.2016.10.042

LongTimeStep Molecular Dynamics through Hydrogen Mass RepartitioningJournal of Chemical Theory and Computation 11:1864–1874.https://doi.org/10.1021/ct5010406

Comparison of multiple Amber force fields and development of improved protein backbone parametersProteins: Structure, Function, and Bioinformatics 65:712–725.https://doi.org/10.1002/prot.21123

VMD: visual molecular dynamicsJournal of Molecular Graphics 14:33–38.https://doi.org/10.1016/02637855(96)000185

GENESIS: a hybridparallel and multiscale molecular dynamics simulator with enhanced sampling algorithms for biomolecular and cellular simulationsWiley Interdisciplinary Reviews: Computational Molecular Science 5:310–323.https://doi.org/10.1002/wcms.1220

The folding mechanism of a betasheet: the WW domainJournal of Molecular Biology 311:373–393.https://doi.org/10.1006/jmbi.2001.4873

Evaluation and optimization of discrete state models of protein foldingThe Journal of Physical Chemistry B 116:11405–11413.https://doi.org/10.1021/jp3044303

GENESIS 1.1: A hybridparallel molecular dynamics simulator with enhanced sampling algorithms on multiple computational platformsJournal of Computational Chemistry 38:2193–2206.https://doi.org/10.1002/jcc.24874

SPFP: Speed without compromise—A mixed precision model for GPU accelerated molecular dynamics simulationsComputer Physics Communications 184:374–380.https://doi.org/10.1016/j.cpc.2012.09.022

Structural analysis of WW domains and design of a WW prototypeNature Structural Biology 7:375–379.https://doi.org/10.1038/75144

Diffusion constant of the TIP5P model of liquid waterThe Journal of Chemical Physics 114:363–366.https://doi.org/10.1063/1.1329346

Sequential data assimilation for singlemolecule FRET photoncounting dataThe Journal of Chemical Physics 142:214115.https://doi.org/10.1063/1.4921983

Efficient maximum likelihood parameterization of continuoustime Markov processesThe Journal of Chemical Physics 143:034109.https://doi.org/10.1063/1.4926516

ConferenceUnderstanding Protein Dynamics with L1Regularized Reversible Hidden Markov ModelsPaper Presented at the Proc. 31st Intl. Conf. on Machine Learning (ICML).

Statistical model selection for Markov models of biomolecular dynamicsThe Journal of Physical Chemistry B 118:6475–6481.https://doi.org/10.1021/jp411822r

Analysis of singlemolecule FRET trajectories using hidden Markov modelingBiophysical Journal 91:1941–1951.https://doi.org/10.1529/biophysj.106.082487

Transition path theory for markov jump processesMultiscale Modeling & Simulation 7:1192–1219.https://doi.org/10.1137/070699500

Settle: An analytical version of the SHAKE and RATTLE algorithm for rigid water modelsJournal of Computational Chemistry 13:952–962.https://doi.org/10.1002/jcc.540130805

Folding, misfolding, and amyloid protofibril formation of WW domain FBP28Biophysical Journal 90:3983–3992.https://doi.org/10.1529/biophysj.105.076406

Projected and hidden Markov models for calculating kinetics and metastable states of complex moleculesThe Journal of Chemical Physics 139:184114.https://doi.org/10.1063/1.4828816

Probabilistic determination of native state ensembles of proteinsJournal of Chemical Theory and Computation 10:3484–3491.https://doi.org/10.1021/ct5001236

Phianalysis at the experimental limits: mechanism of betahairpin formationJournal of Molecular Biology 360:865–881.https://doi.org/10.1016/j.jmb.2006.05.050

Water dispersion interactions strongly influence simulated structural properties of disordered protein statesThe Journal of Physical Chemistry B 119:5113–5123.https://doi.org/10.1021/jp508971m

Assessing the accuracy of physical models used in proteinfolding simulations: quantitative evidence from long molecular dynamics simulationsCurrent Opinion in Structural Biology 24:98–105.https://doi.org/10.1016/j.sbi.2013.12.006

Photonbyphoton hidden markov model analysis for microsecond singlemolecule fret kineticsThe Journal of Physical Chemistry B 120:13065–13075.https://doi.org/10.1021/acs.jpcb.6b10726

On the use of experimental observations to bias simulated ensemblesJournal of Chemical Theory and Computation 8:3445–3451.https://doi.org/10.1021/ct300112v

Markov models of molecular kinetics: generation and validationThe Journal of Chemical Physics 134:174105.https://doi.org/10.1063/1.3565032

An introduction to hidden Markov modelsIEEE ASSP Magazine 3:4–16.https://doi.org/10.1109/MASSP.1986.1165342

On the statistical equivalence of restrainedensemble simulations with the maximum entropy methodThe Journal of Chemical Physics 138:084107.https://doi.org/10.1063/1.4792208

Communication: Consistent interpretation of molecular simulation kinetics using Markov state models biased with external informationThe Journal of Chemical Physics 144:051102.https://doi.org/10.1063/1.4941455

Numerical integration of the cartesian equations of motion of a system with constraints: molecular dynamics of nalkanesJournal of Computational Physics 23:327–341.https://doi.org/10.1016/00219991(77)900985

Routine microsecond molecular dynamics simulations with AMBER on GPUs. 2. Explicit Solvent Particle Mesh EwaldJournal of Chemical Theory and Computation 9:3878–3888.https://doi.org/10.1021/ct400314y

Maximum likelihood trajectories from single molecule fluorescence resonance energy transfer experimentsThe Journal of Chemical Physics 119:9920–9924.https://doi.org/10.1063/1.1616511

Free energy surfaces from singledistance informationThe Journal of Physical Chemistry B 114:15227–15235.https://doi.org/10.1021/jp1053698

Improvements in markov state model construction reveal many nonnative interactions in the folding of NTL9Journal of Chemical Theory and Computation 9:2000–2009.https://doi.org/10.1021/ct300878a

EMMA: A Software Package for Markov Model Building and AnalysisJournal of Chemical Theory and Computation 8:2223–2238.https://doi.org/10.1021/ct300274u

Extraction of protein conformational modes from distance distributions using structurally imputed bayesian data augmentationThe Journal of Physical Chemistry B 120:10469–10482.https://doi.org/10.1021/acs.jpcb.6b07767

BookTransfer LearningIn: Soria E, Martin J, Magdalena R, Martinez M, Serrano A, editors. Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques 1. IGI Global. pp. 242–264.

Statistical mechanical theory of the protein conformation. Ii. Folding pathway for proteinJournal of the Physical Society of Japan 44:1939–1945.https://doi.org/10.1143/JPSJ.44.1939

A maximumcaliber approach to predicting perturbed folding kinetics due to mutationsJournal of Chemical Theory and Computation 12:5768–5776.https://doi.org/10.1021/acs.jctc.6b00938

Information bounds and optimal analysis of dynamic single molecule measurementsBiophysical Journal 86:4015–4029.https://doi.org/10.1529/biophysj.103.037739

Probing the Action of Chemical Denaturant on an Intrinsically Disordered Protein by Simulation and ExperimentJournal of the American Chemical Society 138:11702–11713.https://doi.org/10.1021/jacs.6b05443

Introduction to semisupervised learningSynthesis Lectures on Artificial Intelligence and Machine Learning 3:1–130.https://doi.org/10.2200/S00196ED1V01Y200906AIM006

Collapse transition in proteinsPhysical Chemistry Chemical Physics 11:83–93.https://doi.org/10.1039/B813961J
Decision letter

Nir BenTalReviewing Editor; Tel Aviv University, Israel
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
Thank you for submitting your article "Linking time series of singlemolecule experiments with molecular dynamics simulations by machine learning" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Arup Chakraborty as the Senior Editor. The following individual involved in review of your submission has agreed to reveal his identity: John Straub (Reviewer #2).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
Summary:
The manuscript describes a novel approach based on hidden Markov models to connect singlemolecular FRET measurements with MD simulations, and application to the folding of forminbinding protein WW domain. The manuscript address two key issues in improving the validity of Markov state models used in protein folding studies, namely, 1) the construction of the discrete states, and 2) the construction of the continuous time Markov transition matrix T(\tau). Initially, discrete conformational states were obtained by clustering trajectory snapshots, and transition rates were then estimated from trajectories. To correct potential bias due to problems in the force field, the authors then used a hidden Markov model to model smFRET time series measurement and a) modified the discrete states and b) estimated the T(\tau) values between the discrete states so the smFRET is better explained. Indeed, the refined Markov state model can better account for the singlemolecule timeseries data: The improvement in reproducing the smFRET histogram of 50 uS as shown in Figure 3 is dramatic. It further revealed a different folding pathway of the WWdomain, as well as a different transition state ensemble.
The approach of using MDderived MSM as basis and initial input for likelihood optimization with respect to smFRET data is very interesting and novel, and the calculations and uncertainty analysis were conducted carefully. The insight into the folding mechanism is very interesting. However, quite a few outstanding issues, including some serious concerns, remain to be addressed to assess whether publication is warranted.
Essential revisions:
1) An overall question about the current practice of Markov state models in MDbased protein folding studies. Assuming both simulation and experimental studies are under the same conditions (salt, dye, temperature, etc.), they are probing the same underlying physical process. Therefore, they should reflect the same groundtruth reality. As MD can provide atomistic details within the time scale of simulation, MD studies should provide far more details than smFRET. However, this is not the case. Rather, MD simulation cannot stand on its own, and needs to be corrected by smFRET. As described by the authors, the MSM is shifted before and after incorporating smFRET data. The authors showed that the discrete states are very different when smFRET data are brought into consideration. While it is not explicitly stated (please clarify), it seems that the refined Markov states are derived from the original Markov states by clustering of MD trajectories, but updated with new equilibrium probability (hence different bubble areas in Figure 2) and new transition rates. How can the ground truth, for example, equilibrium probabilities of the Markov states, change if it is examined one way (by MD) or another way (MD+smFRET)? This would suggest there is no objective ground truth that is accessible by current MD simulations.
Lending support to this conclusion, is the discussion you provided, that in other studies prior to this manuscript, the number of discrete states and the nature of these states are all malleable, depending on which metric one examines (RMSD or contact map). This would again suggest that there is no objective truth in determining what constitutes a set of discrete states that are Markovian and what their transition rates are. Rather, all depends on which metric and/or which additional experimental data one chooses to examine.
One cannot help but speculate: would one expect some other alternative discrete Markovian states different from the currently reported ones to emerge, when other types of experimental time series data other than \epsilon is incorporated. Is it likely that yet more different Markovian states, different dynamics of transition rates, different folding dynamics, and different transition state ensembles will be identified, all for the same WWdomain under the same condition (dye, pH, salt, temperature, if well controlled)?
This aspect of shifting ground truth in MD simulation and Markov state modeling is rather unsettling. A pessimistic, but not unreasonable, view is that without a principled approach in defining true reaction coordinates for protein folding one has to make do with rather ad hoc and heuristic approaches in defining the discrete states with unexamined consequences, and hence this unsettling aspect may be with us for a long while and the truth may be elusive.
Here is a suggestion in this respect: The authors employ Equation 5 for a model of FRET efficiency, however, more detailed models exist such as http://aip.scitation.org/doi/pdf/10.1063/1.3230974
The authors could evaluate their approach and determine if a more detailed model might be justified or needed in making contact with experimental measurements.
2) While concern 1 is more general, there is another question more specific to connecting MD and the smFRET measurements. It will be necessary for the authors to show how well the HMM learned T(\tau) conform with MD simulation generated T(\tau) after the refined discrete states are obtained, in a statistically significant way. For example, the authors should compare the HMM rates and the MD rates for the subsets of statestate transitions with the highest and relevant lowest rates among the 87x87 parameters. In addition, the flux and pathway analysis should also be carried out using MDderived rates after the refined states are obtained by the HMM model, for key paths carrying most of the flux, and compared with the analysis using HMM rates. The authors may have already done some of these in the transition state ensemble analysis, and may already have all necessary data.
This is important for validating the connection between the smFRET (HMM modeling) time series and the MD trajectories the authors are proposing. It would be fantastic if all works out, as this would naturally suggest that in the future one no longer needs to do MD simulation. Rather, one could use smFRET to fit dynamic rates, as long as the discrete states can be obtained, for example, by some other means. However, one should be cautious about being overly optimistic here, as there are significant problems in the enormous search space of very high dimension. It will be very interesting if the authors can show that the HMM rates and MD rates after refinement are the same.
3) Potential overfitting remains a concern, as there is a large number (87x87 = 7,569) parameters that need to be fitted. While the authors qualitatively compared results using each half of the data through plotting, it would be necessary to do some crossvalidation tests, which are standard in machine learning. Specifically, the authors could use (n1)/n balanced smFRET and MD data (e.g. similar number of folding trajectories) to identify refined states, and estimating \tau values, then test whether the remaining 1/n data falls within these defined states, and whether the unseen epsilon histograms can be reproduced accurately.
4) In Equation 5, the average values are used in calculating \epsilon. It seems that kappa has little difference at different regions of Q. Is the difference along the dimension of \epsilon that is used for clustering due to difference in distances r(s_{i})? It will be helpful to also show both mean and variance of r(s_{i}) for each of the major discrete states, and justify whether the degree of incluster and betweencluster heterogeneity/homogeneity conforms to the Markovian assumption of each discrete state. This relates to the issue of validating the defined discrete states.
5) The calculations of FRET efficiencies from MD trajectories need to be stated explicitly. Furthermore, the estimation of FRET efficiencies from experimentally measured photon sequences also needs to be presently clearly. The latter in fact is a longstanding issue in the singlemolecule community, for example, the time binning convention and the informationbase binning by Haw Yang.
6) It is not obvious how the folding pathways were decomposed into individual pathways for both MSM models. How this procedure was conducted would critically impact the interpretation of results. This part thus needs careful and clear explanation.
7) Certain key references of interring dynamics from smFRET trajectories were missed in the context of this work, such as Haas et al., 2013.
8) Obtaining a solution of similar likelihood value starting from a random matrix is a bit worrisome. Can the authors elaborate more on how this model was ruled out? Furthermore, the convergence of likelihood optimization should also be discussed to reveal the robustness of this approach.
[Editors' note: further revisions were requested prior to acceptance, as described below.]
Thank you for resubmitting your work entitled "Linking time series of singlemolecule experiments with molecular dynamics simulations by machine learning" for further consideration at eLife. Your revised article has been favorably evaluated by Arup Chakraborty (Senior Editor), a Reviewing Editor, and three reviewers.
The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined below:
Overview:
The unsettling aspects of shifting ground truth, or rather, the lack of ground truth remains. This is, however, an issue the community of protein folding simulations has to deal with.
Using Q as an adequate reaction coordinate is similarly problematic and inadequate, as Q is no substitute to reaction coordinates in the strict sense of earlier classic studies of Chandler and Dinner. But this again is a community problem, and we do not wish to specifically penalize the authors of the current work. Therefore, we accept the authors' arguments, although at least one of the reviewers had some fundamental concerns about the entire set of approaches in this area.
Specific points to address:
1) There is one issue that the authors did not provide answers to but they can easily do, namely, regarding the question of providing details of the 87x87 rates from the Hidden Markov model and the corresponding rates from MD simulation. It will be necessary to make first a general statement on how similar/different these two sets of matched/paired rates are, and second, provide details, e.g., in the Appendix, with sidebyside or overlaid histograms. If these two distributions are similar, it would be a comforting result. However, even if these two distributions do not agree, it will be important for readers to know this fact so they can directly draw their own inferences. This can be easily done with existing data.
2) A grammar and expression overhaul is highly recommended to enhance the readability of the work.
https://doi.org/10.7554/eLife.32668.021Author response
Essential revisions:
1) An overall question about the current practice of Markov state models in MDbased protein folding studies. Assuming both simulation and experimental studies are under the same conditions (salt, dye, temperature, etc.), they are probing the same underlying physical process. Therefore, they should reflect the same groundtruth reality. As MD can provide atomistic details within the time scale of simulation, MD studies should provide far more details than smFRET. However, this is not the case. Rather, MD simulation cannot stand on its own, and needs to be corrected by smFRET. As described by the authors, the MSM is shifted before and after incorporating smFRET data. The authors showed that the discrete states are very different when smFRET data are brought into consideration. While it is not explicitly stated (please clarify), it seems that the refined Markov states are derived from the original Markov states by clustering of MD trajectories, but updated with new equilibrium probability (hence different bubble areas in Figure 2) and new transition rates. How can the ground truth, for example, equilibrium probabilities of the Markov states, change if it is examined one way (by MD) or another way (MD+smFRET)? This would suggest there is no objective ground truth that is accessible by current MD simulations.
We agree with the reviewers that both simulation and experimental data should reflect the same groundtruth reality. However, despite improvements over the decades, force field parameters used in MD simulation are not perfectly correct as reaching the groundtruth reality. In particular, while the local interactions are well described by the current force fields it is still difficult to reproduce the energetic balance between global states (such as unfolded and folded states). Indeed, Piana and coworkers (Piana et al., 2011) showed, in their protein folding simulation study, the folding mechanism of the villin headpiece depends substantially on the choice of force field. It is also known that most force fields produce unfolded states that are more compact and structured than those suggested experimentally (Piana et al., 2014). As suggested by these studies, a subtle free energy balance between the unfolded and folded states is crucial for determining the folding mechanism. Considering this background, the motivation of this work is to make MD data toward the groundtruth reality in a datadriven way. We reweighted the transition probabilities by using smFRET data to achieve the subtle balance in free energy. In the future study, using the free energy landscape and folding pathways obtained by our scheme, it is possible to improve force field parameters closer to the ground truth reality. According to this discussion, we have revised the Introduction section (second paragraph).
Lending support to this conclusion, is the discussion you provided, that in other studies prior to this manuscript, the number of discrete states and the nature of these states are all malleable, depending on which metric one examines (RMSD or contact map). This would again suggest that there is no objective truth in determining what constitutes a set of discrete states that are Markovian and what their transition rates are. Rather, all depends on which metric and/or which additional experimental data one chooses to examine.
One cannot help but speculate: would one expect some other alternative discrete Markovian states different from the currently reported ones to emerge, when other types of experimental time series data other than \epsilon is incorporated. Is it likely that yet more different Markovian states, different dynamics of transition rates, different folding dynamics, and different transition state ensembles will be identified, all for the same WWdomain under the same condition (dye, pH, salt, temperature, if well controlled)?
This aspect of shifting ground truth in MD simulation and Markov state modeling is rather unsettling. A pessimistic, but not unreasonable, view is that without a principled approach in defining true reaction coordinates for protein folding one has to make do with rather ad hoc and heuristic approaches in defining the discrete states with unexamined consequences, and hence this unsettling aspect may be with us for a long while and the truth may be elusive.
We agree with the reviewers that Markov state modeling depends on the choice of coordinate or metric one examines. It generally quite difficult to find the true reaction coordinate for defining states in complex systems. On the other hand, in the case of protein folding problem, theoretical and simulation studies over the decades have shown that the fraction of native contacts, Q, is the best reaction coordinate for describing folding process. For example, Best and coworkers, by analyzing folding MD simulation data of various proteins, showed that Q successfully captures the transition states for all the proteins (Best, Hummer and Eaton, 2013). The FRET efficiency \epsilon was chosen for matching with smFRET data and for differentiating the compact and extended states. These two coordinates (Q and \epsilon) are necessary and sufficient for describing the progress of folding as well as refining the bias of the current force field parameters toward compact structures. Also, since the structure of WW domain is rather simple (threestrands, betasheetonly), these coordinates are enough for uniquely identifying tertiary structures. Considering these reasons and the validation through the comparison with the Phivalue analysis, we think that the picture obtained by the current MSM may be the robust against the choice of coordinates. We have added this discussion in the Results section (subsection “Molecular dynamics simulations”, second paragraph).
Here is a suggestion in this respect: The authors employ Equation 5 for a model of FRET efficiency, however, more detailed models exist such as http://aip.scitation.org/doi/pdf/10.1063/1.3230974
The authors could evaluate their approach and determine if a more detailed model might be justified or needed in making contact with experimental measurements.
We thank the reviewers for letting us know the important paper. Following the reviewers’ suggestion, we applied our scheme with the new definition for FRE efficiency 𝜀 = 1⁄[Λ + (𝑟⁄𝑅_{*})^{}] where Λ > 1. Here, we optimized the likelihood function with Λ = 1.065, a value used in the paper. The result has been added as Figure 2—figure supplement 6. In the figure, although the transition states look more stabilized, the locations of stabilized states are qualitatively same as the original one (Λ = 1.0). This suggests that the story of the manuscript would be robust against the definition of ε (subsection “Markov state model and semisupervised learning”, seventh paragraph).
2) While concern 1 is more general, there is another question more specific to connecting MD and the smFRET measurements. It will be necessary for the authors to show how well the HMM learned T(\tau) conform with MD simulation generated T(\tau) after the refined discrete states are obtained, in a statistically significant way. For example, the authors should compare the HMM rates and the MD rates for the subsets of statestate transitions with the highest and relevant lowest rates among the 87x87 parameters. In addition, the flux and pathway analysis should also be carried out using MDderived rates after the refined states are obtained by the HMM model, for key paths carrying most of the flux, and compared with the analysis using HMM rates. The authors may have already done some of these in the transition state ensemble analysis, and may already have all necessary data.
This is important for validating the connection between the smFRET (HMM modeling) time series and the MD trajectories the authors are proposing. It would be fantastic if all works out, as this would naturally suggest that in the future one no longer needs to do MD simulation. Rather, one could use smFRET to fit dynamic rates, as long as the discrete states can be obtained, for example, by some other means. However, one should be cautious about being overly optimistic here, as there are significant problems in the enormous search space of very high dimension. It will be very interesting if the authors can show that the HMM rates and MD rates after refinement are the same.
In this study, the states were defined through the structural clustering of 400 microseconds MD simulation data. As 400 microseconds should be long enough for sampling all possible structures of WW domain, we think that the defined states are transferable during the HMM refinement. We would like to point that we already characterized the difference in T(\tau) from the MD and HMM refinement: Figure 4 shows the result of flux analysis of folding trajectories, which reveals the bottleneck of folding (the transition state ensemble). Figure 5 shows the results of the pathway analysis, which reveals folding pathways with largest fluxes.
As the reviewers pointed out, it would be nice if all transition probabilities are accurately estimated by the HMM from smFRET data. However, as shown by the HMM refinement starting from a random matrix (in response to comment 8), it is still difficult to estimate accurate kinetics along the Q axis from the limited number of smFRET observations. This suggests an importance of complementary use of both data (MD and smFRET), as proposed by our scheme (the MD information as the initial condition for HMM).
3) Potential overfitting remains a concern, as there is a large number (87x87 = 7,569) parameters that need to be fitted. While the authors qualitatively compared results using each half of the data through plotting, it would be necessary to do some crossvalidation tests, which are standard in machine learning. Specifically, the authors could use (n1)/n balanced smFRET and MD data (e.g. similar number of folding trajectories) to identify refined states, and estimating \tau values, then test whether the remaining 1/n data falls within these defined states, and whether the unseen epsilon histograms can be reproduced accurately.
We thank the reviewers for the important suggestion. Following the reviewers’ comment, we have performed a kfold cross validation test (with k=4) by partitioning the smFRET data set into k subsets (k1 subsets as the training data, a single subset as the test data). We have evaluated the mean squared errors between two normalized FRET efficiency histograms calculated from the model and the subsets of smFRET data. The prediction error (the socalled cross validation error) for the test data was found to be 11.8 × 10^{78}, which is quite small. This suggests that the overfitting may not be a major issue in our modeling. We have added this result (Figure 3—figure supplement 1) in the Materials and methods section (subsection “Analysis of Markov state models”, second paragraph).
4) In Equation 5, the average values are used in calculating \epsilon. It seems that kappa has little difference at different regions of Q. Is the difference along the dimension of \epsilon that is used for clustering due to difference in distances r(s_{i})? It will be helpful to also show both mean and variance of r(s_{i}) for each of the major discrete states, and justify whether the degree of incluster and betweencluster heterogeneity/homogeneity conforms to the Markovian assumption of each discrete state. This relates to the issue of validating the defined discrete states.
Following the reviewers’ comment, we have calculated the orientation factor k^{2} along the e (FRET efficiency) axis. The average (standard deviation) in each region was k^{2} = 0.66 (0.69) for e = 0.000.25, 0.64 (0.68) for e = 0.250.50, 0.61 (0.66) for e = 0.500.75, and 0.63 (0.64) for e = 0.751.00. Since the orientation factor k^{2} show little dependence on both Q and e axes, it would be reasonable to use the isotropic average approximation as employed in the manuscript. We have added this result in the Materials and methods section (subsection “Markov state model and semisupervised learning”, third paragraph).
Also, we have calculated both mean and variance of donoracceptor distance r(s_{i}) for each state. It has been plotted in the 2D space spanned by Q and r (added as Figure 2—figure supplement 2). The figure shows that there are distance gaps between states, which is supportive for the Markovian assumption for the states. Large standard deviations are observed within states of small donoracceptor distances. This is due to the worsening of spatial resolution in FRET efficiency when r(s_{i}) is small compared to R_{0} as discussed in the informationbinning study by HawYang and coworkers (Watkins and Yang, 2004). We have added this result (Figure 2—figure supplement 2) in the Materials and methods section (subsection “Molecular dynamics simulations”, last paragraph).
5) The calculations of FRET efficiencies from MD trajectories need to be stated explicitly. Furthermore, the estimation of FRET efficiencies from experimentally measured photon sequences also needs to be presently clearly. The latter in fact is a longstanding issue in the singlemolecule community, for example, the time binning convention and the informationbase binning by Haw Yang.
We have explicitly described the calculations of FRET efficiencies from MD trajectories and photon sequences in the Results and Materials and methods sections. In particular, we have now used different terms for two FRET efficiencies (‘FRET efficiency \epsilon’ estimated from MD, ‘Measured FRET efficiency E’ from photon counts). We here used the conventional constanttime binning because the Markov state modeling is based on the constanttime binning. As a future direction, it is promising to apply a continuoustime Markov modeling (McGibbon and Pande, 2015), which may allow us to use the informationbased binning (Watkins and Yang, 2004) or photonbyphoton analysis (Gopich and A. Szabo, 2009). We have described this direction in the Discussion section, adding citations for these works (fifth paragraph).
6) It is not obvious how the folding pathways were decomposed into individual pathways for both MSM models. How this procedure was conducted would critically impact the interpretation of results. This part thus needs careful and clear explanation.
We have described how the folding pathways were decomposed and its impact in the Results and Materials and methods sections. In the Results section we have added the following:
“The flux of folding trajectories was decomposed into individual pathways for both models (Figures 5A and 5B). […] The figures show dominant folding pathways with each of largest fluxes contributing to 50% of the total flux.”
7) Certain key references of interring dynamics from smFRET trajectories were missed in the context of this work, such as Haas et al., 2013.
We thank the reviewers for letting us know the important paper. We have added following references for recent advances in extracting more dynamical and structural information from smFRET trajectory (Introduction, first paragraph):
Haas, Yang and Chu, 2013; Matsunaga, Kidera, and Sugita, 2015; Sun, Morrell and Yang, 2016; Hoefling et al., 2011.
8) Obtaining a solution of similar likelihood value starting from a random matrix is a bit worrisome. Can the authors elaborate more on how this model was ruled out? Furthermore, the convergence of likelihood optimization should also be discussed to reveal the robustness of this approach.
We have extended the optimization calculation of a random matrix and examined the convergence of the log likelihood function. The result has been added as Figure 2—figure supplement 8. In the figure, the optimization of the log likelihood from random matrix looks stacked after 10000 iterations and its value is lower than that of MD data. This suggests that the optimization gets trapped in a local minimum in the parameter space. From these observations, we conclude that the model from a random matrix could be ruled out, or at least, global optimization of a random matrix is practically very inefficient. We have added this result (Figure 2—figure supplement 8) in the Materials and methods section (subsection “Markov state model and semisupervised learning”, last paragraph).
[Editors' note: further revisions were requested prior to acceptance, as described below.]
The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined below:
Overview:
The unsettling aspects of shifting ground truth, or rather, the lack of ground truth remains. This is, however, an issue the community of protein folding simulations has to deal with.
Using Q as an adequate reaction coordinate is similarly problematic and inadequate, as Q is no substitute to reaction coordinates in the strict sense of earlier classic studies of Chandler and Dinner. But this again is a community problem, and we do not wish to specifically penalize the authors of the current work. Therefore, we accept the authors' arguments, although at least one of the reviewers had some fundamental concerns about the entire set of approaches in this area.
Specific points to address:
1) There is one issue that the authors did not provide answers to but they can easily do, namely, regarding the question of providing details of the 87x87 rates from the Hidden Markov model and the corresponding rates from MD simulation. It will be necessary to make first a general statement on how similar/different these two sets of matched/paired rates are, and second, provide details, e.g., in the Appendix, with sidebyside or overlaid histograms. If these two distributions are similar, it would be a comforting result. However, even if these two distributions do not agree, it will be important for readers to know this fact so they can directly draw their own inferences. This can be easily done with existing data.
Following the reviewers’ comment, we have added new plots comparing the rates of the two MSMs as Figure 2—figure supplement 4.
First (in panel A), we decomposed each transition probability matrix to obtain characteristic time scales before and after the hidden Markov modeling. We can see that the slowest time scale increases from 2.6 microseconds to 100 microseconds after the hidden Markov modeling. These time scales are related to folding/unfolding transitions of WW domain. Generally, simulation time scale is faster than experiments because of lower viscosity of TIP3P water model. This gap was improved by the information of the singlemolecule experimental data.
In panel B, 87x87 T_{ij} rates before and after the hidden Markov modeling are directly compared in a scatter plot where each rate is colored using the FRET efficiency e_{i}. T_{ij} rates from states with high FRET efficiencies are well correlated with each other. This means that T_{ij} rates between compact states derived from MD simulations are consistent with the experimental data. In contrast, T_{ij} rates from states with middle or low FRET efficiencies are less correlated. This means that hidden Markov modeling mainly “refined” the rates between elongated states.
In this revision, we have added these explanations in the main text.
2) A grammar and expression overhaul is highly recommended to enhance the readability of the work.
In this revision, a native speaker has carefully edited the language. We have also checked a number of expressions.
https://doi.org/10.7554/eLife.32668.022Article and author information
Author details
Funding
Japan Science and Technology Agency (JPMJPR1679)
 Yasuhiro Matsunaga
Ministry of Education, Culture, Sports, Science, and Technology (26120533)
 Yasuhiro Matsunaga
RIKEN (Dynamic Structural Biology)
 Yuji Sugita
Research Organization for Information Science and Technology (hp160022)
 Yasuhiro Matsunaga
 Yuji Sugita
Japan Science and Technology Agency (JPMJCR13M3)
 Yuji Sugita
Ministry of Education, Culture, Sports, Science, and Technology (26119006)
 Yuji Sugita
Research Organization for Information Science and Technology (ra000009)
 Yasuhiro Matsunaga
 Yuji Sugita
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We extend our sincere thanks to William A Eaton and Hoi Sung Chung for providing the singlemolecule FRET data and helpful comments on the draft. Thanks also to Satoshi Takahashi, Hiroyuki Oikawa, Simon Olsson and Suyong Re for their comments. We acknowledge help from Timo Graen in using the AMBERDYES force field. Computational resources were provided by HOKUSAI GreatWave in the RIKEN Advanced Center for Computing and Communication and by K computer in the RIKEN Advanced Institute for Computational Science through the HPCI System Research project (Project IDs. hp160022, hp170137 and ra000009). This research has been funded as a RIKEN pioneering project ‘Dynamic Structural Biology’ (funding awarded to YS), as part of the strategic programs for innovation research (‘Computational life science and application in drug discovery and medical development’ and ‘Novel measurement techniques for visualizing live protein molecules at work’ [Grant No. 26119006] [to YS]), by JST CREST under the ‘Structural Life Science and Advanced Core Technologies for Innovative Life Science Research’ program (Grant No. JPMJCR13M3) (to YS), by MEXT Japan under their ‘Initiative for HighDimensional DataDriven Science through Deepening of Sparse Modeling’ (Grant No. 26120533) (to YM), and by JST PRESTO (Grant No. JPMJPR1679) (to YM).
Reviewing Editor
 Nir BenTal, Tel Aviv University, Israel
Publication history
 Received: October 10, 2017
 Accepted: April 23, 2018
 Version of Record published: May 3, 2018 (version 1)
Copyright
© 2018, Matsunaga et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 3,269
 Page views

 531
 Downloads

 25
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, Scopus, PubMed Central.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Computational and Systems Biology
 Neuroscience
Epifluorescence miniature microscopes ('miniscopes') are widely used for in vivo calcium imaging of neural population activity. Imaging data is typically collected during a behavioral task and stored for later offline analysis, but emerging techniques for online imaging can support novel closedloop experiments in which neural population activity is decoded in real time to trigger neurostimulation or sensory feedback. To achieve short feedback latencies, online imaging systems must be optimally designed to maximize computational speed and efficiency while minimizing errors in population decoding. Here we introduce DeCalciOn, an opensource device for realtime imaging and population decoding of in vivo calcium signals that is hardware compatible with all miniscopes that use the UCLA Data Acquisition (DAQ) interface. DeCalciOn performs online motion stabilization, neural enhancement, calcium trace extraction, and decoding of up to 1024 traces per frame at latencies of <50 ms after fluorescence photons arrive at the miniscope image sensor. We show that DeCalciOn can accurately decode the position of rats (n=12) running on a linear track from calcium fluorescence in the hippocampal CA1 layer, and can categorically classify behaviors performed by rats (n=2) during an instrumental task from calcium fluorescence in orbitofrontal cortex (OFC). DeCalciOn achieves high decoding accuracy at short latencies using innovations such as fieldprogrammable gate array (FPGA) hardware for real time image processing and contourfree methods to efficiently extract calcium traces from sensor images. In summary, our system offers an affordable plugandplay solution for realtime calcium imaging experiments in behaving animals.

 Computational and Systems Biology
 Immunology and Inflammation
Highthroughput sequencing of adaptive immune receptor repertoires is a valuable tool for receiving insights in adaptive immunity studies. Several powerful TCR/BCR repertoire reconstruction and analysis methods have been developed in the past decade. However, detecting and correcting the discrepancy between real and experimentally observed lymphocyte clone frequencies is still challenging. Here we discovered a hallmark anomaly in the ratio between read count and clone countbased frequencies of nonfunctional clonotypes in multiplex PCRbased immune repertoires. Calculating this anomaly, we formulated a quantitative measure of V and Jgenes frequency bias driven by multiplex PCR during library preparation called Over Amplification Rate (OAR). Based on the OAR concept, we developed an original software for multiplex PCRspecific bias evaluation and correction named iROAR: Immune Repertoire Over Amplification Removal (https://github.com/smiranast/iROAR). The iROAR algorithm was successfully tested on previously published TCR repertoires obtained using both 5' RACE (Rapid Amplification of cDNA Ends)based and multiplex PCRbased approaches and compared with a biological spikeinbased method for PCR bias evaluation. The developed approach can increase the accuracy and consistency of repertoires reconstructed by different methods making them more applicable for comparative analysis.