Flexible and efficient simulationbased inference for models of decisionmaking
Abstract
Inferring parameters of computational models that capture experimental data is a central task in cognitive neuroscience. Bayesian statistical inference methods usually require the ability to evaluate the likelihood of the model—however, for many models of interest in cognitive neuroscience, the associated likelihoods cannot be computed efficiently. Simulationbased inference (SBI) offers a solution to this problem by only requiring access to simulations produced by the model. Previously, Fengler et al. introduced likelihood approximation networks (LANs, Fengler et al., 2021) which make it possible to apply SBI to models of decisionmaking but require billions of simulations for training. Here, we provide a new SBI method that is substantially more simulation efficient. Our approach, mixed neural likelihood estimation (MNLE), trains neural density estimators on model simulations to emulate the simulator and is designed to capture both the continuous (e.g., reaction times) and discrete (choices) data of decisionmaking models. The likelihoods of the emulator can then be used to perform Bayesian parameter inference on experimental data using standard approximate inference methods like Markov Chain Monte Carlo sampling. We demonstrate MNLE on two variants of the driftdiffusion model and show that it is substantially more efficient than LANs: MNLE achieves similar likelihood accuracy with six orders of magnitude fewer training simulations and is significantly more accurate than LANs when both are trained with the same budget. Our approach enables researchers to perform SBI on customtailored models of decisionmaking, leading to fast iteration of model design for scientific discovery.
Editor's evaluation
This paper provides a new approach, Mixed Neural Likelihood Estimator (MNLE) to build likelihood emulators for simulationbased models where the likelihood is unavailable. The authors show that the MNLE approach is equally accurate but orders of magnitude more efficient than a recent proposal, likelihood approximation networks (LAN), on two variants of the driftdiffusion model (a widely used model in cognitive neuroscience). This work provides a practical approach for fitting more complex models of behavior and neural activity for which likelihoods are unavailable.
https://doi.org/10.7554/eLife.77220.sa0Introduction
Computational modeling is an essential part of the scientific process in cognitive neuroscience: Models are developed from prior knowledge and hypotheses, and compared to experimentally observed phenomena (Churchland and Sejnowski, 1988; McClelland, 2009). Computational models usually have free parameters which need to be tuned to find those models that capture experimental data. This is often approached by searching for single bestfitting parameters using grid search or optimization methods. While this pointwise approach has been used successfully (Lee et al., 2016; Patil et al., 2016) it can be scientifically more informative to perform Bayesian inference over the model parameters: Bayesian inference takes into account prior knowledge, reveals all the parameters consistent with observed data, and thus can be used for quantifying uncertainty, hypothesis testing, and model selection (Lee, 2008; Shiffrin et al., 2008; Lee and Wagenmakers, 2014; Schad et al., 2021). Yet, as the complexity of models used in cognitive neuroscience increases, Bayesian inference becomes challenging for two reasons. First, for many commonly used models, computational evaluation of likelihoods is challenging because often no analytical form is available. Numerical approximations of the likelihood are typically computationally expensive, rendering standard approximate inference methods like Markov Chain Monte Carlo (MCMC) inapplicable. Second, models and experimental paradigms in cognitive neuroscience often induce scenarios in which inference is repeated for varying numbers of experimental trials and changing hierarchical dependencies between model parameters (Lee, 2011). As such, fitting computational models with arbitrary hierarchical structures to experimental data often still requires idiosyncratic and complex inference algorithms.
Approximate Bayesian computation (ABC, Sisson, 2018) offers a solution to the first challenge by enabling Bayesian inference based on comparing simulated with experimental data, without the need to evaluate an explicit likelihood function. Accordingly, various ABC methods have been applied to and developed for models in cognitive neuroscience and related fields (Turner and Van Zandt, 2012; Turner and Van Zandt, 2018; Palestro et al., 2009; Kangasrääsiö et al., 2019). However, ABC methods are limited regarding the second challenge because they become inefficient as the number of model parameters increases (Lueckmann et al., 2021) and require generating new simulations whenever the observed data or parameter dependencies change.
More recent approaches from the field simulationbased inference (SBI, Cranmer et al., 2020) have the potential to overcome these limitations by using machine learning algorithms such as neural networks. Recently, Fengler et al., 2021 presented an SBI algorithm for a specific problem in cognitive neuroscience—inference for driftdiffusion models (DDMs). They introduced a new approach, called likelihood approximation networks (LANs), which uses neural networks to predict loglikelihoods from data and parameters. The predicted likelihoods can subsequently be used to generate posterior samples using MCMC methods. LANs are trained in a threestep procedure. First, a set of $N$ parameters is generated and for each of the $N$ parameters the model is simulated $M$ times. Second, for each of the $N$ parameters, empirical likelihood targets are estimated from the $M$ model simulations using kernel density estimation (KDE) or empirical histograms. Third, a training dataset consisting of parameters, data points, and empirical likelihood targets is constructed by augmenting the initial set of $N$ parameters by a factor of 1000: for each parameter, 1000 data points and empirical likelihood targets are generated from the learned KDE. Finally, supervised learning is used to train a neural network to predict loglikelihoods, by minimizing a loss function (the Huber loss) between the networkpredicted loglikelihoods and the (log of) the empirically estimated likelihoods. Overall, LANs require a large number of model simulations such that the histogram probability of each possible observed data and for each possible combination of input parameters, can be accurately estimated—$N\cdot M$ model simulations, for example, $1.5\text{\xd7}{10}^{6}\text{}\times {10}^{5}\text{}$ (150 billion) for the examples used in Fengler et al., 2021. The extremely high number of model simulations will make it infeasible for most users to run this training themselves, so that there would need to be a repository from which users can download pretrained LANs. This restricts the application of LANs to a small set of canonical models like DDMs, and prohibits customization and iteration of models by users. In addition, the high simulation requirement limits this approach to models whose parameters and observations are sufficiently low dimensional for histograms to be sampled densely.
To overcome these limitations, we propose an alternative approach called mixed neural likelihood estimation (MNLE). MNLE builds on recent advances in probabilistic machine learning, and in particular on the framework of neural likelihood estimation (Papamakarios et al., 2019b; Lueckmann et al., 2019) but is designed to specifically capture the mixed data types arising in models of decisionmaking, for example, discrete choices and continuous reaction times. Neural likelihood estimation has its origin in classical synthetic likelihood (SL) approaches (Wood, 2010; Price et al., 2018). Classical SL approaches assume the likelihood of the simulationbased model to be Gaussian (so that its moments can be estimated from model simulations) and then use MCMC methods for inference. This approach and various extensions of it have been widely used (Price et al., 2018; Ong et al., 2009; An et al., 2019; Priddle et al., 2022)—but inherently they need multiple model simulations for each parameter in the MCMC chain to estimate the associated likelihood.
Neural likelihood approaches instead perform conditional density estimation, that is, they train a neural network to predict the parameters of the approximate likelihood conditioned on the model parameters (e.g., Papamakarios et al., 2019b; Lueckmann et al., 2019). By using a conditional density estimator, it is possible to exploit continuity across different model parameters, rather than having to learn a separate density for each individual parameter as in classical SL. Recent advances in conditional density estimation (such as normalizing flows, Papamakarios et al., 2019a) further allow lifting the parametric assumptions of classical SL methods and learning flexible conditional density estimators which are able to model a wide range of densities, as well as highly nonlinear dependencies on the conditioning variable. In addition, neural likelihood estimators yield estimates of the probability density which are guaranteed to be nonnegative and normalized, and which can be both sampled and evaluated, acting as a probabilistic emulator of the simulator (Lueckmann et al., 2019).
Our approach, MNLE, uses neural likelihood estimation to learn an emulator of the simulator. The training phase is a simple twostep procedure: first, a training dataset of $N$ parameters $\mathit{\theta}$ is sampled from a proposal distribution and corresponding model simulations $\mathbf{x}$ are generated. Second, the $N$ parameter–data pairs $(\mathit{\theta},\mathbf{x})$ are directly used to train a conditional neural likelihood estimator to estimate $p(\mathbf{x}\mathit{\theta})$. Like for LANs, the proposal distribution for the training data can be any distribution over $\mathit{\theta}$, and should be chosen to cover all parameter values one expects to encounter in empirical data. Thus, the prior distribution used for Bayesian inference constitutes a useful choice, but in principle any distribution that contains the support of the prior can be used. To account for mixed data types, we learn the likelihood estimator as a mixed model composed of one neural density estimator for categorical data and one for continuous data, conditioned on the categorical data. This separation allows us to choose the appropriate neural density estimator for each data type, for example, a Bernoulli model for the categorical data and a normalizing flow (Papamakarios et al., 2019a) for the continuous data. The resulting joint density estimator gives access to the likelihood, which enables inference via MCMC methods. See Figure 1 for an illustration of our approach, and Methods and materials for details.
Both LANs and MNLEs allow for flexible inference scenarios common in cognitive neuroscience, for example, varying number of trials with same underlying experimental conditions or hierarchical inference, and need to be trained only once. However, there is a key difference between the two approaches. LANs use feedforward neural networks to perform regression from model parameters to empirical likelihood targets obtained from KDE. MNLE instead learns the likelihood directly by performing conditional density estimation on the simulated data without requiring likelihood targets. This makes MNLE by design more simulation efficient than LANs—we demonstrate empirically that it can learn likelihood estimators which are as good or better than those reported in the LAN paper, but using a factor of 1,000,000 fewer simulations (Fengler et al., 2021). When using the same simulation budget for both approaches, MNLE substantially outperforms LAN across several performance metrics. Moreover, MNLE results in a density estimator that is guaranteed to correspond to a valid probability distribution and can also act as an emulator that can generate synthetic data without running the simulator. The simulation efficiency of MNLEs allows users to explore and iterate on their own models without generating a massive training dataset, rather than restricting their investigations to canonical models for which pretrained networks have been provided by a central resource. To facilitate this process, we implemented our method as an extension to an opensource toolbox for SBI methods (TejeroCantero et al., 2020), and provide detailed documentation and tutorials.
Results
Evaluating the performance of MNLE on the DDM
We first demonstrate the efficiency and performance of MLNEs on a classical model of decisionmaking, the DDM (Ratcliff and McKoon, 2008). The DDM is an influential phenomenological model of a twoalternative perceptual decisionmaking task. It simulates the evolution of an internal decision variable that integrates sensory evidence until one of two decision boundaries is reached and a choice is made (Figure 1, left). The decision variable is modeled with a stochastic differential equation which, in the ‘simple’ DDM version (as used in Fengler et al., 2021), has four parameters: the drift rate $v$, boundary separation $a$, the starting point $w$ of the decision variable, and the nondecision time τ. Given these four parameters $\mathit{\theta}=(v,a,w,\tau )$, a single simulation of the DDM returns data $\mathbf{x}$ containing a choice $c\in \{0,1\}$ and the corresponding reaction time in seconds $rt\in (\tau ,\mathrm{\infty})$.
MNLE learns accurate likelihoods with a fraction of the simulation budget
The simple version of the DDM is the ideal candidate for comparing the performance of different inference methods because the likelihood of an observation given the parameters, $L(\mathbf{x}\mathit{\theta})$, can be calculated analytically (Navarro and Fuss, 2009, in contrast to more complicated versions of the DDM, e.g., Ratcliff and Rouder, 1998; Usher and McClelland, 2001; Reynolds and Rhodes, 2009). This enabled us to evaluate MNLE’s performance with respect to the analytical likelihoods and the corresponding inferred posteriors of the DDM, and to compare against that of LANs on a range of simulation budgets. For MNLE, we used a budget of 10^{5} simulations (henceforth referred to as MNLE^{5}), for LANs we used budgets of 10^{5} and 10^{8} simulations (LAN^{5} and LAN^{8}, respectively, trained by us) and the pretrained version based on 10^{11} simulations (LAN^{11}) provided by Fengler et al., 2021.
First, we evaluated the quality of likelihood approximations of MNLE^{5}, and compared it to that of LAN^{{5,8,11}}. Both MNLEs and LANs were in principle able to accurately approximate the likelihoods for both decisions and a wide range of reaction times (see Figure 2a for an example, and Details of the numerical comparison). However, LANs require a much larger simulation budget than MNLE to achieve accurate likelihood approximations, that is, LANs trained with 10^{5} or 10^{8} simulations show visible deviations, both in the linear and in logdomain (Figure 2a, lines for LAN^{5} and LAN^{8}).
To quantify the quality of likelihood approximation, we calculated the Huber loss and the meansquared error (MSE) between the true and approximated likelihoods (Figure 2b, c), as well as between the loglikelihoods (Figure 2d, e). The metrics were calculated as averages over (log)likelihoods of a fixed observation given 1000 parameters sampled from the prior, repeated for 100 observations simulated from the DDM. For metrics calculated on the untransformed likelihoods (Figure 2b, c), we found that MNLE^{5} was more accurate than LAN^{{5,8,11}} on all simulation budgets, showing smaller Huber loss than LAN^{{5,8,11}} in 99, 81, and 66 out of 100 comparisons, and smaller MSE than LAN^{{5,8,11}} on 98, 81, and 66 out of 100 comparisons, respectively. Similarly, for the MSE calculated on the loglikelihoods (Figure 2e), MNLE^{5} achieved smaller MSE than LAN^{{5,8,11}} on 100, 100, and 75 out of 100 comparisons, respectively. For the Huber loss calculated on the loglikelihoods (Figure 2d), we found that MNLE^{5} was more accurate than LAN^{5} and LAN^{8}, but slightly less accurate than LAN^{11}, showing smaller Huber loss than LAN^{{5,8}} in all 100 comparisons, and larger Huber loss than LAN^{11} in 62 out of 100 comparisons. All the above pairwise comparisons were significant under the binomial test (p < 0.01), but note that these are simulated data and therefore the p value can be arbitrarily inflated by increasing the number of comparisons. We also note that the Huber loss on the loglikelihoods is the loss which is directly optimized by LANs, and thus this comparison should in theory favor LANs over alternative approaches. Furthermore, the MNLE^{5} results shown here represent averages over 10 random neural network initializations (five of which achieved smaller Huber loss than LAN^{11}), whereas the LAN^{11} results are based on a single pretrained network. Finally, we also investigated MNLE’s property to act as an emulator of the simulator and found the synthetic reaction times and choices generated from the MNLE emulator to match corresponding data simulated from the DDM accurately (see Figure 2—figure supplement 1 and Appendix 1).
When using the learned likelihood estimators for inference with MCMC methods, their evaluation speed can also be important because MCMC often requires thousands of likelihood evaluations. We found that evaluating MNLE for a batch of 100 trials and 10 model parameters (as used during MCMC) took 4.14± (mean over 100 repetitions ± standard error of the mean), compared to 1.02± for LANs, that is, MNLE incurred a larger computational footprint at evaluation time. Note that these timings are based on an improved implementation of LANs compared to the one originally presented in Fengler et al., 2021, and evaluation times can depend on the implementation, compute infrastructure and parameter settings (see Details of the numerical comparison and Discussion). In summary, we found that MNLE trained with 10^{5} simulations performed substantially better than LANs trained with 10^{5} or 10^{8} simulations, and similarly well or better than LANs trained with 10^{11} simulations, on all likelihood approximation accuracy metrics.
MNLE enables accurate flexible posterior inference with MCMC
In the previous section, we showed that MNLE requires substantially fewer training simulations than LANs to approximate the likelihood accurately. To investigate whether these likelihood estimates were accurate enough to support accurate parameter inference, we evaluated the quality of the resulting posteriors, using a framework for benchmarking SBI algorithms (Lueckmann et al., 2021). We used the analytical likelihoods of the simple DDM to obtain reference posteriors for 100 different observations, via MCMC sampling. Each observation consisted of 100 independent and identically distributed (i.i.d.) trials simulated with parameters sampled from the prior (see Figure 3a for an example, details in Materials and methods). We then performed inference using MCMC based on the approximate likelihoods obtained with MNLE (10^{5} budget, MNLE^{5}) and the ones obtained with LAN for each of the three simulation budgets (LAN ${}^{\{5,8,11\}}$, respectively).
Overall, we found that the likelihood approximation performances presented above were reflected in the inference performances: MNLE^{5} performed substantially better than LAN^{5} and LAN^{8}, and equally well or better than LAN^{11} (Figure 3b–d). In particular, MNLE^{5} approximated the posterior mean more accurately than LAN^{{5,8,11}} (Figure 3b), being more accurate than LAN^{{5,8,11}} in 100, 90, and 67 out of 100 comparisons, respectively. In terms of posterior variance, MNLE^{5} performed better than LAN^{{5,8}} and on par with LAN^{11} (Figure 3c), being more accurate than LAN^{{5,8,11}} in 100, 93 (p <<0.01, binomial test), and 58 ($p=0.13$) out of 100 pairwise comparisons, respectively.
Additionally, we measured the parameter estimation accuracy as the MSE between the posterior mean and the groundtruth parameters underlying the observed data. We found that MNLE^{5} estimation error was indistinguishable from that of the reference posterior, and that LAN performance was similar only for the substantially larger simulation budget of LAN^{11} (Figure 3c), with MNLE being closer to reference performance than LAN^{{5,8,11}} in 100, 91, and 66 out of 100 comparisons, respectively (all p < 0.01). Note that all three metrics were reported as averages over the four parameter dimensions of the DDM to keep the visualizations compact, and that this average did not affect the results qualitatively. We report metrics for each dimension in Figure 3—figure supplement 1, as well as additional inference accuracy results for smaller LAN simulation budgets (Figure 3—figure supplement 2) and for different numbers of observed trials (Figure 3—figure supplement 3).
Finally, we used the classifier 2sample test (C2ST, LopezPaz and Oquab, 2017; Lueckmann et al., 2021) to quantify the similarity between the estimated and reference posterior distributions. The C2ST is defined to be the error rate of a classification algorithm which aims to classify whether samples belong to the true or the estimated posterior. Thus, it ranges from 0.5 (no difference between the distributions, the classifier is at chance level) to 1.0 (the classifier can perfectly distinguish the two distributions). We note that the C2ST is a highly sensitive measure of discrepancy between two multivariate distributions—for example if the two distributions differ in any dimension, the C2ST will be close to 1 even if all other dimensions match perfectly. We found that neither of the two approaches was able to achieve perfect approximations, but that MNLE^{5} achieved lower C2ST scores compared to LAN^{{5,8,11}} on all simulation budgets (Figure 3e): mean C2ST score LAN^{{5,8,11}}, 0.96, 0.78, 0.70 vs. MNLE^{5}, 0.65, with MNLE^{5} showing a better score than LAN^{{5,8,11}} on 100, 91, and 68 out of 100 pairwise comparisons, respectively (all p < 0.01). In summary, MNLE achieves more accurate recovery of posterior means than LANs, similar or better recovery of posterior variances, and overall more accurate posteriors (as quantified by C2ST).
MNLE posteriors have uncertainties which are well calibrated
For practical applications of inference, it is often desirable to know how well an inference procedure can recover the groundtruth parameters, and whether the uncertainty estimates are well calibrated, (Cook et al., 2006), that is, that the uncertainty estimates of the posterior are balanced, and neither overconfident nor underconfident. For the DDM, we found that the posteriors inferred with MNLE and LANs (when using LAN^{11}) recovered the groundtruth parameters accurately (in terms of posterior means, Figure 3d and Figure 4a)—in fact, parameter recovery was similarly accurate to using the ‘true’ analytical likelihoods, indicating that much of the residual error is due to stochasticity of the observations, and not the inaccuracy of the likelihood approximations.
To assess posterior calibration, we used simulationbased calibration (SBC, Talts et al., 2018). The basic idea of SBC is the following: If one repeats the inference with simulations from many different prior samples, then, with a wellcalibrated inference method, the combined samples from all the inferred posteriors should be distributed according to the prior. One way to test this is to calculate the rank of each groundtruth parameter (samples from the prior) under its corresponding posterior, and to check whether all the ranks follow a uniform distribution. SBC results indicated that MNLE posteriors were as well calibrated as the reference posteriors, that is, the empirical cumulative density functions of the ranks were close to that of a uniform distribution (Figure 4b)—thus, on this example, MNLE inferences would likely be of similar quality compared to using the analytical likelihoods. When trained with the large simulation budget of 10^{11} simulations, LANs, too appeared to recover most of the groundtruth parameters well. However, SBC detected a systematic underestimation of the parameter $a$ and overestimation of the parameter τ, and this bias increased for the smaller simulation budgets of LAN^{5} and LAN^{8} (Figure 4b, see the deviation below and above the desired uniform distribution of ranks, respectively).
The results so far (i.e., Figures 3 and 4) indicate that both LAN^{11} and MNLE^{5} lead to similar parameter recovery, but only MNLE^{5} leads to posteriors which were well calibrated for all parameters. These results were obtained using a scenario with 100 i.i.d. trials. When increasing the number of trials (e.g., to 1000 trials), posteriors become very concentrated around the groundtruth value. In that case, while the posteriors overall identified the groundtruth parameter value very well (Figure 4—figure supplement 1c), even small deviations in the posteriors can have large effects on the posterior metrics (Figure 3—figure supplement 3). This effect was also detected by SBC, showing systematic biases for some parameters (Figure 4—figure supplement 2). For MNLE, we found that these biases were smaller, and furthermore that it was possible to mitigate this effect by inferring the posterior using ensembles, for example, by combining samples inferred with five MNLEs trained with identical settings but different random initialization (see Appendix 1 for details). These results show the utility of using SBC as a tool to test posterior coverage, especially when studying models for which reference posteriors are not available, as we demonstrate in the next section.
MNLE infers wellcalibrated, predictive posteriors for a DDM with collapsing bounds
MNLE was designed to be applicable to models for which evaluation of the likelihood is not practical so that standard inference tools cannot be used. To demonstrate this, we applied MNLE to a variant of the DDM for which analytical likelihoods are not available (note, however, that numerical approximation of likelihoods for this model would be possible, see e.g., Shinn et al., 2020, Materials and methods for details). This DDM variant simulates a decision variable like the simple DDM used above, but with linearly collapsing instead of constant decision boundaries (see e.g., Hawkins et al., 2015; Palestro et al., 2018). The collapsing bounds are incorporated with an additional parameter γ indicating the slope of the decision boundary, such that $\mathit{\theta}=(a,v,w,\tau ,\gamma )$ (see Details of the numerical comparison).
We tested inference with MNLE on the DDM with linearly collapsing bound using observations comprised of 100 i.i.d. trials simulated with parameters sampled from the prior. Using the same MNLE training and MCMC settings as above, we found that posterior density concentrated around the underlying groundtruth parameters (see Figure 5a), suggesting that MNLE learned the underlying likelihood accurately. To assess inference quality systematically without needing reference posteriors, we performed posterior predictive checks by running simulations with the inferred posteriors samples and comparing them to observed (simulated) data, and checked posterior calibration properties using SBC. We found that the inferred posteriors have good predictive performance, that is, data simulated from the inferred posterior samples accurately matched the observed data (Figure 5b), and that their uncertainties are well calibrated as quantified by the SBC results (Figure 5c). Overall, this indicated that MNLE accurately inferred the posterior of this intractable variant of the DDM.
Discussion
Statistical inference for computational models in cognitive neuroscience can be challenging because models often do not have tractable likelihood functions. The recently proposed LAN method (Fengler et al., 2021) performs SBI for a subset of such models (DDMs) by training neural networks with model simulations to approximate the intractable likelihood. However, LANs require large amounts of training data, restricting its usage to canonical models. We proposed an alternative approached called MNLE, a synthetic neural likelihood method which is tailored to the data types encountered in many models of decisionmaking.
Our comparison on a tractable example problem used in Fengler et al., 2021 showed that MNLE performed on par with LANs using six orders of magnitude fewer model simulations for training. While Fengler et al., 2021 discuss that LANs were not optimized for simulation efficiency and that it might be possible to reduce the required model simulations, we emphasize that the difference in simulation efficiency is due to an inherent property of LANs. For each parameter in the training data, LANs require empirical likelihood targets that have to be estimated by building histograms or kernel density estimates from thousands of simulations. MNLE, instead, performs conditional density estimation without the need of likelihood targets and can work with only one simulation per parameter. Because of these conceptual differences, we expect the substantial performance advantage of MNLE to be robust to the specifics of the implementation.
After the networks are trained, the time needed for each evaluation determines the speed of inference. In that respect, both LANs and MNLEs are conceptually similar in that they require a single forwardpass through a neural network for each evaluation, and we found MNLE and the original implementation of LANs to require comparable computation times. However, evaluation time will depend, for example, on the exact network architecture, software framework, and computing infrastructure used. Code for a new PyTorch implementation of LANs has recently been released and improved upon the evaluation speed original implementation we compared to. While this new implementation made LAN significantly faster for a single forwardpass, we observed that the resulting time difference with the MCMC settings used here was only on the order of minutes, whereas the difference in simulation time for LAN^{11} vs. MNLE^{5} was on the order of days. The exact timings will always be implementation specific and whether or not these differences are important will depend on the application at hand. In a situation where iteration over model design is required (i.e., custom DDMs), an increase in training efficiency on the order of days would be advantageous.
There exist a number of approaches with corresponding software packages for estimating parameters of cognitive neuroscience models, and DDMs in particular. However, these approaches either only estimate single bestfitting parameters (Voss and Voss, 2007; Wagenmakers et al., 2007; Chandrasekaran and Hawkins, 2019; Heathcote et al., 2019; Shinn et al., 2020) or, if they perform fully Bayesian inference, can only produce Gaussian approximations to posteriors (Feltgen and Daunizeau, 2021), or are restricted to variants of the DDM for which the likelihood can be evaluated (Wiecki et al., 2013, HDDM [Hierarchical DDM] toolbox). A recent extension of the HDDM toolbox includes LANs, thereby combining HDDM’s flexibility with LAN’s ability to perform inference without access to the likelihood function (but this remains restricted to variants of the DDM for which LAN can be pretrained). In contrast, MNLE can be applied to any simulationbased model with intractable likelihoods and mixed data type outputs. Here, we focused on the direct comparison to LANs based on variants of the DDM. We note that these models have a rather lowdimensional observation structure (as common in many cognitive neuroscience models), and that our examples did not include additional parameter structure, for example, stimulus encoding parameters, which would increase the dimensionality of the learning problem. However, other variants of neural density estimation have been applied successfully to a variety of problems with higher dimensionality (see e.g., Gonçalves et al., 2020; Lueckmann et al., 2021; Glöckler et al., 2021; Dax et al., 2022). Therefore, we expect MNLE to be applicable to other simulationbased problems with higherdimensional observation structure and parameter spaces, and to scale more favorably than LANs. Like for any neural networkbased approach, applying MNLE to different inference problems may require selecting different architecture and training hyperparameters settings, which may induce additional computational training costs. To help with this process, we adopted default settings which have been shown to work well on a large range of SBI benchmarking problems (Lueckmann et al., 2021), and we integrated MNLE into the established sbi python package that provides welldocumented implementations for training and inference performance of SBI algorithms.
Several extensions to classical SL approaches have addressed the problem of a bias in the likelihood approximation due to the strong parametric assumptions, that is, Gaussianity, the use of summary statistics, or finitesample biases (Price et al., 2018; Ong et al., 2009; van Opheusden et al., 2020). MNLE builds on flexible neural likelihood estimators, for example, normalizing flows, and does not require summary statistics for a lowdimensional simulator like the DDM, so would be less susceptible to these first two biases. It could be subject to biases resulting from the estimation of the loglikelihoods from a finite number of simulations. In our numerical experiments, and for the simulation budgets we used, we did not observe biased inference results. We speculate that the ability of neural density estimators to pool data across multiple parameter settings (rather than using only data from a specific parameter set, like in classical SL methods) mitigates finitesample effects.
MNLE is an SBI method which uses neural density estimators to estimate likelihoods. Alternatives to neural likelihood estimation include neural posterior estimation (NPE, Papamakarios and Murray, 2016; Lueckmann et al., 2017; Greenberg et al., 2019, which uses conditional density estimation to learn the posterior directly) and neural ratio estimation (NRE, Hermans et al., 2020; Durkan et al., 2020, which uses classification to approximate the likelihoodtoevidence ratio to then perform MCMC). These approaches could in principle be applied here as well, however, they are not as well suited for the flexible inference scenarios common in decisionmaking models as MNLE. NPE directly targets the posterior and therefore, by design, typically requires retraining if the structure of the problem changes (e.g., if the prior or the hierarchical structure of the model changes). There are variants of NPE that use embedding nets which can amortize over changing number of trials, avoiding retraining (Radev et al., 2022, von Krause et al., 2022). NRE learns the likelihoodtoevidence ratio using ratio estimation (and not density estimation) and would not provide an emulator of the simulator.
Regarding future research directions, MNLE has the potential to become more simulation efficient by using weight sharing between the discrete and the continuous neural density estimators (rather than to use separate neural networks, as we did here). Moreover, for highdimensional inference problems in which slice samplingbased MCMC might struggle, MNLE could be used in conjunction with gradientbased MCMC methods like Hamiltonian Monte Carlo (HMC, Brooks et al., 2011; Hoffman and Gelman, 2014), or variational inference as recently proposed by Wiqvist et al., 2021 and Glöckler et al., 2021. With MNLE’s full integration into the sbi package, both gradientbased MCMC methods from Pyro (Bingham et al., 2019), and variational inference for SBI (SNVI, Glöckler et al., 2021) are available as inference methods for MNLE (a comparison of HMC and SNVI to slice sampling MCMC on one example observation resulted in indistinguishable posterior samples). Finally, using its emulator property (see Appendix 1), MNLE could be applied in an active learning setting for highly expensive simulators in which new simulations are chosen adaptively according to a acquisition function in a Bayesian optimization framework (Gutmann and Corander, 2016; Lueckmann et al., 2019; Järvenpää et al., 2019).
In summary, MNLE enables flexible and efficient inference of parameters of models in cognitive neuroscience with intractable likelihoods. The training efficiency and flexibility of the neural density estimators used overcome the limitations of LANs (Fengler et al., 2021). Critically, these features enable researchers to develop customized models of decisionmaking and not just apply existing models to new data. We implemented our approach as an extension to a public sbi python package with detailed documentation and examples to make it accessible for practitioners.
Materials and methods
Mixed neural likelihood estimation
Request a detailed protocolMNLE extends the framework of neural likelihood estimation (Papamakarios et al., 2019a; Lueckmann et al., 2019) to be applicable to simulationbased models with mixed data types. It learns a parametric model ${q}_{\psi}(\mathbf{x}\mathit{\theta})$ of the intractable likelihood $p(\mathbf{x}\mathit{\theta})$ defined implicitly by the simulationbased model. The parameters ψ are learned with training data ${\{{\mathit{\theta}}_{n},{\mathbf{x}}_{n}\}}_{1:N}$ comprised of model parameters ${\mathit{\theta}}_{n}$ and their corresponding data simulated from the model ${\mathbf{x}}_{n}{\mathit{\theta}}_{n}\sim p(\mathbf{x}{\mathit{\theta}}_{n})$. The parameters are sampled from a proposal distribution over parameters ${\mathit{\theta}}_{n}\sim p(\mathit{\theta})$. The proposal distribution could be any distribution, but it determines the parameter regions for which the density estimator will be good in estimating likelihoods. Thus, the prior, or a distribution that contains the support of the prior (or even all priors which one expects to use in the future) constitutes a useful choice. After training, the emulator can be used to generate synthetic data $\mathbf{x}\mathit{\theta}\sim {q}_{\psi}(\mathbf{x}\mathit{\theta})$ given parameters, and to evaluate the SL ${q}_{\psi}(\mathbf{x}\mathit{\theta})$ given experimentally observed data. Finally, the SL can be used to obtain posterior samples via
through approximate inference with MCMC. Importantly, the training is amortized, that is, the emulator can be applied to new experimental data without retraining (running MCMC is still required).
We tailored MNLE to simulationbased models that return mixed data, for example, in form of reaction times $rt$ and (usually categorical) choices $c$ as for the DDM. Conditional density estimation with normalizing flows has been proposed for continuous random variables (Papamakarios et al., 2019a), or discrete random variables (Tran et al., 2019), but not for mixed data. Our solution for estimating the likelihood of mixed data is to simply factorize the likelihood into continuous and discrete variables,
and use two separate neural likelihood estimators: A discrete one ${q}_{{\psi}_{c}}$ to estimate $p(c\mathit{\theta})$ and a continuous one ${q}_{{\psi}_{rt}}$ to estimate $p(rt\mathit{\theta},c)$. We defined ${q}_{{\psi}_{c}}$ to be a Bernoulli model and use a neural network to learn the Bernoulli probability ρ given parameters $\mathit{\theta}$. For ${q}_{{\psi}_{rt}}$ we used a conditional neural spline flow (NSF, Durkan et al., 2019) to learn the density of $rt$ given a parameter $\mathit{\theta}$and choice $c$. The two estimators are trained separately using the same training data (see Neural network architecture, training and hyperparameters for details). After training, the full neural likelihood can be constructed by multiplying the likelihood estimates ${q}_{{\psi}_{c}}$ and ${q}_{{\psi}_{rt}}$ back together:
Note that, as the second estimator ${q}_{{\psi}_{rt}}(rc,\mathit{\theta})$ is conditioned on the choice $c$, our likelihood model can account for statistical dependencies between choices and reaction times. The neural likelihood can then be used to approximate the intractable likelihood defined by the simulator, for example, for inference with MCMC. Additionally, it can be used to sample synthetic data given model parameters, without running the simulator (see The emulator property of MNLE).
Relation to LAN
Request a detailed protocolNeural likelihood estimation can be much more simulation efficient than previous approaches because it exploits continuity across the parameters by making the density estimation conditional. Fengler et al., 2021, too, aim to exploit continuity across the parameter space by training a neural network to predict the value of the likelihood function from parameters $\mathit{\theta}$ and data $\mathbf{x}$. However, the difference to neural likelihood estimation is that they do not use the neural network for density estimation directly, but instead do classical neural networkbased regression on likelihood targets. Crucially, the likelihood targets first have to obtained for each parameter in the training dataset. To do so, one has to perform density estimation using KDE (as proposed by Turner et al., 2015) or empirical histograms for every parameter separately. Once trained, LANs do indeed exploit the continuity across the parameter space (they can predict loglikelihoods given unseen data and parameters), however, they do so at a very high simulation cost: For a training dataset of $N$ parameters, they perform $N$ times KDE based on $M$ simulations each^{11}1 For models with categorical output, that is, all decisionmaking models, KDE is performed separately for each choice., resulting is an overall simulation budget of $N\cdot M$ ($N=$ 1.5 million and $M=$ 100,000 for ‘pointwise’ LAN approach).
Details of the numerical comparison
Request a detailed protocolThe comparison between MNLE and LAN is based on the DDM. The DDM simulates a decision variable $X$ as a stochastic differential equation with parameters $\mathit{\theta}=(v,a,w,\tau )$:
where $W$ a Wiener noise process. The priors over the parameters are defined to be uniform: $v\sim \mathcal{U}(2,2)$ is the drift, $a\sim \mathcal{U}(0.5,2)$ the boundary separation, $w\sim \mathcal{U}(0.3,0.7)$ the initial offset, and $\tau \sim \mathcal{U}(0.2,1.8)$ the nondecision time. A single simulation from the model returns a choice $c\in \{0,1\}$ and the corresponding reaction time in seconds $rt\in (\tau ,\mathrm{\infty})$.
For this version of the DDM the likelihood of an observation $(c,rt)$ given parameters $\mathit{\theta}$, $L(c,rt\mathit{\theta})$, can be calculated analytically (Navarro and Fuss, 2009). To simulate the DDM and calculate analytical likelihoods we used the approach and the implementation proposed by Drugowitsch, 2016. We numerically confirmed that the simulations and the analytical likelihoods match those obtained from the research code provided by Fengler et al., 2021.
To run LANs, we downloaded the neural network weights of the pretrained models from the repository mentioned in Fengler et al., 2021. The budget of training simulations used for the LANs was $1.5\times {10}^{11}$ (1.5 million training data points, each obtained from KDE based on 10^{5} simulations). We only considered the approach based on training a multilayer perceptron on singletrial likelihoods (‘pointwise approach’, Fengler et al., 2021). At a later stage of our study we performed additional experiments to evaluate the performance of LANs trained at smaller simulation budgets, for example, for 10^{5} and 10^{8} simulations. For this analysis, we used an updated implementation of LANs based on PyTorch that was provided by the authors. We used the training routines and default settings provided with that implementation. To train LANs at the smaller budgets we used the following splits of budgets into number of parameter settings drawn from the prior, and number of simulations per parameter setting used for fitting the KDE: for the 10^{5} budget we used 10^{2} and 10^{3}, respectively (we ran experiments splitting the other way around, but results were slightly better for this split); for the 10^{8} budget we used an equal split of 10^{4} each (all code publicly available, see Code availability).
To run MNLE, we extended the implementation of neural likelihood estimation in the sbi toolbox (TejeroCantero et al., 2020). All comparisons were performed on a single AMD Ryzen Threadripper 1920X 12Core processor with 2.2 GHz and the code is publicly available (see Code availability).
For the DDM variant with linearly collapsing decision boundaries, the boundaries were parametrized by the initial boundary separation, $a$, and one additional parameter, γ, indicating the slope with which the boundary approaches zero. This resulted in a fivedimensional parameter space for which we used the same prior as above, plus the an additional uniform prior for the slope $\gamma \sim \mathcal{U}(1.0,0)$. To simulate this DDM variant, we again used the Julia package by Drugowitsch, 2016, but we note that for this variant no analytical likelihoods are available. While it would be possible to approximate the likelihoods numerically using the Fokker–Planck equations (see e.g., Shinn et al., 2020), this would usually involve a tradeoff between computation time and accuracy as well as design of bespoke solutions for individual models, and was not pursued here.
Flexible Bayesian inference with MCMC
Once the MNLE is trained, it can be used for MCMC to obtain posterior samples $\mathit{\theta}\sim p(\mathit{\theta}\mathbf{x})$ given experimentally observed data $\mathbf{x}$. To sample from posteriors via MCMC, we transformed the parameters to unconstrained space, used slice sampling (Neal, 2009), and initialized ten parallel chains using sequential importance sampling (Papamakarios et al., 2019a), all as implemented in the sbi toolbox. We ran MCMC with identical settings for MNLE and LAN.
Importantly, performing MNLE and then using MCMC to obtain posterior samples allows for flexible inference scenarios because the form of $\mathbf{x}$ is not fixed. For example, when the model produces trialbased data that satisfy the i.i.d. assumption, for example, a set of reaction times and choices $\mathbf{X}={\{rt,c\}}_{i=1}^{N}$ in a DDM, then MNLE allows to perform inference given varying numbers of trials, without retraining. This is achieved by training MNLE based on singletrial likelihoods once and then combining multiple trials into the joint likelihood only when running MCMC:
Similarly, one can use the neural likelihood to perform hierarchical inference with MCMC, all without the need for retraining (see Hermans et al., 2020; Fengler et al., 2021, for examples).
Stimulus and intertrial dependencies
Request a detailed protocolSimulationbased models in cognitive neuroscience often depend not only on a set of parameters $\mathit{\theta}$, but additionally on (a set of) stimulus variables $s$ which are typically given as part of the experimental conditions. MNLE can be readily adapted to this scenario by generating simulated data with multiple stimulus variables, and including them as additional inputs to the network during inference. Similarly, MNLE could be adapted to scenarios in which the i.i.d. assumption across trials as used above (see Flexible Bayesian inference with MCMC) does not hold. Again, this would be achieved by adapting the model simulator accordingly. For example, when inferring parameters $\mathit{\theta}$ of a DDM for which the outcome of the current trial $i$ additionally depends on current stimulus variables s_{i} as well as on previous stimuli s_{j} and responses r_{j}, then one would implement the DDM simulator as a function $f(\mathit{\theta};{s}_{iT},\dots ,{s}_{i};{r}_{iT},\dots ,{r}_{i1})$ (where $T$ is a history parameter) and then learn the underlying likelihood by emulating $f$ with MNLE.
Neural network architecture, training, and hyperparameters
Architecture
Request a detailed protocolFor the architecture of the Bernoulli model we chose a feedforward neural network that takes parameters $\mathit{\theta}$ as input and predicts the Bernoulli probability ρ of the corresponding choices. For the normalizing flow we used the NSF architecture (Durkan et al., 2019). NSFs use a standard normal base distribution and transform it using several modules of monotonic rationalquadratic splines whose parameters are learned by invertible neural networks. Using an unbounded base distribution for modeling data with bounded support, for example, reaction time data $rt\in (0,\mathrm{\infty})$, can be challenging. To account for this, we tested two approaches: We either transformed the reaction time data to logarithmic space to obtain an unbounded support ($\mathrm{log}rt\in (\mathrm{\infty},\mathrm{\infty})$), or we used a lognormal base distribution with rectified (instead of linear) tails for the splines (see Durkan et al., 2019, for details and Architecture and training hyperparameters for the architecture settings used).
Training
Request a detailed protocolThe neural network parameters ${\psi}_{c}$ and ${\psi}_{rt}$ were trained using the maximum likelihood loss and the Adam optimizer (Kingma and Ba, 2015). As proposal distribution for the training dataset we used the prior over DDM parameters. Given a training dataset of parameters, choices, and reaction times $\{{\mathit{\theta}}_{i},({c}_{i},r{t}_{i}){\}}_{i=1}^{N}$ with ${\mathit{\theta}}_{i}\sim p(\mathit{\theta});\phantom{\rule{thickmathspace}{0ex}}({c}_{i},r{t}_{i})\sim \text{DDM}({\mathit{\theta}}_{i})$, we minimized the negative logprobability of the model. In particular, using the same training data, we trained the Bernoulli choice model by minimizing
and the NSF by minimizing
Training was performed with code and training hyperparameter settings provided in the sbi toolbox (TejeroCantero et al., 2020).
Hyperparameters
Request a detailed protocolMNLE requires a number of hyperparameter choices regarding the neural network architectures, for example, number of hidden layers, number of hidden units, number of stacked NSF transforms, kind of base distribution, among others (Durkan et al., 2019). With our implementation building on the sbi package we based our hyperparameter choices on the default settings provided there. This resulted in likelihood accuracy similar to LAN, but longer evaluation times due to the complexity of the underlying normalizing flow architecture.
To reduce evaluation time of MNLE, we further adapted the architecture to the example model (DDM). In particular, we ran a crossvalidation of the hyperparameters relevant for evaluation time, that is, number of hidden layers, hidden units, NSF transforms, spline bins, and selected those that were optimal in terms of Huber loss and MSE between the approximate and the analytical likelihoods, as well as evaluation time. This resulted in an architecture with performance and evaluation time similar to LANs (more details in Appendix: Architecture and training hyperparameters). The crossvalidation relied on access to the analytical likelihoods which is usually not given in practice, for example, for simulators with intractable likelihoods. However, we note that in cases without access to analytical likelihoods a similar crossvalidation can be performed using quality measures other than the difference to the analytical likelihood, for example, by comparing the observed data with synthetic data and SLs provided by MNLE.
Appendix 1
Code availability
We implemented MNLE as part of the opensource package for SBI, sbi, available at https://github.com/mackelab/sbi, (Boelts, 2022a copy archived at swh:1:rev:d72fc6d790285c7779afbbe9a5f6b640691d4560). Code for reproducing the results presented here, and tutorials on how to apply MNLE to other simulators using sbi can be found at https://github.com/mackelab/mnleforddms, (Boelts, 2022b copy archived at swh:1:rev:5e6cf714c223ec5c414b76ac70f7dc88d4fbd321). The implementation of MNLE relies on packages developed by the Python (Van Rossum and Drake, 1995) and Julia (Bezanson et al., 2017) communities, including DifferentialEquations.jl (Rackauckas and Nie, 2017), DiffModels.jl (Drugowitsch, 2016), NumPy (Harris et al., 2020), pandas (pandas development team, 2020), Pyro (Bingham et al., 2019), PyTorch (Paszke et al., 2019), sbi (TejeroCantero et al., 2020), sbibm (Lueckmann et al., 2021), and Scikitlearn (Pedregosa et al., 2011).
Architecture and training hyperparameters
For the Bernoulli neural network we used three hidden layers with 10 units each and sigmoid activation functions. For the neural spline flow architecture (Durkan et al., 2019), we transformed the reaction time data to the logdomain, used a standard normal base distribution, 2 spline transforms with 5 bins each and conditioning networks with 3 hidden layers and 10 hidden units each, and rectified linear unit activation functions. The neural network training was performed using the sbi package with the following settings: learning rate 0.0005; training batch size 100; 10% of training data as validation data, stop training after 20 epochs without validation loss improvement.
The emulator property of MNLE
Being based on the neural likelihood estimation framework, MNLE naturally returns an emulator of the simulator that can be sampled to generate synthetic data without running the simulator. We found that the synthetic data generated by MNLE accurately matched the data we obtained by running the DDM simulator (Figure 2—figure supplement 1). This has several potential benefits: it can help with evaluating the performance of the density estimator, it enables almost instantaneous data generation (one forwardpass in the neural network) even if the simulator is computationally expensive, and it gives full access to the internals of the emulator, for example, to gradients w.r.t. to data or parameters.
There is variant of the LAN approach which allows for sampling synthetic data as well: In the ‘Histogramapproach’ (Fengler et al., 2021) LANs are trained with a convolutional neural network (CNN) architecture using likelihood targets in form of twodimensional empirical histograms. The output of the CNN is a probability distribution over a discretized version of the data space which can, in principle, be sampled to generate synthetic DDM choices and reaction times. However, the accuracy of this emulator property of CNNLANs is limited by the number of bins used to approximate the continuous data space (e.g., 512 bins for the examples shown in Fengler et al., 2021).
Data availability
We implemented MNLE as part of the open source package for SBI, sbi, available at https://github.com/mackelab/sbi, copy archived at swh:1:rev:d72fc6d790285c7779afbbe9a5f6b640691d4560. Code for reproducing the results presented here, and tutorials on how to apply MNLE to other simulators using sbi can be found at https://github.com/mackelab/mnleforddms, copy archived at swh:1:rev:5e6cf714c223ec5c414b76ac70f7dc88d4fbd321.
References

Accelerating bayesian synthetic likelihood with the graphical lassoJournal of Computational and Graphical Statistics 28:471–475.https://doi.org/10.1080/10618600.2018.1537928

Julia: A fresh approach to numerical computingSIAM Review 59:65–98.https://doi.org/10.1137/141000671

Pyro: Deep universal probabilistic programmingJournal of Machine Learning Research 20:973–978.

Softwaresbi: simulationbased inference, version swh:1:rev:d72fc6d790285c7779afbbe9a5f6b640691d4560Software Heritage.

BookMCMC Using Hamiltonian DynamicsIn: Brooks S, editors. Handbook of Markov Chain Monte Carlo. Elsevier. pp. 1–2.https://doi.org/10.1201/b10905

ChaRTr: An R toolbox for modeling choices and response times in decisionmaking tasksJournal of Neuroscience Methods 328:108432.https://doi.org/10.1016/j.jneumeth.2019.108432

Validation of software for bayesian models using posterior quantilesJournal of Computational and Graphical Statistics 15:675–692.https://doi.org/10.1198/106186006X136976

ConferenceGroup equivariant neural posterior estimationIn International Conference on Learning Representations.

ConferenceNeural spline flowsAdvances in Neural Information Processing Systems. pp. 7511–7522.

ConferenceOn contrastive learning for likelihoodfree inferenceIn International Conference on Machine Learning. pp. 2771–2781.

An overcomplete approach to fitting driftdiffusion decision models to trialbytrial dataFrontiers in Artificial Intelligence 4:531316.https://doi.org/10.3389/frai.2021.531316

ConferenceVariational methods for simulationbased inferenceIn International Conference on Learning Representations.

ConferenceAutomatic posterior transformation for likelihoodfree inferenceIn Proceedings of the 36th International Conference on Machine Learning of Proceedings of Machine Learning Research. pp. 2404–2414.

Bayesian optimization for likelihoodfree inference of simulatorbased statistical modelsThe Journal of Machine Learning Research 17:4256–4302.

Revisiting the evidence for collapsing boundaries and urgency signals in perceptual decisionmakingThe Journal of Neuroscience 35:2476–2484.https://doi.org/10.1523/JNEUROSCI.241014.2015

Dynamic models of choiceBehavior Research Methods 51:961–985.https://doi.org/10.3758/s134280181067y

ConferenceLikelihoodfree mcmc with approximate likelihood ratiosIn Proceedings of the 37th International Conference on Machine Learning of Proceedings of Machine Learning Research.

The NoUturn sampler: adaptively setting path lengths in Hamiltonian Monte CarloJournal of Machine Learning Research: JMLR 15:1593–1623.

Efficient acquisition rules for modelbased approximate bayesian computationBayesian Analysis 14:595–622.https://doi.org/10.1214/18BA1121

ConferenceAdam: A method for stochastic optimizationIn Proceedings of the 3rd International Conference on Learning Representations, ICLR.

Three case studies in the Bayesian analysis of cognitive modelsPsychonomic Bulletin & Review 15:1–15.https://doi.org/10.3758/pbr.15.1.1

How cognitive modeling can benefit from hierarchical Bayesian modelsJournal of Mathematical Psychology 55:1–7.https://doi.org/10.1016/j.jmp.2010.08.013

BookBayesian Cognitive ModelingCambridge university press.https://doi.org/10.1017/CBO9781139087759

Learning problemsolving rules as search through a hypothesis spaceCognitive Science 40:1036–1079.https://doi.org/10.1111/cogs.12275

ConferenceRevisiting classifier twosample testsIn 5th International Conference on Learning Representations, ICLR.

ConferenceLikelihoodfree inference with emulator networksIn Proceedings of The 1st Symposium on Advances in Approximate Bayesian Inference of Proceedings of Machine Learning Research. pp. 32–53.

ConferenceBenchmarking simulationbased inferenceProceedings of The 24th International Conference on Artificial Intelligence and Statistics of Proceedings of Machine Learning Research. pp. 343–351.

The place of modeling in cognitive scienceTopics in Cognitive Science 1:11–38.https://doi.org/10.1111/j.17568765.2008.01003.x

Fast and accurate calculations for firstpassage times in Wiener diffusion modelsJournal of Mathematical Psychology 53:222–230.https://doi.org/10.1016/j.jmp.2009.02.003

Variational Bayes with synthetic likelihoodStatistics and Computing 28:971–988.https://doi.org/10.1007/s1122201797733

Some task demands induce collapsing bounds: Evidence from a behavioral analysisPsychonomic Bulletin & Review 25:1225–1248.https://doi.org/10.3758/s1342301814799

ConferenceFast εfree Inference of Simulation Models with Bayesian Conditional Density EstimationIn Advances in Neural Information Processing Systems.

ConferenceSequential neural likelihood: Fast likelihoodfree inference with autoregressive flowsIn Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) of Proceedings of Machine Learning Research. pp. 837–848.

ConferencePytorch: An imperative style, highperformance deep learning libraryIn Advances in Neural Information Processing Systems. pp. 8024–8035.

Scikitlearn: Machine learning in pythonJournal of Machine Learning Research 12:2825–2830.

Bayesian Synthetic LikelihoodJournal of Computational and Graphical Statistics 27:1–11.https://doi.org/10.1080/10618600.2017.1302882

Efficient bayesian synthetic likelihood with whitening transformationsJournal of Computational and Graphical Statistics 31:50–63.https://doi.org/10.1080/10618600.2021.1979012

DifferentialEquations.jl – a performant and featurerich ecosystem for solving differential equations in juliaJournal of Open Research Software 5:15.https://doi.org/10.5334/jors.151

BayesFlow: learning complex stochastic models with invertible neural networksIEEE Transactions on Neural Networks and Learning Systems 33:1452–1466.https://doi.org/10.1109/TNNLS.2020.3042395

Modeling response times for twochoice decisionsPsychological Science 9:347–356.https://doi.org/10.1111/14679280.00067

Toward a principled Bayesian workflow in cognitive sciencePsychological Methods 26:103–126.https://doi.org/10.1037/met0000275

BookOverview of abcIn: Sisson SA, Fan Y, Beaumont MA, editors. In Handbook of Approximate Bayesian Computation, Chapter 1. CRC Press, Taylor & Francis Group. pp. 1–678.https://doi.org/10.1201/9781315117195

sbi: A toolkit for simulationbased inferenceJournal of Open Source Software 5:2505.https://doi.org/10.21105/joss.02505

ConferenceDiscrete flows: Invertible generative models of discrete dataAdvances in Neural Information Processing Systems. pp. 14719–14728.

A tutorial on approximate bayesian computationJournal of Mathematical Psychology 56:69–85.https://doi.org/10.1016/j.jmp.2012.02.005

Informing cognitive abstractions through neuroimaging: the neural drift diffusion modelPsychological Review 122:312–336.https://doi.org/10.1037/a0038894

Approximating bayesian inference through model simulationTrends in Cognitive Sciences 22:826–840.https://doi.org/10.1016/j.tics.2018.06.003

The time course of perceptual choice: the leaky, competing accumulator modelPsychological Review 108:550–592.https://doi.org/10.1037/0033295x.108.3.550

Unbiased and efficient loglikelihood estimation with inverse binomial samplingPLOS Computational Biology 16:e1008483.https://doi.org/10.1371/journal.pcbi.1008483

Fastdm: A free program for efficient diffusion model analysisBehavior Research Methods 39:767–775.https://doi.org/10.3758/bf03192967

An EZdiffusion model for response time and accuracyPsychonomic Bulletin & Review 14:3–22.https://doi.org/10.3758/bf03194023

HDDM: Hierarchical bayesian estimation of the driftdiffusion model in pythonFrontiers in Neuroinformatics 7:14.https://doi.org/10.3389/fninf.2013.00014
Decision letter

Valentin WyartReviewing Editor; École normale supérieure, PSL University, INSERM, France

Timothy E BehrensSenior Editor; University of Oxford, United Kingdom

Luigi AcerbiReviewer; University of Helsinki, Finland

Jean DaunizeauReviewer; Inserm, France
Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.
Decision letter after peer review:
Thank you for submitting your article "Flexible and efficient simulationbased inference for models of decisionmaking" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by Valentin Wyart as the Reviewing Editor and Timothy Behrens as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Luigi Acerbi (Reviewer #1); Jean Daunizeau (Reviewer #2).
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission. As you will see after reading the two reviews (copied at the bottom of this message), the two reviewers have found that the MNLE approach you describe represents a significant advance over LANs. As summarized by Reviewer #1, instead of estimating the likelihood *independently* for each parameter setting and then training a neural network to interpolate between these density estimates, MNLE uses *conditional* density estimation which trains a density network while sharing information across different parameters settings. This approach, and others built on top of it or inspired by it, may have a large impact in the field, even more so since all code has been made available by the authors.
But while this approach is a priori order of magnitude more efficient than LANs, the reviewers have agreed that the current manuscript does not provide sufficient empirical evidence that it is the case. The reviewers have discussed how in practice the comparison between LANs and MNLEs could be improved to strengthen the manuscript in this respect. Furthermore, the reviewers have also identified limitations of the approach (shared with LANs) that are worth discussing more explicitly in the paper. We hope that you will find the reviews helpful when revising your manuscript. When submitting your revised manuscript, please provide a pointbypoint response to the essential revisions that were decided after discussion between the reviewers. The other points made in the separate reviews (provided at the bottom of this message) should also be addressed, but they do not require a pointbypoint response.
Essential revisions:
1. While the MNLE approach appears much more sampleefficient than the LAN approach, there is not sufficient empirical evidence that it is indeed the case. The main piece of evidence is described in Section 2.2, where the authors show that MNLE learns likelihood function as accurately as LAN, but with only a fraction of the simulation budget. The authors show that a MNLE estimator trained with 10^5 model simulations is as accurate as a LAN estimator trained with 10^11 model simulations. However, the authors did not show that a LAN estimator trained with only 10^5 model simulations yields less accurate results than a MNLE estimator trained with the same simulation budget, or that a MNLE estimator trained with 10^11 simulations yields much better results than a LAN estimator trained with the same simulation budget. A more interpretable comparison of the two approaches is needed here: claiming that MNLE is more efficient than LAN is important, but requires additional work. We do not require the authors to systematically vary the number of model simulations (spanning the 10^5 to 10^11 range), and quantify each method's efficiency in terms of accuracy profile over this range. The authors could compare their MNLE estimator trained with 10^5 model simulations with a LAN estimator trained on 10^8 samples, which is still 1000 times more than their MNLE estimator, and show that the LAN estimator performs much worse than their MNLE estimator. Note that there is an additional technical caveat here, in that training LANs depends on an additional degree of freedom = how many parameters / how many samples per parameter. The authors should keep a sensible split. The current split appears to be 10^6 parameters and 10^5 samples per parameter. When going down to 10^8, it would make sense to keep a similar ratio.
2. Additional "accuracy" metrics should be considered when comparing MNLE and LAN estimators. These metrics may include likelihood approximation accuracy, parameter estimation accuracy and estimation uncertainty accuracy. The last two are important in practice, because errors in likelihood approximations may have a very small impact on parameter estimation. Regarding parameter estimation accuracy, the metric used in the manuscript (the "absolute difference in posterior sample mean normalized by reference posterior standard deviation") may not be the most informative. Indeed, significant differences between MNLEs and LANs in this metric may have no practical consequence whatsoever in terms of absolute parameter estimation errors. This is what Figure 4A seems to indicate. It could be useful to use metrics that relate to parameter recovery, e.g., parameter estimation error or pairwise parameter confusion. LAN estimators may show more pairwise parameter confusion than MNLE estimators, for example.
3. The following limitations of MNLE estimators (limitations shared with LAN estimators) should be discussed more explicitly in the manuscript:
– The driftdiffusion model used to benchmark the approach has a very lowdimensional observation structure (one binary observation + one continuous observation per trial). This limitation is not necessarily a problem because many models in cognitive neuroscience have a very lowdimensional observation structure, but it is worth mentioning.
– The approach described works for i.i.d. data. Any additional structure/dependence (e.g., adding parameters to characterize the stimulus shown in the trial) effectively increases the dimensionality of the likelihood approximation the network needs to learn.
– The manuscript does not sufficiently discuss nor explore the scalability of the MNLE approach. Given the previous related work by some of the authors, that has shown remarkable results in this respect, it would be useful to discuss the scalability of MNLEs in the discussion.
– Like any neuralnetwork based approach, selecting the hyperparameters and architecture of the network requires either prior knowledge and/or brute force. And it is currently unclear how in practice the MNLE approach can be applied for different models. Importantly, even if training a MNLE estimator *given the hyperparameters* might require a small number of simulations (10^5), we need to account for the additional cost of finding the correct hyperparameters for training the emulator (presumably, requiring an exploration of at least 10100 hyperparameter settings).
Reviewer #1 (Recommendations for the authors):
I already discussed an earlier version of the manuscript with the authors, so I don't really have much to add to my previous comments and to my public comments (i.e., there are a couple of limitations that the authors could address a bit more explicitly, like dimensionality of the method and the cost of hyperparameter tuning).
Reviewer #2 (Recommendations for the authors):
Let me now explain why I think that authors' main claim (namely: that MNLE is more efficient than LAN) may not be strongly supported by the results reported in the current version of the manuscript.
The main piece of evidence is summarized in Section 2.2, where authors claim that MNLE learns likelihood function as accurately as LAN, but with only a fraction of the simulation budget. In brief, authors showed that a MNLE estimator trained with 10^5 model simulations is as accurate as a pretrained version of the LAN method for DDMs (which used 10^11 model simulations). The problem here, is that they did not show that a LAN estimator trained with only 10^5 model simulations yields less accurate results. Or that MNLE trained with 10^11 simulations yields much better results. More generally, what is needed here is a quantitative comparison of the efficiency of the method. One would expect that, for both MNLE and LAN, increasing the number of model simulations increases the accuracy of likelihood approximation. However, this probably follows the law of diminishing marginal utility (i.e. using 10^11 simulations may bring little advantage when compared to e.g. 10^10) (:) Hence, efficiency here should be measured in terms of the speed at which the results accuracy increases. In other terms, claiming that MNLE is more efficient than LAN is important, but requires more work than what is done here. Authors should systematically vary the number of model simulations (spanning, e.g. the 10^5 to 10^11 range), and quantify each method's efficiency in terms of the accuracy profile over this range.
Now, this analysis should be performed with different "accuracy" metrics. In particular, authors may measure likelihood approximation accuracy, parameter estimation accuracy and estimation uncertainty accuracy. The latter are critical, because if errors in likelihood approximations may have very small impact on parameter estimation. I note that, w.r.t. parameter estimation accuracy and estimation uncertainty accuracy, I don't think authors chose very informative metrics. For example, the former is defined as the "absolute difference in posterior sample mean normalized by reference posterior standard deviation". If I understood correctly, this metric may show significant differences between MNLE and LAN that would have no practical consequence whatsoever, given that methods would be very similar in terms of their absolute parameter estimation errors. In fact, this is what Figure 4A seems to indicate. I would rather suggest to use accuracy metrics that practically "mean something" for parameter recovery. Measuring parameter estimation error is a possibility (although one may then conclude that MNLE brings no clear advantage when compared to LAN, cf. Figure 4A). But there are other possibilities. For example, likelihood approximation errors may induce pairwise parameter confusions. In turn, for a small simulation budget, LAN estimators may show more pairwise parameter confusion than MNLE. Note: to quantify pairwise parameter confusion in the context of DDM, authors may take inspiration from Feltgen 2021 (or any other metric of their choice).
https://doi.org/10.7554/eLife.77220.sa1Author response
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission. As you will see after reading the two reviews (copied at the bottom of this message), the two reviewers have found that the MNLE approach you describe represents a significant advance over LANs. As summarized by Reviewer #1, instead of estimating the likelihood *independently* for each parameter setting and then training a neural network to interpolate between these density estimates, MNLE uses *conditional* density estimation which trains a density network while sharing information across different parameters settings. This approach, and others built on top of it or inspired by it, may have a large impact in the field, even more so since all code has been made available by the authors.
But while this approach is a priori order of magnitude more efficient than LANs, the reviewers have agreed that the current manuscript does not provide sufficient empirical evidence that it is the case. The reviewers have discussed how in practice the comparison between LANs and MNLEs could be improved to strengthen the manuscript in this respect. Furthermore, the reviewers have also identified limitations of the approach (shared with LANs) that are worth discussing more explicitly in the paper. We hope that you will find the reviews helpful when revising your manuscript. When submitting your revised manuscript, please provide a pointbypoint response to the essential revisions that were decided after discussion between the reviewers. The other points made in the separate reviews (provided at the bottom of this message) should also be addressed, but they do not require a pointbypoint response.
We appreciate that the reviewers view our approach as a significant advance over LANs, and with a potentially large impact on the field. We agree that, in the previous version of the manuscript, there was a lack of a direct comparison between MNLE and LANs at a similar simulation budget. We now provide results for these experiments below and show that MNLE is significantly more accurate than LANs across several simulation budgets. We now also provide further discussions regarding limitations of the MNLE.
Essential revisions:
1. While the MNLE approach appears much more sampleefficient than the LAN approach, there is not sufficient empirical evidence that it is indeed the case. The main piece of evidence is described in Section 2.2, where the authors show that MNLE learns likelihood function as accurately as LAN, but with only a fraction of the simulation budget. The authors show that a MNLE estimator trained with 10^5 model simulations is as accurate as a LAN estimator trained with 10^11 model simulations. However, the authors did not show that a LAN estimator trained with only 10^5 model simulations yields less accurate results than a MNLE estimator trained with the same simulation budget, or that a MNLE estimator trained with 10^11 simulations yields much better results than a LAN estimator trained with the same simulation budget. A more interpretable comparison of the two approaches is needed here: claiming that MNLE is more efficient than LAN is important, but requires additional work. We do not require the authors to systematically vary the number of model simulations (spanning the 10^5 to 10^11 range), and quantify each method's efficiency in terms of accuracy profile over this range. The authors could compare their MNLE estimator trained with 10^5 model simulations with a LAN estimator trained on 10^8 samples, which is still 1000 times more than their MNLE estimator, and show that the LAN estimator performs much worse than their MNLE estimator. Note that there is an additional technical caveat here, in that training LANs depends on an additional degree of freedom = how many parameters / how many samples per parameter. The authors should keep a sensible split. The current split appears to be 10^6 parameters and 10^5 samples per parameter. When going down to 10^8, it would make sense to keep a similar ratio.
The revised manuscript includes additional experiments to show that MNLE is more simulationefficient than LANs, as requested, with updated main figures. Note that the authors of the LAN paper recently provided an updated PyTorch implementation of LAN, which now made it feasible to retrain LANs at smaller budgets locally, instead of having to exclusively rely on the pretrained network at a budget of 10^{11}, as we had done previously.
We retrained LANs for budgets of 10^{5} and 10^{8} simulations. For the 10^{5} budget we used 10^{2} parameter samples and 10^{3} samples per parameter, respectively (in contrast to the split used in the LAN paper, we used more samples per parameter than parameter samples, which resulted in slightly better performance than the other way around). For the 10^{8} budget we used an equal split of 10^{4} and 10^{4}.
As neural networktraining settings, we used the defaults provided in the new LAN implementation. As training data, we used the data from the same DDM simulator as used for MNLE, except for the 10^{11} LAN budget for which simulation and training was computationally too expensive so that we used the updated pretrained network (the training data still came from the same distribution (simple DDM), just not from our own simulator).
We found that MNLE with a 10^{5} budget (henceforth referred to as MNLE^{5}) significantly outperformed LAN trained with 10^{5} and 10^{8} budgets (LAN^{5} and LAN^{8}, respectively) on all metrics (see updated Figure 2 below). When compared to LAN^{11}, MNLE^{5} was more accurate on all but one metric, the Huber loss on the loglikelihoods (LAN’s training loss). Note that the numbers for MNLE^{5} and LAN^{11} differ slightly from our initial submission because MNLE performance is now averaged across ten random neural network initializations (a subset of which outperformed the single pretrained LAN^{11} network) and because we used new implementations of LAN (provided by the authors) and for MNLE (now fully integrated into the sbi toolbox).
We updated Figure 2 with the new results and the following changes: we added the new results based on the smaller simulation budgets for LANs as lighter shades of orange; we show the metrics results as bar plots with standard error of the mean (SEM) error bars instead of boxplots, to improve readability; we replace panels D (evaluation time) and E (budgets) with Huber loss and MSE calculated on loglikelihoods (shown before in panel B and C) and show Huber loss and MSE on calculated on likelihoods in panel B and C. The main text is updated accordingly (not copied here due to length). We also still mention the results on evaluation time in the main text, and cover inference time of LAN and MNLE in general in the discussion as well.
2. Additional "accuracy" metrics should be considered when comparing MNLE and LAN estimators. These metrics may include likelihood approximation accuracy, parameter estimation accuracy and estimation uncertainty accuracy. The last two are important in practice, because errors in likelihood approximations may have a very small impact on parameter estimation. Regarding parameter estimation accuracy, the metric used in the manuscript (the "absolute difference in posterior sample mean normalized by reference posterior standard deviation") may not be the most informative. Indeed, significant differences between MNLEs and LANs in this metric may have no practical consequence whatsoever in terms of absolute parameter estimation errors. This is what Figure 4A seems to indicate. It could be useful to use metrics that relate to parameter recovery, e.g., parameter estimation error or pairwise parameter confusion. LAN estimators may show more pairwise parameter confusion than MNLE estimators, for example.
We thank the reviewers for pointing this out and agree that it is important to check how errors in likelihood approximation accuracy impact inference accuracy in practice, and to report parameter estimation accuracy with a separate metric. To evaluate both posterior accuracy and parameter estimation accuracy we now additionally report the mean squared error (MSE) between the posterior sample mean and the underlying groundtruth parameters. We updated figure 3 as follows: in panel B and C we show relative errors in posterior mean and posterior variance (as before), and in panel D we now show the parameter estimation accuracy for all methods, as well as for the reference posterior as a reference for the best possible accuracy (the panel showed the dispersion metric before). Note that we average parameter estimation metric over the four parameter dimensions to keep the figure compact, but that this did not affect results qualitatively (we added a supplementary figure showing metrics for each parameter dimension separately, Figure 3—figure supplement 1, see end of document). Panel E shows C2ST scores as before. All panels were updated with the corresponding results for inference with LAN budgets 10^{5} and 10^{8} as well.
We found that the increased errors in likelihood approximation accuracies for smaller LAN budgets we reported above were reflected in the inference accuracy as well (see updated Figure 3):
To provide concrete examples of how using a LAN trained on a smaller budget affects the inferred posterior we added a supplementary figure (Figure 3—figure supplement 2, see end of document) showing posteriors inferred with the smaller LAN budgets. Additionally, we calculated simulationbased calibration (SBC) results for the two smaller LAN budgets and found that calibration quality decreased with simulation budget and updated Figure 4B, the figure caption and the main text accordingly.
We thank reviewer 2 for the pointer to the parameter confusion as an additional metric. However, we did not obtain this additional result as we believe that the current results on inference accuracy for the smaller simulation budgets convincingly demonstrate the efficiency of MNLE versus LAN. Overall, this shows that for budgets smaller than 10^{11} simulations, inference with LANs was not nearly as accurate as with MNLE, whereas MNLE was close to the reference performance on parameter estimation accuracy. We adapted the figure captions and results reported in the main text accordingly.
3. The following limitations of MNLE estimators (limitations shared with LAN estimators) should be discussed more explicitly in the manuscript:
– The driftdiffusion model used to benchmark the approach has a very lowdimensional observation structure (one binary observation + one continuous observation per trial). This limitation is not necessarily a problem because many models in cognitive neuroscience have a very lowdimensional observation structure, but it is worth mentioning.
– The approach described works for i.i.d. data. Any additional structure/dependence (e.g., adding parameters to characterize the stimulus shown in the trial) effectively increases the dimensionality of the likelihood approximation the network needs to learn.
– The manuscript does not sufficiently discuss nor explore the scalability of the MNLE approach. Given the previous related work by some of the authors, that has shown remarkable results in this respect, it would be useful to discuss the scalability of MNLEs in the discussion.
– Like any neuralnetwork based approach, selecting the hyperparameters and architecture of the network requires either prior knowledge and/or brute force. And it is currently unclear how in practice the MNLE approach can be applied for different models. Importantly, even if training a MNLE estimator *given the hyperparameters* might require a small number of simulations (10^5), we need to account for the additional cost of finding the correct hyperparameters for training the emulator (presumably, requiring an exploration of at least 10100 hyperparameter settings).
We agree that these are four important points and changed the third paragraph in the discussion accordingly (see below).
Regarding the additional model or stimulus parameter, we agree that those will effectively increase the dimensionality of the learning problem, but we note that under the i.i.d assumption, additional parameter dependencies, e.g,. for hierarchical inference scenarios, will not require retraining MNLEs but will only affect the inference step with MCMC or VI.
Regarding the selection of neural network settings we agree that this was not pointed out clearly enough in the manuscript. MNLE is now fully integrated into the sbi toolbox and was set up with default settings that we expect to work well in many applications (see e.g., the sbi benchmark by Lueckmann et al., 2021, who showed that the same settings worked well for NLE across a wide range of tasks). Additionally, with the evaluation techniques available in the sbi toolbox, e.g., posterior predictive checks and simulationbased calibration, it is now possible for users to evaluate their trained MNLE emulator. Nevertheless, it is important to emphasize that it might be necessary to adapt the MNLE architecture to a given application and to repeat training multiple times to try different settings.
To address all four points, we adapted the discussion as follows (extending the third paragraph in the discussion):
“[…]. In contrast, MNLE can be applied to any simulationbased model with intractable likelihoods and mixed data typeoutputs. Here, we focused on the direct comparison to LANs based on variants of the DDM. We note that these models have a rather lowdimensional observation structure (as common in many cognitive neuroscience models), and that our examples did not include additional parameter structure, e.g., stimulus encoding parameters, which would increase the dimensionality of the learning problem. However, other variants of neural density estimation have been applied successfully to a variety of problems with higher dimensionality (see e.g., Gonçalves et al., 2020; Lueckmann et al., 2021; Glöckler et al., 2021; Dax et al., 2022). Therefore, we expect MNLE to be applicable to other simulationbased problems with higherdimensional observation structure and parameter spaces, and to scale more favorably than LANs.
Like for any neural networkbased approach, applying MNLE to different inference problems may require selecting different architecture and training hyperparameters settings, which may induce additional computational training costs. To help with this process, we adopted default settings which have been shown to work well on a large range of SBI benchmarking problems (Lueckmann et al., 2021), and we integrated MNLE into the established sbi python package that provides welldocumented implementations for training and inference performance of SBI algorithms.”
Reviewer #2 (Recommendations for the authors):
Let me now explain why I think that authors' main claim (namely: that MNLE is more efficient than LAN) may not be strongly supported by the results reported in the current version of the manuscript.
The main piece of evidence is summarized in Section 2.2, where authors claim that MNLE learns likelihood function as accurately as LAN, but with only a fraction of the simulation budget. In brief, authors showed that a MNLE estimator trained with 10^5 model simulations is as accurate as a pretrained version of the LAN method for DDMs (which used 10^11 model simulations). The problem here, is that they did not show that a LAN estimator trained with only 10^5 model simulations yields less accurate results. Or that MNLE trained with 10^11 simulations yields much better results. More generally, what is needed here is a quantitative comparison of the efficiency of the method. One would expect that, for both MNLE and LAN, increasing the number of model simulations increases the accuracy of likelihood approximation. However, this probably follows the law of diminishing marginal utility (i.e. using 10^11 simulations may bring little advantage when compared to e.g. 10^10) (:) Hence, efficiency here should be measured in terms of the speed at which the results accuracy increases. In other terms, claiming that MNLE is more efficient than LAN is important, but requires more work than what is done here. Authors should systematically vary the number of model simulations (spanning, e.g. the 10^5 to 10^11 range), and quantify each method's efficiency in terms of the accuracy profile over this range.
Now, this analysis should be performed with different "accuracy" metrics. In particular, authors may measure likelihood approximation accuracy, parameter estimation accuracy and estimation uncertainty accuracy. The latter are critical, because if errors in likelihood approximations may have very small impact on parameter estimation. I note that, w.r.t. parameter estimation accuracy and estimation uncertainty accuracy, I don't think authors chose very informative metrics. For example, the former is defined as the "absolute difference in posterior sample mean normalized by reference posterior standard deviation". If I understood correctly, this metric may show significant differences between MNLE and LAN that would have no practical consequence whatsoever, given that methods would be very similar in terms of their absolute parameter estimation errors. In fact, this is what Figure 4A seems to indicate. I would rather suggest to use accuracy metrics that practically "mean something" for parameter recovery. Measuring parameter estimation error is a possibility (although one may then conclude that MNLE brings no clear advantage when compared to LAN, cf. Figure 4A). But there are other possibilities. For example, likelihood approximation errors may induce pairwise parameter confusions. In turn, for a small simulation budget, LAN estimators may show more pairwise parameter confusion than MNLE. Note: to quantify pairwise parameter confusion in the context of DDM, authors may take inspiration from Feltgen 2021 (or any other metric of their choice).
Thank you for the detailed review and explanation of your concerns with the paper. As mentioned above in the pointbypoint response to the essential revision, we were now able to train LANs ourselves and run experiments with smaller simulation budgets. We trained LANs with budgets of 10^{{4, 5, 6, 7, and 8}} and observed, as expected, an overall decrease in likelihood approximation accuracy with decreasing budget. As addressed in detail above and now reported in the updated version of Figure 2, we found that the likelihood approximation accuracy of LAN matched that of MNLE (trained with 10^{5}) simulations only for a budget of 10^{11} simulations. The performance with the smaller budgets decreased monotonically with decreasing budgets and resulted in significantly worse performance in likelihood approximations as well as in posterior accuracy, posterior calibration and parameter point estimation accuracy. We clarify the algorithmic differences between LAN and MNLE and the reason for the efficiency difference in the main text (Methods and Materials—Relation to LAN).
We also thank you for the suggestions regarding performance metrics. We updated the posterior accuracy metrics to be in absolute terms and added the parameter estimation accuracy metric as described above. As our additional results confirmed our expectations with respect to the LAN performance on smaller budgets, we did not add a metric on parameter confusion.
Overall, we hope that our additional experiments alleviate your concerns regarding the efficiency of MNLE compared to LAN.
https://doi.org/10.7554/eLife.77220.sa2Article and author information
Author details
Funding
Deutsche Forschungsgemeinschaft (SFB 1233)
 JanMatthis Lueckmann
 Jakob H Macke
Deutsche Forschungsgemeinschaft (SPP 2041)
 Jan Boelts
 Jakob H Macke
Deutsche Forschungsgemeinschaft (Germany's Excellence Strategy MLCoE)
 Jan Boelts
 JanMatthis Lueckmann
 Richard Gao
 Jakob H Macke
Bundesministerium für Bildung und Forschung (ADIMEM)
 JanMatthis Lueckmann
 Jakob H Macke
HORIZON EUROPE Marie SklodowskaCurie Actions (101030918)
 Richard Gao
Bundesministerium für Bildung und Forschung (Tübingen AI Center)
 Jan Boelts
 Jakob H Macke
Bundesministerium für Bildung und Forschung (FKZ 01IS18052 AD)
 JanMatthis Lueckmann
 Jakob H Macke
Bundesministerium für Bildung und Forschung (KZ 01IS18039A)
 Jan Boelts
 Jakob H Macke
The funders had no role in study design, data collection, and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Luigi Acerbi, Michael Deistler, Alexander Fengler, Michael Frank, and Ingeborg Wenger for discussions and comments on a preliminary version of the manuscript. We also acknowledge and thank the Python and Julia communities for developing the tools enabling this work, including DifferentialEquations.jl, DiffModels.jl, NumPy, pandas, Pyro, PyTorch, sbi, sbibm, and Scikitlearn (see Appendix for details).
Senior Editor
 Timothy E Behrens, University of Oxford, United Kingdom
Reviewing Editor
 Valentin Wyart, École normale supérieure, PSL University, INSERM, France
Reviewers
 Luigi Acerbi, University of Helsinki, Finland
 Jean Daunizeau, Inserm, France
Version history
 Preprint posted: December 23, 2021 (view preprint)
 Received: January 25, 2022
 Accepted: July 26, 2022
 Accepted Manuscript published: July 27, 2022 (version 1)
 Version of Record published: August 12, 2022 (version 2)
 Version of Record updated: September 7, 2022 (version 3)
Copyright
© 2022, Boelts et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 2,457
 Page views

 556
 Downloads

 2
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Evolutionary Biology
 Neuroscience
The process of brain folding is thought to play an important role in the development and organisation of the cerebrum and the cerebellum. The study of cerebellar folding is challenging due to the small size and abundance of its folia. In consequence, little is known about its anatomical diversity and evolution. We constituted an open collection of histological data from 56 mammalian species and manually segmented the cerebrum and the cerebellum. We developed methods to measure the geometry of cerebellar folia and to estimate the thickness of the molecular layer. We used phylogenetic comparative methods to study the diversity and evolution of cerebellar folding and its relationship with the anatomy of the cerebrum. Our results show that the evolution of cerebellar and cerebral anatomy follows a stabilising selection process. We observed 2 groups of phenotypes changing concertedly through evolution: a group of 'diverse' phenotypes  varying over several orders of magnitude together with body size, and a group of 'stable' phenotypes varying over less than 1 order of magnitude across species. Our analyses confirmed the strong correlation between cerebral and cerebellar volumes across species, and showed in addition that large cerebella are disproportionately more folded than smaller ones. Compared with the extreme variations in cerebellar surface area, folial anatomy and molecular layer thickness varied only slightly, showing a much smaller increase in the larger cerebella. We discuss how these findings could provide new insights into the diversity and evolution of cerebellar folding, the mechanisms of cerebellar and cerebral folding, and their potential influence on the organisation of the brain across species.

 Neuroscience
Consumption of food and water is tightly regulated by the nervous system to maintain internal nutrient homeostasis. Although generally considered independently, interactions between hunger and thirst drives are important to coordinate competing needs. In Drosophila, four neurons called the interoceptive subesophageal zone neurons (ISNs) respond to intrinsic hunger and thirst signals to oppositely regulate sucrose and water ingestion. Here, we investigate the neural circuit downstream of the ISNs to examine how ingestion is regulated based on internal needs. Utilizing the recently available fly brain connectome, we find that the ISNs synapse with a novel celltype bilateral Tshaped neuron (BiT) that projects to neuroendocrine centers. In vivo neural manipulations revealed that BiT oppositely regulates sugar and water ingestion. Neuroendocrine cells downstream of ISNs include several peptidereleasing and peptidesensing neurons, including insulin producing cells (IPCs), crustacean cardioactive peptide (CCAP) neurons, and CCHamide2 receptor isoform RA (CCHa2RRA) neurons. These neurons contribute differentially to ingestion of sugar and water, with IPCs and CCAP neurons oppositely regulating sugar and water ingestion, and CCHa2RRA neurons modulating only water ingestion. Thus, the decision to consume sugar or water occurs via regulation of a broad peptidergic network that integrates internal signals of nutritional state to generate nutrientspecific ingestion.