Likelihood approximation networks (LANs) for fast inference of simulation models in cognitive neuroscience
Abstract
In cognitive neuroscience, computational modeling can formally adjudicate between theories and affords quantitative fits to behavioral/brain data. Pragmatically, however, the space of plausible generative models considered is dramatically limited by the set of models with known likelihood functions. For many models, the lack of a closedform likelihood typically impedes Bayesian inference methods. As a result, standard models are evaluated for convenience, even when other models might be superior. Likelihoodfree methods exist but are limited by their computational cost or their restriction to particular inference scenarios. Here, we propose neural networks that learn approximate likelihoods for arbitrary generative models, allowing fast posterior sampling with only a oneoff cost for model simulations that is amortized for future inference. We show that these methods can accurately recover posterior parameter distributions for a variety of neurocognitive process models. We provide code allowing users to deploy these methods for arbitrary hierarchical model instantiations without further training.
eLife digest
Cognitive neuroscience studies the links between the physical brain and cognition. Computational models that attempt to describe the relationships between the brain and specific behaviours quantitatively is becoming increasingly popular in this field. This approach may help determine the causes of certain behaviours and make predictions about what behaviours will be triggered by specific changes in the brain.
Many of the computational models used in cognitive neuroscience are built based on experimental data. A good model can predict the results of new experiments given a specific set of conditions with few parameters. Candidate models are often called ‘generative’: models that can simulate data. However, typically, cognitive neuroscience studies require going the other way around: they need to infer models and their parameters from experimental data. Ideally, it should also be possible to properly assess the remaining uncertainty over the parameters after having access to the experimental data. To facilitate this, the Bayesian approach to statistical analysis has become popular in cognitive neuroscience.
Common software tools for Bayesian estimation require a ‘likelihood function’, which measures how well a model fits experimental data for given values of the unknown parameters. A major obstacle is that for all but the most common models, obtaining precise likelihood functions is computationally costly. In practice, this requirement limits researchers to evaluating and comparing a small subset of neurocognitive models for which a likelihood function is known. As a result, it is convenience, rather than theoretical interest, that guides this process.
In order to provide one solution for this problem, Fengler et al. developed a method that allows users to perform Bayesian inference on a larger number of models without high simulation costs. This method uses likelihood approximation networks (LANs), a computational tool that can estimate likelihood functions for a broad class of models of decision making, allowing researchers to estimate parameters and to measure how well models fit the data. Additionally, Fengler et al. provide both the code needed to build networks using their approach and a tutorial for users.
The new method, along with the userfriendly tool, may help to discover more realistic brain dynamics underlying cognition and behaviour as well as alterations in brain function.
Introduction
Computational modeling has gained traction in cognitive neuroscience in part because it can guide principled interpretations of functional demands of cognitive systems while maintaining a level of tractability in the production of quantitative fits of brainbehavior relationships. For example, models of reinforcement learning (RL) are frequently used to estimate the neural correlates of the exploration/exploitation tradeoff, of asymmetric learning from positive versus negative outcomes, or of modelbased vs. modelfree control (Schönberg et al., 2007; Niv et al., 2012; Frank et al., 2007; Zajkowski et al., 2017; Badre et al., 2012; Daw et al., 2011b). Similarly, models of dynamic decisionmaking processes are commonly used to disentangle the strength of the evidence for a given choice from the amount of that evidence needed to commit to any choice, and how such parameters are impacted by reward, attention, and neural variability across species (Rangel et al., 2008; Forstmann et al., 2010; Krajbich and Rangel, 2011; Frank et al., 2015; Yartsev et al., 2018; Doi et al., 2020). Parameter estimates might also be used as a theoretically driven method to reduce the dimensionality of brain/behavioral data that can be used for prediction of, for example, clinical status in computational psychiatry (Huys et al., 2016).
Interpreting such parameter estimates requires robust methods that can estimate their generative values, ideally including their uncertainty. For this purpose, Bayesian statistical methods have gained traction. The basic conceptual idea in Bayesian statistics is to treat parameters θ and data $\mathbf{\mathbf{x}}$ as stemming from a joint probability model $p(\theta ,\mathbf{\mathbf{x}})$. Statistical inference proceeds by using Bayes’ rule,
to get at $p(\theta \mathbf{\mathbf{x}})$, the posterior distribution over parameters' given data. $p(\theta )$ is known as the prior distribution over the parameters θ, and $p(\mathbf{\mathbf{x}})$ is known as the evidence or just probability of the data (a quantity of importance for model comparison). The term $p(\mathbf{\mathbf{x}}\theta )$, the probability (density) of the dataset $\mathbf{\mathbf{x}}$ given parameters θ, is known as the likelihood (in accordance with usual notation, we will also write $\mathrm{\ell}(\theta \mathbf{\mathbf{x}})$ in the following). It is common in cognitive science to represent likelihoods as $p(\mathbf{\mathbf{x}}\theta ,s)$, where $s$ specifies a particular stimulus. We suppress $s$ in our notation, but note that our approach easily generalizes when explicitly conditioning on trialwise stimuli. Bayesian parameter estimation is a natural way to characterize uncertainty over parameter values. In turn, it provides a way to identify and probe parameter tradeoffs. While we often do not have access to $p(\theta \mathbf{\mathbf{x}})$ directly, we can draw samples from it instead, for example, via Markov Chain Monte Carlo (MCMC) (Robert and Casella, 2013; Diaconis, 2009; Robert and Casella, 2011).
Bayesian estimation of the full posterior distributions over model parameters contrasts with maximum likelihood estimation (MLE) methods that are often used to extract single best parameter values, without considering their uncertainty or whether other parameter estimates might give similar fits. Bayesian methods naturally extend to settings that assume an implicit hierarchy in the generative model in which parameter estimates at the individual level are informed by the distribution across a group, or even to assess within an individual how trialbytrial variation in (for example) neural activity can impact parameter estimates (commonly known simply as hierarchical inference). Several toolboxes exist for estimating the parameters of popular models like the drift diffusion model (DDM) of decisionmaking and are widely used by the community for this purpose (Wiecki et al., 2013; Heathcote et al., 2019; Turner et al., 2015; Ahn et al., 2017). Various studies have used these methods to characterize how variability in neural activity, and manipulations thereof, alters learning and decision parameters that can quantitatively explain variability in choice and response time distributions (Cavanagh et al., 2011; Frank et al., 2015; Herz et al., 2016; Pedersen and Frank, 2020).
Traditionally, however, posterior sampling or MLE for such models required analytical likelihood functions: a closedform mathematical expression providing the likelihood of observing specific data (reaction times and choices) for a given model and parameter setting. This requirement limits the application of any likelihoodbased method to a relatively small subset of cognitive models chosen for sodefined convenience instead of theoretical interest. Consequently, model comparison and estimation exercises are constrained as many important but likelihoodfree models were effectively untestable or required weeks to process a single model formulation. Testing any slight adjustment to the generative model (e.g., different hierarchical grouping or splitting conditions) requires a repeated time investment of the same order. For illustration, we focus on the class of sequential sampling models (SSMs) applied to decisionmaking scenarios, with the most common variants of the DDM. The approach is, however, applicable to any arbitrary domain.
In the standard DDM, a twoalternative choice decision is modeled as a noisy accumulation of evidence toward one of two decision boundaries (Ratcliff and McKoon, 2008). This model is widely used as it can flexibly capture variations in choice, error rates, and response time distributions across a range of cognitive domains and its parameters have both psychological and neural implications. While the likelihood function is available for the standard DDM and some variants including intertrial variability of its drift parameter, even seemingly small changes to the model form, such as dynamically varying decision bounds (Cisek et al., 2009; Hawkins et al., 2015) or multiplechoice alternatives (Krajbich and Rangel, 2011), are prohibitive for likelihoodbased estimation, and instead require expensive Monte Carlo (MC) simulations, often without providing estimates of uncertainty across parameters.
In the last decade and a half, approximate Bayesian computation (ABC) algorithms have grown to prominence (Sisson et al., 2018). These algorithms enable one to sample from posterior distributions over model parameters, where models are defined only as simulators, which can be used to construct empirical likelihood distributions (Sisson et al., 2018). ABC approaches have enjoyed successful application across life and physical sciences (e.g., Akeret et al., 2015), and notably, in cognitive science (Turner and Van Zandt, 2018), enabling researchers to estimate theoretically interesting models that were heretofore intractable. However, while there have been many advances without sacrificing information loss in the posterior distributions (Turner and Sederberg, 2014; Holmes, 2015), such ABC methods typically require many simulations to generate synthetic or empirical likelihood distributions, and hence can be computationally expensive (in some cases prohibitive – it can take weeks to estimate parameters for a single model). This issue is further exacerbated when embedded within a sequential MCMC sampling scheme, which is needed for unbiased estimates of posterior distributions. For example, one typically needs to simulate between 10,000 and 100,000 times (the exact number varies depending on the model) for each proposed combination of parameters (i.e., for each sample along a Markov chain [MC], which may itself contain tens of thousands of samples), after which they are discarded. This situation is illustrated in Figure 1, where the red arrows point at the computations involved in the approach suggested by Turner et al., 2015.
To address this type of issue, the statistics and machine learning communities have increasingly focused on strategies for the amortization of simulationbased computations (Gutmann et al., 2018; Papamakarios and Murray, 2016; Papamakarios et al., 2019a; Lueckmann et al., 2019; Radev et al., 2020b; Radev et al., 2020a; Gonçalves et al., 2020; Järvenpää et al., 2021). The aim is generally to use model simulations upfront and learn a reusable approximation of the function of interest (targets can be the likelihood, the evidence, or the posterior directly).
In this paper, we develop a general ABC method (and toolbox) that allows users to perform inference on a significant number of neurocognitive models without repeatedly incurring substantial simulation costs. To motivate our particular approach and situate it in the context of other methods, we outline the following key desiderata:
The method needs to be easily and rapidly deployable to end users for Bayesian inference on various models hitherto deemed computationally intractable. This desideratum naturally leads us to an amortization approach, where end users can benefit from costs incurred upfront.
Our method should be sufficiently flexible to support arbitrary inference scenarios, including hierarchical inference, estimation of latent covariate (e.g., neural) processes on the model parameters, arbitrary specification of parameters across experimental conditions, and without limitations on dataset sizes. This desideratum leads us to amortize the likelihood functions, which (unlike other amortization strategies) can be immediately applied to arbitrary inference scenarios without further cost.
We desire approximations that do not a priori sacrifice covariance structure in the parameter posteriors, a limitation often induced for tractability in variational approaches to approximate inference (Blei et al., 2017).
End users should have access to a convenient interface that integrates the new methods seamlessly into preexisting workflows. The aim is to allow users to get access to a growing database of amortized models through this toolbox and enable increasingly complex models to be fit to experimental data routinely, with minimal adjustments to the user’s working code. For this purpose, we will provide an extension to the widely used HDDM toolbox (Wiecki et al., 2013; Pedersen and Frank, 2020) for parameter estimation of DDM and RL models.
Our approach should exploit modern computing architectures, specifically parallel computation. This leads us to focus on the encapsulation of likelihoods into neural network (NN) architectures, which will allow batch processing for posterior inference.
Guided by these desiderata, we developed two amortization strategies based on NNs and empirical likelihood functions. Rather than simulate during inference, we instead train NNs as parametric function approximators to learn the likelihood function from an initial set of a priori simulations across a wide range of parameters. By learning (log) likelihood functions directly, we avoid posterior distortions that result from inappropriately chosen (userdefined) summary statistics and corresponding distance measures as applied in traditional ABC methods (Sisson et al., 2018). Once trained, likelihood evaluation only requires a forward pass through the NN (as if it were an analytical likelihood) instead of necessitating simulations. Moreover, any algorithm can be used to facilitate posterior sampling, MLE or maximum a posteriori estimation (MAP).
For generality, and because they each have their advantages, we use two classes of architectures, multilayered perceptrons (MLPs) and convolutional neural networks (CNNs), and two different posterior sampling methods (MCMC and importance sampling). We show proofs of concepts using posterior sampling and parameter recovery studies for a range of cognitive process models of interest. The trained NNs provide the community with a (continually expandable) bank of encapsulated likelihood functions that facilitate consideration of a larger (previously computationally inaccessible) set of cognitive models, with orders of magnitude speedup relative to simulationbased methods. This speedup is possible because costly simulations only have to be run once per model upfront and henceforth be avoided during inference: previously executed computations are amortized and then shared with the community.
Moreover, we develop standardized amortization pipelines that allow the user to apply this method to arbitrary models, requiring them to provide only a functioning simulator of their model of choice.
In the 'Approximate Bayesian Computation' section, we situate our approach in the greater context of ABC, with a brief review of online and amortization algorithms. The subsequent sections describe our amortization pipelines, including two distinct strategies, as well as their suggested use cases. The 'Test beds' section provides an overview of the (neuro)cognitive process models that comprise our benchmarks. The 'Results' section shows proofofconcept parameter recovery studies for the two proposed algorithms, demonstrating that the method accurately recovers both the posterior mean and variance (uncertainty) of generative model parameters, and that it does so at a runtime speed of orders of magnitude faster than traditional ABC approaches without further training. We further demonstrate an application to hierarchical inference, in which our trained networks can be imported into widely used toolboxes for arbitrary inference scenarios. In the'Discussion' and the last section, we further situate our work in the context of other ABC amortization strategies and discuss the limitations and future work.
Approximate Bayesian computation
ABC methods apply when one has access to a parametric stochastic simulator (also referred to as generative model), but, unlike the usual setting for statistical inference, no access to an explicit mathematical formula for the likelihood of observations given the simulator’s parameter setting.
While likelihood functions for a stochastic stimulator can in principle be mathematically derived, this can be exceedingly challenging even for some historically famous models such as the Ornstein–Uhlenbeck (OU) process (Lipton and Kaushansky, 2018) and may be intractable in many others. Consequently, the statistical community has increasingly developed ABC tools that enable posterior inference of such ‘likelihoodfree’ stochastic models while completely bypassing any likelihood derivations (Cranmer et al., 2020).
Given a parametric stochastic simulator model $\mathcal{M}$ and dataset $\mathbf{\mathbf{x}}$, instead of exact inference based on ${p}_{\mathcal{M}}(\theta \mathbf{\mathbf{x}})$, these methods attempt to draw samples from an approximate posterior ${\stackrel{~}{p}}_{\mathcal{M}}(\theta \mathbf{\mathbf{x}})$. Consider the following general equation:
where ${s}_{\mathcal{M}}$ refers to sufficient statistics (roughly, summary statistics of a dataset that retain sufficient information about the parameters of the generative process). ${K}_{h}({s}_{\mathcal{\mathcal{M}}}{s}_{\mathbf{x}})$ refers to a kernelbased distance measure/cost function, which evaluates a probability density function for a given distance between the observed and expected summary statistics ${s}_{\mathcal{M}}{s}_{\mathbf{\mathbf{x}}}$. The parameter $h$ (commonly known as bandwidth parameter) modulates the cost gradient. Higher values of $h$ lead to more graceful decreases in cost (and therefore a worse approximation of the true posterior).
By generating simulations, one can use such summary statistics to obtain approximate likelihood functions, denoted as ${\stackrel{~}{p}}_{\mathcal{M}}(\mathbf{\mathbf{x}}\theta )$, where approximation error can be mitigated by generating large numbers of simulations. The caveat is that the amount of simulation runs needed to achieve a desired degree of accuracy in the posterior can render such techniques computationally infeasible.
With a focus on amortization, our goal is to leverage some of the insights and developments in the ABC community to develop NN architectures that can learn approximate likelihoods deployable for any inference scenario (and indeed any inference method, including MCMC, variational inference, or even MLE) without necessitating repeated training. We next describe our particular approach and return to a more detailed comparison to existing methods in the 'Discussion' section.
Learning the likelihood with simple NN architectures
In this section, we outline our approach to amortization of computational costs of large numbers of simulations required by traditional ABC methods. Amortization approaches incur a oneoff (potentially quite large) simulation cost to enable cheap, repeated inference for any dataset. Recent research has led to substantial developments in this area (Cranmer et al., 2020). The most straightforward approach is to simply simulate large amounts of data and compile a database of how model parameters are related to observed summary statistics of the data (Mestdagh et al., 2019). This database can then be used during parameter estimation in empirical datasets using a combination of nearestneighbor search and local interpolation methods. However, this approach suffers from the curse of dimensionality with respect to storage demands (a problem that is magnified with increasing model parameters). Moreover, its reliance on summary statistics (Sisson et al., 2018) does not naturally facilitate flexible reuse across inference scenarios (e.g., hierarchical models, multiple conditions while fixing some parameters across conditions, etc.).
To fulfill all desiderata outlined in the introduction, we focus on directly encapsulating the likelihood function over empirical observations of a simulation model so that likelihood evaluation is substantially cheaper than constructing (empirical) likelihoods via model simulation online during inference. Such empirical observations could be, for example, trialwise choices and reaction times (RTs). This strategy then allows for flexible reuse of such approximate likelihood functions $\widehat{\mathrm{\ell}}(\theta \mathbf{\mathbf{x}})$ in a large variety of inference scenarios applicable to common experimental design paradigms. Specifically, we encapsulate $\widehat{\mathrm{\ell}}(\theta \mathbf{\mathbf{x}})$ as a feedforward NN, which allows for parallel evaluation by design. We refer to these networks as likelihood approximation networks (LANs).
Figure 1 spells out the setting (A) and usage (B) of such a method. The LAN architectures used in this paper are simple, small in size, and are made available for download for local usage. While this approach does not allow for instantaneous posterior inference, it does considerably reduce computation time (by up to three orders of magnitude; see ‘Run time’ section) when compared to approaches that demand simulations at inference. Notably, this approach also substantially speeds up inference even for models that are not entirely likelihood free but nevertheless require costly numerical methods to obtain likelihoods. Examples are the fullDDM with intertrial variability in parameters (for which likelihoods can be obtained via numerical integration, as is commonly done in software packages such as HDDM but with substantial cost), but also other numerical methods for generalized DDMs (Shinn et al., 2020; Drugowitsch, 2016). At the same time, we maintain the high degree of flexibility with regards to deployment across arbitrary inference scenarios. As such, LANs can be treated as a highly flexible plugin to existing inference algorithms and remain conceptually simple and lightweight.
Before elaborating our LAN approach, we briefly situate it in the context of some related work.
One branch of literature that interfaces ABC with deep learning attempts to amortize posterior inference directly in endtoend NNs (Radev et al., 2020b; Radev et al., 2020a; Papamakarios and Murray, 2016; Papamakarios et al., 2019a; Gonçalves et al., 2020). Here, NN architectures are trained with a large number of simulated datasets to produce posterior distributions over parameters, and once trained, such networks can be applied to directly estimate parameters from new datasets without the need for further simulations. However, the goal to directly estimate posterior parameters from data requires the user to first train a NN for the very specific inference scenario in which it is applied empirically. Such approaches are not easily deployable if a user wants to test multiple inference scenarios (e.g., parameters may vary as a function of task condition or brain activity, or in hierarchical frameworks), and consequently, they do not achieve our second desideratum needed for a userfriendly toolbox, making inference cheap and simple.
We return to discuss the relative merits and limitations of these and further alternative approaches in the 'Discussion' section.
Formally, we use model simulations to learn a function ${f}_{\mathrm{\Phi}}(x,\theta )$, where ${f}_{\mathrm{\Phi}}(.)$ is the output of a NN with parameter vector $\mathrm{\Phi}$ (weights, biases). The function ${f}_{\mathrm{\Phi}}(\mathbf{\mathbf{x}},\theta )$ is used as an approximation $\widehat{\mathrm{\ell}}(\theta \mathbf{\mathbf{x}})$ of the likelihood function $\mathrm{\ell}(\theta \mathbf{\mathbf{x}})$. Once learned, we can use such $f$ as a plugin to Bayes' rule:
To perform posterior inference, we can now use a broad range of existing MC or MCMC algorithms. We note that, fulfilling our second desideratum, an extension to hierarchical inference is as simple as plugging in our NN into a probabilistic model of the form
where α, β are generic grouplevel global parameters, and γ serves as a generic fixed hyperparameter vector.
We provide proofs of concepts for two types of LANs. While we use MLPs and convolutional neural networks (CNNs), of conceptual importance is the distinction between the two problem representations they tackle, rather than the network architectures per se.
The first problem representation, which we call the pointwise approach, considers the functions ${f}_{\mathrm{\Phi}}(x\theta )$, where θ is the parameter vector of a given stochastic simulator model, and $x$ is a single datapoint (trial outcome). The pointwise approach is a mapping from the input dimension $\mathrm{\Theta}+x$, where $.$ refers to the cardinality, to the onedimensional output. The output is simply the loglikelihood of the single datapoint $x$ given the parameter vector θ. As explained in the next section, for this mapping we chose simple MLPs.
The second problem representation, which we will refer to as the histogram approach, instead aims to learn a function ${f}_{\mathrm{\Phi}}(.\theta )$, which maps a parameter vector θ to the likelihood over the full (discretized) dataspace (i.e., the likelihood of the entire RT distributions at once). We represent the output space as an outcome histogram with dimensions $n\times m$ (where in our applications $n$ is the number of bins for a discretization of reaction times, and $m$ refers to the number of distinct choices). Thus, our mapping has input dimension $\mathrm{\Theta}$ and output dimension $m\times n$. Representing the problem this way, we chose CNNs as the network architecture.
In each of the above cases, we pursued the architectures that seemed to follow naturally, without any commitment to their optimality. The pointwise approach operates on lowdimensional inputs and outputs. With network evaluation speed being of primary importance to us, we chose a relatively shallow MLP to learn this mapping, given that it was expressive enough. However, when learning a mapping from a parameter vector θ to the likelihood over the full dataspace, as in the histogram approach, we map a lowdimensional data manifold to a much higher dimensional one. Using an MLP for this purpose would imply that the number of NN parameters needed to learn this function would be orders of magnitude larger than using a CNN. Not only would this mean that forward passes through the network would take longer, but also an increased propensity to overfit on an identical data budget.
The next two sections will give some detail regarding the networks chosen for the pointwise and histogram LANs. Figure 1 illustrates the general idea of our approach while Figure 2 gives a conceptual overview of the exact training and inference procedure proposed. These details are thoroughly discussed in the last section of the paper.
Pointwise approach: learn likelihoods of individual observations with MLPs
As a first approach, we use simple MLPs to learn the likelihood function of a given stochastic simulator. The network learns the mapping $\{\theta ,\mathbf{\mathbf{x}}\}\to \mathrm{log}{\mathrm{\ell}}_{\mathrm{\Phi}}(\theta \mathbf{\mathbf{x}})$, where θ represents the parameters of our stochastic simulator, $\mathbf{\mathbf{x}}$ are the simulator outcomes (in our specific examples below, $\mathbf{\mathbf{x}}$, refers to the tuple $(rt,c)$ of reaction times and choices), and $\mathrm{log}\mathrm{\ell}(\theta \mathbf{\mathbf{x}})$ is the loglikelihood of the respective outcome. The trained network then serves as the approximate likelihood function $\mathrm{log}{\widehat{\mathrm{\ell}}}_{\mathrm{\Phi}}(\theta \mathbf{\mathbf{x}})$. We emphasize that the network is trained as a function approximator to provide us with a computationally cheap likelihood estimate, not to provide a surrogate simulator. Biological plausibility of the network is therefore not a concern when it comes to architecture selection. Fundamentally this approach attempts to learn the loglikelihood via a nonlinear regression, for which we chose an MLP. Loglikelihood labels for training were derived from empirical likelihoods functions (details are given in the 'Materials and methods' section), which in turn were constructed as kernel density estimates (KDEs). The construction of KDEs roughly followed Turner et al., 2015. We chose the Huber loss function (details are given in the 'Materials and methods' section) because it is more robust to outliers and thus less susceptible to distortions that can arise in the tails of distributions.
Histogram approach: learn likelihoods of entire dataset distributions with CNNs
Our second approach is based on a CNN architecture. Whereas the MLP learned to output a single scalar likelihood output for each datapoint (‘trial’, given a choice, reaction time, and parameter vector), the goal of the CNN was to evaluate, for the given model parameters, the likelihood of an arbitrary number of datapoints via one forward pass through the network. To do so, the output of the CNN was trained to produce a probability distribution over a discretized version of the dataspace, given a stochastic model and parameter vector. The network learns the mapping $\theta \to \mathrm{log}{\mathrm{\ell}}_{\mathrm{\Phi}}(\theta .)$.
As a side benefit, in line with the methods proposed by Lueckmann et al., 2019; Papamakarios et al., 2019b, the CNN can in fact act as a surrogate simulator; for the purpose of this paper, we do not exploit this possibility however. Since here we attempt to learn distributions, labels were simple binned empirical likelihoods, and as a loss function we chose the Kullback Leibler (KL) divergence between the network's output distribution and the label distribution (details are given in the 'Materials and methods' section).
Training specifics
Most of the specifics regarding training procedures are discussed in detail in the 'Materials and methods' section; however, we mention some aspects here to aid readability.
We used 1.5 M and 3 M parameter vectors (based on 100k simulations each) to train the MLP and CNN approach, respectively. We chose these numbers consistently across all models and in fact trained on less examples for the MLP only due to RAM limitations we faced on our machines (which in principle can be circumvented). These numbers are purposely high (but in fact quite achievable with access to a computing cluster, simulations for each model was on the order of hours only) since we were interested in a workable proof of concept. We did not investigate the systematic minimization of training data; however, some crude experiments indicate that a decrease by an order of magnitude did not seriously affect performance.
We emphasize that this is in line with the expressed philosophy of our approach. The point of amortization is to throw a lot of resources at the problem once so that downstream inference is made accessible even on basic setups (a usual laptop). In case simulators are prohibitively expensive even for reasonably sized computer clusters, minimizing training data may gain more relevance. In such scenarios, training the networks will be very cheap compared to simulation time, which implies that retraining with progressively more simulations until one observes asymptotic test performance is a viable strategy.
Test beds
We chose variations of SSMs common in the cognitive neurosciences as our test bed (Figure 3). The range of models we consider permits great flexibility in allowable data distributions (choices and response times). We believe that initial applications are most promising for such SSMs because (1) analytical likelihoods are available for the most common variants (and thus provide an upperbound benchmark for parameter recovery) and (2) there exist many other interesting variants for which no analytic solution exists.
We note that there is an intermediate case in which numerical methods can be applied to obtain likelihoods for a broader class of models (e.g., Shinn et al., 2020). These methods are nevertheless computationally expensive and do not necessarily afford rapid posterior inference. Therefore, amortization via LANs is attractive even for these models. Figure 3 further outlines this distinction.
We emphasize that our methods are quite general, and any model that generates discrete choices and response times from which simulations can be drawn within a reasonable amount of time can be suitable to the amortization techniques discussed in this paper (given that the model has parameter vectors of dimension roughly lt_{15}). In fact, LANs are not restricted to models of reaction time and choice to begin with, even though we focus on these as test beds.
As a general principle, all models tested below are based on stochastic differential equations (SDEs) of the following form:
where we are concerned with the probabilistic behavior of the particle (or vector of particles) $\mathbf{\mathbf{X}}$. The behavior of this particle is driven by $a(t,x)$, an underlying drift function, $b(t,x)$, an underlying noise transformation function, ${B}_{t}$, an incremental noise process, and ${X}_{0}=w$, a starting point.
Of interest to us are specifically the properties of the firstpassagetimedistributions (FPTD) for such processes, which are needed to compute the likelihood of a given response time/choice pair $\{rt,c\}$. In these models, the exit region of the particle (i.e., the specific boundary it reaches) determines the choice, and the time point of that exit determines the response time. The joint distribution of choices and reaction times is referred to as a FPTD.
Given some exit region $\mathcal{E}$, such FPTDs are formally defined as
In words, a first passage time, for a given exit region, is defined as the first time point of entry into the exit region, and the FPTD is the probability, respectively, of such an exit happening at any specified time $t$ (Ratcliff, 1978; Feller, 1968). Partitioning the exit region into subsets ${\mathcal{E}}_{1},\mathrm{\dots}.,{\mathcal{E}}_{n}$ (e.g., representing $n$ choices), we can now define the set of defective distributions
where $\theta \in \mathrm{\Theta}$ describes the collection of parameters driving the process. For every ${\mathcal{E}}_{i}$,
$\{{f}_{{\mathcal{E}}_{\U0001d4be}},\mathrm{\dots},{f}_{{\mathcal{E}}_{\U0001d4c3}}\}$ jointly define the FPTD such that
These functions ${f}_{{\mathcal{E}}_{i}}$, jointly serve as the likelihood function s.t.
For illustration, we focus the general model formulation above to the standard DDM. Details regarding the other models in our test bed are relegated to the 'Materials and methods' section.
To obtain the DDM from the general equation above, we set $a(t,x)=v$ (a fixed drift across time), $b(t,x)=1$ (a fixed noise variance across time), and $\mathrm{\Delta}\mathbf{\mathbf{B}}\sim \mathcal{N}(0,\mathrm{\Delta}t)$. The DDM applies to the two alternative decision cases, where decision corresponds to the particle crossings of an upper or lower fixed boundary. Hence, ${\mathcal{E}}_{1}=\{\mathbb{R}\ge a\}$ and ${\mathcal{E}}_{2}=\{\mathbb{R}\le a\}$, where $a$ is a parameter of the model. The DDM also includes a normalized starting point $w$ (capturing potential response biases or priors), and finally a nondecision time τ (capturing the time for perceptual encoding and motor output). Hence, the parameter vector for the DDM is then $\theta =(v,a,w,\tau )$. The SDE is defined as
The DDM serves principally as a basic proof of concept for us, in that it is a model for which we can compute the exact likelihoods analytically (Feller, 1968; Navarro and Fuss, 2009).
The other models chosen for our test bed systematically relax some of the fixed assumptions of the basic DDM, as illustrated in Figure 3.
We note that many new models can be constructed from the components tested here. As an example of this modularity, we introduce inhibition/excitation to the race model, which gives us the leaky competing accumulator (LCA) (Usher and McClelland, 2001). We could then further extend this model by introducing parameterized bounds. We could introduce reinforcement learning parameters to a DDM (Pedersen and Frank, 2020) or in combination with any of the other decision models. Again we emphasize that while these diffusionbased models provide a large test bed for our proposed methods, applications are in no way restricted to this class of models.
Results
Networks learn likelihood function manifolds
Across epochs of training, both training and validation loss decrease rapidly and remain low (Figure 4A) suggesting that overfitting is not an issue, which is sensible in this context. The low validation loss further shows that the network can interpolate likelihoods to specific parameter values it has not been exposed to (with the caveat that it has to be exposed to the same range; no claims are made about extrapolation).
Indeed, a simple interrogation of the learned likelihood manifolds shows that they smoothly vary in an interpretable fashion with respect to changes in generative model parameters (Figure 4B). Moreover, Figure 4C shows that the MLP likelihoods mirror those obtained by KDEs using 100,000 simulations, even though the model parameter vectors were drawn randomly and thus not trained per se. We also note that the MLP likelihoods appropriately filter out simulation noise (random fluctuations in the KDE empirical likelihoods across separate simulation runs of $100K$ samples each). This observation can also be gleaned from Figure 4C, which shows the learned likelihood to sit right at the center of sampled KDEs (note that for each subplot 100 such KDEs were used). As illustrated in the Appendix, these observations hold across all tested models. One perspective on this is to consider the MLP likelihoods as equivalent to KDE likelihoods derived from a much larger number of underlying samples and interpolated. The results for the CNN (not shown to avoid redundancy) mirror the MLP results. Finally, while Figure 4 depicts the learned likelihood for the simple DDM for illustration purposes, the same conclusions apply to the learned manifolds for all of the tested models (as shown in Appendix 1—figures 1–6). Indeed inspection of those manifolds is insightful for facilitating interpretation of the dynamics of the underlying models, how they differ from each other, and the corresponding RT distributions that can be captured.
Parameter recovery
Benchmark: analytical likelihood available
While the above inspection of the learned manifolds is promising, a true test of the method is to determine whether one can perform proper inference of generative model parameters using the MLP and CNN. Such parameter recovery exercises are typically performed to determine whether a given model is identifiable for a given experimental setting (e.g., number of trials, conditions, etc.). Indeed, when parameters are collinear, recovery can be imperfect even if the estimation method itself is flawless (Wilson and Collins, 2019; Nilsson et al., 2011; Daw, 2011a). A Bayesian estimation method, however, should properly assign uncertainty to parameter estimates in these circumstances, and hence it is also important to evaluate the posterior variances over model parameters.
Thus as a benchmark, we first consider the basic DDM for which an arbitrarily close approximation to the analytical likelihood is available (Navarro and Fuss, 2009). This benchmark allows us to compare parameter recovery given (1) the analytical likelihood, (2) an approximation to the likelihood specified by training an MLP on the analytical likelihood (thus evaluating the potential loss of information incurred by the MLP itself), (3) an approximation to the likelihood specified by training an MLP on KDEbased empirical likelihoods (thus evaluating any further loss incurred by the KDE reconstruction of likelihoods), and (4) an approximate likelihood resulting from training the CNN architecture, on empirical histograms. Figure 5 shows the results for the DDM.
For the simple DDM and analytical likelihood, parameters are nearly perfectly recovered given $N=1024$ datapoints (‘trials’) (Figure 5A). Notably, these results are mirrored when recovery is performed using the MLP trained on the analytical likelihood (Figure 5B). This finding corroborates, as visually suggested by the learned likelihood manifolds, the conclusion that globally the likelihood function was well behaved. Moreover, only slight reductions in recoverability were incurred when the MLP was trained on the KDE likelihood estimates (Figure 5C), likely due to the known small biases in KDE itself (Turner et al., 2015). Similar performance is achieved using the CNN instead of MLP (Figure 5D).
As noted above, an advantage of Bayesian estimation is that we obtain an estimate of the posterior uncertainty in estimated parameters. Thus, a more stringent requirement is to additionally recover the correct posterior variance for a given dataset $\mathcal{D}$ and model $\mathcal{M}$. One can already see visually in Figure 5C, D that posterior uncertainty is larger when the mean is further from the ground truth (lighter shades of gray indicate higher posterior variance). However, to be more rigorous one can assess whether the posterior variance is precisely what it should be.
The availability of an analytical likelihood for the DDM, together with our use of sampling methods (as opposed to variational methods that can severely bias posterior variance), allows us to obtain the ‘ground truth’ uncertainty in parameter estimates. Figure 6 shows that the sampling from a MLP trained on analytical likelihoods, an MLP trained on KDEbased likelihoods, and a CNN all yield excellent recovery of the variance. For an additional run that involved datasets of size $n=4096$ instead of $n=1024$, we observed a consistent decrease in posterior variance across all methods (not shown) as expected.
No analytical likelihood available
As a proof of concept for the more general ABC setting, we show parameter recovery results for two nonstandard models, the linear collapse (LC) and Weibull models, as described in the 'Test bed' section. The results are summarized in Figures 7 and 8 and described in more detail in the following two paragraphs.
Parameter recovery
Figure 7 shows that both the MLP and CNN methods consistently yield very good to excellent parameter recovery performance for the LC model, with parameterwise regression coefficients globally above ${R}^{2}>0.9$. As shown in Figure 8, parameter recovery for the Weibull model is less successful, however, particularly for the Weibull collapsing bound parameters. The drift parameter $v$, the starting point bias $w$, and the nondecision time are estimated well; however, the boundary parameters $a$, α, and β are less well recovered by the posterior mean. Judging by the parameter recovery plot, the MLP seems to perform slightly less well on the boundary parameters when compared to the CNN.
To interrogate the source of the poor recovery of α and β parameters, we considered the possibility that the model itself may have issues with identifiability, rather than poor fit. Figure 8 shows that indeed, for two representative datasets in which these parameters are poorly recovered, the model nearly perfectly reproduces the ground truth data in the posterior predictive RT distributions. Moreover, we find that whereas the individual Weibull parameters are poorly recovered, the net boundary $B(t)$ is very well recovered, particularly when evaluated within the range of the observed dataset. This result is reminiscent of the literature on sloppy models (Gutenkunst et al., 2007), where sloppiness implies that various parameter configurations can have the same impact on the data. Moreover, two further conclusions can be drawn from this analysis. First, when fitting the Weibull model, researchers should interpret the bound trajectory as a latent parameter rather than the individual α and β parameters per se. Second, the Weibull model may be considered as viable only if the estimated bound trajectory varies sufficiently within the range of the empirical RT distributions. If the bound is instead flat or linearly declining in that range, the simple DDM or LC models may be preferred, and their simpler form would imply that they would be selected by any reasonable model comparison metric. Lastly, given our results the Weibull model could likely benefit from reparameterization if the desire is to recover individual parameters rather than the bound trajectory $B(t)$. Given the common use of this model in collapsing bound studies (Hawkins et al., 2015) and that the bound trajectories are nevertheless interpretable, we leave this issue for future work.
The Appendix shows parameter recovery studies on a number of other stochastic simulators with nonanalytical likelihoods, described in the 'Test bed' section. The appendices show tables of parameterwise recovery ${R}^{2}$ for all models tested. In general, recovery ranges from good to excellent. Given the Weibull results above, we attribute the less good recovery for some of these models to identifiability issues and specific dataset properties rather than to the method per se. We note that our parameter recovery studies here are in general constrained to the simplest inference setting equivalent to a singlesubject, singlecondition experimental design. Moreover, we use uninformative priors for all parameters of all models. Thus, these results provide a lower bound on parameter recoverability, provided of course that the datasets were generated from the valid parameter ranges on which the networks were trained; see Section 0.10 for how recovery can benefit from more complex experimental designs with additional task conditions, which more faithfully represents the typical inference scenario deployed by cognitive neuroscientists. Lastly, some general remarks about the parameter recovery performance. A few factors can negatively impact how well one can recover parameters. First, if the model generally suffers from identifiability issues, the resulting tradeoffs in the parameters can lead the MCMC chain to get stuck on the boundary for one or more parameters. This issue is endemic to all models and unrelated to likelihoodfree methods or LANs, and should at best be attacked at the level of reparameterization (or a different experimental design that can disentangle model parameters). Second, if the generative parameters of a dataset are too close to (or beyond) the bounds of the trained parameter space, we may also end with a chain that gets stuck on the boundary of the parameter space. We confronted this problem by training on parameter spaces that yield response time distributions that are broader than typically observed experimentally for models of this class, while also excluding obviously defective parameter setups. Defective parameter setups were defined in the context of our applications as parameter vectors that generate data that never allow one or the other choice to occur (as in $p(c)\ll \frac{1}{100,000}$, data that concentrates more than half of the reaction times within a single 1 ms bin and data that generated mean reaction times beyond 10 s). These guidelines were chosen as a mix of basic rationale and domain knowledge regarding usual applications of DDMs to experimental data. As such, the definition of defective data may depend on the model under consideration.
Runtime
A major motivation for this work is the amortization of network training time during inference, affording researchers the ability to test a variety of theoretically interesting models for linking brain and behavior without large computational cost. To quantify this advantage, we provide some results on the posterior sampling runtimes using (1) the MLP with slice sampling (Neal, 2003) and (2) CNN with iterated importance sampling.
The MLP timings are based on slice sampling (Neal, 2003), with a minimum of $n=2000$ samples. The sampler was stopped at some $n\ge 2000$, for which the Geweke statistic (Geweke, 1992) indicated convergence (the statistic was computed once every 100 samples for $n\ge 2000$). Using an alternative sampler, based on differential evolution Markov chain Monte Carlo (DEMCMC) and stopped when the Gelman–Rubin $\widehat{R}<1.1$ (Gelman and Rubin, 1992), yielded very similar timing results and was omitted in our figures.
For the reported importance sampling runs, we used 200K importance samples per iteration, starting with γ values of 64, which was first reduced to 1 where in iteration $i$, ${\gamma}_{i}=\frac{64}{{2}^{i1}}$, before a stopping criterion based on relative improvement of the confusion metric was used.
Figure 9A shows that all models can be estimated in the order of hundreds of seconds (minutes), comprising a speed improvement of at least two orders of magnitude compared to traditional ABC methods using KDE during inference (i.e., the PDA method motivating this work; Turner et al., 2015). Indeed, this estimate is a lower bound on the speed improvement: we extrapolate only the observed difference between network evaluation and online simulations, ignoring the additional cost of constructing and evaluating the KDEbased likelihood. We decided to use this benchmark because it provides a fairer comparison to more recent PDA approaches in which the KDE evaluations can be sped up considerably (Holmes, 2015).
Notably, due to its potential for parallelization (especially on GPUs), our NN methods can even induce performance speedups relative to analytical likelihood evaluations. Indeed, Figure 9B shows that as the dataset grows runtime is significantly faster than even a highly optimized cython implementation of the Navarro Fuss algorithm (Navarro and Fuss, 2009) for evaluation of the analytic DDM likelihood. This is also noteworthy in light of the fullDDM (as described in the 'Test bed' section), for which it is currently common to compute the likelihood term via quadrature methods, in turn based on repeated evaluations of the Navarro Fuss algorithm. This can easily inflate the evaluation time by 1–2 orders of magnitude. In contrast, evaluation times for the MLP and CNN are only marginally slower (as a function of the slightly larger network size in response to higher dimensional inputs). We confirm (omitted as separate figure) from experiments with the HDDM Python toolbox that our methods end up approximately $1050$ times faster for the fullDDM than the current implementation based on numerical integration, maintaining comparable parameter recovery performance. We strongly suspect there to be additional remaining potential for performance optimization.
Hierarchical inference
One of the principal benefits of LANs is that they can be directly extended – without further training – to arbitrary hierarchical inference scenarios, including those in which (1) individual participant parameters are drawn from group distributions, (2) some parameters are pooled and others separated across task conditions, and (3) neural measures are estimated as regressors on model parameters (Figure 10). Hierarchical inference is critical for improving parameter estimation particularly for realistic cognitive neuroscience datasets in which thousands of trials are not available for each participant and/or where one estimates impacts of noisy physiological signals onto model parameters (Wiecki et al., 2013; Boehm et al., 2018; Vandekerckhove et al., 2011; Ratcliff and Childers, 2015).
To provide a proof of concept, we developed an extension to the HDDM Python toolbox (Wiecki et al., 2013), widely used for hierarchical inference of the DDM applied to such settings. Lifting the restriction of previous versions of HDDM to only DDM variants with analytical likelihoods, we imported the MLP likelihoods for all twochoice models considered in this paper. Note that GPUbased computation is supported out of the box, which can easily be exploited with minimal overhead using free versions of Google’s Colab notebooks. We generally observed GPUs to improve speed approximately fivefold over CPUbased setups for the inference scenarios we tested. Preliminary access to this interface and corresponding instructions can be found at https://github.com/lnccbrown/lans/tree/master/hddmnn_tutorial (copy archived at swh:1:rev:e3369b9df138c75d0e490be0c48c53ded3e3a1d6); Fengler, 2021.
Figure 11 shows example results from hierarchical inference using the LC model, applied to synthetic datasets comprising 5 and 20 subjects (a superset of participants). Recovery of individual parameters was adequate even for five participants, and we also observe the expected improvement of recovery of the grouplevel parameters μ and σ for 20 participants.
Figure 12 shows an example that illustrates how parameter recovery is affected when a dataset contains multiple experimental conditions (e.g., different difficulty levels). It is common in such scenarios to allow task conditions to affect a single (or subset)model parameter (in the cases shown: $v$), while other model parameters are pooled across conditions. As expected, for both the fullDDM (A) and the Levy model (B), the estimation of global parameters is improved when increasing the number of conditions from 1 to 5 to 10 (left to right, where the former are subsets of the latter datasets). These experiments confirm that one can more confidently estimate parameters that are otherwise difficult to estimate such as the noise α in the Levy model and $sv$ the standard deviation of the drift in the fullDDM.
Both of these experiments provide evidence that our MLPs provide approximate likelihoods which behave in accordance with what is expected from proper analytical methods, while also demonstrating their robustness to other samplers (i.e., we used HDDM slice samplers without further modification for all models).
We expect that proper setting of prior distributions (uniform in our examples) and further refinements to the slice sampler settings (to help mode discovery) can improve these results even further. We include only the MLP method in this section since it is most immediately amenable to the kind of trialbytriallevel analysis that HDDM is designed for. We plan to investigate the feasibility of including the CNN method into HDDM in future projects.
Discussion
Our results demonstrate the promise and potential of amortized LANs for Bayesian parameter estimation of neurocognitive process models. Learned manifolds and parameter recovery experiments showed successful inference using a range of network architectures and posterior sampling algorithms, demonstrating the robustness of the approach.
Although these methods are extendable to any model of similar complexity, we focused here on a class of SSMs, primarily because the most popular of them – the DDM – has an analytic solution, and is often applied to neural and cognitive data. Even slight departures from the standard DDM framework (e.g., dynamic bounds or changes in the noise distribution) are often not considered for full Bayesian inference due to the computational complexity associated with traditional ABC methods. We provide access to the learned likelihood functions (in the form of network weights) and code to enable users to fit a variety of such models with orders of magnitude speedup (minutes instead of days). In particular, we provided an extension to the commonly used HDDM toolbox (Wiecki et al., 2013) that allows users to apply these models to their own datasets immediately. We also provide access to code that would allow users to train their own likelihood networks and perform recovery experiments, which can then be made available to the community.
We offered two separate approaches with their own relative advantages and weaknesses. The MLP is suited for evaluating likelihoods of individual observations (choices, response times) given model parameters, and as such can be easily extended to hierarchical inference settings and trialbytrial regression of neural activity onto model parameters. We showed that importing the MLP likelihood functions into the HDDM toolbox affords fast inference over a variety of models without tractable likelihood functions. Moreover, these experiments demonstrated that use of the NN likelihoods even confers a performance speedup over the analytical likelihood function – particularly for the fullDDM, which otherwise required numerical methods on top of the analytical likelihood function for the simple DDM.
Conversely, the CNN approach is well suited for estimating likelihoods across parameters for entire datasets in parallel, as implemented with importance sampling. More generally and implying potential further improvements, any sequential MC method may be applied instead. These methods offer a more robust path to sampling from multimodal posteriors compared to MCMC, at the cost of the curse of dimensionality, rendering them potentially less useful for highly parameterized problems, such as those that require hierarchical inference. Moreover, representing the problem directly as one of learning probability distributions and enforcing the appropriate constraints by design endows the CNN approach with a certain conceptual advantage. Finally, we note that in principle (with further improvements) triallevel inference is possible with the CNN approach, and vice versa, importance sampling can be applied to the MLP approach.
In this work, we employed sampling methods (MCMC and importance sampling) for posterior inference because in the limit they are well known to allow for accurate estimation of posterior distributions on model parameters, including not only mean estimates but their variances and covariances. Accurate estimation of posterior variances is critical for any hypothesis testing scenario because it allows one to be confident about the degree of uncertainty in parameter estimates. Indeed, we showed that for the simple DDM we found that posterior inference using our networks yielded nearly perfect estimation of the variances of model parameters (which are available due to the analytic solution). Of course, our networks can also be deployed for other estimation methods even more rapidly: they can be immediately used for MLE via gradient descent or within other approximate inference methods, such as variational inference (see Acerbi, 2020 for a related approach).
Other approaches exist for estimating generalized diffusion models. A recent example, not discussed thus far, is the pyDDM Python toolbox (Shinn et al., 2020), which allows MLE of generalized drift diffusion models. The underlying solver is based on the Fokker–Planck equations, which allow access to approximate likelihoods (where the degree of approximation is traded off with computation time/discretization granularity) for a flexible class of diffusionbased models, notably allowing arbitrary evidence trajectories, starting point, and nondecision time distributions. However, to incorporate trialbytrial effects would severely inflate computation time (on the order of the number of trials) since the solver would have to operate on a trialbytrial level. Moreover, any model that is not driven by Gaussian diffusion, such as the Levy model we considered here or the linear ballistic accumulator, is out of scope with this method. In contrast, LANs can be trained to estimate any such model, limited only by the identifiability of the generative model itself. Finally, pyDDM does not afford full Bayesian estimation and thus quantification of parameter uncertainty and covariance.
We moreover note that our LAN approach can be useful even if the underlying simulation model admits other likelihood approximations, regardless of trialbytrial effect considerations, since a forward pass through a LAN may be speedier. Indeed, we observed substantial speedups in HDDM for using our LAN method to the fullDDM, for which numerical methods were previously needed to integrate over intertrial variability.
We emphasize that our test bed application to SSMs does not delimit the scope of application of LANs. Neither are reasonable architectures restricted to MLPs and CNNs (see Lueckmann et al., 2019; Papamakarios et al., 2019b for related approaches that use completely different architectures). Models with highdimensional (roughly gt_{15}) parameter spaces may present a challenge for our global amortization approach due to the curse of dimensionality. Further, models with discrete parameters of high cardinality may equally present a given network with training difficulties. In such cases, other methods may be preferred over likelihood amortization generally (e.g., Acerbi, 2020); given that this is an open and active area of research, we can expect surprising developments that may in fact turn the tide again in the near future.
Despite some constraints, this still leaves a vast array of models in reach for LANs, of which our test bed can be considered only a small beginning.
By focusing on LANs, our approach affords the flexibility of networks serving as plugins for hierarchical or arbitrarily complex model extensions. In particular, the networks can be immediately transferred, without further training, to arbitrary inference scenarios in which researchers may be interested in evaluating links between neural measures and model parameters, and to compare various assumptions about whether parameters are pooled and split across experimental manipulations. This flexibility in turn sets our methods apart from other amortization and NNbased ABC approaches offered in the statistics, machine learning, and computational neuroscience literature (Papamakarios and Murray, 2016; Papamakarios et al., 2019a; Gonçalves et al., 2020; Lueckmann et al., 2019; Radev et al., 2020b), while staying conceptually extremely simple. Instead of focusing on extremely fast inference for very specialized inference scenarios, our approach focuses on achieving speedy inference while not implicitly compromising modeling flexibility through amortization step.
Closest to our approach is the work of Lueckmann et al., 2019, and Papamakarios et al., 2019b, both of which attempt to target the likelihood with a neural density estimator. While flexible, both approaches imply the usage of summary statistics, instead of a focus on trialwise likelihood functions. Our work can be considered a simple alternative with explicit focus on trialwise likelihoods.
Besides deep learningbased approaches, another major machine learninginspired branch of the ABC literature concerns loglikelihood and posterior approximations via Gaussian process surrogates (GPSs) (Meeds and Welling, 2014; Järvenpää et al., 2018; Acerbi, 2020). A major benefit of GPSs lies in the ability for clever training data selection via active learning since such GPSs allow uncertainty quantification out of the box, which in turn can be utilized for the purpose of targeting highuncertainty regions in parameter space. GPSbased computations scale with the number of training examples, however, which make them much more suitable for minimizing the computational cost for a given inference scenario than facilitating global amortization as we suggest in this paper (for which one usually need larger sets of training data than can traditionally be handled efficiently by GPS). Again when our approach is applicable, it will offer vastly greater flexibility once a LAN is trained.
Limitations and future work
There are several limitations of the methods presented in this paper, which we hope to address in future work. While allowing for great flexibility, the MLP approach suffers from the drawback that we do not enforce (or exploit) the constraint that $\widehat{\mathrm{\ell}}(\theta \mathbf{\mathbf{x}})$ is a valid probability distribution, and hence the networks have to learn this constraint implicitly and approximately. Enforcing this constraint has the potential to improve estimation of tail probabilities (a known issue for KDE approaches to ABC more generally; Turner et al., 2015).
The CNN encapsulation exploits the fact that ${\int}_{\mathcal{X}}\widehat{\mathrm{\ell}}(\theta \mathbf{\mathbf{x}})\mathit{d}\mathbf{\mathbf{x}}=1$, however, makes estimation of trialbytrial effects more resource hungry. We plan to investigate the potential of the CNN for trialbytrial estimation in future research.
A potential solution that combines the strengths of both the CNN and MLP methods is to utilize mixture density networks to encapsulate the likelihood functions. We are currently exploring this avenue. Mixture density networks have been successfully applied in the context of ABC (Papamakarios and Murray, 2016); however, training can be unstable without extra care (Guillaumes, 2017). Similarly, invertible flows (Rezende and Mohamed, 2015) and/or mixture density networks Bishop, 1994 may be used to learn likelihood functions (Papamakarios et al., 2019b; Lueckmann et al., 2019); however, the philosophy remains focused on distributions of summary statistics for single datasets. While impressive improvements have materialized at the intersection of ABC and deep learning methods (Papamakarios et al., 2019a; Greenberg et al., 2019; Gonçalves et al., 2020) (showing some success with posterior amortization for models up to 30 parameters, but restricted to a local region of high posterior density in the resulting parameter space), generally less attention has been paid to amortization methods that are not only of casespecific efficiency but sufficiently modular to serve a large variety of inference scenarios (e.g., Figure 10). This is an important gap that we believe the popularization of the powerful ABC framework in the domain of experimental science hinges upon. A second and shortterm avenue for future work is the incorporation of our presented methods into the HDDM Python toolbox (Wiecki et al., 2013) to extend its capabilities to a larger variety of SSMs. Initial work in this direction has been completed, the alpha version of the extension being available in the form of a tutorial at https://github.com/lnccbrown/lans/tree/master/hddmnn_tutorial.
Our current training pipeline can be further optimized on two fronts. First, no attempt was made to minimize the size of the network needed to reliably approximate likelihood functions so as to further improve computational speed. Second, little attempt was made to optimize the amount of training provided to networks. For the models explored here, we found it sufficient to simply train the networks for a very large number of simulated datapoints such that interpolation across the manifold was possible. However, as model complexity increases, it would be useful to obtain a measure of the networks’ uncertainty over likelihood estimates for any given parameter vector. Such uncertainty estimates would be beneficial for multiple reasons. One such benefit would be to provide a handle on the reliability of sampling, given the parameter region. Moreover, such uncertainty estimates could be used to guide the online generation of training data to train the networks in regions with high uncertainty. At the intersection of ABC and NNs, active learning has been explored via uncertainty estimates based on network ensembles (Lueckmann et al., 2019). We plan to additionally explore the use of Bayesian NNs, which provide uncertainty over their weights, for this purpose (Neal, 1995).
One more general shortcoming of our methods is the reliance on empirical likelihoods for training, which in turn are based on a fixed number of samples across parameter vectors, just as the PDA method proposed by Turner et al., 2015. Recently, this approach has been criticized fundamentally on grounds of producing bias in the generated KDEbased likelihood estimates (van Opheusden et al., 2020). A reduction of the approximate likelihood problem to one of inverse binomial sampling was proposed (van Opheusden et al., 2020), which will generate unbiased likelihood estimates. To address this concern, we will investigate adaptive strategies for the selection of the simulations count $n$. We however highlight two points here that aim to put the promise of unbiased likelihoods in perspective. First, our networks add interpolation to the actual estimation of a likelihood. Likelihoods close in parameter space therefore share information that translates into an effectively higher simulation count than the 100k chosen to construct each empirical likelihood used for training. Quantifying this benefit precisely we leave for future research; however, we suspect, as suggested by Figure 4, that it may be substantial. Second, while we generally acknowledge that bias in the tails remains somewhat of an issue in our approach, resolution is at best partial even in the proposed methods of van Opheusden et al., 2020. For the estimation of parameters for which a given datapoint is extremely unlikely (i.e., the data is generally unlikely under the model), the authors suggest to threshold the simulation count so that their algorithm is guaranteed to stop. This effectively amounts to explicitly allowing for bias again. As another alternative, the authors suggest to introduce a lapse rate in the generative model, which the LAN approach can accommodate as well. However, the introduction of a lapse rate does not deal with tail events directly either, but rather assumes that tail events are unrelated to the process of interest. This in turn will render a lower but fixed number of simulations $N$ feasible for training LANs as well. This is notwithstanding the desirable minimization of simulation times even for high likelihood events, especially when trialwise simulations are in fact necessary (which tends to be in cases where amortization with LANs is a priori not a good computational strategy to begin with). Hence, although the inverse binomial sampling approach is elegant conceptually, excessive computation remains an issue when we need accurate estimates of the probability of actual tail events. Generally, however, we maintain it is desirable and important for future work to make use of the otherwise great potential of adaptive sampling to minimize total computational cost.
Furthermore, we relegate to future research proper exploitation of the fact that LANs are by design differentiable in the parameters. We are currently working on an integration of LANs with Tensorflow probability (Abadi et al., 2016), utilizing autograd to switch our MCMC method to the gradientbased NUTS sampler (Hoffman and Gelman, 2014). The main benefits of this sampler are robust mixing behavior, tolerance for high levels of correlations in the parameter space, while at the same time maintaining the ability to sample from highdimensional posteriors. High level of correlations in posteriors is traditionally an Achilles' heel of the otherwise robust coordinatewise slice samplers. DEMCMC and iterated Importance samplers are somewhat more robust in this regards; however, both may not scale efficiently to highdimensional problems. Robustness concerns aside, initial numerical results additionally show some promising further speedups.
Another important branch for future work lies in the utilization of LANs for model comparison. Initial results are promising in that we obtained satisfactory model recovery using the deviance information criterion (DIC) used for model selection in the standard HDDM package. However, this issue demands much more attention to evaluate other model selection metrics and extensive further numerical experiments, which we relegate to future work.
Lastly, in contrast to the importance sampler driving the posterior inference for the CNN, we believe that some of the performance deficiencies of the MLP are the result of our MCs not having converged to the target distribution. A common problem seems to be that the sampler hits the bounds of the constrained parameter space and does not recover from that. As shown in Figures 7 and 8, even ostensibly bad parameter recoveries follow a conceptual coherence and lead to good posterior predictive performance. We therefore may be underreporting the performance of the MLP and plan to test the method on an even more comprehensive suite of MCMC samplers, moreover including thus far neglected potential for reparameterization.
Materials and methods
Test beds
General information
Request a detailed protocolAll models were simulated using the Euler–Maruyama method, which for some fixed discretization step size $\mathrm{\Delta}t$ evolves the process as
where the definition of $\mathrm{\Delta}\mathbf{\mathbf{B}}$ depends on the noise process. For simple Brownian motion, this translates into Gaussian displacements, specifically $\mathrm{\Delta}\mathbf{\mathbf{B}}\sim \mathcal{N}(0,\mathrm{\Delta}t)$, which is commonly denoted as $d\mathbf{\mathbf{W}}$. More generally, the noise need not be Gaussian, and indeed we later apply our methods to the Levy flight model for which the noise process is an alpha stable distribution, denoted as ${\mathbf{\mathbf{L}}}_{\alpha}$ s.t. $\mathrm{\Delta}{\mathbf{\mathbf{L}}}_{\alpha}\sim {(\mathrm{\Delta}t)}^{\frac{1}{\alpha}}\mathbf{\mathbf{L}}(\alpha ,0,1,0)$.
The models chosen for our test bed systematically vary different aspects of complexity, as illustrated in Figure 3. The DDM provides a benchmark and a sanity check since we can compute its likelihood analytically. The fullDDM provides us with a model for which analytical computations are still based on the analytical likelihood of the DDM; however, evaluation is slowed by the necessity for numerical integration. This forms a first test for the speed of evaluation of our methods. For the Ornstein–Uhlenbeck, Levy, Race, and DDM with parameterized boundary models, we cannot base our calculations on an analytical likelihood, but we can nevertheless perform parameter recovery and compare to other methods that utilize empirical likelihoods. The Ornstein–Uhlenbeck model adds statedependent behavior to the diffusion while the Levy model adds variation in the noise process and the Race models expand the output dimensions according to the number of choices.
FullDDM
Request a detailed protocolThe fullDDM maintains the same specification for the driving SDE, but also allows for trialtotrial variability in three parameters (Ratcliff and McKoon, 2008). We allow the drift rate $v$ to vary trial by trial; according to a normal distribution, $v\sim \mathcal{N}(0,{\sigma}_{v})$, the nondecision time τ to vary according to a uniform distribution $\tau \sim \mathbf{\mathbf{U}}[{\u03f5}_{\tau},{\u03f5}_{\tau}]$ and the starting point $w$ to vary according to a uniform distribution as well $w\sim \mathbf{\mathbf{U}}[{\u03f5}_{w},{\u03f5}_{w}]$. The parameter vector for the fullDDM is then $\theta =(v,a,w,\tau ,{\sigma}_{v},{\u03f5}_{\tau},{\u03f5}_{w})$.
To calculate the FPTD for this model, we can use the analytical likelihood expression from the DDM. However, we need to use numerical integration to take into account the random parameters (Wiecki et al., 2013). This inflates execution time by a factor equivalent to the number of executions needed to compute the numerical integral.
Ornstein–Uhlenbeck model
Request a detailed protocolThe Ornstein–Uhlenbeck model introduces a state dependency on the drift rate $v$. Here, $a(t,x)=v+g*x$, where $g$ is an inhibition/excitation parameter. If $g<0$, it acts as a leak (the particle is mean reverting). If $g>0$, the particle accelerates away from the 0 state, as in an attractor model. At $g=0$, we recover the simple DDM process. This leaves us with a parameter vector $\theta =(v,a,w,\tau ,g)$. The corresponding SDE is defined as
This model does not have an analytical likelihood function that can be employed for cheap inference (Mullowney and Iyengar, 2006). We discuss alternatives, other than our proposed methods, to simple analytical likelihoods later. For our purposes, approximate inference is necessary for this model. The Ornstein–Uhlenbeck model is usually defined only for $g<0$; our parameter space makes it strictly speaking a relaxation.
Levy flights
Request a detailed protocolThe Levy flight (Wieschen et al., 2020; Reynolds and Rhodes, 2009) model dispenses with the Gaussian noise assumption in that the incremental noise process instead follows an alphastable distribution ${\mathcal{L}}_{\alpha}$. Specifically, we consider distributions $\mathcal{L}(\alpha ,0,1,0)$ that are centered at 0, symmetric, and have unitary scale parameter. These distributions have a first moment for $\alpha \in (1,2]$, but infinite variance for $\alpha <2$. An important special case is $\alpha =2$, where $\mathcal{L}(2,0,1,0)=\mathcal{N}(0,2)$. The parameter vector for this process is $\theta =(v,a,w,\tau ,\alpha )$. We fix $a(t,x)=v$ and $b(t,x)=1$. The SDE is defined as
The Levy flight is a flexible model used across disciplines for some of its theoretical optimality properties (Wosniack et al., 2017) despite not possessing closedform FPTDs. We add it here as it is different from the other models under consideration; in principle, it could also capture decisionmaking scenarios in which there are sudden jumps in the accumulation of evidence (e.g., due to internal changes in attention). Its behavior is shaped by altering the properties of the incremental noise process directly.
Parameterized collapsing decision bounds
Request a detailed protocolWe will consider variations of the DDM in which the decision boundary is not fixed but is timevarying (represented by a boundary parameter $a$ with a parameterized boundary function $h(t;{\theta}_{h})$). In such cases, we augment the parameter vector θ with the set ${\theta}_{h}$ and drop $a$. Such variations are optimal in a variety of settings (e.g., when there are response deadlines, Frazier and Angela, 2008, or distributions of trial types with different difficulties, Malhotra et al., 2018; Palestro et al., 2018), and also better reflect the underlying dynamics of decision bounds within biologically inspired neural models (O'Reilly and Frank, 2006; Ratcliff and Frank, 2012; Wiecki and Frank, 2013). The boundary functions considered in the following are the Weibull bound (Weibull),
and the linear collapse bound (LC),
Race models: N > 2
Request a detailed protocolThe Race model departs from previous model formulations in that it has a particle for each of $N$ choice options instead of a single particle representing the evidence for one option over another. The function ${f}_{{E}_{i}}(t,\theta )$ now represents the probability of particle $i$ to be the first of all particle to cross the bound $a$ at time $t$. We consider race models for which the drift and starting point can vary for each particle separately. Treating the boundary as a constant $a$ leaves us with a parameter vector $\theta =({v}_{1},...,{v}_{n},a,{w}_{1},...,{w}_{n},ndt)$. The SDE is defined for each particle separately (or in vector form) as
These models represent the most straightforward extension to a multichoice scenario.
Multilayered perceptron
Network specifics
Request a detailed protocolWe apply the same simple architecture consistently across all example contexts in this paper. Our networks have three hidden layers, $\{{L}_{1},{L}_{2},{L}_{3}\}$, of sizes $\{100,100,120\}$, each using $tanh(.)$ activation functions. The output layer consists of a single node with linear activation function.
Training process
Request a detailed protocolTraining hyperparameters
Request a detailed protocolThe network is trained via stochastic backpropagation using the Adam (Kingma and Ba, 2014) optimization algorithm. As a loss function, we utilize the Huber loss (Huber, 1992) defined as
Training data
Request a detailed protocolWe used the following approach to generate training data across all examples shown below.
First, we generate 100K simulations from the stochastic simulator (or model $\mathcal{\mathcal{M}}$) for each of 1.5 M parameter configurations. Since for the examples we consider the stochasticity underlying the models are in the form of a SDE, all simulations were conducted using the simple Euler–Maruyama method with timesteps $\delta t$ of 0.001 s. The maximum time we allowed the algorithms to run was 20 s, much more than necessary for a normal application of the simulator models under consideration.
Based on these simulations, we then generate empirical likelihood functions using KDEs (Turner et al., 2015). KDEs use atomic datapoints $\{{x}_{0},\mathrm{\dots},{x}_{N}\}$ and reformulate them into a continuous probability distribution $f(y;\mathbf{\mathbf{x}})={\sum}_{i}^{N}K(\frac{yx}{h})$, where we choose $K(.)$ as a standard Gaussian kernel $f(x)=\frac{1}{\sqrt{2\pi}}\mathrm{exp}\frac{{x}^{2}}{2}$, and $h$, the socalled bandwidth parameter, is set by utilizing Silverman’s rule of thumb (Silverman, 1986). Where the data made Silverman’s rule inapplicable, we set a lower bound on h as 10^{−3}. Additionally, we follow Charpentier and Flachaire, 2015 in transforming our KDE to accommodate positive random variables with skewed distributions (in adherence to the properties of data resulting from the response time models forming our examples).
To ensure that the networks accurately learn likelihoods across a range of plausible data, for each parameter set we trained the networks by sampling 1000 datapoints from a mixture distribution with three components (mixture probabilities respectively $\{0.8,0.1,0.1\}$). The first component draws samples directly from the KDE distributions. The second component is uniform on $[0s,20s]$, and the third component samples uniformly on $[1s,0s]$. The aim of this mixture is to allow the network to see, for each parameter setting of the stochastic simulator, training examples of three kinds: (1) ‘Where it matters’, that is, where the bulk of the probability mass is given in the generative model. (2) Regions of low probability to inform the likelihood estimate in those regions (i.e., to prevent distortion of likelihood estimates for datapoints that are unlikely to be generated under the model). (3) Examples on the negative real line to ensure that it is learned to consistently drive likelihood predictions to 0 for datapoints close to 0.
The supervision signal for training has two components. For positive datapoints (reaction times in our examples), we evaluate the loglikelihood according to our KDE. Likelihoods of negative datapoints were set to an arbitrary low value of 10^{−29} (a loglikelihood of −66.79). 10^{−29} also served as the lower bounds on likelihood evaluations. While this constrains our accuracy on the very tails of distributions, extremely low evaluations unduly affect the training procedure. Since the generation of training data can easily be parallelized across machines, we simply frontloaded the data generation accordingly. We refer back to Figure 2 for a conceptual overview.
This procedure yields 1.5B labeled training examples on which we train the network. We applied early stopping upon a lack of loss improvement for more than five epochs of training. All models were implemented using Tensorflow (Abadi et al., 2016).
We note here that this amount of training examples is likely an overshoot by potentially one or more orders of magnitude. We did not systematically test for the minimum amount of training examples needed to train the networks. Minimal experiments we ran showed that roughly onetenth of the training examples lead to very much equivalent training results. Systematic minimization of the training data is left for future numerical experiments since we do not deem it essential for purposes of a proof of concept.
Sampling algorithms
Request a detailed protocolOnce trained, we can now run standard MCMC schemes, where, instead of an analytical likelihood, we evaluate ${f}_{w}(\mathbf{\mathbf{x}},\theta )$ as a forward pass through the MLP. Figure 1B schematically illustrates this approach (following the green arrows) and contrasts with currently applied methods (red arrows). We report multiple soconducted parameter recovery experiments in the 'Results' section and validate the approach first with models with known analytical likelihood functions.
Regarding sampling, we utilized two MCMC algorithms, which showed generally very similar results. In contrast to the importance sampling algorithm used for the CNN (described below), MCMC methods are known for having trouble with multimodal posteriors. Running our experiments across algorithms was a safeguard against incorporating samplerspecific deficiencies into our analysis. We however acknowledge that even more extensive experiments may be necessary for comprehensive guarantees. First, having an ultimate implementation of our method into the HDDM Python toolbox (Wiecki et al., 2013) in view, we use slice sampling (as used by the toolbox), specifically the stepout procedure following Neal, 2003. Second, we used a custom implementation of the DEMCMC algorithm (Braak, 2006), known for being robust in higher dimensional parameter spaces. Our DEMCMC implementation adds reflecting boundaries to counteract problematic behavior when the sampler attempts to move beyond the parameter space, which is truncated by the (broad) range of parameters in which the MLP was trained. The number of chains we use is consistently determined as $5*\theta $, five times the number of parameters of a given stochastic model. Samplers were initialized by using slight perturbations of five maximum likelihood estimates and computed via differential evolution optimization (Storn and Price, 1997; Virtanen et al., 2020). Since results were very similar across samplers, we restrict ourselves mostly to reporting results derived from the slice sampler, given that this sampler forms the backend of the HDDM user interface we envision. Implementations of the MLP method, the samplers we used, as well as the training pipeline can be found at https://github.com/lnccbrown/lans/tree/master/almlp (copy archived at swh:1:rev:e3369b9df138c75d0e490be0c48c53ded3e3a1d6); Fengler, 2021.
Additional notes
Request a detailed protocolNote that we restricted parameter recovery for the MLP to datasets that distributed at least 5% of choices to the less frequently chosen option. This modest filtering accommodates the fact that such datasets were also excluded form the training data for the MLP model since they (1) present difficulties for the KDE estimator and (2) lead to generally less stable parameter estimates (i.e., it is not advisable to use diffusion models when choices are deterministic).
Convolutional neural network
Network specifics
Request a detailed protocolThe CNN takes as an input a parameter vector θ, giving as output a discrete probability distribution over the relevant dataspace. In the context of our examples below, the output space is of dimensions ${\mathcal{R}}^{{N}_{c}}\times {\mathcal{R}}^{{N}_{d}}$, where ${N}_{c}$ is the number of relevant choice alternatives, and ${N}_{d}$ is the number of bins for the reaction time for each choice (${N}_{d}=512$ for all examples below). The network architecture consists of a sequence of three fully connected upsampling layers, $\{{L}_{1}^{FC},{L}_{2}^{FC},{L}_{3}^{FC}\}$, of respectively $\{64,256,1024\}$ nodes. These are followed by a sequence of three convolutional layers $\{{L}_{1}^{C},{L}_{2}^{C},{L}_{3}^{C}\}$ with $1\times 5$ kernels, and a final fully connected layer with softmax activation. The network size was not minimized through architecture search, which, along with other potential further speed improvements, we leave for future research.
Training process
Request a detailed protocolFor the CNN, we use 100K simulations from the stochastic simulator for each of 3 M parameter vectors and bin the simulation outcomes as normalized counts into ${\mathcal{R}}^{{N}_{c}}\times {\mathcal{R}}^{{N}_{d}}$ slots respectively (looking ahead to our examples, ${N}_{c}$ concerns the number of choice outcomes, and ${N}_{d}$ the number of bins into which the reaction time outcomes are split for a given simulator). The resultant relative frequency histograms (empirical likelihood functions) ${\mathrm{\ell}}_{empirical}(\theta \mathbf{\mathbf{x}})\forall \theta \in \mathrm{\Theta},\forall x\in \mathcal{X}$, then serve as the target labels during training, with the corresponding parameters θ serving as feature vectors. For a given parameter vector θ, the CNN gives out a histogram $\widehat{{\mathrm{\ell}}_{\varphi}}(\theta \mathbf{\mathbf{x}})$, where $\varphi $ are the network parameters. The network is then trained by minimizing the KL divergence between observed and generated histograms
Training $\mathcal{D}(\widehat{\mathrm{\ell}}(\theta \mathbf{\mathbf{x}})\parallel {\mathrm{\ell}}_{empirical}(\theta \mathbf{\mathbf{x}}))$ is not the only option. We note that it would have been a valid choice to train on $\mathcal{D}({\mathrm{\ell}}_{empirical}(\theta \mathbf{\mathbf{x}})\parallel \widehat{\mathrm{\ell}}(\theta \mathbf{\mathbf{x}}))$ (Minka, 2013) or the symmetrized Kullback–Leibler divergence instead. Training results, however, were good enough for our present purposes to leave a precise performance comparison across those loss functions for future research, leaving room for further improvements.
As for the MLP, we use the Adam optimizer (Kingma and Ba, 2014) and implemented the network in Tensorflow (Abadi et al., 2016).
Sampling algorithm
Request a detailed protocolOne benefit of using the CNN lies in the enhanced potential for parallel processing across large number of parameter configurations and datapoints. To fully exploit this capability, instead of running a (sequential) MCMC algorithm for our parameter recovery studies, we use iterated importance sampling, which can be done in parallel. Specifically, we use adaptive importance sampling based on mixtures of tdistributions, following a slightly adjusted version of the suggestions in Cappé et al., 2008; Wraith et al., 2009.
While importance sampling is well established, for clarity and the setting in which we apply it, we explain some of the details here. Importance sampling algorithms are driven by the basic equality
which holds for any pair of probability distributions such that $g(\theta )>0$, where $f(\theta )>0$. $f(\theta )$ is our posterior distribution, and $g(\theta )$ is the proposal distribution. We now sample $N$ tuples θ according to $g(\theta )$ and assign each ${\theta}_{i}$ an importance weight ${w}_{i}=\frac{f({\theta}_{\mathbf{\mathbf{i}}})}{g({\theta}_{\mathbf{\mathbf{i}}})}$.
To get samples from the posterior distribution, we sample with replacement the θ from the set $\{{\theta}_{0},\mathrm{\dots},{\theta}_{n}\}$, with probabilities assigned as the normalized weights $\{\stackrel{~}{{w}_{0}},...,\stackrel{~}{{w}_{n}}\}$. We note that importance sampling is exact for $N\to \mathrm{\infty}$. However for finite $N$, the performance is strongly dependent on the quality of the proposal distribution $g(.)$. A bad match of $f(.)$ and $g(.)$ leads to high variance in the importance weights, which drives down performance of the algorithm, as commonly measured by the effective sample size (Liu, 2008)
Iterated importance sampling uses consecutive importance sampling rounds to improve the proposal distribution $g(.)$. A final importance sampling round is used to get the importance sample we use as our posterior sample. Specifically, we start with a mixture of tdistributions ${g}_{0}(.)$, where $\mathcal{M}$ is the number of mixture components. Each component of ${g}_{0}(.)$ is centered at the MAP according to a optimization run (again we used differential evolution). The componentcovariance matrix is estimated by a numerical approximation of the Hessian at the respective MAP. Each round $i$, based on the importance sample ${\{\mathbf{\mathbf{x}},\mathbf{\mathbf{w}}\}}_{i}$, we update the proposal distribution (to a new mixture of tdistributions) using the update equations derived in Cappé et al., 2008.
As suggested by Cappé et al., 2008, convergence is assessed using the normalized perplexity statistic (the exponentiated Shannon entropy of the importance weights). For run $i$, this is computed as ${\mathrm{exp}}^{\frac{{H}^{k,N}}{/N}}$, where ${H}^{k,N}={\sum}_{i=1}^{N}{\overline{w}}_{k,i}\mathrm{log}{\overline{w}}_{k,i}$.
To help convergence, we depart from the basic setup suggested in Cappé et al., 2008 in the following way. We apply an annealing factor $\gamma}_{k}=max{2}^{zk},1\text{}z\in \{1,2,4,...,\$, so that for iteration $k$ of the importance sampler we are operating on the target $f{(x)}^{\frac{1}{{\gamma}_{k}}}$. Smoothing the target during the first iterations helps with successfully adjusting the proposal distribution $g(.)$. Figure 2 visualizes the CNN approach. Again, we emphasize that more numerical experiments using a larger variety of sampling algorithms are desirable, but are out of the scope for this paper. Implementations of the CNN method, the samplers we used, as well as the training pipeline can be found at https://github.com/lnccbrown/lans/tree/master/alcnn.
Strengths and weaknesses
Request a detailed protocolIn this section, we clarify a few strengths and weaknesses of the two presented methods and their respective use cases. First, representing the likelihood function datapointwise as an MLP output, or globally via the CNN output histogram, affects the potential for parallelization. As exploited by the choice of sampler, the CNN is very amenable to parallelization across parameters since inputs are parameter tuples only. Since the output is represented as a global likelihood histogram, the dataset likelihood is computed as the summation of the elementwise multiplied of binloglikelihoods, with a correspondingly binned dataset (counts over bins). This has the highly desirable property of making evaluation cost (time) independent of dataset size. While the MLP in principle allows parallel processing of inputs, the datapointwise representation of input values ($\{\theta ,x\}$) makes the potential for crossparameter parallelization dependent on dataset sizes. While a single evaluation of the CNN is more costly, crossparameter batch processing can make it preferable to the MLP. Second, the CNN has an advantage during training, where the representation of the output as a softmax layer, and corresponding training via minimization of the KL divergence, provides a more robust training signal to ensure probability distributions compared to the purely local one in which the MLP learns a scalar likelihood output as a simple regression problem. Third, and conversely, the MLP formulation is more natural for trialwise parameter estimates since the histogram representations may be redundant in case datapoints are in fact evaluated one by one (given datapointwise parameters induced by trialbytrial effects on parameter vectors). Give equivalent success in learning likelihoods, we see potential for speedup when using the pointwise approach in this case. In principle, both approaches however allow one to estimate the impact of trialwise regressors on model parameters during inference, without further training. It is, for example, common in the cognitive neuroscience literature to allow the crosstrial timecourse of EEG, fMRI, or spike signals to be modeled as a trialbytrial regressor on model parameters of, for example, DDMs (Wiecki et al., 2013; Frank et al., 2015; Cavanagh et al., 2011; Herz et al., 2016; Pedersen and Frank, 2020). Another relevant example is the incorporation of latent learning dynamics. If a subject’s choice behavior is driven by reinforcement learning across stimuli, we can translate this into trialbytrial effects on the parameter vectors of a generative process model (Pedersen and Frank, 2020). These applications are implicitly enabled at no extra cost with the MLP method, while the trialbytrial split multiplies the necessary computations for the CNN by the number $N$ of datapoints when compared to scenarios that only need datasetwise parameters. We stress again, however, that in general both the CNN and the MLP can directly be used for hierarchical inference scenarios. The preceding discussion pertains to further potential for optimization and relative strengths, not categorical potential for application to a given scenario. With respect to the latter, both methods are essentially equal.
Appendix 1
Parameter recovery
Here, we provide additional figures concerning parameter recovery studies. Appendix 1—table 1 summarizes the parameterwise ${R}^{2}$ between ground truth and the posterior mean estimates for each tested model and for each the CNN and MLP (where applicable) methods in turn. For the MLP, results are based on a reference run that used training data constructed from KDE empirical likelihoods utilizing 100k simulations each, and a slice sampler stopped with help of the Geweke diagnostic. Results in the paper are based on slice samplers as well as slice samplers, which explains why not all ${R}^{2}$ values match exactly the ones found in other figures. Our findings were, however, generally robust across samplers.
Manifolds/likelihoods
We show some examples of the likelihood manifolds for the various models that we tested.
DDMSDV
Linear collapse
Weibull
Levy
Ornstein
FullDDM
Data availability
All code is provided freely and is available at the following links: https://github.com/lnccbrown/lans/tree/master/hddmnn_tutorial, https://github.com/lnccbrown/lans/tree/master/almlp and https://github.com/lnccbrown/lans/tree/master/alcnn (copy archived at https://archive.softwareheritage.org/swh:1:rev:e3369b9df138c75d0e490be0c48c53ded3e3a1d6).
References

ConferenceTensorflow: a system for largescale machine learning12th USENIX Symposium on Operating SystemsDesign and Implementation (OSDI 16). pp. 265–283.

ConferenceVariational bayesian monte carlo with noisy likelihoodsAdvances in Neural Information Processing Systems.

Approximate bayesian computation for forward modeling in cosmologyJournal of Cosmology and Astroparticle Physics 2015:043.https://doi.org/10.1088/14757516/2015/08/043

Cython: the best of both worldsComputing in Science & Engineering 13:31–39.https://doi.org/10.1109/MCSE.2010.118

Variational inference: a review for statisticiansJournal of the American Statistical Association 112:859–877.https://doi.org/10.1080/01621459.2017.1285773

Estimating acrosstrial variability parameters of the diffusion decision model: expert advice and recommendationsJournal of Mathematical Psychology 87:46–75.https://doi.org/10.1016/j.jmp.2018.09.004

Adaptive importance sampling in general mixture classesStatistics and Computing 18:447–459.https://doi.org/10.1007/s112220089059x

Subthalamic nucleus stimulation reverses mediofrontal influence over decision thresholdNature Neuroscience 14:1462–1467.https://doi.org/10.1038/nn.2925

Logtransform kernel density estimation of income distributionL’actualité Économique 91:141–159.

Decisions in changing conditions: the urgencygating modelJournal of Neuroscience 29:11560–11571.https://doi.org/10.1523/JNEUROSCI.184409.2009

ConferenceTrialbytrial data analysis using computational modelsDecision Making, Affect, and Learning: Attention and Performance XXIII.

The markov chain monte carlo revolutionBulletin of the American Mathematical Society 46:179–205.https://doi.org/10.1090/S027309790801238X

fMRI and EEG predictors of dynamic decision parameters during human reinforcement learningThe Journal of Neuroscience 35:485–494.https://doi.org/10.1523/JNEUROSCI.203614.2015

ConferenceSequential hypothesis testing under stochastic deadlinesAdvances in Neural Information Processing Systems. pp. 465–472.

Inference from iterative simulation using multiple sequencesStatistical Science 7:457–472.https://doi.org/10.1214/ss/1177011136

Evaluating the accuracy of samplingbased approaches to the calculations of posterior momentsBayesian Statistics 4:641–649.

ConferenceAutomatic posterior transformation for LikelihoodFree inferenceInternational Conference on Machine Learning, PMLR. pp. 2404–2414.

ThesisMixture density networks for distribution and uncertainty estimationUniversitat Politècnica de Catalunya. Facultat d’Informàtica de Barcelona, PhD thesis.

Sloppy models and parameter indeterminancy in systems biologyPLOS Computational Biology 3:30189.https://doi.org/10.1371/journal.pcbi.0030189

Likelihoodfree inference via classificationStatistics and Computing 28:411–425.https://doi.org/10.1007/s1122201797386

Dynamic models of choiceBehavior Research Methods 51:961–985.https://doi.org/10.3758/s134280181067y

The nouturn sampler: adaptively setting path lengths in hamiltonian monte carloJournal of Machine Learning Research : JMLR 15:1593–1623.

A practical guide to the probability density approximation (PDA) with improved implementation and error characterizationJournal of Mathematical Psychology 6869:13–24.https://doi.org/10.1016/j.jmp.2015.08.006

BookRobust estimation of a location parameterIn: Kotz S, Johnson N. L, editors. Breakthroughs in Statistics. Springer. pp. 492–518.https://doi.org/10.1007/9781461243809_35

Computational psychiatry as a bridge from neuroscience to clinical applicationsNature Neuroscience 19:404–413.https://doi.org/10.1038/nn.4238

Gaussian process modelling in approximate bayesian computation to estimate horizontal gene transfer in BacteriaThe Annals of Applied Statistics 12:2228–2251.https://doi.org/10.1214/18AOAS1150

BookMonte Carlo Strategies in Scientific ComputingSpringer Science & Business Media.https://doi.org/10.1007/9780387763712

ConferenceLikelihoodfree inference with emulator networksSymposium on Advances in Approximate Bayesian Inference, PMLR. pp. 32–53.

Timevarying decision boundaries: insights from optimality analysisPsychonomic Bulletin & Review 25:971–996.https://doi.org/10.3758/s1342301713406

Prepaid parameter estimation without likelihoodsPLOS Computational Biology 15:e1007181.https://doi.org/10.1371/journal.pcbi.1007181

ReportMaximum Likelihood Estimation and Computation for the OrnsteinUhlenbeck Processstat.pitt.

Fast and accurate calculations for firstpassage times in Wiener Diffusion modelsJournal of Mathematical Psychology 53:222–230.https://doi.org/10.1016/j.jmp.2009.02.003

Hierarchical bayesian parameter estimation for cumulative prospect theoryJournal of Mathematical Psychology 55:84–93.https://doi.org/10.1016/j.jmp.2010.08.006

Some task demands induce collapsing bounds: evidence from a behavioral analysisPsychonomic Bulletin & Review 25:1225–1248.https://doi.org/10.3758/s1342301814799

ConferenceSequential neural likelihood: fast likelihoodfree inference with autoregressive flows PMLRThe 22nd International Conference on Artificial Intelligence and Statistics. pp. 837–848.

ConferenceFast εfree inference of simulation models with bayesian conditional density estimationAdvances in Neural Information Processing Systems. pp. 1028–1036.

ConferenceBayesFlow: learning complex stochastic models with invertible neural networksIEEE Transactions on Neural Networks and Learning Systems. pp. 1–15.https://doi.org/10.1109/TNNLS.2020.3042395

Towards endtoend likelihoodfree inference with convolutional neural networksBritish Journal of Mathematical and Statistical Psychology 73:23–43.https://doi.org/10.1111/bmsp.12159

A framework for studying the neurobiology of valuebased decision makingNature Reviews Neuroscience 9:545–556.https://doi.org/10.1038/nrn2357

A theory of memory retrievalPsychological Review 85:59–108.https://doi.org/10.1037/0033295X.85.2.59

ConferenceVariational inference with normalizing flowsInternational Conference on Machine Learning, PMLR. pp. 1530–1538.

BookMonte Carlo Statistical MethodsSpringer Science & Business Media.https://doi.org/10.1007/9781475741452

Differential evolution–a simple and efficient heuristic for global optimization over continuous spacesJournal of Global Optimization 11:341–359.https://doi.org/10.1023/A:1008202821328

Informing cognitive abstractions through neuroimaging: the neural drift diffusion modelPsychological Review 122:312–336.https://doi.org/10.1037/a0038894

A generalized, likelihoodfree method for posterior estimationPsychonomic Bulletin & Review 21:227–250.https://doi.org/10.3758/s1342301305300

Approximating bayesian inference through model simulationTrends in Cognitive Sciences 22:826–840.https://doi.org/10.1016/j.tics.2018.06.003

The time course of perceptual choice: the leaky, competing accumulator modelPsychological Review 108:550–592.https://doi.org/10.1037/0033295X.108.3.550

Unbiased and efficient loglikelihood estimation with inverse binomial samplingPLOS Computational Biology 16:e1008483.https://doi.org/10.1371/journal.pcbi.1008483

Hierarchical diffusion models for twochoice response timesPsychological Methods 16:44–62.https://doi.org/10.1037/a0021765

HDDM: hierarchical bayesian estimation of the DriftDiffusion model in PythonFrontiers in Neuroinformatics 7:14.https://doi.org/10.3389/fninf.2013.00014

A computational model of inhibitory control in frontal cortex and basal gangliaPsychological Review 120:329–355.https://doi.org/10.1037/a0031542

Jumping to conclusion? A lévy flight model of decision makingThe Quantitative Methods for Psychology 16:120–132.https://doi.org/10.20982/tqmp.16.2.p120

The evolutionary origins of lévy walk foragingPLOS Computational Biology 13:e1005774.https://doi.org/10.1371/journal.pcbi.1005774
Decision letter

Valentin WyartReviewing Editor; École normale supérieure, PSL University, INSERM, France

Timothy E BehrensSenior Editor; University of Oxford, United Kingdom

Luigi AcerbiReviewer

Jean DaunizeauReviewer; INSERM, France
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Acceptance summary:
Your work addresses a very timely and important issue in the field: the restricted availability of candidate computational models that can be tested as explanations of brain or behavioral data. Your approach, which uses artificial neural networks for speeding up inference, helps achieving significantly faster solutions for fitting computational models to data, with potential impact for a broad research community. The ability to decouple the slow training of neural networks to learn the likelihood function of a given model from the fast use of previously trained neural networks to fit data is particularly appealing. We congratulate you for this work which should stimulate the exploration of previously untested computational models of brain activity and behavior.
Decision letter after peer review:
Thank you for submitting your article "Likelihood Approximation Networks (LANs) for Fast Inference of Simulation Models in Cognitive Neuroscience" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by Valentin Wyart as the Reviewing Editor and Timothy Behrens as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Luigi Acerbi (Reviewer #1); Jean Daunizeau (Reviewer #2).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
We would like to draw your attention to changes in our revision policy that we have made in response to COVID19 (https://elifesciences.org/articles/57162). Specifically, we are asking editors to accept without delay manuscripts, like yours, that they judge can stand as eLife papers without additional data, even if they feel that they would make the manuscript stronger. Thus the revisions requested below only address clarity and presentation.
This manuscript describes a novel method based on artificial neural networks to approximate the trial likelihood function of a model whose likelihood is difficult or impossible to evaluate analytically. The authors demonstrate their method on a class of generalized driftdiffusion models, performing Bayesian inference and various forms of hierarchical inference, with good empirical results. The authors discuss the appeal of their method for broadening the family of models that can be employed in computational cognitive neuroscience.
Both reviewers have found your manuscript to be addressing a very timely and important issue in the field, namely: the restricted availability of candidate computational models that can be tested as explanations of observed data. The described work fits very well with recent work in the area which uses (deep) neural networks for speeding up inference. Both reviewers have found that the methods proposed in the manuscript can indeed help achieve faster solutions when fitting computational models to data, with potential impact for the scientific community. Both reviewers felt that the manuscript deserves publication in eLife given that the following points outlined below are addressed to improve the clarity of the presentation and also the limits of the proposed method.
Essential revisions:
1. Clarifications:
For most of the manuscript, the authors make a distinction of "analytical likelihood" vs. "no likelihood" (where one needs to do simulationbased inference). However, there is a middleground of models for which the likelihood is not analytical but can be computed numerically. These models constitute a superset of the analytical models. The authors touch upon this point only in the Discussion (when talking about Shinn et al. 2020 and generalized DDMs), but this point and the reference should come up much earlier as it is quite important. There are still good reasons for wanting to do amortized inference as opposed to using numerical likelihoods (the latter can be very expensive, and they don't exist for all DDMs), but the readers should be made aware of these alternative options – even beyond the DDM example.
Mathematical notations are problematic. First, most notations are not described when they first appear, and the reader often has to guess what the variables are. For example, the KDE notation is first presented when introducing the ABC approach (section 2), but one has to read the Additional Methods (section 10) to find out what they really are. Second, mathematical notations are not fully consistent throughout the paper. For example, the parameters are called either P or theta, the data is sometimes called D and sometimes X, whereas X also sometimes denotes the DDM's accumulated evidence. The variable W sometimes means the ANN's weights, sometimes the DDM's Wiener noise, sometimes the DDM's initial condition. The authors should make sure that notations are selfcontained and consistent throughout the manuscript.
Another concern regarding mathematical notations is that the typical notation for modelling likelihoods in cognitive science is p(xs,theta), where x is the observation/response, s a vector describing the "stimulus" in the current trial, and theta the model parameters. In this paper, the authors merge s into theta, which is fine, but it is important to specify it (e.g., this affects the dimensionality of theta).
Important aspects of the training of the networks should be explained earlier in the manuscript. First, the networks are trained on hundreds of billions of simulations: O(10^{6}) parameter settings times 10^{5} samples per parameter setting. This ballpark should be mentioned early (e.g., in Figure 1 and 2 or related text). More generally, it is important to stress that amortized inference is not free at all – it incurs a very large cost upfront. So, for example, it is likely not worth it in the scenario in which a researcher is iterating through many different models (as opposed to a single model applied many times). Second, the initial introduction of the neural network models (lines 255269) is confusing, due to some unfortunate choice of wording in the explanation (and despite the fact that what the authors are doing is quite simple):
1) In the pointwise representation, the network is learning a mapping from x and theta to p(xtheta).
2) In the histogram representation, the network is learning a mapping from theta to p(xtheta), where p(xtheta) is represented as a histogram over the space of (discretized) observations x.
What is confusing is that the choice of architecture used in the paper (MLP for pointwise, CNN for histogram) seems to be orthogonal to what the network is representing. The histogram representation could use a MLP architecture. In fact, the authors should explain why they state that the histogram representation "naturally leads to CNNs as the network architecture" (e.g., there seems to be no translational invariance in the problem). Why is MLP a priori better for approximating pointwise likelihoods, and CNN better for histogram likelihoods? What, in these architectures, makes them suited for the tasks? For example, what would authors expect, if they used CNNs for pointwise likelihoods and MLPs for histogram likelihoods? Did authors try different architectures and ended concluding that this was the best choice, or is it an educated guess?
Also, the expression "discretized global likelihood of an entire dataset" confusing, because to me that would be learning p(Dtheta). Instead, the authors mean learning the full trial likelihood function p(xtheta), which in their case helps compute the likelihood of the entire datasets (because the trials are assumed to be i.i.d.). Last, the authors use "parameterization" when they mean "parameter vector".
Overall, it would be very helpful if the authors were to rewrite this part more clearly. Figure 2 could be made larger, highlighting the input and output of the networks.
2. Limitations of the method:
Related to an earlier point, it is important to stress explicitly that amortized inference is not free at all – it incurs a very large cost upfront (here, training ANNs on hundreds of billions of simulations: O(10^{6}) parameter settings times 10^{5} samples per parameter setting). So, for example, it is likely not worth it in the scenario in which a researcher is iterating through many different models (as opposed to a single model applied many times).
The method described only applies to models whose likelihood is conditionally independent given the model and trial parameters (theta); that is whose likelihood factorizes as p(Dtheta) = \prod_{i} p(x_{i}theta). Models with a Markovian dependence on previous observations can still be addressed (by putting a limited number of previous observations into theta). However, the likelihood above does not apply to models with latent dynamics. The models that satisfy the requirements of the method are still plenty of models in cognitive science, but still, it is very important to specify the restrictions of the method. The authors can look at the section "Definitions and notation" of van Opheusden et al. (2020) for a more indepth discussion.
The authors should state clearly the scalability of the proposed approach based on the dimensionality of theta (which includes the "stimulus" s, see an earlier point). The bottleneck here seems to be the coverage of the "prior" space. It is already impressive that the method works alright in 67 dimensions; still, it can be expected to worsen quickly due to the curse of dimensionality (not because of the network – but because of the inability of the prior sampling to cover the space). For example (which would be good to mention), Goncalves et al. (2020) have shown that neural networks can reconstruct posteriors in up to ~30 dimensions (with some tricks). However, to do so they cannot use pure, "a priori" amortized inference, but they need to iterate the sampling procedure to zoom into the region of interest.
Relatedly, the authors might want to comment on what happens when s or theta is not a continuous parameter (for example, see again van Opheusden et al., 2020; in one of the models, s takes an integer value of a "board configuration" which is not necessarily part of a nice metric space; e.g. board state 15 is not necessarily close to 16). Clearly the network can still work in principle but effectively we would expect discontinuities (due to learning in a discrete space) which might cause issues. In other words, unless the authors have some evidence that their method also works well for discrete parameters, they might want to mention this limitation of the method.
Relatedly also, the authors should explain more clearly how they chose the range of parameters which they sample from during ANN training. In a sense, and although the goal here is to approximate the likelihood, it seems to me that this step effectively places some form of implicit prior distribution on parameters (at least in terms of admissible parameter values). In turn, this induces some estimation artefacts, which can be eyeballed on some panels of Figure 8 (cf. α and β bound parameters of the Weibull model). Wouldn't the estimation error be much higher, had the authors simulated data under broader parameter ranges (further beyond the sampling range that was used for ANN training)? Does a restricted sampling range not eventually induce some form of potential nonidentifiability issue? These points should be discussed more explicitly in the manuscript.
The authors have shown that the method they propose yields efficient (Bayesian) statistical inference on model parameters. However, one may also want to use the method for comparing models. Can authors demonstrate the model comparison capabilities of their method? If the authors believe this is out of scope, they should at least discuss this possible application of their method. ABC methods have sometimes been criticized for the fact that they induce model section biases. However, these biases may be related to the selection of data sufficient statistics (which may be unfair to some models). How do these points translate to the current approach?
The authors have demonstrated the appeal of their method on DDM variants, which typically predict response times and choices. Practically speaking, this means storing the trained ANNs for these two types of dependent variables. But what if one uses these models to predict another set of dependent behavioral variables (e.g. including choice confidence)? It seems that one would have to reperform all the simulations. Or is there a way to leverage what has already been done? The authors may want to discuss this point somewhere in the manuscript.
3. Figures are not described with a sufficient amount of detail, and the read has at times to guess what captions do not state. For example, what do negative RT mean in Figure 4 (I guess: RT conditional on "the other" choice outcome)? What are the middle panels of Figure 7 showing (I guess: posterior intervals on ANGLE bounds over decision time, as well as empirical and fitted choicedependant RT histograms)? What are the X1024 and X4096 variants of their ANNbased approach? What are the nodes/variables in the DAGs of Figure 10? These ambiguities should be solved in the revised manuscript.
4. The relationship of this work to a few related lines of work may deserve to be discussed. In particular, the approach proposed here appears to share similarities to (variational) autoencoders, which is now an established ANNbased method for solving very similar problems. Also, it may be worth discussing the pros and cons of the authors' approach, when compared to other existing approaches to likelihoodfree inference, which would approximate the likelihood in some other way (using, e.g., Gaussian processes). Anecdotally, for the specific case of 'vanilla' DDM, there are recent advances to the problem of deriving conditional and marginal moments of RT distributions (Srivastava et al., 2016). These are much faster to evaluate than Fuss and Navarro (2009), because they provide analytical expressions that do not require any iterative numerical computations. Embedded within an adhoc likelihood (derived from an observation equation of the form: RT = E[RTtheta] + i.i.d. Gaussian noise), these can provide very fast/efficient parameter estimates. It would be worth discussing these kinds of approaches somewhere in the Discussion section.
https://doi.org/10.7554/eLife.65074.sa1Author response
Essential revisions:
1. Clarifications:
For most of the manuscript, the authors make a distinction of "analytical likelihood" vs. "no likelihood" (where one needs to do simulationbased inference). However, there is a middleground of models for which the likelihood is not analytical but can be computed numerically. These models constitute a superset of the analytical models. The authors touch upon this point only in the Discussion (when talking about Shinn et al. 2020 and generalized DDMs), but this point and the reference should come up much earlier as it is quite important. There are still good reasons for wanting to do amortized inference as opposed to using numerical likelihoods (the latter can be very expensive, and they don't exist for all DDMs), but the readers should be made aware of these alternative options – even beyond the DDM example.
We agree that this is an important distinction that the paper should have emphasized more, beyond alluding to this in Figure 3 (whereby the full DDM was depicted in a different color because it requires numerical integration and is therefore costly to evaluate). We now make it explicit in Section 3 and mention it again in the main text in Section 4 when describing the testbed models. Indeed, powerful numerical approximation methods exist for some likelihoods, but these do not afford straightforward or rapid Bayesian inference. An example of this is already present in the full DDM, given the numerical integration, which typically slows down estimation compared to standard DDM by at least an order of magnitude. Informal tests against the full DDM likelihood in the HDDM python package (Wickie et al. 2013) show that we gain an approximately 40 fold increase in speed for this model when using LANs instead. Notably, the generalized DDMs discussed in Shinn et al. 2020 did not include a method for posterior estimation of parameters (which would likely be quite costly to compute). Moreover, these generalized DDMs are limited to cases with Gaussian noise. Any departures from that, such as Levy flights, are not accommodated. This issue is not a concern for LANs.
Mathematical notations are problematic. First, most notations are not described when they first appear, and the reader often has to guess what the variables are. For example, the KDE notation is first presented when introducing the ABC approach (section 2), but one has to read the Additional Methods (section 10) to find out what they really are. Second, mathematical notations are not fully consistent throughout the paper. For example, the parameters are called either P or theta, the data is sometimes called D and sometimes X, whereas X also sometimes denotes the DDM's accumulated evidence. The variable W sometimes means the ANN's weights, sometimes the DDM's Wiener noise, sometimes the DDM's initial condition. The authors should make sure that notations are selfcontained and consistent throughout the manuscript.
Another concern regarding mathematical notations is that the typical notation for modelling likelihoods in cognitive science is p(xs,theta), where x is the observation/response, s a vector describing the "stimulus" in the current trial, and theta the model parameters. In this paper, the authors merge s into theta, which is fine, but it is important to specify it (e.g., this affects the dimensionality of theta).
We thank the reviewers for pointing this out. We have now improved notational consistencies. We now refer to network parameters as Φ, and datasets and data points are consistently 𝐱 and 𝑥. Model parameters and parameter dimensions are now always 𝜃 and _{Θ}. Wiener noise is mentioned once as a capital bold 𝑑_{𝐖}, which is distinct from (lower case) 𝑤 as the starting point parameter. We also made explicit that we suppress the stimulus 𝑠 in our discussions when first defining the likelihood.
Important aspects of the training of the networks should be explained earlier in the manuscript. First, the networks are trained on hundreds of billions of simulations: O(10^{6}) parameter settings times 10^{5} samples per parameter setting. This ballpark should be mentioned early (e.g., in Figure 1 and 2 or related text). More generally, it is important to stress that amortized inference is not free at all – it incurs a very large cost upfront. So, for example, it is likely not worth it in the scenario in which a researcher is iterating through many different models (as opposed to a single model applied many times).
We have now added additional details to our initial discussion of the ANNs and training procedures to help readability. As we now reiterate in the text, our approach’s primary benefit lies in reusability and flexibility, which we believe will afford practical advantages for end users to evaluate models of interest across arbitrary inference scenarios. Note that we did not attempt a structured minimization of training data. Indeed, informal experiments indicate performance would be quite similar with an order of magnitude less training. The reason we are not so focused on the precise training time is that the cost of training LANs is only incurred once for any theoretically meaningful model. At this point, the trained network can be added to a bank of LANs, which can be made available to the community and reused without further training. In turn, the community will be able to test their data against a broader variety of models (and to easily add their own to the bank) and which can be evaluated for arbitrary inference scenarios (hierarchical estimation, allow some parameters to vary across conditions or to relate to neural measure X, etc.). While it is true that, in principle, we would prefer to further minimize amortization cost (and we may attempt to do so in future), we view in the long run that this component is comparatively negligible for any model that would be reused.
Second, the initial introduction of the neural network models (lines 255269) is confusing, due to some unfortunate choice of wording in the explanation (and despite the fact that what the authors are doing is quite simple):
1) In the pointwise representation, the network is learning a mapping from x and theta to p(xtheta).
2) In the histogram representation, the network is learning a mapping from theta to p(xtheta), where p(xtheta) is represented as a histogram over the space of (discretized) observations x.
What is confusing is that the choice of architecture used in the paper (MLP for pointwise, CNN for histogram) seems to be orthogonal to what the network is representing. The histogram representation could use a MLP architecture. In fact, the authors should explain why they state that the histogram representation "naturally leads to CNNs as the network architecture" (e.g., there seems to be no translational invariance in the problem). Why is MLP a priori better for approximating pointwise likelihoods, and CNN better for histogram likelihoods? What, in these architectures, makes them suited for the tasks? For example, what would authors expect, if they used CNNs for pointwise likelihoods and MLPs for histogram likelihoods? Did authors try different architectures and ended concluding that this was the best choice, or is it an educated guess?
We agree that the proposed network architectures are not hard constraints, and we did not intend to imply otherwise. Indeed, we had discussed this very issue amongst ourselves. We have now clarified in the manuscript that the problem representation is what matters. We pursued the architectures in each case that seemed to follow naturally but without any commitment to their optimality. When learning a mapping from simulation model parameters to a histogram representation of the likelihood, we map a low dimensional data manifold to a much higher dimensional one. Using an MLP for this purpose would imply that the number of neural network parameters needed to learn this function would be orders of magnitude larger than using a CNN. Not only would this mean that “forward passes“ through the network would take longer, but also an increased propensity to overfit on an identical data budget. Given that our networks performed adequately well in the training objective and parameter recovery, we felt that we did not need to pursue a rigorous architecture search for this paper’s main points to be developed. We will optimize these for speed in future work.
Also, the expression "discretized global likelihood of an entire dataset" confusing, because to me that would be learning p(Dtheta). Instead, the authors mean learning the full trial likelihood function p(xtheta), which in their case helps compute the likelihood of the entire datasets (because the trials are assumed to be i.i.d.).
We recognized this as potentially confusing and rewrote sections of the related paragraphs for clarity.
Last, the authors use "parameterization" when they mean "parameter vector".
Overall, it would be very helpful if the authors were to rewrite this part more clearly.
Thank you, we have now fixed in response to the same point elsewhere in the reviewer comments.
Figure 2 could be made larger, highlighting the input and output of the networks.
Thanks for this suggestion. We have now edited Figure 2 to showcase the neural networks more prominently.
2. Limitations of the method:
Related to an earlier point, it is important to stress explicitly that amortized inference is not free at all – it incurs a very large cost upfront (here, training ANNs on hundreds of billions of simulations: O(10^{6}) parameter settings times 10^{5} samples per parameter setting). So, for example, it is likely not worth it in the scenario in which a researcher is iterating through many different models (as opposed to a single model applied many times).
Please see a detailed response to this comment above. We now added these aspects at the end of section 3, and included some discussion that puts the quoted numbers into perspective. We agree that this will likely aid readability and help with a more nuanced perspective on global amortization.
The method described only applies to models whose likelihood is conditionally independent given the model and trial parameters (theta); that is whose likelihood factorizes as p(Dtheta) = \prod_{i} p(x_{i}theta). Models with a Markovian dependence on previous observations can still be addressed (by putting a limited number of previous observations into theta). However, the likelihood above does not apply to models with latent dynamics. The models that satisfy the requirements of the method are still plenty of models in cognitive science, but still, it is very important to specify the restrictions of the method. The authors can look at the section "Definitions and notation" of van Opheusden et al. (2020) for a more indepth discussion.
We thank the reviewers for raising this point. The problems with amortization when one of the “parameters“ may be a discrete space of high cardinality are indeed real. However, in general our approach can in fact incorporate latent processes on the parameters, which can induce sequential dependence. An example is the RLDDM model in Pedersen et al. 2020, which puts a latent reinforcement learning process on some of the DDM parameters, which we can easily accommodate with LANs, and which we note in the discussion.
The authors should state clearly the scalability of the proposed approach based on the dimensionality of theta (which includes the "stimulus" s, see an earlier point). The bottleneck here seems to be the coverage of the "prior" space. It is already impressive that the method works alright in 67 dimensions; still, it can be expected to worsen quickly due to the curse of dimensionality (not because of the network – but because of the inability of the prior sampling to cover the space). For example (which would be good to mention), Goncalves et al. (2020) have shown that neural networks can reconstruct posteriors in up to ~30 dimensions (with some tricks). However, to do so they cannot use pure, "a priori" amortized inference, but they need to iterate the sampling procedure to zoom into the region of interest.
Indeed, we would generally not advise to apply amortization with LANs naively for models of > _{15} parameters (the CNN was tested on models in that ballpark with the LCA4 in the paper) – or at least we haven’t tested that scenario yet. It is correct that in general the curse of dimensionality can hardly be avoided here, if global amortization is the goal. We do cite Goncalves et al. (2020) twice, and we now also explicitly mention the ballpark figure of 30 parameters with their method. However, a major part of our argument is that modularity of the amortized piece of the pipeline is important. While the amortization of a 30 dimensional posterior is impressive, it remains inflexible to any consideration of experimental design and in spirit focuses on ’single model’, ’single dataset’ fits. Even hierarchical estimation of the basic DDM model with 10 subjects, will elevate us to > 40 parameters and render posterior amortization futile. While we emphasize the virtues of our approach, which lies in the flexibility and reuse, we certainly agree that it is not always the right choice to amortize likelihoods and that in some situations one may rightfully prefer the methods of e.g. Goncalves et al. (2020) (and others). Scaling of our approach with respect to a single model of possibly large parameter spaces, should follow the limitations of the approach of Goncalves et al. (2020), when training is restricted to one round (without ’zooming in’).
Relatedly, the authors might want to comment on what happens when s or theta is not a continuous parameter (for example, see again van Opheusden et al., 2020; in one of the models, s takes an integer value of a "board configuration" which is not necessarily part of a nice metric space; e.g. board state 15 is not necessarily close to 16). Clearly the network can still work in principle but effectively we would expect discontinuities (due to learning in a discrete space) which might cause issues. In other words, unless the authors have some evidence that their method also works well for discrete parameters, they might want to mention this limitation of the method.
Thank you, we have added this limitation briefly. In principle discrete parameters can work, however in general it is true that it may not be advisable for parameters which have a lot of discrete states, for which conditioning on the positions on a chessboard is a good example. The resulting manifold will likely not be smooth enough. Also, the amount of training data needed to amortize such scenario fully, may simply render global amortization an unreasonable computational strategy to begin with.
Relatedly also, the authors should explain more clearly how they chose the range of parameters which they sample from during ANN training. In a sense, and although the goal here is to approximate the likelihood, it seems to me that this step effectively places some form of implicit prior distribution on parameters (at least in terms of admissible parameter values). In turn, this induces some estimation artefacts, which can be eyeballed on some panels of Figure 8 (cf. α and β bound parameters of the Weibull model). Wouldn't the estimation error be much higher, had the authors simulated data under broader parameter ranges (further beyond the sampling range that was used for ANN training)? Does a restricted sampling range not eventually induce some form of potential nonidentifiability issue? These points should be discussed more explicitly in the manuscript.
Generally performance will not be optimal if the generative parameters are too close (or outside as one may consider for experimental data) to the boundary of the training region, or if a dataset is defective in a number of ways (e.g., too few choices in one or the other category) which reduces parameter identifiability even for analytic likelihood methods. We now added some clarifying remarks at the end of the section on parameter recovery. The parameter spaces were chosen so as to strongly envelope reasonable expectations for experimental data. In actuality, our LANs were trained to cover a far broader range of RTs compared to the general scope in which DDMlike models are evaluated (usually decisions that take a few seconds; our LANs cover up to 30 seconds, which could always be further expanded but is unlikely to be meaningful). We explain this again in a response to a similar criticism below. We discuss this as well in response to the comment about Line 438489.
The authors have shown that the method they propose yields efficient (Bayesian) statistical inference on model parameters. However, one may also want to use the method for comparing models. Can authors demonstrate the model comparison capabilities of their method? If the authors believe this is out of scope, they should at least discuss this possible application of their method. ABC methods have sometimes been criticized for the fact that they induce model section biases. However, these biases may be related to the selection of data sufficient statistics (which may be unfair to some models). How do these points translate to the current approach?
We now discuss this in section 8. It is indeed an important topic and a feasible application of the method. Initial results in the context of an extension to the HDDM package, show that DIC appears to work adequately to the same degree it does for analytic likelihood. However, to properly develop the model comparison aspect, and to do it justice would demand a substantial addition to the paper, which we indeed intended to leave for future work.
The authors have demonstrated the appeal of their method on DDM variants, which typically predict response times and choices. Practically speaking, this means storing the trained ANNs for these two types of dependent variables. But what if one uses these models to predict another set of dependent behavioral variables (e.g. including choice confidence)? It seems that one would have to reperform all the simulations. Or is there a way to leverage what has already been done? The authors may want to discuss this point somewhere in the manuscript.
Thanks for this point. While there are some models that consider confidence as arising directly from the decision variable in DDMs (e.g., van den Berg et al., 2016), here we conceptualize this question more generally as asking about an approach to a joint model, where confidence is modeled conditioned on reaction times and choices, with some probabilistic model 𝑝_{𝑀𝑐𝑜𝑛𝑓} (𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒𝑟𝑡, 𝑐, 𝜃_{𝑐𝑜𝑛𝑓} ).
As a whole we would be left with,
𝑝𝑀_{𝑐𝑜𝑛𝑓} (𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒𝑟𝑡, 𝑐, 𝜃_{𝑐𝑜𝑛𝑓} )𝑝_{𝐷𝐷𝑀}(𝑟𝑡, 𝑐𝜃_{𝐷𝐷𝑀})
In essence, the question becomes how to “integrate out“ reaction times and choices from this model of choice confidence. This can be achieved either by using numerical integration using precomputed LANs, or by attempting to train the joint model directly as a LAN (in principle training data from the DDMLAN could be reused here or the DDMLAN itself, a variety of options would exist). This is, in fact, an important secondorder application, which we hope to address in future work.
3. Figures are not described with a sufficient amount of detail, and the read has at times to guess what captions do not state. For example, what do negative RT mean in Figure 4 (I guess: RT conditional on "the other" choice outcome)? What are the middle panels of Figure 7 showing (I guess: posterior intervals on ANGLE bounds over decision time, as well as empirical and fitted choicedependant RT histograms)? What are the X1024 and X4096 variants of their ANNbased approach? What are the nodes/variables in the DAGs of Figure 10? These ambiguities should be solved in the revised manuscript.
Thank you for pointing out the lack of clarity in the captions. We hope to have resolved the resulting ambiguities. Figure 10 now explains the plate notation used and what 𝜃 and 𝜆 stand for right away. We added explanations regarding the legends in Figure 9. We added information to Figure 7 so that we explicitly describe the left, middle and right panels. We clarified what negative RTs mean in Figure 4 (indeed they are mirrored to display RTs for the alternative choice).
4. The relationship of this work to a few related lines of work may deserve to be discussed. In particular, the approach proposed here appears to share similarities to (variational) autoencoders, which is now an established ANNbased method for solving very similar problems. Also, it may be worth discussing the pros and cons of the authors' approach, when compared to other existing approaches to likelihoodfree inference, which would approximate the likelihood in some other way (using, e.g., Gaussian processes). Anecdotally, for the specific case of 'vanilla' DDM, there are recent advances to the problem of deriving conditional and marginal moments of RT distributions (Srivastava et al., 2016). These are much faster to evaluate than Fuss and Navarro (2009), because they provide analytical expressions that do not require any iterative numerical computations. Embedded within an adhoc likelihood (derived from an observation equation of the form: RT = E[RTtheta] + i.i.d. Gaussian noise), these can provide very fast/efficient parameter estimates. It would be worth discussing these kinds of approaches somewhere in the Discussion section.
1. Variational Autoencoder: While a link can be drawn to autoencoders (see Fengler et al.(2020) Cognitive Science Society conference paper), we consider it not essential and potentially misleading (this reflects an update in our own thinking), in the following sense. Our networks are not attempting to learn latent variables, but rather the ’forward part’ of the problem. We ultimately use the ’forward part’ to infer the latent parameter vectors 𝜃 as part of the Bayesian inference framework, but this has little to do with the amortization strategy. We are not attempting a mapping from data back to data (via a bottleneck), in both strategies and moreover the learned mappings are not stochastic. Hence we decided to leave out any connection to the variational autoencoder framework.
2. Gaussian Processes: We rewrote parts of sections 3 and expanded the discussion on related works to include Gaussian Process approaches and other likelihood based variants, to make sure the contextualization is more thorough early on, instead of relying solely on elaborations in the discussion.
3. ’Vanilla’ DDM alternatives: The approach of Srivastava et al. (2016) aims to capture moments of RT distributions. While one could build adhoc likelihood functions of the form suggested by the reviewer, this seems to defeat the purpose of attempting to be faithful to the actual likelihood (there seems to be no guarantee that basing the likelihood on 𝑅𝑇 = 𝐸[𝑅𝑇𝑡ℎ𝑒𝑡𝑎]+𝑒𝑝𝑠), which may lead to highly inaccurate posteriors when compared to our method (of course this is testable). Notwithstanding the potential general usefulness of the work of Srivastava et al. (2016), this approach may be of questionable value when one is interested in fast and “accurate“ posterior inference, and moreover does not generalize to model variations such as the LCA, Levy Flights, etc.
https://doi.org/10.7554/eLife.65074.sa2Article and author information
Author details
Funding
National Institute of Mental Health (P50 MH11946701)
 Michael J Frank
National Institute of Mental Health (R01 MH08484008A1)
 Michael J Frank
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
This work was funded by NIMH grants P50 MH 11946701 and R01 MH08484008A1. We thank Michael Shvartsman, Matthew Nassar, and Thomas Serre for helpful comments and discussion regarding the earlier versions of this manuscript. Furthermore, we thank Mads Lund Pederson for help with integrating our methods into the HDDM Python toolbox. Lastly, we would like to thank the two reviewers of the manuscript for helpful suggestions, which improved the readability of the manuscript.
Senior Editor
 Timothy E Behrens, University of Oxford, United Kingdom
Reviewing Editor
 Valentin Wyart, École normale supérieure, PSL University, INSERM, France
Reviewers
 Luigi Acerbi
 Jean Daunizeau, INSERM, France
Version history
 Received: November 21, 2020
 Accepted: April 1, 2021
 Accepted Manuscript published: April 6, 2021 (version 1)
 Version of Record published: May 6, 2021 (version 2)
 Version of Record updated: May 21, 2021 (version 3)
Copyright
© 2021, Fengler et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 3,495
 Page views

 521
 Downloads

 19
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Developmental Biology
 Neuroscience
Chimeric RNAs have been found in both cancerous and healthy human cells. They have regulatory effects on human stem/progenitor cell differentiation, stemness maintenance, and central nervous system development. However, whether they are present in human retinal cells and their physiological functions in the retinal development remain unknown. Based on the human embryonic stem cellderived retinal organoids (ROs) spanning from days 0 to 120, we present the expression atlas of chimeric RNAs throughout the developing ROs. We confirmed the existence of some common chimeric RNAs and also discovered many novel chimeric RNAs during retinal development. We focused on CTNNBIP1CLSTN1 (CTCL) whose downregulation caused precocious neuronal differentiation and a marked reduction of neural progenitors in human cerebral organoids. CTCL is universally present in human retinas, ROs, and retinal cell lines, and its lossoffunction biases the progenitor cells toward retinal pigment epithelial cell fate at the expense of retinal cells. Together, this work provides a landscape of chimeric RNAs and reveals evidence for their critical role in human retinal development.

 Neuroscience
Deciphering patterns of connectivity between neurons in the brain is a critical step toward understanding brain function. Imagingbased neuroanatomical tracing identifies areatoarea or sparse neurontoneuron connectivity patterns, but with limited throughput. Barcodebased connectomics maps large numbers of singleneuron projections, but remains a challenge for jointly analyzing singlecell transcriptomics. Here, we established a rAAV2retro barcodebased multiplexed tracing method that simultaneously characterizes the projectome and transcriptome at the single neuron level. We uncovered dedicated and collateral projection patterns of ventromedial prefrontal cortex (vmPFC) neurons to five downstream targets and found that projectiondefined vmPFC neurons are molecularly heterogeneous. We identified transcriptional signatures of projectionspecific vmPFC neurons, and verified Pou3f1 as a marker gene enriched in neurons projecting to the lateral hypothalamus, denoting a distinct subset with collateral projections to both dorsomedial striatum and lateral hypothalamus. In summary, we have developed a new multiplexed technique whose paired connectome and gene expression data can help reveal organizational principles that form neural circuits and process information.