Designing optimal behavioral experiments using machine learning
Abstract
Computational models are powerful tools for understanding human cognition and behavior. They let us express our theories clearly and precisely and offer predictions that can be subtle and often counterintuitive. However, this same richness and ability to surprise means our scientific intuitions and traditional tools are illsuited to designing experiments to test and compare these models. To avoid these pitfalls and realize the full potential of computational modeling, we require tools to design experiments that provide clear answers about what models explain human behavior and the auxiliary assumptions those models must make. Bayesian optimal experimental design (BOED) formalizes the search for optimal experimental designs by identifying experiments that are expected to yield informative data. In this work, we provide a tutorial on leveraging recent advances in BOED and machine learning to find optimal experiments for any kind of model that we can simulate data from, and show how byproducts of this procedure allow for quick and straightforward evaluation of models and their parameters against real experimental data. As a case study, we consider theories of how people balance exploration and exploitation in multiarmed bandit decisionmaking tasks. We validate the presented approach using simulations and a realworld experiment. As compared to experimental designs commonly used in the literature, we show that our optimal designs more efficiently determine which of a set of models best account for individual human behavior, and more efficiently characterize behavior given a preferred model. At the same time, formalizing a scientific question such that it can be adequately addressed with BOED can be challenging and we discuss several potential caveats and pitfalls that practitioners should be aware of. We provide code to replicate all analyses as well as tutorial notebooks and pointers to adapt the methodology to different experimental settings.
Introduction
Computational modeling of behavioral phenomena is currently experiencing rapid growth, in particular with respect to methodological improvements. For instance, seminal work by Wilson and Collins, 2019 has been crucial in raising methodological standards and bringing attention to how computational analyses can add value to the study of human (and animal) behavior. Meanwhile, in most instances, computational analyses are applied to data collected from experiments that were designed based on intuition and convention. That is, while experimental designs are usually motivated by scientific questions in mind, they are often chosen without explicitly and quantitatively considering how informative these data might be for the computational analyses and, finally, the scientific questions being studied. This can, in the worst case, completely undermine the research effort, especially as particularly valuable experimental designs can be counterintuitive. Today, advancements in machine learning (ML) open up the possibility of applying computational methods when deciding how experiments should be designed in the first place, to yield data that are maximally informative with respect to the scientific question at hand.
In this work, we provide an introduction to modern BOED and a stepbystep tutorial on how these advancements can be combined and leveraged to find optimal experimental designs for any computational model that we can simulate data from, and show how byproducts of this procedure allow for quick and straightforward evaluation of models and their parameters against real experimental data. Formalizing the scientific goal of an experiment such that our design aligns with our intention can be difficult. BOED forces the researcher to make explicit, and engage with, various assumptions and constraints that would otherwise remain implicit, which helps in making experimental research more rigorous but can also lead to unexpected optimal designs. We discuss several issues related to experimental design that practitioners may encounter no matter how they design their experiments. Importantly, these issues are made more salient by formalizing the process.
Why optimize experimental designs?
Experiments are the bedrock of scientific data collection, making it possible to discriminate between different theories or models, and discovering what specific commitments a model must make to align with reality by means of parameter estimation. The process of designing experiments involves making a set of complex decisions, where the space of possible experimental designs is typically navigated based on a combination of researchers’ prior experience, intuitions about what designs may be informative, and convention.
This approach has been successful for centuries across various scientific fields, but it has important limitations that are amplified as theories become richer, more nuanced, and more complex. Often, a theory does not make a concrete prediction but rather has free parameters that license a variety of possible predictions of varying plausibility. For example, if we want to model human behavior, it is important to recognize that different participants in a task might have different strategies or priorities, which can be captured by parameters in a model. As we expand models to accommodate more complex effects, it becomes ever more timeconsuming and difficult to design experiments that distinguish between models, or allow for effective parameter estimation. As an example from cognitive science, there are instances where models turned out to be empirically indistinguishable under certain conditions, a fact that was only recognized after a large number of experiments had already been done (Jones and Dzhafarov, 2014). Such problems also play an important role in the larger context of the replication crisis (e.g. Pashler and Wagenmakers, 2012).
Collecting experimental data is typically a costly and timeconsuming process, especially in the case of studies involving resourceintensive recording techniques, such as neuroimaging or eyetracking (e.g. Dale, 1999; Lorenz et al., 2016). There is a strong economic as well as ethical case for conducting experiments that are expected to lead to maximally informative data with respect to the scientific question at hand. For instance, in computational psychiatry, researchers are often interested in estimating parameters describing people’s traits and how they relate, often doing so under constrained resources in terms of the number of participants and their time. Additionally, researchers and practitioners alike require that experiments are sufficiently informative to provide confidence that they would reveal effects if they are there and also that negative results are true negatives (e.g. Huys et al., 2013; Karvelis et al., 2018). Consequently, as theories of natural phenomena become more realistic and complex, researchers must expand their methodological toolkit for designing scientific experiments.
Bayesian optimal experimental design
As a potential solution, Bayesian optimal experimental design (BOED; see Ryan et al., 2016; Rainforth et al., 2023, for reviews) provides a principled framework for optimizing the design of experiments. Broadly speaking, BOED rephrases the task of finding optimal experimental designs to that of solving an optimization problem. That is, the researcher specifies all controllable parameters of an experiment, also known as experimental designs, for which they wish to find optimal settings, which are then determined by maximizing a utility function. Exactly what we might want to optimize depends on our experiment, but can include any aspect of the design that we can specify, such as which stimuli to present, when or where measurements should be taken in a quasiexperiment, or, as an example, how to reward risky choices in a behavioral experiment. The utility function is selected to measure the quality of a given experimental design with respect to the scientific goal at hand, such as model discrimination, parameter estimation or prediction of future observations, with typical choices including expected information gain, uncertainty reduction, and many others (Ryan et al., 2016), as explained below. Importantly, BOED approaches require that theories are formalized via computational models of the underlying natural phenomena.
Computational models
Scientific theories are increasingly formalized via computational models (e.g. Guest and Martin, 2021). These models take many forms, but one desideratum for a model is that, given a set of experimental data, we can compare it to other models based on how well it predicts that data. To that end, many computational models are constructed in a way that allows for the analytic evaluation of the likelihood of the model given data, by means of a likelihood function. However, this imposes strong constraints on the space of models we can consider. Models that are rich enough to accurately describe complex phenomena—like human behavior—often have intractable likelihood functions (e.g. Lintusaari et al., 2017; Cranmer et al., 2020). In other words, we cannot compute likelihoods due to the prohibitive computational cost of doing so, or the inability to express the likelihood function mathematically. This has at least two implications (1) When we develop rich models, we often lack the tools to evaluate and compare them directly, and are left with coarse, qualitative proxy measures–like the “the model curves look like the human curves for some parameter settings!”; (2) when we are intent on conducting a systematic comparison of models, we are often forced to use simplified models that have tractable likelihoods, even when the simplifications lead to consequential departures from the theories we might want our models to capture.
Simulator models: What if the likelihood cannot be computed?
Many realistically complex models have the feature that we can simulate data from them, which allows for using simulationbased inference methods, such as approximate Bayesian computation (ABC) and many others (e.g. Lintusaari et al., 2017; Cranmer et al., 2020). There has been a growing interest in this class of models, often referred to as generative, implicit, or simulator models (e.g. Marjoram et al., 2003; Sisson et al., 2007; Lintusaari et al., 2017; Brehmer et al., 2020; Cranmer et al., 2020). We will use the term simulator model throughout this paper to refer to any model from which we can simulate data. We note that some generative models have tractable likelihood functions, but this class of models is subsumed by simulator models, which do not make any structural assumptions on the likelihood and include important cases where the likelihood is too expensive to be computed, or cannot be computed at all (simulator models also include the common case of a potentially complex deterministic simulator with an additional observation noise component, as is used in many areas of science). Simulator models have become ubiquitous in the sciences, including physics (Schafer and Freeman, 2012), biology (Ross et al., 2017), economics (Gouriéroux et al., 2010), epidemiology (Corander et al., 2017), and cognitive science (Palestro et al., 2018), among others.
The study of human, and animal, perception and cognition is experiencing a surge in the use of computational modeling (e.g. Griffiths, 2015). Here, simulator models can formalize complex theories about latent psychological processes to arrive at quantitative predictions about observable behavior. As such, simulator models appear across cognitive science, including for example, many Bayesian and connectionist models, a wide range of process models, cognitive architectures, or many reinforcement learning models and physics engines (see, e.g. Turner and Van Zandt, 2012; Turner and Sederberg, 2014; Bitzer et al., 2014; Turner et al., 2018; Bramley et al., 2018; Ullman et al., 2018; Palestro et al., 2018; Kangasrääsiö et al., 2019; Gebhardt et al., 2021). The applicability and usefulness of experimental design optimization for parameter estimation and model comparison has been demonstrated in many scientific disciplines, but this has often been restricted to settings with simple and tractable classes of models, for instance in cognitive science (e.g. Myung and Pitt, 2009; Zhang and Lee, 2010; Ouyang et al., 2018).
An overview of this tutorial
In this tutorial, we present a flexible workflow that combines recent advances in ML and BOED in order to optimize experimental designs for any computational model from which we can simulate data. A highlevel overview of the underlying MLbased BOED method is shown in Figure 1, including researchers’ inputs and resulting outputs. The presented approach is applicable to any model from which we can simulate data and scales well to realistic numbers of design variables, experimental trials and blocks. Furthermore, as well as optimized designs, the presented approach also results in tractable amortized posterior inference—that is, it allows researchers to use their actual experimental data to easily compute posterior distributions, which may otherwise be computationally expensive or intractable to compute. Moreover, we note that, to our knowledge, there has been no study that performs BOED for simulator models without likelihoods and then uses the optimal designs to run a real experiment. Instead, most practical applications of BOED have been limited to simple scientific theories of nature (e.g. Liepe et al., 2013). This paucity of realworld applications is partly because the situations where BOED would be most useful are those where it has previously been computationally intractable to use. The MLbased approach presented in this work aims to overcome these limitations.
As with many new computational tools, however, the techniques for using BOED are not trivial to implement from scratch. In an effort to help other scientists understand these methods and use them to unlock the full potential of BOED in their own research, we next present a case study in using BOED to better understand human decisionmaking. Specifically, we (1) describe new simulator models that generalize earlier decisionmaking models that were constrained by the need to have tractable likelihoods; (2) walk through the steps to use BOED to design experiments comparing these simulator models and estimating the psychologically interpretable parameters they contain; and (3) contrast our new, optimized experiments to previous designs and illustrate their advantages and the new lessons they offer. All our steps and analyses are accompanied by code that we have commented and structured to be easily adapted by other researchers.
Case study: Multiarmed bandit tasks
As a case study application, we consider the question of how people balance their pursuit of shortterm reward (‘exploitation’) against learning how to maximize reward in the longer term (‘exploration’). This question has been studied at length in psychology, neuroscience, and computer science, and one of the key frameworks for investigating it is the multiarmed bandit decisionmaking task (e.g. Steyvers et al., 2009; Dezfouli et al., 2019; Schulz et al., 2020; Gershman, 2018). Multiarmed bandits have a long history in statistics and machine learning (e.g. Robbins, 1952; Sutton and Barto, 2018) and formalize a general class of sequential decision problems—repeatedly choosing between a set of options under uncertainty with the goal of maximizing cumulative reward (for more background, see Box 1).
Bandit tasks.
Bandit tasks constitute a minimal reinforcement learning task, in which the agent is faced with the problem of balancing exploration and exploitation. Various algorithms have been proposed for maximizing the reward in bandit tasks, and some of these algorithms have been used directly or in modified form as models of human behavior. A selection of such algorithms indeed captures important psychological mechanisms in solving bandit tasks (Steyvers et al., 2009). However, a richer account of human behavior in bandit tasks would seem to be required to accommodate flexibility and nuance of human behavior (e.g. see Dezfouli et al., 2019; Lee et al., 2011), leading to simulator models that do not necessarily permit familiar likelihoodbased inference, due to, for example unobserved latent states, and which complicate the task of intuiting informative experiments. See Box 1—figure 1 for a schematic of the datagenerating process of multiarmed bandit tasks.
We demonstrate the applicability of our method by optimizing the design of a multiarmed bandit decisionmaking task, considering three flexible extensions of previously proposed models of human behavior. In our experiments, we consider two general scientific tasks: model discrimination and parameter estimation. We then evaluate the optimal designs found using our method empirically with simulations and real human behavioral data collected from online experiments, and compare them to designs commonly used in the literature. We find that, as compared to commonly used designs, our optimal experimental designs yield significantly better model recovery, more informative posterior distributions and improved model parameter disentanglement.
Workflow summary
In the following sections, we describe and walk through our workflow stepbystep, using the aforementioned multiarmed bandit task setting as a concrete case study to further illustrate MLbased BOED. At a high level, our workflow comprises the following steps:
Defining a scientific goal, for example, which model, among a set of models, best describes a natural phenomenon.
Formalizing a theory, or theories, as a computational model(s) that can be sampled from.
Setting up the design optimization problem and deciding which aspects of the experimental design need to be optimized.
Constructing the required ML models and training them with simulated data.
Validating the obtained optimal designs in silico and, potentially, reevaluating design and modeling choices.
After confirming and validating the optimal design, the real experiment can be performed. The trained ML models used for BOED can be used afterwards to easily compute posterior distributions.
We provide a more detailed description of the above workflow in the following sections. The outputs of the MLbased BOED method are optimal experimental designs, an estimate of the (maximum) expected information gain and an amortized posterior distribution, which allows for straightforward computation of posterior distributions from collections of realworld data and saves us a potentially computationally expensive (likelihoodfree) inference step. Readers may find that many of these steps should be part of the design of experiments no matter whether using BOED or not. However, BOED here provides the additional advantage of making all of these steps, which are sometimes carried out heuristically, explicit and formal.
Step 1: Formulate your scientific goal
The first step in setting up the experimental design optimization is to define our scientific goal. Although there may be many different goals for an experiment, they often fall into the broad categories of model discrimination or parameter estimation.
Model discrimination
Computational models provide formal and testable implementations of theories about nature. Model discrimination (also often referred to as model comparison) lies at the core of many scientific questions and amounts to the problem of deciding which of a set of different models best explains some observed data. A rigorous formulation of this problem is given by the Bayesian approach: Starting from a prior belief about the plausibility of different models and their parameters, we consider how well different models can explain the observed data and use this information to update our prior beliefs. Crucially, more flexible theories are penalized automatically via the notion of the Bayesian Ockham’s razor. This ensures that in cases where two theories can explain the data equally well, we should favor the simpler one. Formally, this requires computing the marginal likelihood of our different models, which may mean integrating (or for discrete parameters, summing) over a large number of model parameters, which is often intractable, and thus often tackled via information criteria, which may, however, suffer from various problems (e.g. Pitt and Myung, 2002; Dziak et al., 2020; Schad et al., 2021). This computational complexity of computing marginal model likelihoods is exacerbated when model parameter likelihood is intractable, as is the case for many simulator models.
With respect to our case study, we may, for instance, be interested in comparing which of the three simulator models presented in Box 2 best explains human behavior in multiarmed bandit tasks.
Models of human behavior in bandit tasks.
We study models of human choice behavior in bandit tasks to demonstrate and validate our method. We generalize previously proposed models (see Appendix 3 for details), and consider three novel computational models for which likelihoods are intractable.
WinStay LoseThompsonSample (WSLTS)
Posits that people tend to repeat choices that resulted in a reward, and otherwise to explore by selecting options with a probability proportional to their posterior probability of being the best (Thompson sampling). This model generalizes the WinStay LoseShift model that has been considered in previous work (Steyvers et al., 2009), but accommodates the possibility that people might be more likely to shift to more promising alternatives than uniformly at random.
Autoregressive $\epsilon$Greedy (AEG)
Formalizes the idea that people might greedily choose the option with the highest estimated reward in each trial with a certain probability or otherwise explore randomly, but also permits that people may be ‘sticky’, that is, having some tendency to reselect whatever option was chosen on the previous trial, or ‘antisticky’, that is, preferring to switch to a new option on each trial. This generalizes the $\epsilon$greedy model studied previously (Steyvers et al., 2009), further accommodating the possibility that people preferentially stick to or avoid their previous selection.
Generalized Latent State (GLS)
Captures the idea that people might have an internal latent explorationexploitation state that influences their choices when encountering situations in which they are faced with an exploreexploit dilemma. Transitions between these states may depend on previously encountered rewards and the previous latent state. This generalizes previous latentstate models that assume people shift between states independently of previous states and observed rewards Zhang et al., 2009, or only once per task (Lee et al., 2011). As these models are novel and to illustrate the method, we do not specify strong prior beliefs about any model parameters and describe the priors distribution in Appendix 1.
Parameter estimation
Researchers are often interested in learning about model parameters of a particular, single model. Similar to the model discrimination approach, a Bayesian approach of learning about these parameters would involve starting with a prior distribution and then updating our belief via Bayes’ rule using observed data. For instance, we may be interested in estimating an individual’s (or population’s) working memory capacity given a working memory model, response time distribution in a model for selective attention, choice probability of choosing between different options in a risky choice model, inclination to explore versus exploit in a reinforcement task, or associations between biases in probabilistic visual integration tasks and autistic/schizotypal traits (Karvelis et al., 2018) to name a few examples. Model parameters may here be continuous or discrete, which matters for how we set up up the optimization procedure in Step 4.
As an example, for the Autoregressive $\u03f5$Greedy (AEG) simulator model in our case study, we are, among many other things, interested in estimating peoples’ inclination to reselect previously chosen options as opposed to avoiding repeating their own past behavior, which is characterized by a specific model parameter.
Other goals
While we here focus on the problems of model discrimination and parameter estimation, we point out that the overall BOED approach via mutual information (MI) estimation is more general, and can easily be adapted to other tasks, such as improving future predictions (e.g. Kleinegesse and Gutmann, 2021) and many others (see Ryan et al., 2016).
How do we measure the value of an experimental design?
We formalize the quality of our experimental design via a utility function, which thereby serves as a quantitative way of measuring the value of one experimental design over any another and provides the objective function for our optimization problem. Following recent work in BOED (e.g. Kleinegesse and Gutmann, 2019; Kleinegesse et al., 2020; Kleinegesse and Gutmann, 2021; Foster et al., 2019), we use MI (Lindley, 1956; also known as the expected information gain) as our utility function $U(\mathbf{d})$ for measuring the value of an experimental design $\mathbf{d}$, that is
where $\mathit{v}$ is a variable we wish to estimate and $\mathit{y}$ corresponds to the observed data (actions and rewards in our case study). Note that the probability density function $p(\mathit{v}\mathbf{y},\mathbf{d})$ is the posterior distribution and $p(\mathit{v})$ is the prior distribution of the variable of interest.
For the model discrimination task, the variable of interest $\mathit{v}$ in our utility function corresponds to a scalar model indicator variable $m$ (where $m=1$ corresponds to a model 1, $m=2$ to some model 2 and so on for as many models as we are comparing). For the parameter estimation task, the variable of interest $\mathit{v}$ in our utility function is the vector of model parameters ${\mathit{\theta}}_{m}$ for a given model $m$.
MI has several appealing properties (Paninski, 2005) that make it a popular utility function in BOED (Ryan et al., 2016). Intuitively, MI quantifies the amount of information our experiment is expected to provide about the variable of interest. Furthermore, the MI utility function can easily be adapted to various scientific goals by changing the variable of interest (Kleinegesse and Gutmann, 2021). Optimal designs are then found by maximizing this utility function, that is ${\mathbf{d}}^{\ast}=\mathrm{a}\mathrm{r}\mathrm{g}\phantom{\rule{thinmathspace}{0ex}}\mathrm{m}\mathrm{a}\mathrm{x}U(\mathbf{d})$. Although a useful quantity, estimating $\displaystyle U(\mathbf{d})$ is typically extremely difficult, especially for simulator models that have intractable likelihoods $p(\mathbf{y}\mathit{v},\mathbf{d})$ (e.g. Ryan et al., 2016). Nonetheless, in the later steps of our workflow, we shall explain how we can approximate this utility function using MLbased approaches.
Step 2: Cast your theory as a computational model
After defining our scientific goal, we need to ensure that our scientific theories are, or can be, formalized as computational models. This either requires us to formalize them from the ground up, or leverage existing computational models from literature that support our theories. In the former case, this entails making all aspects of the theory explicit, which is therefore a highly useful and established practice; we refer the reader to Wilson and Collins, 2019 for a more general guidance on building computational models. For the presented MLbased BOED approach, we require that our theories are cast as computational models that allow us to simulate data, that is simulator models. Further we need to decide on a general experimental paradigm that needs to be able to interact with our computational models.
Building a computational model
Formally, a computational model defines a generative model $\mathbf{y}\sim p(\mathbf{y}\mathit{\theta},\mathbf{d})$ that allows sampling of synthetic data $\displaystyle \mathbf{y}$, given values of its model parameters $\mathit{\theta}$ and the experimental design $\displaystyle \mathbf{d}$. We assume that sampling from this model is feasible, but make no assumptions about whether evaluating the likelihood function $p(\mathbf{y}\mathit{\theta},\mathbf{d})$ is tractable or intractable. Even if the likelihood is tractable and simple, important quantities such as the marginal likelihood may nevertheless be intractable, since evaluating the marginal likelihood requires integrating over all free parameters. This problem is naturally exacerbated for more complex models that may have intractable likelihood functions.
See Box 2 for a set of simulator models from our case study. These models describe human behavior in a multiarmed bandit setting, where the observed data $\mathit{y}$ corresponds to the sequences of chosen bandit arms along with their observed reward.
What are our prior beliefs?
In addition to formalizing our theories as simulator models, we need to complete this step by defining prior distributions over the (free) model parameters, as would be required for any analysis. We point the reader to Wilson and Collins, 2019; Schad et al., 2021; Mikkola et al., 2021 for guidance on specifying prior distributions, as this is a broad research topic and a detailed treatment is beyond the scope of this work. Generally, prior distributions should reflect what is known about the model parameters prior to running the experiment, based on either empirical data or general domain knowledge. This is crucial for the design of experiments, as we may otherwise spend experimental resources and effort on learning what is already known about our models.
Step 3: Decide which aspect of your experiment to optimize
Having formalized our scientific theories as simulator models and decided on our scientific goals, we need to think about which parts of the experiment we want to optimize. In principle, any controllable aspect of the experimental apparatus could be tuned as part of the experimental design optimization. Concretely, in order to solve an experimental design problem by means of BOED, our experimental designs $\mathbf{d}$ need to be formalized as part of the simulator model $p(\mathbf{y}\mathit{\theta},\mathbf{d})$. Instead of optimizing all controllable aspects of an experiment, it may then help to leverage domain knowledge and prior work to choose a sufficient experimental design set, in order to reduce the dimensionality of the design problem.
In our case study, for instance, we have chosen the reward probabilities of different bandit arms to be the experimental designs to optimize over. This choice posits that certain reward environments yield more informative data than others, in the context of model discrimination and parameter estimation. Our task is then to find the optimal reward environment that yields maximally informative data.
There may be other experimental choices that we will have to make, in particular to limit variability in the observed data, to control for confounding effects, and to reduce nonstationarity, among many other reasons. In our multiarmed bandit setting, for example, we may choose to limit the number of trials in the experiment to avoid participant fatigue, which is usually not accounted for in computational models of behavioral phenomena. Additionally, we may limit the number of available choices in a discrete choice task or impose constraints on any other aspects suggested by domain knowledge, or where we want to ensure comparability with prior research. For example, we may have reason to believe that certain designs are more ecologically valid than others—we discuss such considerations in the Caveats Section.
Static and adaptive designs
We consider the common case of static BOED, with the possibility of having multiple blocks of trials. That is, we find a set of optimal designs prior to actually performing the experiment. In some cases, however, a scientist may wish to optimize the designs sequentially as the participant progresses through the experiment. This is called sequential, or adaptive, BOED (see Ryan et al., 2016; Rainforth et al., 2023) and has been explored previously for tractable models of human behavior (Myung et al., 2013), or in the domain of adaptive testing (e.g. Weiss and Kingsbury, 1984). More recently, sequential BOED has also been extended to deal with implicit models (Kleinegesse et al., 2020; Ivanova et al., 2021). At this point, sequential BOED is an active area of research and may be a consideration for researchers in the future. Generally, sequential designs can be more efficient, as they allow to adapt the experiment ‘onthefly’. However, sequential designs also make building the actual experiment more complicated, because the experiment needs to be updated in almost realtime to provide a seamless experience. Further, in some settings, participants may notice that the experiment adapts to their actions, which may inadvertently alter their behavior. As we discuss below in the Caveats Section, the effects of model misspecification may be especially severe with sequential designs, contributing to our focus on static designs. For future applications, which approach to follow is likely going to be a choice that uniquely depends on the experiment and research questions at hand, while the overall workflow outlined in this tutorial will be relevant to both (Box 3).
Experimental settings in the case study.
Behavioral experiments are often set up with several experimental blocks. We here make the common assumption of exchangeability with respect to the experimental blocks in our analyses and randomize block order in all experiments involving human participants. In our experiments, the bandit task in each experimental block has 3 choice options (bandit arms) and 30 trials (sequential choices). For each block of trials, the data $\displaystyle \mathbf{y}$ consists of 30 actions and 30 rewards, leading to a dimensionality of 60 per block. For a given block, the experimental design $\displaystyle \mathbf{d}$ we wish to optimize are the reward probabilities of the multiarmed bandit, which, in our setting, are threedimensional (corresponding to independent scalar Bernoulli reward probabilities for each bandit arm). The generative process for models of human behavior in bandit tasks is illustrated in Box 2. We demonstrate the optimization of reward probabilities for multiarmed bandit tasks, with the scientific goals of model discrimination and parameter estimation. We consider five blocks of 30 trials per participant, yielding a fairly short but still informative experiment. We divide this into two blocks for model discrimination followed by three blocks for parameter estimation, where the parameter estimation blocks are conditional on the winning model for the participant in the first two blocks. In other words, for the parameter estimation task, a different experimental design (which corresponds to a particular setting of the reward probabilities associated with each bandit arm) is presented to a participant depending on the simulator model that best described them in the first two blocks. Since we consider threearmed bandits, the design space is constituted by the possible combinations of reward probabilities. This is sixdimensional for model discrimination and ninedimensional for PE. Similarly, the data for a single block is 60dimensional, so the dimensionality of the data is 120 for model discrimination and 180 for PE. Our method applies to any prior distribution over model parameters or the model indicator, but for demonstration purposes, we assume uninformative prior distributions on all model parameters (see Appendix 1 for details). As our baseline designs, we sample all reward probabilities from a $\text{Beta}(2,2)$ distribution, following a large and wellknown experiment on bandit problems with 451 participants (Steyvers et al., 2009). More information about the experimental settings and detailed descriptions of the algorithms can be found in the Appendix.
Step 4: Use machine learning to design the experiments
As previously explained, MI shown in Equation 1 is an effective means of measuring the value (i.e. the utility) of an experimental design setting $\displaystyle \mathbf{d}$ in BOED (Paninski, 2005; Ryan et al., 2016). Unfortunately, it is generally an expensive, or intractable, quantity to compute, especially for simulator models. The reason for this is twofold: Firstly, the MI is defined via a highdimensional integral. Secondly, the MI requires density evaluations of either the posterior distribution or the marginal likelihood, both of which are expensive to compute for general statistical models, and intractable for simulator models. Recent advances in BOED for simulator models thus advocate the use of cheaper MI lower bounds instead (e.g. Foster et al., 2019; Kleinegesse et al., 2020, Kleinegesse and Gutmann, 2021; Ivanova et al., 2021).
Many recent innovations in BOED rely on ML to accurately and efficiently estimate functions. ML is currently revolutionizing large parts of the natural sciences, with applications ranging from understanding the spread of viruses (Currie et al., 2020) to discovering new molecules (Jumper et al., 2021), forecasting climate change (Runge et al., 2019), and developing new theories of human decisionmaking (Peterson et al., 2021). Due to its versatility, ML can be integrated effectively into a wide range of scientific workflows, facilitating theory building, data modeling and analysis. In particular, ML allows us to automatically discover patterns in highdimensional data sets. Many recent stateoftheart methods in natural science, therefore, deeply integrate ML into their data analysis pipelines (e.g. Blei and Smyth, 2017). However, there has been little focus on applying ML to improve the data collection process, which ultimately determines the quality of data, and thus the efficacy of downstream analyses and inferences. In this work, we use ML methods as part of our optimization scheme, where they serve as flexible function approximators.
The step of training a good ML model currently requires some previous expertise with (or willingness to learn about) applying ML methods. Fortunately, this step is less experimentspecific than it may first appear. Many different experiments generate similar data, and the architecture we provide can deal with multiple blocks and automatically learns summary statistics, as we explain above. Many excellent resources on the basics of ML exist, and a detailed treatment is beyond the scope of this tutorial. However, we provide pointers to several aspects in the code repository: https://github.com/simonvalentin/boedtutorial (copy archived at Valentin, 2023).
Approximating mutual information with machine learning
In this work, we leverage the MINEBED method (Kleinegesse et al., 2020), which allows us to efficiently estimate the MI using ML. Specifically, for a particular design $\displaystyle \mathbf{d}$, we first simulate data from the computational models under consideration, (shown in Box Box 2). The approach works by constructing a lower bound on the MI that is parameterized via an ML model. In our case study, we use a neural network with an architecture customized for behavioral experiments (shown in Box 4), but any other ML model would work in place. We then train the ML model by using the simulated data as input and the lower bound as an objective function, which gradually tightens the lower bound.
Bespoke neural network architecture for behavioral experiments.
More concretely, this method works by training a neural network ${T}_{\mathit{\psi}}(\mathit{v},\mathbf{y})$, or any other ML model, where $\mathit{\psi}$ are the neural network parameters and the data $\displaystyle \mathbf{y}$ is simulated at design $\displaystyle \mathbf{d}$ with samples from the prior $p(\mathit{v})$. We train this neural network by maximizing the following objective function, which is a lower bound of the MI,
The above lower bound is also known as the NWJ lower bound and has several appealing biasvariance properties (Poole et al., 2019; we note that alternative bounds on MI can be used in the place of the NWJ bound as our loss function, as discussed by Kleinegesse and Gutmann, 2021). The expectations in Equation 2 are usually approximated using (MonteCarlo) sampleaverages.
Once we have trained the neural network ${T}_{\mathit{\psi}}(\mathit{v},\mathbf{y})$ on $\mathit{\psi}$, we estimate the MI at $\displaystyle \mathbf{d}$ by computing Equation 2 using a heldout test set of data simulated from the computational model(s).
Dealing with highdimensional observations
We can efficiently deal with highdimensional data and do not have to construct (approximately sufficient) summary statistics based on domain expertise, as is commonly done (Myung and Pitt, 2009; Zhang and Lee, 2010; Ouyang et al., 2018), and which can be prohibitively difficult for complex simulator models (Chen et al., 2021). In fact, the previously described method can allow us to automatically learn summary statistics as a byproduct, as explained below.
Critical to the performance of our method in our case study, as with any application of neural networks, is the choice of architecture (see Elsken et al., 2019 and Ren et al., 2021 for recent reviews of the challenges and solutions in neural architecture search). In order to develop an effective neural network architecture, we leverage the method of neural approximate sufficient statistics (Chen et al., 2021) and propose them as a novel and practical alternative to handcrafted summary statistics in the context of BOED. Behavioral data are often collected in several blocks that can have different designs but should be analyzed jointly, which can be challenging due to the high dimensionality of the joint data, which typically comprises numerous choices, judgments or other measurements (including highresolution and/or highfrequency neuroimaging data). In order to deal with a multiblock data structure effectively, we propose an architecture devised specifically for application to behavioral experiments. Our architecture, visualized in Box 4—figure 1, incorporates a subnetwork for each block of an experiment. The outputs of these subnetworks are then concatenated and passed as an input to a larger neural network. In doing so, besides being able to reduce the dimensionality of the input data, each subnetwork conveniently learns to approximate sufficient summary statistics of the data from each block in the experiment (Chen et al., 2021).
In this work, we develop a bespoke feedforward neural network architecture, which allows us to efficiently deal with multiple blocks of data. The code we are providing is straightforwardly adaptable to other problems ‘out of the box’, as the architecture can scale to multiple blocks of data and realistic complexity (dimensionality) of the observed data. However, the overall approach is not restricted to this particular ML method; in fact, any sufficiently flexible ML model may be used, as long as the loss function can be specified. In some settings, researchers may be interested in using ML models that only require little tuning, such as treebased ensemble methods like Random Forests (Breiman, 2001) or gradient boosted trees (Friedman, 2001). On the other hand, when dealing with highdimensional but structured data, such as in eyetracking or neuroimaging (or combinations of different types of data, see Turner et al., 2017), one may use other appropriate architectures, such as convolutional (LeCun et al., 2015), recurrent (Rumelhart et al., 1986), transformers, (Vaswani et al., 2017), or graph neural networks (Zhou et al., 2020). For further pointers to practical recommendations, see the code repository: https://github.com/simonvalentin/boedtutorial, (copy archived at Valentin, 2023).
Search over the space of possible experimental designs
Once we have obtained an estimate of $\displaystyle U(\mathbf{d})$ at a candidate design $\displaystyle \mathbf{d}$, we need to update the design appropriately when searching the design space to solve the experimental design problem $\mathbf{d}\ast =\mathrm{a}\mathrm{r}\mathrm{g}\phantom{\rule{thinmathspace}{0ex}}{\mathrm{m}\mathrm{a}\mathrm{x}}_{\mathbf{d}}U(\mathbf{d})$. Unfortunately, in the setting of discrete observed data $\displaystyle \mathbf{y}$ (such as choices between a set of options), gradients with respect to designs are generally unavailable, requiring the use of gradientfree optimization techniques. We here optimize the design $\displaystyle \mathbf{d}$ by means of Bayesian optimization (BO; see Shahriari et al., 2015, for a review), which has been used effectively in the experimental design context before (e.g. MartinezCantin, 2014; Kleinegesse et al., 2020). We specifically use a Gaussian Process (GP) as our probabilistic surrogate model with a Matérn5/2 kernel and Expected Improvement as the acquisition function (these are standard choices). A formal summary of our BOED approach is shown in Appendix 4.
For higher dimensional continuous design spaces, this search may become difficult and necessitate the use of more scalable BO variants (e.g. Overstall and Woods, 2017; Oh et al., 2018; Eduardo and Gutmann, 2023). For discrete design spaces or combinations of discrete and continuous design spaces, one can straightforwardly parallelize the optimization for each level of the discrete variable. If the observed data are continuous, such as participants’ response times, and one is able to compute gradients with respect to the designs (i.e. the simulator is differentiable), the design space may also be explored with gradientbased optimization (Kleinegesse and Gutmann, 2021). In some settings, we may have some nontrivial constraints on permissible experimental designs, and these can be enforced as part of the search, making sure that impermissible designs are rejected and/or not explored. Thus, if we want our designs to have certain properties, we need to encode these constraints. For example, in our bandit tasks, we could enforce a certain minimum degree of randomness associated with each bandit arm’s reward probability (although we do not do so in the present work), such that each reward probability is within in a specified range.
Step 5: Validate the optimal experimental design in silico
Once we have converged in our search over experimental designs, we can simulate the behavior of our theories under our optimal experimental design(s). Irrespective of the final experiment that will be run, model simulations can already provide useful information for theory building, for example by noticing that the simulator models generate implausible data for certain experimental designs. This means that we can use simulations to assess whether we can recover our models, or individual model parameters, given the experimental design, as we did in our simulation study. We highlight this again as a critical step, since issues at this stage warrant revising the models under consideration and, importantly, any corresponding realworld experiment would be expected to yield uninformative data and waste resources (Wilson and Collins, 2019).
As we discuss in the Caveats Section, BOED may sometimes lead to unusual yet effective experimental designs, and we suggest that surprising optimal designs may deserve careful consideration, as they may reveal subtle differences in predicted behaviors or model misspecifications. One appeal of optimal designs is that some of the ways of obtaining information about which models is correct correspond to finding situations in which one of them makes extreme predictions. BOED can thereby be seen as facilitating good prior predictive checking. We have illustrated this step using our case study in Box 5 and Box 6, where we present simulation study results for model discrimination and parameter estimation, respectively.
Model discrimination.
Simulation study Our method reveals that the optimal reward probabilities for the model discrimination task are, approximately, $[0,0,0.6]$ for one block of trials and $[1,1,0]$ for the other block. These optimal designs stand in stark contrast to the usual reward probabilities used in such behavioral experiments, which characteristically use less extreme values (Steyvers et al., 2009). Contrary to our initial intuitions, and, presumably those of previous experiment designers, the extreme probabilities in these optimal designs are effective at distinguishing between particular theoretical commitments of our different models. For instance, the AEG model (see Box 2) can break ties between options with equivalent (observed) reward rates via the ‘stickiness’ parameter. To illustrate this effect, consider the bandit with $[1,1,0]$ reward probabilities, as found above. Here, switching between the two options that always produce a reward is different from switching to the option that never produces a reward. Specifically, switching between the winning options can be seen as ‘antistickiness’, motivated for example, out of boredom or epistemic curiosity about subtle differences in the true reward rates of the winning options. This type of strategy is less apparent when the bandit arms are stochastic. As such, these mechanisms can be investigated most effectively by not introducing additional variability due to stochastic rewards and instead isolating their effects. Hence, we can interpret the, perhaps counterintuitive, optimal experimental designs as effective choices for isolating distinctive mechanisms of our behavioral models or their parameters under the assumed priors.
Using the neural network that was trained at that optimal design, we can cheaply compute approximate posterior distributions of the model indicator (as described in Materials and methods). This yields the confusion matrices in Box 5—figure 1, which show that our optimal designs result in considerably better model recovery than the baseline designs.
Parameter estimation: Simulation study.
We next discuss the parameter estimation results, focusing on the WSLTS model; the results for the AEG and GLS model can be found in Appendix 5. We find that the optimal reward probabilities for the WSLTS model are [0, 1, 0], [0, 1, 1] and [1, 0, 1] for the three blocks. Similar to the model discrimination task, these optimal designs take extreme values, unlike commonly used reward probabilities in the literature. In Box 6—figure 1 we show posterior distributions of the WSLTS model parameters for optimal and baseline designs. We find that our optimal designs yield data that result in considerably improved parameter recovery, as compared to baseline designs.
Step 6: Run the real experiment
Once we are satisfied with the found optimal experimental design, we can run the actual experiment. Here, the ML models we used to approximate MI allow us to cheaply perform posterior inference. After running the realworld experiment, we can use our trained ML models to cheaply compute approximate posterior distributions over models and their parameters. This procedure provides a valuable addition to a Bayesian modeling workflow Schad et al., 2021; Wilson and Collins, 2019, as computing posterior distributions is a computationally expensive step, which is especially difficult for models with intractable likelihoods. We explain theoretical details below, and showcase an analysis of a real experiment with our case study in Box 7, Box 8, Box 9.
Human participant study: Methodology.
To validate our method on empirical data, we collected a sample of $N=326$ participants from Prolific (https://www.prolific.com/; all experiments were certified according to the University of Edinburgh Informatics Research Ethics Process, RT number 2019/58792). Participants were randomly allocated to our optimal designs or to one of ten baseline designs (the same as in the simulation study); we refer to the former as the optimal design group and to the latter as the baseline group. The first two blocks correspond to the model discrimination task, that is they were used to identify the model that matches a participant’s behavior best. This was done by computing the posterior distribution over the model indicator $m$ using the trained neural networks (see Appendix 1) and then selecting the model with the highest posterior probability, that is the maximum a posteriori estimate. This inference process was done online without a noticeable interruption to the experiment. The following three blocks were then used for the parameter estimation task. Participants in the optimal design group were allocated the optimal design according to their inferred model (as provided by the model discrimination task). Participants in the baseline group were again allocated to baseline designs. This data collection process facilitates gathering realworld data for both the model discrimination and parameter estimation tasks, allowing us to effectively compare our optimal designs to baseline designs. See Appendix 2 for more details about the human participant study setup.
Here, we follow the rationale that more useful experimental designs are those that lead to more information (or lower posterior uncertainty) about the variable of interest. The belief about our variable of interest is provided by its posterior distribution, which we can estimate using the neural network output. We assess the quality of these posterior distributions by estimating their entropy, which quantifies the amount of information they encode (Shannon, 1948). This is an effective and prominent metric to evaluate distributions, as it directly relates to the uncertainty about the variable of interest. For the model discrimination task, we specifically consider the Shannon entropy as a metric, since the model indicator $m$ is a discrete random variable. As the continuous analogue of Shannon entropy, we use differential entropy for measuring the entropy of posterior distributions for the parameter estimation task, since the model parameters are continuous random variables. This allows us to quantitatively evaluate our optimal designs and compare their efficacy to that of baseline designs.
Human participant study: Results.
We here briefly present the results of our human participant study. For the model discrimination tasks, we find that our optimal designs yield posterior distributions that tend to have lower Shannon entropy than the baseline designs, as shown in Box 8—figure 1. Our MLbased approach thus allows us to obtain more informative beliefs about which model matches real individual human behavior best, as compared to baselines.
Having determined the best model to explain a given participant’s behavior, we turn to the parameter estimation task, focusing our efforts on estimating the model parameters that explain their specific strategy. Similar to the model discrimination task, we here assess the quality of joint posterior distributions using entropy, where smaller values directly correspond to smaller uncertainty. Box 8—figure 2 shows the distribution of differential posterior entropies across participants that were assigned to the AEG simulator model; we show corresponding plots for the WSLTS and GLS models in Appendix 5. Similar to the model discrimination task, we find that our optimal designs provide more informative data, that is they result in joint posterior distributions that tend to have lower differential entropy, as compared to the ones resulting from baseline designs, as shown in the figure.
Moreover, the proportions of participants allocated to the WSLTS, AEG, and GLS models for the optimal design were 62(37%), 75(45%) and 29(18%), respectively. For the baseline designs, the proportions were 57(36%), 22(14%) and 81(51%), for the WSLTS, AEG, and GLS models, respectively. The finding that the largest proportion of participants for the optimal designs were best described by the AEG model differs from prior work, where most participants were best described by WinStay LoseShift (a special case of our WSLTS model) Steyvers et al., 2009. This suggests that a reinterpretation of previous results is required, where the human tendency to stick with previous choices is less about myopically repeating the last successful choice, and more about balancing reward maximization, exploration, and a drive to be consistent over recent runs of choices. Interestingly, we find different distributions over models for the optimal as compared to the baseline designs (A chisquare test revealed a significant difference between the distributions of WSLTS, AEG, and GLS in the Optimal and Baseline groups, ${\chi}^{2}(2)=53.66$, p < 0.0001.). In particular, while, descriptively, the proportions of participants allocated to the WSLTS model are very close (37%, and 36% for the optimal and baseline designs, respectively), the proportions allocated to the AEG and GLS model differ more substantially. This may suggest that people’s strategy depends on the experimental design, potentially with some participants favoring strategies similar to the AEG model with more reliable reward environments, and the GLS in more unreliable (baseline) environments. The models we have considered in our case study do not account for the possibility of transferring knowledge across experimental blocks. However, learning about the reward probabilities in one (or several) experimental blocks may influence people’s expectations about the reward distribution of future block(s). For instance, having inferred (correctly or not), that at least one arm bandit arm always pays a reward, a learner would do well to quickly identify this arm in a future block (see, e.g. Wang et al., 2018, for a related line of research). Our space of computational models does not include such strategy switching mechanisms, and we discuss the impact of such potential misspecifications below in more detail. While exploring these findings in detail from a cognitive science perspective is beyond the scope of the present work, we view this as a fruitful avenue for future work.
The extreme designs here may be surprising, considering the typically used stochastic rewards in bandit problems. However, these designs allow the experiment to focus on particular aspects of people’s behavior, which turns out to be most informative for distinguishing between the computational models we consider. For instance, while our AEG model performs uniformly random exploration, the WSLTS model performs uncertaintyguided exploration. Meanwhile, our optimal designs raise the question of ecological validity, as it is unclear which kinds of ‘reward environments’ are more reflective of realworld situations. On the one hand, research in causal reasoning suggests that people expect relationships to be deterministic and reliable (e.g. Schulz and Sommerville, 2006). On the other hand, some environments are known to be random and more akin to a classic, stochastic bandit problem, such as the stock market. We view the question of the ecological validity of stochastic rewards as an important direction for future work, and discuss this more below in the Caveats Section.
Posterior estimation
Once the neural network is trained, we can conveniently compute an approximate posterior distribution with a single forwardpass (see Kleinegesse et al., 2020, for a derivation and further explanations). The NWJ lower bound is tight when ${T}_{{\mathit{\psi}}^{\ast}}(\mathit{v},\mathbf{y})=1+\mathrm{log}p(\mathit{v}\mathbf{y},{\mathbf{d}}^{\ast})/p(\mathit{v})$. By rearranging this, we can thus use our trained neural network ${T}_{{\mathit{\psi}}^{\ast}}(\mathit{v},\mathbf{y})$ to compute a (normalized) estimate of the posterior distribution, This can be done with the following equation,
Neural networks are notorious for large variations in performance, mainly due to the random initialization of network parameters. The quality of the posterior distribution estimated using Equation 3 is therefore quite sensitive to the initialization of the neural network parameters. To overcome this limitation and to obtain a more robust estimate of the posterior distribution, we suggest to use ensemble learning to compute Equation 3. This is done by training several neural networks (50 in our case study) and then averaging estimates of the posteriors obtained from each trained neural network.
Summary statistics
Additionally, as another component of our approach, we also obtain automaticallylearned (approximate) sufficient summary statistics for data observed with the optimal experimental designs. Summary statistics are often crafted manually by domain experts, which tends to be a timeconsuming and difficult task. In fact, handcrafted summary statistics are becoming less effective and informative as our models naturally become more complex as science advances (Chen et al., 2021). Our learned summary statistics could be used for various downstream tasks, for example as an input for approximate inference techniques (such as ABC) or as means to interpret and analyze the data more effectively, for example as a lower dimensional representation used to visualize the data.
Exploring model parameter disentanglement.
In addition to measuring the efficacy of our optimal experiments via the differential entropy of posterior distributions, we can ask how well they isolate, or disentangle, the effects of individual model parameters. Specifically, we measure this by means of the (Pearson) correlation between model parameters in the joint posterior distribution, which we present in Box 9—figure 1 for all simulator models and for both optimal designs and baseline designs. Here, a correlation of $r=0$ means that there is no (linear) relationship between the respective pair of inferred model parameters across individuals, thereby providing evidence that the data under this design allow us to separate the natural mechanisms corresponding to those parameters. On the other hand, correlations closer to $r=1$ or $r=1$ would imply that the model parameters essentially encode the same mechanism. As shown descriptively in Box 9—figure 1, our optimal designs are indeed able to isolate the mechanisms of individual model parameters more effectively than baseline designs. Note that these are not confusion matrices, as displayed in Box 5—figure 1, but rather average correlation matrices of the model parameter posterior distributions.
To assess this observation quantitatively, we first $z$transform (Fisher, 1915) the correlation coefficients (i.e. all entries below the main diagonal in the correlation matrices) and compute the average absolute $z$score over the model parameters within each participant. The resulting average of absolute $z$scores is an expression for the average magnitude of linear dependency between the posterior model parameters for each participant. Next, we assess whether the average absolute $z$scores are significantly different between those participants allocated to our optimal designs and those allocated to the baseline designs. We compute this statistical significance by conducting twosided Welch’s $t$tests (Welch, 1947). We find that our optimal designs significantly outperform baseline designs for all three models, that is for the WSLTS model we find a pvalue of p < 0.001 ($t(60.12)=13.89$), for the AEG model we have p = 0.004 ($t(27.30)=3.12$) and for the GLS model we have p < 0.001 ($t(91.01)=8.44$). This provides additional evidence that our optimal designs are better able to disentangle the effects of individual model parameters than baseline designs. As explained earlier, we argue that this isolation of effects is facilitated by the extreme values of our optimal designs.
Model checking
As also emphasized by previous works on computational modeling, a modelbased data analysis (such as via the amortized posterior distributions discussed above) should go handinhand with model checking, also referred to as validation (e.g. Wilson and Collins, 2019; Palminteri et al., 2017). A detailed treatment of methods and techniques for model checking is beyond the scope of this paper. However, it should be noted that model checking is just as important with data collected through optimized experimental designs as through handcrafted designs.
Caveats
We now turn to more highlevel considerations and discuss several categories of potential issues and caveats related to BOED. We first discuss how the optimality of an experimental design is always defined with respect to a specific goal. Second, we consider how posterior inferences always depend on the experimental design and how this impacts BOED considerations. Third, we bring attention to the important issue of model misspecification in the context of BOED. Fourth, we discuss the topic of ecological validity and generalizability of our inferences with BOED. Finally, we cover the interpretability of our optimal designs. While we discuss these issues in light of BOED, they are just as important with handcrafted experiments. Importantly, however, BOED makes these issues more salient, by turning the design of experiments into a more explicit process. Therefore, these points should be relevant to any researcher collecting or analyzing experimental data using computational models.
No single design to rule them all
The process of making all of our assumptions explicit and leaving the process of proposing concrete designs to the optimization procedure can provide us with assurance that our design is optimal with respect to our assumptions. Therefore, this process may also reduce bias in the design of experiments, as the optimization process can easily be replicated. We believe that these properties will help tackle the notorious replication crisis (Ioannidis, 2005; Pashler and Wagenmakers, 2012; Baker, 2016) and potentially setting new standards.
However, it is important to point out that the designs found using our optimization procedure are not assured to be optimal for all scenarios. That is, there is no ‘free lunch’ (Wolpert, 1996): No design suits all situations, as the usefulness of experimental designs depends on the theories under consideration, prior beliefs, and utility function. For instance, if we already have strong prior beliefs about a particular model parameter (such as people being strongly inclined to explore instead of exploit), there may be little to learn about this parameter and the BOED procedure may instead focus on parameters we have weaker prior beliefs about, which may yield different optimal designs. More generally, different scientific goals can lead to different optimal designs.
As an illustrative example, we consider the design of a survey item for political orientation. For simplicity, we assume a onedimensional spectrum with extremes to either side of the scale being equally likely in the population, with most people falling in the middle of the scale. In one scenario the goal may be to estimate a given person’s political belief along the spectrum. Here, an informative question is likely going to be moderate, since extreme items on either side of the spectrum ignore the other side, have a low prior probability and are therefore less useful. However, if the goal is to distinguish ‘extremist’ participants on one side of the spectrum from the rest of the population, or if the scientific question implies that a participant can only belong to either side of the onedimensional political spectrum then polarizing items are likely more informative than moderate items that do not reveal such extremist beliefs. This emphasizes the importance of clearly defining the scientific goal, making sure that the computational models reflect the theories under consideration, as described in Steps 1 and 2.
Inferences are implicitly conditioned on the experimental design
An important fact that is easy to forget is that all parameter inferences, be they Bayesian or frequentist, are inherently conditioned on the experimental designs used to generate the observed data. That is, the inferred belief about some parameter $\theta$ is determined by $\displaystyle p(\theta \mathbf{y},\mathbf{d})$ and not $\displaystyle p(\theta \mathbf{y})$, where $\displaystyle \mathbf{d}$ is the design with which the data $\displaystyle \mathbf{y}$ was observed, even though this is usually not reflected in the notation explicitly, but rather kept implicit.
As we have seen in our case study, different experimental designs may lead to different posterior inferences. For example, for the task of model discrimination, we found different proportions of participants best explained by our optimal designs and our baseline designs. The question of when and why different experimental designs might support different models or theories is an important one that should concern anyone doing empirical research, but contrasting qualitatively different designs, as we do, makes this especially salient. We would expect the simplest and most compelling evidence for some design $\mathbf{d}}_{a$ being ‘better’ than another ($\displaystyle {\mathbf{d}}_{b}$) to be posterior distributions that are in qualitative agreement having conducted experiments with both designs, but with lower posterior variances given design $\displaystyle {\mathbf{d}}_{a}$. This raises the question of what any differences in inferences may imply. In extreme cases, one might observe two different experiments leading to dramatically different conclusions—whatever the provenance of the two experimental designs. These differences may go so far as two experimental designs, $\displaystyle {\mathbf{d}}_{a}$ and $\displaystyle {\mathbf{d}}_{b}$ supporting different theories, such that $\displaystyle {\mathbf{d}}_{a}$ supports theory $A$, but falsifies theory $B$ and $\displaystyle {\mathbf{d}}_{b}$ supports theory $B$, but falsifies theory $A$. Such cases might plausibly be the result of serious model misspecification, where no model under consideration gives a good account of human behavior across different designs. For example, in the present case study, if every model under consideration supposes that people use fixed strategies, but in reality people adapt to the structure or statistics of the specific task at hand, then different designs would support different models. It seems plausible that some participants in bandit tasks might perform such strategy selection (e.g. Lieder and Griffiths, 2017): Having engaged with a block of deterministic bandit arms, some participants may approach the second block differently than if they had just engaged with a more stochastic design, or may even switch strategies within the same block. The problem of model misspecification is thus brought to light very clearly by BOED, as can be seen in the case study.
More generally, in cognitive science, we typically aspire to estimate parameters $\mathit{\theta}$ that describe some stable (with respect to time, but also context) cognitive attribute or process. If two designs do not support the same estimated parameters, then $\mathit{\theta}$ likely does not describe the stable trait that we would like it to describe. Sometimes, this may suggest adopting a higher level model that decides when $\mathit{\theta}$ should have an effect (such as a strategy switching model, as discussed above), such that our new $\mathit{\theta}}^{\mathrm{\prime}$ describes a policy. Alternatively, we can explicitly limit the context (or domain) in which $\mathit{\theta}$ should operate. For example, in the context of our case study, instead of generalizing our theory by proposing strategyswitching mechanisms, we may say that a theory on behavior in bandit problems should only cover stochastic reward environments, where no bandit arm has deterministic (0 or 1) rewards. These considerations depend very much on the concrete phenomena being studied and the theories used to describe them.
Model misspecification and informativeness
Despite best efforts, computational models will suffer from some degree of model misspecification, failing to capture some aspects of people’s behavior. While the misspecification may only be subtle in some cases, it can also be more severe, which poses a very real and important problem for computational modeling. BOED takes the perspective that the scientific question and models should come first and thus puts the model more into central focus than traditional approaches, thereby providing a more formal way to analyze the impact of modeling choices on the utility of different designs and expected results. For instance, computational models may sometimes only provide reasonable predictions in limited regions of the space of possible experimental designs. We can inspect this in Step 5 of our proposed workflow and surprising predictions may motivate iterating on our theory/model building, or if we are committed to the model, can be exploited by BOED to perform powerful tests of competing theories.
While the issue of model misspecification is not specific to BOED, optimizing experimental designs based on misspecified models can have negative consequences. In particular, if the space of models is too narrow, our data may be less useful for later analyses with wellspecified models. Two simple ways of lessening the impact of model misspecification are to enforce diversity constraints and to select qualitatively different highutility designs to cover important behavioral phenomena (Palminteri et al., 2017). As discussed, we can include constraints in our search over experimental designs, such that our designs are not too narrowly focused on a single region of the parameter space. We could, for example, enforce this on a blocklevel, to ensure that every participant is presented with a diverse set of experimental designs. Alternatively, we can perform a more manual selection and choose designs that have high utility, but are likely to elicit qualitatively different phenomena. We present exemplary ways of engaging with the estimated utility function in Appendix 6.
Ecological validity and generalizability
We have seen that BOED may sometimes lead to extreme or atypical designs, as the present results illustrate, and in some cases these designs may be deemed undesirable. Similarly, we may sometimes want to use not only one experimental design for all units of observation, but for example, slightly vary the design across participants. As discussed by Yarkoni, 2020, there is a common policy in experimental research of ignoring stimulus sampling variability, also referred to as the ‘fixedeffect fallacy’ Clark, 1973. This policy can lead to problems with generalizing inferences to broader classes of stimuli, especially when the phenomenon under question is likely sensitive to contextual effects (Van Bavel et al., 2016). Fortunately, the BOED workflow is inherently designed to handle such cases.
First, the optimization procedure over designs can include constraints (e.g. at most one deterministic bandit arm), which may be appropriate depending on the research question and can be viewed as inserting expert knowledge. Second, the approximation of the learned utility surface, allows for exploring the utility of different designs systematically. Rather than simply picking the design with maximal utility, practitioners can select from this surface to capture qualitatively different phenomena, or simply include slight variations of similar designs. For instance, researchers can pick the top$k$ designs, or adopt a more sophisticated policy to ensure that the designs being used cover important parts of the design space to ensure robustness and generalizability of results, as argued for, for example, by Yarkoni, 2020. If the goal is to provide more varied designs, instead of the present approach of optimizing the reward probabilities directly, it would be straightforward to parametrize the design space differently and optimize a design policy. For instance, in the context of our case study, we could optimize the parameters ($\alpha$ and $\beta$) of a $\text{Beta}(\alpha ,\beta )$ distribution from which the reward probabilities for the bandit arms of a given block are drawn. This is clearly only one option, and different ways of parametrizing design policies are possible, as an alternative or complement to choosing designs from the estimated utility surface. The BOED procedure is agnostic to such choices, and decisions should be based on the concrete problem being studied. Researchers may here consider the tradeoff between generalizabiltiy and variance control, which can be studied by simulating data from the model and performing parameter recovery analyses. We illustrate how the utility surface can be explored in Appendix 6, with code that shows how this can be accomplished in the related GitHub repository: https://github.com/simonvalentin/boedtutorial, (copy archived at Valentin, 2023).
Interpretability
Simulating data from the computational models under consideration is of key importance in all stages of the experimental design process. In this work, we treat all models as simulators and, as argued, for example, by Palminteri et al., 2017, simulated data are crucial in assessing and potentially falsifying specific model commitments. Prior predictive analyses are crucial for refining models and their priors, including detecting potential problems that may already be obvious from simulations alone, without having to run a real experiment. Beyond comparing models’ posterior probability, posterior predictive analyses can provide insights into where models are ‘right’ and where they may depart from human behavior, referring back to the previous theme of model misspecification. As proposed by Palminteri et al., 2017, model comparison (through posterior probabilities) and falsification through simulation of behavioral effects thus serve complementary roles. Such simulations can serve as rigorous tests of theories, as a failure of a model to generate an effect of interest can be used as a criterion for rejecting the model (or a particular mechanism encoded in the model) in that form. These simulations should be considered a key part of the workflow presented in this work, beyond just inspecting posterior distributions (over models or their parameters).
Conclusion and outlook
This tutorial provides a flexible stepbystep workflow to finding optimal experimental designs, by leveraging recent advances in ML and BOED. As a first step, the proposed workflow suggests to define a scientific goal, such as model discrimination or parameter estimation, thereby expressing the value of a particular design as a measurable quantity using a utility function. Secondly, the scientific theories under investigation need to be cast as a computational model from which we can sample synthetic data. The methodologies discussed in this tutorial make no assumption, however, on whether or not the likelihood function of the computational model needs be tractable, which opens up the space of scientific theories that can be tested considerably. In the third step, the design optimization problem needs to be set up, which includes deciding which experimental design variables need to be optimized and whether there are any additional constraints that should be included. The fourth step of our proposed workflow then involves using ML to estimate and optimize the utility function. In doing so, we discussed a recent approach that leverages neural networks to learn a mapping between experimental designs and expected information gain, that is MI, and provided a novel extension that efficiently deals with behavioral data. In the fifth step, the discovered optimal designs are then validated in silico using synthetic data. This ensures that the optimal designs found are sound and that no experimental resources are wasted, possibly also if it is found that the computational models need to be revisited after this step. Lastly, in the sixth step, the real experiment can be performed using the discovered optimal design. Using the discussed methodologies, the trained neural networks can then conveniently be used to compute posterior distributions and summary statistics, immediately after the data collection.
This tutorial and the accompanying case study demonstrate the usefulness of modern ML and BOED methods for improving the way in which we design experiments and thus collect empirical data. Furthermore, the methodologies discussed in this tutorial optimize experimental designs for models of cognition without requiring computable likelihoods, or marginal likelihoods. This is critical to provide the methodological support required for scientific theories and models of human behavior that become increasingly realistic and complex. As part of our case study, using simulations and realworld data, we showed that the proposed workflow and the discussed methodologies yield optimal designs that outperform humancrafted designs found in the literature, offering more informative data for model discrimination as well as parameter estimation. We showed that adopting more powerful methodological tools allows us to study more realistic theories of human behavior, and our results provide empirical support for the expressiveness of the simulator models we studied in our case study. BOED is currently an active area of research and future work will likely make some steps easier and more computationally efficient, such as automatic tuning of the ML methods used to estimate MI, variations of the loss function, improved ways of searching over the design space, new ways of dealing with model misspecification or computationally efficient sequential BOED (Rainforth et al., 2023). What will remain despite such technical advancements, however, is the need to formulate sound scientific questions and actively engaging with tools that help to automate designs. We thus believe that the steps and considerations outlined in the workflow presented in this work will stay relevant. More broadly, ML has seen success in behavioral research regarding the analysis of large datasets and discovering new theories with promising results (e.g. Peterson et al., 2021) but its potential for data collection and in particular experimental design has been explored remarkably little. We view the present work as an important step in this direction.
Appendix 1
Experiments
Priors
We use a uniform categorical prior over the model indicator $m$, that is $p(m)=\mathcal{U}(\{1,2,3\})$. We generally use uninformative priors $\mathcal{U}(0,1)$ for all model parameters, except for the temperature parameter of the WSLTS model that has a $\text{LogNorm}(0,1)$ prior, as it acts as an exponent in reshaping the posterior. We generate 50,000 samples from the prior and then simulate corresponding synthetic data $\mathbf{y}\mathit{\theta},\mathbf{d}$ at every design $\displaystyle \mathbf{d}$.
Subnetworks
For all our experiments, we use subnetworks ${\mathbf{S}}_{\mathit{\psi}}(\mathit{v},\mathbf{y})$ that consist of two hidden layers with 64 and 32 hidden units, respectively, and ReLU activation functions. The number of sufficient statistics we wish to learn for each block of behavioral data is given by the number of dimensions in the output layer of the subnetworks. These are 6, 8, 6 and 8 units for the model discrimination, parameter estimation (WSLTS), parameter estimation (AEG) and parameter estimation (GLS) experiments, respectively. The flexibility of a subnetwork is naturally improved when increasing the number of desired summary statistics, but the computational cost increases accordingly. When the number of summary statistics is too low, the summary statistics we learn may not be sufficient. We have found the above numbers of summary statistics to be effective middlegrounds and refer to Chen et al., 2021 for more detailed guidance on how to select the number of summary statistics.
Main network
The main network ${T}_{\mathit{\psi}}(\mathit{v},\mathbf{y})$ consists of the concatenated outputs of the subnetworks for each block of behavioral data and the variable of interest. This is then followed by two fullyconnected layers with ReLU activation functions. For the model discrimination experiment, we use 32 hidden units for the two hidden layers, while we use 64 and 32 hidden units for the parameter estimation experiments. See Box 4—figure 1 for a visualization of this bespoke neural network architecture.
Training
We use the Adam optimizer to maximize the lower bound shown in Equation 2, with a learning rate of 10^{3} and a weight decay of 10^{3} (except for the parameter estimation (WSLTS) experiments where we use a weight decay of 10^{4}). We additionally use a plateau learning rate scheduler with a decay factor of 05 and a patience of 25 epochs. We train the neural network for 200, 400, 300 and 300 epochs for the model discrimination, parameter estimation (WSLTS), parameter estimation (AEG) and parameter estimation (GLS) experiments, respectively. At every design we simulate 50,000 samples from the datagenerating distribution (one for every prior sample) and randomly hold out 10,000 of those as a validation set, which are then used to compute an estimate of the mutual information via Equation 2. During the BO procedure we select an experimental budget of 400 evaluations (80 of which were initial evaluations), which is more than double needed to converge.
Appendix 2
Human participant study
Task
Participants completed the multiarmed bandit tasks in online experiments. After going through instructions on the interface and setup of the task, participants were required to pass a series of five comprehension questions (in a truefalse format). If any of the comprehension questions were answered incorrectly, the participant was sent back to the instructions and could only progress to the task once they answered all questions correctly. An example screenshot of the task interface can be found in Appendix 5.
Twophase design
To allocate participants to the parameter estimation designs for the respective model that best matched their behavior in the model discrimination blocks, we implemented a simple API that uses the ensemble of trained neural networks and performs approximate MAP inference over the model indicator (see the previous subsection on posterior estimation). As we have obtained amortized posterior distributions as byproducts in the BOED procedure, this inference can be done efficiently: Forwardpasses through the neural networks are computationally cheap, only taking a fraction of a second, which means that there was no noticeable delay for the allocation of the optimal model for each participant.
Participants
$N=326$ adults (154 female, mean age $M=35.61}{\displaystyle SD=12.59$) participated in the experiment in return for a basic payment of £0.60 and performance related bonuses of up to £1.00. Participants took, on average, 8.8 ($SD=2.89$) minutes to complete the task.
Appendix 3
Computational models
We here provide further details about the computational models of human behavior in bandit tasks, which we briefly presented in the main text. We consider the multiarmed bandit setting where, at each trial $t=1,\dots ,T$, participants have to make an action $a}^{(t)}\in \{1,\dots ,K\$, which consists of choosing any of the $K$ bandit arms, and subsequently observe a reward $r}^{(t)}\in \{0,1\$. After a participant has gone through all $T$ trials, we summarize their behavior by a set of actions $\mathit{a}=({a}^{(1)},\mathrm{\dots},{a}^{(T)})$ and observed rewards $\mathit{r}=({r}^{(1)},\mathrm{\dots},{r}^{(T)})$.
Whether a participant observes a reward of 0 or 1 when making a particular choice $a}^{(t)$ depends on the specified reward probability. We here assume that each bandit arm $k\in \{1,\dots ,K\}$ is associated with a Bernoulli reward distribution. Each of these reward distributions has a (potentially) different reward probability, which is given by the corresponding entry in the design vector $\mathit{d}\in [0,1{]}^{K}$. In other words, the reward $r}^{(t)$ a participant receives when making a particular choice ${a}^{(t)}=k$ is sampled via ${r}^{(t)}\sim \text{Bernoulli}({d}_{k})$, where $d}_{k$ is the $k$th element of $\displaystyle \mathbf{d}$.
Computational models of human behavior in bandit tasks only differ in how they model the choices of a participant depending on the previous actions and rewards. To help us describe the mechanisms of such computational models, we here define the vectors $\mathit{\alpha}$ and $\mathit{\beta}$, which store the number of observed 0 and 1 rewards for all $K$ arms. That is, $\alpha}_{k$ refers to the number of times the $k$th arm was selected and subsequently generated a reward. Similarly, ${\mathit{\beta}}_{k}$ refers to the number of times a participant did not observe a reward when selecting the $k$th bandit arm. Below, we describe each of the computational models in more detail and include corresponding pseudocode.
WinStay LoseThompsonSample (WSLTS)
Here, we propose WinStay LoseThompsonSample (WSLTS) as an amalgamation of WinStay LoseShift (WSLS; Robbins, 1952) and Thompson Sampling (Thompson, 1933). WSLTS has three model parameters $({\theta}_{0},{\theta}_{1},{\theta}_{2})$ that regulate different aspects of exploration and exploitation behavior. Specifically, $\theta}_{0$ corresponds to the probability of staying after winning, that is the probability of reselecting the previous arm $a}^{(t1)$ after having observed ${r}^{(t1)}=1$. However, with probability $1{\theta}_{0}$ the agent may decide to choose a different bandit arm even after having observed a reward of 1 at the previous trial. In this case, the agent performs Thompson Sampling, using a temperature parameter $\theta}_{2$, to decide which bandit arm to select (see Equation 4). Conversely, $\theta}_{1$ corresponds to the probability of switching to another arm when observing a loss in the previous trial, that is when ${r}^{(t1)}=0$, in which case the agent again performs Thompson Sampling to select another bandit arm. However, the agent may also reselect the previous bandit arm with probability $1{\theta}_{1}$ even after having observed a loss.
During the exploration phases mentioned above, the WSLTS agent performs Thompson Sampling from a reshaped posterior, which is controlled via a temperature parameter ${\mathit{\theta}}_{2}$. The corresponding choice of bandit arm is then given by
where $\alpha}_{k$ and $\beta}_{k$ correspond to the number of times a participant observed a reward of 1 and 0, respectively, for a bandit arm $k$. This stands in contrast to standard WSLS, where the agent shifts to another arm uniformly at random. We note that, for $\theta}_{2$ values close to 1 we recover standard ThompsonSampling (excluding the previous arm), while for $\theta}_{2}\to \mathrm{\infty$ we recover standard WSLS, and for ${\theta}_{2}\to 0$ we obtain a greedy policy. We provide pseudocode for the WSLTS computational model in Algorithm 1.
Algorithm 1 WinStay LoseThompsonSample (WSLTS) 

Input: Parameter $\mathit{\theta}=({\theta}_{0},{\theta}_{1},{\theta}_{2})$, design $\mathit{d}=({d}_{1},\dots ,{d}_{K})$ Output: Actions $\mathit{a}=({a}^{(1)},\mathrm{\dots},{a}^{(T)})$, rewards $\mathit{r}=({r}^{(1)},\mathrm{\dots},{r}^{(T)})$ 1: Initialize pseudocounts $\alpha}_{k$ and $\beta}_{k$ to 1 for all bandit arms $k$. 2: for t = 1, …, T do 3: Sample $u\sim \mathcal{U}(0,1)$. 4: if t=1 then 5: Select the first bandit arm $a}^{(1)$ uniformly at random. 6: else if ${r}^{(t1)}=1$ then 7: if $\displaystyle u<{\theta}_{0}$ then 8: Reselect the previous bandit arm $a}^{(t1)$. 9: else 10: ThomsonSample according to Equation 4. 11: else 12: if $u<{\theta}_{1}$ then 13: ThomsonSample according to equation 4. 14: else 15: Reselect the previous previous bandit arm $a}^{(t1)$. 16: Sample the reward ${r}^{(t)}\sim \text{Bernoulli}({d}_{{a}^{(t)}})$. 17: Increment ${\alpha}_{{a}^{(t)}}\leftarrow {\alpha}_{{a}^{(t1)}}+1$ if ${r}^{(t)}=1$ and ${\beta}_{{a}_{t}}\leftarrow {\beta}_{{a}^{(t1)}}+1$ if ${r}^{(t)}=0$. 
Autoregressive $\epsilon$Greedy (AEG)
Standard $\epsilon$Greedy (e.g. Sutton and Barto, 2018) is a ubiquitous method in reinforcement learning, where the agent selects the arm with the highest expected reward with probability $1\epsilon$, where $\epsilon$ is a model parameter. Conversely, with probability $\epsilon$, the agent performs exploration by uniformly selecting a bandit arm. Here, we propose Autoregressive $\epsilon$Greedy (AEG) as a generalization of $\epsilon$Greedy, which allows for modeling people’s tendency towards autoregressive behavior (e.g. Gershman, 2020).
Specifically, the AEG model has two model parameters $({\theta}_{0},{\theta}_{1})$, where $\theta}_{0$ corresponds to the same $\epsilon$ parameter as in $\epsilon$Greedy. However, as opposed to randomly selecting any bandit arm, the probability of selecting the previous arm, in order to break ties between options with the same expected reward, is specifically controlled via the second model parameter $\theta}_{1$. We provide pseudocode for the AEG computational model in Algorithm 2.
Algorithm 2 Autoregressive $\epsilon$Greedy (AEG) 

Input: Parameter $\mathit{\theta}=({\theta}_{0},{\theta}_{1})$, design $\mathit{d}=({d}_{1},\dots ,{d}_{K})$ Output: Actions $\mathit{a}=({a}^{(1)},\mathrm{\dots},{a}^{(T)})$, rewards $\mathit{r}=({r}^{(1)},\mathrm{\dots},{r}^{(T)})$ 1: Initialize pseudocounts $\alpha}_{k$ and $\beta}_{k$ to 1 for all bandit arms $k$. 2: for t=1, …, T do 3: Sample $u\sim \mathcal{U}(0,1)$ and $v\sim \mathcal{U}(0,1)$. 4: Update the estimated reward probability for all $K$ bandit arms as $\frac{{\alpha}_{k}}{{\alpha}_{k}+{\beta}_{k}}$. 5: Let $M$ be the set of bandit arms with maximal expected reward. 6: if t=1 then 7: Select the first bandit arm $a}^{(1)$ uniformly at random. 8: else if $u<{\theta}_{0}$ then 9: if $v<{\mathit{\theta}}_{1}+\left(1{\mathit{\theta}}_{1}\right)/K$ then 10: Reselect the previous bandit arm $a}^{(t1)$. 11: else 12: Randomly select another bandit arm. 13: else 14: if ${a}^{(t1)}\in M$ and $v<{\mathit{\theta}}_{1}+\left(1{\mathit{\theta}}_{1}\right)/M$ then 15: Reselect the previous bandit arm $a}^{(t1)$. 16: else 17: Randomly select another bandit arm among the set $M$. 18: Sample the reward ${r}^{(t)}\sim \text{Bernoulli}({d}_{{a}^{(t)}})$. 19: Increment ${\alpha}_{{a}^{(t)}}\leftarrow {\alpha}_{{a}^{(t1)}}+1$ if ${r}^{(t)}=1$ and ${\beta}_{{a}_{t}}\leftarrow {\beta}_{{a}^{(t1)}}+1$ if ${r}^{(t)}=0$. 
Generalized Latent State (GLS)
Lee et al., 2011 proposed a latent state model for bandit tasks whereby a learner can be in either an explore or an exploit state and switch between these as they go through the task. Here, we propose the Generalized Latent State (GLS) model, which unifies and extends latentstate and latentswitching models, previously studied in Lee et al., 2011, allowing for more flexible and structured transitions (we thank Patrick Laverty for contributing towards this). The transition distribution over whether the agent is in an exploit state is specified by four parameters, each of which specify the probability of being in an exploit state at trial $t$. This transition distribution is denoted by $\pi (l,{r}_{t1})$, where $l$ is the previous latent state and $r}_{t1$ the reward observed in the previous trial. We here refer to trials where a bandit arm was selected but failed to produce a reward as failures. We provide pseudocode for our GLS computational model in Algorithm 3.
Algorithm 3 Generalized Latent State (GLS) 

Input: Parameter $\mathit{\theta}=({\theta}_{0},{\theta}_{1},{\theta}_{2},{\theta}_{3},{\theta}_{4})$, design $\mathit{d}=({d}_{1},\dots ,{d}_{K})$ Output: Actions $\mathit{a}=({a}^{(1)},\mathrm{\dots},{a}^{(T)})$, rewards $\mathit{r}=({r}^{(1)},\mathrm{\dots},{r}^{(T)})$ 1: Initialize pseudocounts $\alpha}_{k$ and $\beta}_{k$ to 1 for all bandit arms $k$. 2: Sample the initial latent state $l\sim \text{Bernoulli}(0.5)$. 3: for t=1, … do 4: Let $R\leftarrow$ all $k$ such that $k={\mathrm{arg}\mathrm{max}}_{k}\mathit{\alpha}$ (arms with the maximal number of rewards). 5: Let $F\leftarrow$ all $k$ such that $k={\mathrm{arg}\mathrm{min}}_{k}\mathit{\beta}$ (arms with the minimal number of failures). 6: Let $S\leftarrow R\cap F$. 7: Sample $u\sim \mathcal{U}(0,1)$ and $v\sim \mathcal{U}(0,1)$. 8: if t=1 then 9: Select the first bandit arm $a}^{(1)$ uniformly at random. 10: else if $S>1$ then 11: Randomly select a bandit arm $a}^{(t)$ from the set $S$. 12: else if $S=1$ then 13: if $u<{\theta}_{0}$ then 14: Select the bandit arm ${a}^{(t)}\in S$ 15: else 16: Randomly select another bandit arm. 17: else if $v<\pi (l,{r}_{t1})$ then 18: $l\leftarrow 1$ $\u25b9$ Latent state is exploit 19: if $u<{\theta}_{0}$ then 20: Let $R}_{\text{min}$ be the set of bandit arms in $R$ with the minimal number of failures. 21: Randomly select a bandit arm $a}^{(t)$ from the set $R}_{\text{min}$. 22: else 23: Randomly select another bandit arm. 24: else 25: $l\leftarrow 0$ $\u25b9$ Latent state is explore 26: if $u<{\theta}_{0}$ then 27: Let $F}_{\text{max}$ be the set of bandit arms in $F$ with the maximal number of rewards. 28: Randomly select a bandit arm $a}^{(t)$ from the set $F}_{\text{max}$. 29: else 30: Randomly select another bandit arm. 31: 32: Sample the reward ${r}^{(t)}\sim \text{Bernoulli}({d}_{{a}^{(t)}})$. 33: Increment ${\alpha}_{{a}^{(t)}}\leftarrow {\alpha}_{{a}^{(t1)}}+1$ if ${r}^{(t)}=1$ and ${\beta}_{{a}_{t}}\leftarrow {\beta}_{{a}^{(t1)}}+1$ if ${r}^{(t)}=0$. 
Appendix 4
BOED Algorithm
Algorithm 4 describes the BOED procedure used in our work to determine optimal experiments in bandit tasks.
Algorithm 4 BOED 

Input: Simulator model $\mathbf{y}\sim p(\mathbf{y}\mathit{v},\mathbf{d})$, prior distribution $p(\mathit{v})$, neural network (NN) architecture for ${T}_{\mathit{\psi}}(\mathit{v},\mathbf{y})$ Output: Optimal design $\mathbf{d}}^{\ast$, trained NN ${T}_{{\mathit{\psi}}^{\ast}}(\mathit{v},\mathbf{y})$ at $\mathbf{d}}^{\ast$ 1: Randomly initialize the experimental designs $\mathbf{d}}_{n}\leftarrow {\mathbf{d}}_{0$ 2: Initialize the Gaussian Process for Bayesian optimization (BO) 3: while $U({\mathbf{d}}_{n})$ not converged do 4: Sample from the prior: ${\mathit{v}}^{(i)}\sim p(\mathit{v})$ for $i=1,\dots ,M$ 5: Sample from the simulator: ${\mathbf{y}}^{(i)}\sim p(\mathbf{y}{\mathit{v}}^{(i)},\mathbf{d})$ for $i=1,\dots ,M$ 6: Randomly initialize the NN parameters ${\mathit{\psi}}_{n}\leftarrow {\mathit{\psi}}_{0}$ 7: while $U({\mathbf{d}}_{n};{\mathit{\psi}}_{n})$ with fixed $\mathbf{d}}_{n$ not converged do 8: Compute a sample average of the lower bound (see the main text) 9: Estimate gradients of the sample average with respect to ${\mathit{\psi}}_{n}$ 10: Update ${\mathit{\psi}}_{n}$ using any gradientbased optimizer 11: Use $\mathbf{d}}_{n$ and $U({\mathbf{d}}_{n})$ to update the Gaussian Process 12: Use BO to determine which $\mathbf{d}}_{n+1$ to evaluate next 
Appendix 5
Additional figures and results
In this section, we provide additional figures and results that supplement our main text. Appendix 5—figure 1 shows a screenshot of the bandit task that participants were presented with in our human experiment. Appendix 5—figures 2 and 3 show the parameter estimation results for the AEG and GLS simulator models, respectively, of our simulation study. Specifically, these figures show marginal posterior distributions of the model parameters, averaged over several (simulated) observations, for baseline designs and our optimal designs. With regards to the real experiment with human participants, Appendix 5—figures 4 and 5 show distributions of the posterior entropies in the parameter estimation task for the WSLTS and GLS simulator models, respectively. These distributions compare the differential entropy of the posteriors obtained using our optimal designs with those obtained using the baseline designs. Lastly, Appendix 5—figures 6 and 7 show example posteriors of human participants assigned to the WSLTS model, while Appendix 5—figures 8 and 9 show example posteriors of human participants assigned to the GLS model.
Appendix 6
Exploring the utility surface and locally optimal designs
In this section, we provide an example of how our framework allows for exploring the surface of the utility function and (locally) optimal designs. To do so, we consider the example of the model discrimination task where the optimal design was found to be around $[0,0,0.6]$ for the first bandit arm and $[1,1,0]$ for the second bandit arm. Our framework maximizes a probabilistic surrogate model, for example a Gaussian process (GP), over the estimated mutual information (MI) values in order to find optimal designs. It is straightforward to extract the learned GP from the final round of training, which allows us to utilize it for posttraining analysis and exploration. Since the design space is sixdimensional, it is difficult to visualize the entirety of the utility function, provided by the mean function of the learned GP, at once. However, it is possible to explore the utility function by means of slicing. An example of visualizing such a slice is shown in Appendix 6—figure 1. Furthermore, it is possible to systematically search for local optima of the utility function by running stochastic gradient ascent (SGA) over the mean function of the GP, with several restarts. We have done this for 20 restarts for the above example and summarized the (unique) local optima of the utility function in Appendix 6—table 1. Note that instead of using SGA, we could also treat the utility as an unnormalized density and sample designs proportional to this density (see, e.g. Müller, 1999).
Appendix 7
Example participants
We now explore the data from our optimal design group in the human participant study, specifically focusing on participants that were best described by the AEG model as an example. We qualitatively investigate their behavior by visualizing the (marginal) posterior distribution of the AEG model, as shown in Appendix 7—figure 1 and Appendix 7—figure 2 and for two example participants. The posterior distribution for the first participant (Appendix 7—figure 1) indicates small values for the first parameter 0 of the AEG model, that is the ε parameter, which implies that the participant tends to select the arm with the highest probability of receiving a reward as opposed to making random exploration decisions. Larger values of the second parameter 1 of the AEG model, that is the ‘stickiness’ parameter, imply that the participant is biased towards reselecting the previously chosen bandit arm. The second example participant (Appendix 7—figure 1) shows markedly different behavior. As indicated by the marginal posterior of the 0 parameter, this participant engages in considerably more random choices. Moreover, this participant shows ‘antisticky’ behavior, as implied by low values of the second parameter 1. Lastly, we compute the posterior mean for each participant best described by the AEG model and visualize them in a scatter plot in Appendix 7—figure 3. We find that most participants best described by the AEG model show behavior somewhat similar to the first example participant, but as expected, there is interindividual variation, for example, with some participants displaying ‘antisticky’ behavior.
References

Perceptual decision making: driftdiffusion model is equivalent to a Bayesian modelFrontiers in Human Neuroscience 8:102.https://doi.org/10.3389/fnhum.2014.00102

Intuitive experimentation in the physical worldCognitive Psychology 105:9–38.https://doi.org/10.1016/j.cogpsych.2018.05.001

ConferenceNeural approximate sufficient statistics for implicit modelsIn International Conference on Learning Representations (ICLR.

The languageasfixedeffect fallacy: A critique of language statistics in psychological researchJournal of Verbal Learning and Verbal Behavior 12:335–359.https://doi.org/10.1016/S00225371(73)800143

Frequencydependent selection in vaccineassociated pneumococcal population dynamicsNature Ecology & Evolution 1:1950–1960.https://doi.org/10.1038/s415590170337x

How simulation modelling can help reduce the impact of COVID19Journal of Simulation 14:83–97.https://doi.org/10.1080/17477778.2020.1751570

Models that learn how humans learn: The case of decisionmaking and its disordersPLOS Computational Biology 15:e1006903.https://doi.org/10.1371/journal.pcbi.1006903

Sensitivity and specificity of information criteriaBriefings in Bioinformatics 21:553–565.https://doi.org/10.1093/bib/bbz016

Bayesian optimization with informative covarianceTransactions on Machine Learning Research pp. 1–37.

Neural architecture search: A surveyJournal of Machine Learning Research 20:1–21.https://doi.org/10.1007/9783030053185

ConferenceVariational Bayesian optimal experimental designAdvances in Neural Information Processing Systems.

Greedy function approximation: A gradient boosting machineThe Annals of Statistics 29:1232.https://doi.org/10.1214/aos/1013203451

Hierarchical reinforcement learning explains task interleaving behaviorComputational Brain & Behavior 4:284–304.https://doi.org/10.1007/s42113020000939

Indirect inference for dynamic panel modelsJournal of Econometrics 157:68–77.https://doi.org/10.1016/j.jeconom.2009.10.024

How computational modeling can force theory building in psychological sciencePerspectives on Psychological Science 16:789–802.https://doi.org/10.1177/1745691620970585

Mapping anhedonia onto reinforcement learning: a behavioural metaanalysisBiology of Mood & Anxiety Disorders 3:12.https://doi.org/10.1186/20455380312

ConferenceImplicit Deep Adaptive Design: PolicyBased Experimental Design without Likelihoods35th Conference on Neural Information Processing Systems.

ConferenceEfficient Bayesian experimental design for implicit modelsIn Proceedings of the TwentySecond International Conference on Artificial Intelligence and Statistics.

Psychological models of human and optimal performance in bandit problemsCognitive Systems Research 12:164–174.https://doi.org/10.1016/j.cogsys.2010.07.007

Strategy selection as rational metareasoningPsychological Review 124:762–794.https://doi.org/10.1037/rev0000075

Maximizing the information content of experiments in systems biologyPLOS Computational Biology 9:e1002888.https://doi.org/10.1371/journal.pcbi.1002888

On a Measure of the Information Provided by an ExperimentThe Annals of Mathematical Statistics 27:986–1005.https://doi.org/10.1214/aoms/1177728069

Fundamentals and recent developments in approximate bayesian computationSYSTEMATIC BIOLOGY 66:e66–e82.https://doi.org/10.1093/sysbio/syw077

Bayesopt: a bayesian optimization library for nonlinear optimization, experimental design and banditsJournal of Machine Learning Research: JMLR 15:3735–3739.

Simulationbased optimal designBayesian Statistics 6:459–474.https://doi.org/10.1093/oso/9780198504856.001.0001

Optimal experimental design for model discriminationPsychological Review 116:499–518.https://doi.org/10.1037/a0016104

A tutorial on adaptive design optimizationJournal of Mathematical Psychology 57:53–67.https://doi.org/10.1016/j.jmp.2013.05.005

ConferenceBOCK: Bayesian optimization with cylindrical kernelsIn Proceedings of the 35th International Conference on Machine Learning.

Conferencewebpploed: A practical optimal experiment design systemIn Proceedings of the annual meeting of the cognitive science society.

The Importance of falsification in computational cognitive modelingTrends in Cognitive Sciences 21:425–433.https://doi.org/10.1016/j.tics.2017.03.011

Asymptotic theory of informationtheoretic experimental designNeural Computation 17:1480–1507.https://doi.org/10.1162/0899766053723032

Editors’ introduction to the special section on replicability in psychological science: a crisis of confidence?Perspectives on Psychological Science 7:528–530.https://doi.org/10.1177/1745691612465253

When a good fit can be badTrends in Cognitive Sciences 6:421–425.https://doi.org/10.1016/S13646613(02)019642

ConferenceOn variational bounds of mutual informationIn Proceedings of the 36th International Conference on Machine Learning.

A comprehensive survey of neural architecture search: Challenges and solutionsACM Computing Surveys 54:3447582.https://doi.org/10.1145/3447582

Some aspects of the sequential design of experimentsBulletin of the American Mathematical Society 58:527–535.https://doi.org/10.1090/S000299041952096208

Using approximate Bayesian computation to quantify cellcell adhesion parameters in a cell migratory processNPJ Systems Biology and Applications 3:9.https://doi.org/10.1038/s4154001700107

Inferring causation from time series in Earth system sciencesNature Communications 10:2553.https://doi.org/10.1038/s41467019101053

A review of modern computational algorithms for bayesian optimal designInternational Statistical Review 84:128–154.https://doi.org/10.1111/insr.12107

Toward a principled Bayesian workflow in cognitive sciencePsychological Methods 26:103–126.https://doi.org/10.1037/met0000275

BookLikelihoodfree inference in cosmology: potential for the estimation of luminosity functionsIn: Feigelson E, Babu G, editors. Statistical Challenges in Modern Astronomy V. New York: Springer. pp. 3–19.https://doi.org/10.1007/9781461435204

Finding structure in multiarmed banditsCognitive Psychology 119:101261.https://doi.org/10.1016/j.cogpsych.2019.101261

Taking the human out of the loop: a review of bayesian optimizationProceedings of the IEEE 104:148–175.https://doi.org/10.1109/JPROC.2015.2494218

A mathematical theory of communicationBell System Technical Journal 27:379–423.https://doi.org/10.1002/j.15387305.1948.tb01338.x

A Bayesian analysis of human decisionmaking on bandit problemsJournal of Mathematical Psychology 53:168–179.https://doi.org/10.1016/j.jmp.2008.11.002

BookReinforcement Learning: An Introduction. Adaptive Computation and Machine Learning SeriesThe MIT Press.

A tutorial on approximate Bayesian computationJournal of Mathematical Psychology 56:69–85.https://doi.org/10.1016/j.jmp.2012.02.005

A generalized, likelihoodfree method for posterior estimationPsychonomic Bulletin & Review 21:227–250.https://doi.org/10.3758/s1342301305300

Approaches to analysis in modelbased cognitive neuroscienceJournal of Mathematical Psychology 76:65–79.https://doi.org/10.1016/j.jmp.2016.01.001

Competing theories of multialternative, multiattribute preferential choicePsychological Review 125:329–362.https://doi.org/10.1037/rev0000089

Learning physical parameters from dynamic scenesCognitive Psychology 104:57–82.https://doi.org/10.1016/j.cogpsych.2017.05.006

SoftwareBoedTutorial, version swh:1:rev:94f32a04693763ec6af5d31247caf129bfe99142Software Heritage.

ConferenceAttention is all you needAdvances in Neural Information Processing Systems.

Prefrontal cortex as a metareinforcement learning systemNature Neuroscience 21:860–868.https://doi.org/10.1038/s4159301801478

Application of computerized adaptive testing to educational problemsJournal of Educational Measurement 21:361–375.https://doi.org/10.1111/j.17453984.1984.tb01040.x

The lack of a priori distinctions between learning algorithmsNeural Computation 8:1341–1390.https://doi.org/10.1162/neco.1996.8.7.1341

The generalizability crisisThe Behavioral and Brain Sciences 45:e1.https://doi.org/10.1017/S0140525X20001685

Optimal experimental design for a class of bandit problemsJournal of Mathematical Psychology 54:499–508.https://doi.org/10.1016/j.jmp.2010.08.002
Article and author information
Author details
Funding
University of Edinburgh (Principal's Career Development Scholarship)
 Simon Valentin
EPSRC Centre for Doctoral Training in Data Science (EP/L016427/1)
 Steven Kleinegesse
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Fiona Knäussel for her help in creating the schematics used in this paper, Maximilian Harkotte for valuable feedback during preparation of the paper and three anonymous reviewers for their careful reading of our manuscript and their many insightful comments and suggestions. SV was supported by a Principal’s Career Development Scholarship, awarded by the University of Edinburgh. SK was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh.
Senior and Reviewing Editor
 Joshua I Gold, University of Pennsylvania, United States
Version history
 Received: January 17, 2023
 Preprint posted: May 12, 2023 (view preprint)
 Accepted: November 19, 2023
 Version of Record published: January 23, 2024 (version 1)
Copyright
© 2024, Valentin, Kleinegesse et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 905
 views

 101
 downloads

 0
 citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Neuroscience
One wellknown biomarker candidate that supposedly helps capture fluid cognition is Brain Age, or a predicted value based on machinelearning models built to predict chronological age from brain MRI. To formally evaluate the utility of Brain Age for capturing fluid cognition, we built 26 ageprediction models for Brain Age based on different combinations of MRI modalities, using the Human Connectome Project in Aging (n=504, 36–100 years old). First, based on commonality analyses, we found a large overlap between Brain Age and chronological age: Brain Age could uniquely add only around 1.6% in explaining variation in fluid cognition over and above chronological age. Second, the ageprediction models that performed better at predicting chronological age did NOT necessarily create better Brain Age for capturing fluid cognition over and above chronological age. Instead, betterperforming ageprediction models created Brain Age that overlapped larger with chronological age, up to around 29% out of 32%, in explaining fluid cognition. Third, Brain Age missed around 11% of the total variation in fluid cognition that could have been explained by the brain variation. That is, directly predicting fluid cognition from brain MRI data (instead of relying on Brain Age and chronological age) could lead to around a 1/3time improvement of the total variation explained. Accordingly, we demonstrated the limited utility of Brain Age as a biomarker for fluid cognition and made some suggestions to ensure the utility of Brain Age in explaining fluid cognition and other phenotypes of interest.

 Developmental Biology
 Neuroscience
Movement is a key feature of animal systems, yet its embryonic origins are not fully understood. Here, we investigate the genetic basis underlying the embryonic onset of movement in Drosophila focusing on the role played by small noncoding RNAs (microRNAs, miRNAs). To this end, we first develop a quantitative behavioural pipeline capable of tracking embryonic movement in large populations of fly embryos, and using this system, discover that the Drosophila miRNA miR2b1 plays a role in the emergence of movement. Through the combination of spectral analysis of embryonic motor patterns, cell sorting and RNA in situs, genetic reconstitution tests, and neural optical imaging we define that miR2b1 influences the emergence of embryonic movement by exerting actions in the developing nervous system. Furthermore, through the combination of bioinformatics coupled to genetic manipulation of miRNA expression and phenocopy tests we identify a previously uncharacterised (but evolutionarily conserved) chloride channel encoding gene – which we term Movement Modulator (Motor) – as a genetic target that mechanistically links miR2b1 to the onset of movement. Cellspecific genetic reconstitution of miR2b1 expression in a null miRNA mutant background, followed by behavioural assays and target gene analyses, suggest that miR2b1 affects the emergence of movement through effects in sensory elements of the embryonic circuitry, rather than in the motor domain. Our work thus reports the first miRNA system capable of regulating embryonic movement, suggesting that other miRNAs are likely to play a role in this key developmental process in Drosophila as well as in other species.