Control of entropy in neural models of environmental state
 Cited 1
 Views 690
 Annotations
Abstract
Humans and animals construct internal models of their environment in order to select appropriate courses of action. The representation of uncertainty about the current state of the environment is a key feature of these models that controls the rate of learning as well as directly affecting choice behaviour. To maintain flexibility, given that uncertainty naturally decreases over time, most theoretical inference models include a dedicated mechanism to drive up model uncertainty. Here we probe the longstanding hypothesis that noradrenaline is involved in determining the uncertainty, or entropy, and thus flexibility, of neural models. Pupil diameter, which indexes neuromodulatory state including noradrenaline release, predicted increases (but not decreases) in entropy in a neural state model encoded in human medial orbitofrontal cortex, as measured using multivariate functional MRI. Activity in anterior cingulate cortex predicted pupil diameter. These results provide evidence for topdown, neuromodulatory control of entropy in neural state models.
https://doi.org/10.7554/eLife.39404.001Introduction
Faced with a constant stream of complex sensory data, observers construct internal (mental) models of the underlying state of the world that generated that data. These models allow the observer to understand and make predictions about the environment. The current state of the environment can be thought of as a latent variable in such models, that is a variable that cannot be observed directly but must be inferred from sensory observations. For example a rat in a reversal learning task needs to know the underlying state of the world (which of two levers is more likely to deliver a food reward), but cannot observe the probabilities per se; instead the rat must infer the probabilities through repeated sampling of an observable variable (does the lever pay out on this trial or not?).
Theoretical accounts of inference (sometimes called learning, as in reinforcement learning) tend to include two distinct processes for updating beliefs over time. The first process is evidencedriven updating of beliefs – after each new piece of evidence (such as a reward) is observed, the model is updated to take into account the new observation. The second process is one that occurs between observations, as states of the world for which no evidence has necessarily been observed are upweighted to account for the possibility that the environment could change before the next observation. Mathematically this second process is called the transition function as it determines the transition between posterior beliefs on trial t and prior beliefs on trial t + 1.
These two updating processes differ somewhat, both conceptually and algorithmically. Conceptually, evidencedriven updating, as the name suggests, is driven by recently observed evidence and usually takes into account how that evidence differs from predictions under a prior belief (e.g., reward prediction error). In contrast, updating under the transition function is driven by higher order features of the environment such as the inferred level of environmental volatility and structural knowledge about which environmental states tend to follow each other (Behrens et al., 2007; Meyniel and Dehaene, 2017).
Algorithmic examples of evidencedriven updating would be the value update equation of the RescorlaWagner model, or the application of Bayes’ theorem to combine a new observation with a prior belief. In contrast, a simple example of a transition function algorithm is a constant ‘leak’ that upweights all possible future states of the environment equally, and downweights beliefs based on past observations, such that the model’s beliefs constantly decay towards a state of agnosticism (an example would be the leaky accumulator (Usher and McClelland, 2001)). A more complex transition function could determine structured changes in the environment, for example by preferentially predicting that if the environment changes, it changes to a similar state (an example would be the Gaussian transition function on reward probability (Behrens et al., 2007)).
Put another way, the overall level and distribution of uncertainty in an internal model is controlled in part by the transition function. When the environment is believed to be changeable (Behrens et al., 2007), or to have recently changed (Wilson et al., 2010), the transition function adjusts beliefs so that there is high uncertainty about the state of the world on trial t + 1, even given what was known at trial t. The optimal level of uncertainty is determined by two opposing forces: if relatively little uncertainty is added at each transition the model can utilize a longer history of observations to infer the current state of the world; but if too little uncertainty is added, the model becomes inflexible.
Previous studies have addressed the question of how the optimal level of uncertainty, and hence the magnitude of the ‘leak’ or transition function, should be determined by inferring the higher order statistics of the environment such as its volatility or changepoint probabilities. Furthermore, bloodoxygenlevel dependent (BOLD) and pupillometric signals tracking such (optimal) uncertainty have been observed (Behrens et al., 2007; McGuire et al., 2014; Nassar et al., 2012; Preuschoff et al., 2011). A second, open question, however, is how the brain gets from calculating how flexible one’s beliefs should be, to controlling flexibility and uncertainty within a model that represents the current state of the world.
In relation to the first question, how the optimal level of uncertainty is calculated, Behrens and colleagues (Behrens et al., 2007) argued that the learning rate (analogous to the magnitude of the leak) should optimally depend on environmental volatility, and observed that when volatility (hence learning rate) was high, anterior cingulate cortex (ACC) was active. O’Reilly and colleagues (O'Reilly et al., 2013) showed activity in a similar region when participants were instructed that changes in the environment had occurred. McGuire and colleagues (McGuire et al., 2014) found activity in ACC associated with belief uncertainty, another modelbased measure that should determine the magnitude of the transition function (in their case, the probability that the model abandons its prior beliefs entirely in favour of a flat prior).
In relation to the second question, how uncertainty/flexibility is implemented within internal models, there is less evidence, one reason being that it is difficult to measure the uncertainty of an internal neural model directly. An intriguing possibility is that increases in neural model uncertainty are controlled via a neuromodulatory route. In particular noradrenaline has been proposed as a candidate mechanism by which the level of uncertainty in neural models of the world may be controlled (AstonJones and Cohen, 2005; Bouret and Sara, 2005; Iigaya, 2016; Martins and Froemke, 2015; Nassar et al., 2012; Preuschoff et al., 2011; Pulcu and Browning, 2017; Yu and Dayan, 2005), and experimental manipulations increasing noradrenaline levels in cortex tend to lead to increased randomness in behavior (Tervo et al., 2014). One theoretical proposal, that noradrenaline performs a ‘network reset’ (Bouret and Sara, 2005), is particularly reminiscent of the second, transitionfunction updating process described above.
In the current experiment we focussed on the second question above. That is, we attempted to measure the level of uncertainty within a probabilistic world model, and determine signals that predicted changes in the level of this uncertainty. These are key components for any candidate mechanism for the implementation of a transitionfunction that increases uncertainty, such as a leak.
We first established that we could measure uncertainty in a neural representation of a simple task state space using functional MRI. Then we identified predictors of increases in belief uncertainty, in the absence of overt changes in behaviour. In a simple experimental environment (4arm bandit task), human participants formed a probabilistic model of the state of the environment (which bandit has the high payout rate) that could be decoded from the medial orbitofrontal cortex (mOFC). This model represented participants’ certainty about the state of the environment in that the strength of the representation of the currently selected option, relative to other options, depended on the strength of evidence experienced, as captured by an optimal observer model. As well as measuring a neural model with fMRI, we simultaneously measured pupil dilation – an index of neuromodulatory state, commonly used as a measure of noradrenaline release to cortex. We found changes in pupil dilation predicted changes in uncertainty in the mOFC state model, providing evidence that the mechanism for driving up entropy in cortical models of the environment could be neuromodulatory. In circumstances when model entropy should increase, such as after observations likely to indicate a change in environmental state, activity in the ACC was observed, which in turn explained the strength of the pupil dilation response, suggesting ACC may be involved in regulating these neuromodulatory processes.
Results
To establish a task environment in which participants were constantly making inferences about the current state of the environment, we used a 4arm bandit task (Figure 1a, Figure 1—figure supplement 1). At any given time, one of four response options had a high (70 or 90%) probability of delivering reward and the other three had a low (20%) probability. The high reward option switched after a variable number of trials (mean 20 ± SD 5). Participants’ task was therefore to infer the current state of the environment, i.e. which option was the current highreward option. Participants were explicitly instructed about the reward structure before beginning the task. Although there were only four options available for selection (denoted by a different colour), there were eight options on the screen at any given time (see Materials and methods for details and motivation). The individual trials in this task were presented in rapid succession to maximize design efficiency given that we were interested in a subset of trials.
Participants’ behaviour alternated between exploitation and exploration as they formed and revised beliefs about the current state of the environment (Figure 1c). 72% of trials on average were classified as exploit trials (SEM across participants 1.2%); the mean length for an exploitation 'block' was 13.7 trials (across participants) and the withinsubject standard deviation of exploitation block lengths was (mean across participants) 7.1 trials. In order to classify trials as either exploitative or exploratory, we used a simple heuristic. We defined the start of a period of exploitation as the first trial on which participants selected the true high reward probability option and received reward, and the end of the exploit period as the last trial before switching to a different option (thus within exploit periods, by definition, all choices were for the same option). The task therefore provided a clear behavioural marker of the onset of exploration, which was defined as the first trial on which participants switched away from making this selection. As an aside, in Appendix 1 we explore an alternative approach (Ebitz et al., 2018) to defining explore and exploit phases using a hidden Markov model (HMM). This approach yielded a very similar trial classification to the one we used. The relative merits of the two approaches are discussed in Appendix 1.
The possibility for voluntary exploration in the task created a constant pressure to evaluate whether the state of the environment had changed. Certainty about the state of the environment was modelled as the entropy across the state space in a Bayesian ideal observer model. The model, described in the Methods, had a space of possible states encompassing the possibility that each option was the current high reward option, and a range of possible payout rates; it also had a uniform leak.
We defined belief uncertainty as the entropy across the space of four possible states (namely, the four possible highreward options, marginalizing over possible payout rates):
where ${H}_{i}$ is the state that option i is the high reward option. For comparison we defined another measure of belief uncertainty, relative uncertainty (Wilson et al., 2010), which behaved similarly and was highly correlated with entropy (mean correlation across participants r=0.80). The relation between entropy and relative uncertainty is explored further in Appendix 1.
Model entropy was generally high in the exploit phases and low in the explore phases of the task (Figure 1d,e); it also tended to be higher when the payout rate of the high reward option was lower. Both high entropy, and high relative uncertainty, made it more likely that participants would switch from exploitation to exploration following a reward omission (two separate logistic regressions: entropy  t_{18} = 7.60, p=2.57×10^{−7}; relative uncertainty  t_{18} = 13.50, p=3.7×10^{−11}). Furthermore, using an action policy defined by fitting two multinomial softmax functions (one for explore trials, one for exploit) to the option probabilities ${p(H}_{i})$, we found that the particpants’ choices were correctly predicted on 95% of exploit trials and 51% of explore trials (chance would be 25%). The importance of using a model for the study was in fact to have a representation of participants’ latent belief states during phases when their overt behaviour remained stable, rather than to predict overt changes of behaviour; however the fact that the model could predict behaviour relatively well across all phases of the task speaks to its validity.
The overall distribution of brain activity in explore and exploit phases of the task differed (GLM1, see Figure 1—figure supplement 2). The explore phase engaged a frontalparietal action network and the dorsomedial prefrontal cortex (dorsal anterior cingulate dACC and adjacent preSMA). The exploit phase engaged medial orbitofrontal cortex (mOFC) and the hippocampus, two regions known to be engaged in modelbased choice (Boorman et al., 2016; Jones et al., 2012; Noonan et al., 2010; Wikenheiser et al., 2017). In addition to state uncertainty (model entropy), several factors differed between explore and exploit phases, notably the frequency of response switching, and the frequency of reward omission. Further analysis (GLM2, Figure 1—figure supplement 3) indicated that response switching and high model entropy (see below for details of model) predicted activity throughout the frontal parietal network and dorsomedial frontal cortex. Reward omission predicted activity in dorsomedial frontal cortex only, and reward activated ventral striatum. Low model entropy predicted activation of medial OFC and hippocampus (Figure 1—figure supplement 3). Most of the differences in activity observed between exploitation and exploration were explained by these factors (GLM3; Figure 1—figure supplement 4).
However, the main focus of our analysis was the variation in state uncertainty within the exploitation phase of the task, during which the chosen action was by definition constant, and how this uncertainty may be represented and modulated. By limiting our analysis to core exploitation trials (at least five trials, and therefore approximately twenty seconds, after the last action switch and at least five trials before the next action switch; such trials were 25% of all trials (mean across participants = 201/800 trials, SEM = 7.0)), we were able to identify neural activity concerned with changes in participants’ beliefs in the absence of changes in overt behaviour. This dissociated uncertainty about the state of the environment from variability of action. All analyses presented are limited to this subset of trials unless otherwise stated.
During each exploitation block, participants’ certainty about the state of the environment varied due to probabilistic reward omission. To generate trialbytrial estimates of the certainty of participants’ beliefs to regress against neural responses, we constructed a normative Bayesian learning model (see Materials and methods). The model inferred the trialwise probability that each of the four options was the high reward probability option, based on the data observed by each individual participant (Figure 1e). For each trial, we calculated the entropy of the model’s posterior probability distribution over options, based on evidence up to and including that trial (Figure 1d). This was used as a proxy for the entropy in the participants’ internal model, at the point of feedback on each trial. Entropy is low when the probability mass of the posterior is mostly over one option, i.e. participants are certain. Within exploitation phases, model entropy generally decreased over time as participants continued exploiting a single option, but increased following probabilistic omissions of reward, and accordingly depended on the payout rate of the high reward option (70% or 90% reward) (see Figure 1d,e).
Throughout the next sections of this paper, we discuss the relationship between model entropy (in our theoretical model) and the level of uncertainty in a neural model of the state of the world. However, it should be noted that in the current task, the possible states of the environment are simply the alternatives for which option is the high reward option, that is {option A is high reward option… option D is high reward option}. Thus, in the current task, state and option value are inherently interlinked. It is necessarily the case that high confidence in the currently chosen option (low entropy state) correlates with that option having a high expected value (mean correlation ± s.e.m: 0.93 ± 0.005). Thus it should be noted that throughout the subsequent analyses, a low entropy trial, defined by us as a trial on which one particular option is considered very likely to be the high reward option, could equivalently be described as a state in which one particular option (the selected one) has a high expected value and others have a low expected value. In behavioural control more generally, this association would hold for any world model for which the aim was to determine the expected value associated with different candidate actions.
Decoding a probabilistic representation of beliefs about the state of the environment
A probabilistic model of the current state of the world should comprise representations of candidate states, weighted by their probabilities. Recent studies have identified medial OFC as a likely site for such a probabilistic model in the context of modelbased reinforcement learning (Chan et al., 2016; Schuck et al., 2016; Wilson et al., 2014).
In the present task, such a probabilistic internal model could be thought of as a mixture model comprising representations of the possible options, wherein the strength of the representation of a given option on each trial is weighted by the belief that option is the high reward probability option (Figure 1b).
To identify regions that represented the task environment in this probabilistic manner, we used a multivariate searchlight approach over the whole brain as follows: at each searchlight, in one half of the data (training set) we identified activity patterns characteristic of trials on which each option was selected. In the other half of the data (test set) we calculated the trialwise probability of classifying as each option (we then swapped training and test sets; see Materials and methods for details).
Simply searching for regions in which we could correctly classify the selected option, we were, as expected, able to decode at abovechance levels from a broad network of areas including motor and visual cortex (Figure 2a). A plot of the probability of decoding each action from visual cortex (Figure 2c, top) and motor cortex (Figure 2c, middle) reveals a representational structure that is sensitive to spatial and motor factors; the most likely options to be decoded, other than the actually selected option, are those corresponding to the chosen hand (and equivalently, the chosen visual hemifield).
However, the defining feature of a probabilistic model is that the relative weighting of the chosen option should depend on certainty, as shown in schematic Figure 1b. To search for regions that had this property, we defined a trialwise summary measure of representation strength as the ratio of the odds of classifying the trial as the chosen option, to the odds of classifying it as one of the unchosen options. This oddsratio measure captured the extent to which the chosen option was represented more strongly than unchosen alternatives on each trial. Finally, we regressed trialbytrial model entropy (as a proxy for the entropy of participants’ beliefs) against representation strength (GLM4). This identified areas in which representation strength for the chosen option was higher when there was higher certainty that the chosen option was the high reward option.
Using this approach we identified a region in mOFC (peak = 10,40,–22; Figure 2b) in which representation strength for the exploited option was higher when model entropy was lower (corrected for multiple comparisons using permutation testing; cluster mass p<0.05 corrected with a cluster forming threshold of p=0.001). This was the only cluster to survive multiple comparisons correction. The same effect was also detectable in the hippocampus (t_{(18)}=2.11 p<0.05 uncorrected at an ROI centered on the peak voxel 34, 14, 16 for the exploit vs. explore univariate contrast from GLM 1; Figure 1—figure supplement 2). However the effect in hippocampus did not survive whole brain correction for multiple comparisons and should therefore be treated with caution.
The weighted representation of option preferences observed in mOFC is consistent with a probabilistic internal model of the environment, whereby the currently selected option has a higher decoding probability when model entropy is low (Figure 1b and Figure 2c, bottom). It is distinct from an action/policy representation, in which the representation of options is not modulated by entropy, as was observed in visual and motor areas (Figure 2c, top and middle).
To probe the robustness of the mOFC result, we repeated the analysis using two measures other than the model entropy regressor to predict representation strength. Entropy is a summary measure of the uncertainty across options; low entropy could arise from any of the four options being represented more strongly than the others. To check that the result was specific to the current state of the environment, we repeated the above analysis but instead of model entropy, used the probability (belief) assigned by the model that the currently exploited option is the high reward option as the predictor variable (GLM5). This also produced a specific effect in the medial OFC (Figure 2—figure supplement 1). To confirm our result was not dependent on the particular learning model we used, we simply compared representation strength during exploit periods with high and low payout rates. When the payout rate of the exploited, high reward option was 90% vs. 70%, representation strength was higher in mOFC (Figure 2—figure supplement 1).
What is the nature of the representation in mOFC?
The multivariate whole brain analysis showed that the identity of the chosen option could be decoded with greater confidence (representation strength was higher) when model entropy was low. Whilst OFC has been proposed as the site of a representation of the current state of the world (Schuck et al., 2016; Wilson et al., 2014), it is also concerned with value representation. Trials with low model entropy in our task tended to be the same trials on which a high value option was chosen (an option which the participant believed had a high probability of leading to a reward), and trials on which recent reward history contained few reward omissions.
The reason we interpret the multivariate results as evidence for a probabilistic representation of the current state (which option is the current high reward option) is that our ability to decode the identity of the chosen option from mOFC (indexed as representation strength for the identity of that option) was higher when the participant’s uncertainty (model entropy) was lower. Therefore decoding is driven by the relative weightings of representations of specific options rather than decoding a general value representation. Indeed, the training and test sets used in the analysis both included cases where the option value was higher and lower, which was unknown to the classifier. We examined whether our effects still held after regressing out a univariate effect of option value by repeating the analysis after first regressing out the effect of value from univariate activity in each voxel in the mOFC region of interest. The effect of model entropy on representation strength remained significant (t_{(18)}=7.62,p<0.001). This was also true when regressing out recent reward history (whether the current and previous three trials were rewarded) in addition to value (t_{(18)}=7.26,p<0.001) from voxelwise activity before carrying out multivariate analysis. Since these analyses were performed on the residual voxelwise data after the effect of value (or reward history) on voxelwise activity was removed, univariate option value could not drive the classification. Regardless, given the decoding effects were on specific option identities, decoding a general value representation cannot explain our results. Rather the decoding must be driven by different relative weightings of specific option representations, consistent with a probabilistic state representation.
We should reiterate here that in the current task, state and option value are inherently interlinked because the possible states of the environment are simply the alternatives for which is the high reward option, i.e. {option A is high reward option… option D is high reward option}. Whilst it is the representation strength of state identity that depends on uncertainty, the states themselves are defined in terms of option value. Therefore the present results in no way speak against the hypothesis that mOFC represents option values; in fact the opposite is true.
A candidate neuromodulatory mechanism for increasing flexibility of belief representations
Having confirmed that a probabilistic model of the state of the world is represented in the brain, we went on to ask how entropy in that model might be controlled. An intriguing possibility is that increases in neural model uncertainty are controlled via a neuromodulatory route. The neuromodulator noradrenaline has been proposed as a candidate mechanism by which the level of uncertainty in neural models of the world may be controlled (AstonJones and Cohen, 2005; Bouret and Sara, 2005; Iigaya, 2016; Martins and Froemke, 2015; Nassar et al., 2012; Preuschoff et al., 2011; Pulcu and Browning, 2017; Yu and Dayan, 2005). Noradrenaline is produced in the locus coeruleus (LC), and is released very broadly in the cerebral cortex including the mOFC (AstonJones and Cohen, 2005).
We used baseline pupil size (mean pupil area in the 20 ms before outcome presentation, expressed as % signal change relative to the participant’s mean pupil area over the task run) to index neuromodulatory state. Pupil size has primarily been interpreted in the literature as indexing noradrenaline levels, although the muscles controlling pupil size are sensitive to both noradrenaline and acetylcholine. Changes in pupil diameter have been linked to singleunit activity in the (noradrenergic) LC (Joshi et al., 2016). Recent physiological work has demonstrated that both noradrenaline and acetylcholine as measured in cortex, are associated with pupil size (Reimer et al., 2016); levels of both noradrenaline and acetylcholine were correlated with pupil dilation over a time course of up to a few seconds (a similar time course to the interval between trials in the current experiment). Therefore our pupillometry results cannot be interpreted as specific to noradrenaline but rather reflect a complex combination of neuromodulatory and other factors. However, Reimer et al. (2016) also report that noradrenaline, but not acetylcholine, was correlated with changes in pupil dilation (the temporal derivative of pupil size) as well as pupil size itself, which favours a noradrenergic interpretation of our pupillometry results (most of the pupillometry results reported below concern changes in pupil size rather than pupil size per se). However, we must emphasise that pupillometry is a very indirect measure of neuromodulation and although we mostly interpret our results in the context of noradrenaline due to the extensive literature on pupillometry and noradrenaline, as well as influential theories and experimental work linking noradrenaline to uncertainty, model updating and exploration, we note that our pupil size data cannot be specifically linked to noradrenaline over other pupillinked arousal factors such as acetylcholine.
Baseline pupil size was larger in explore than exploit trials (Figure 3a, left), and increased when transitioning into exploration (Figure 3—figure supplement 1). On a trialbytrial basis, reward omission produced a sustained increase in pupil size following feedback (Figure 3a, right).
We hypothesized that pupil dilation would predict changes in the level of uncertainty in the internal model. To test this hypothesis, we carried out a linear regression (GLM6) in which the change in baseline pupil size from one trial to the next was used to predict changes in representation strength in the mOFC from one trial to the next. In other words, we asked whether change in pupil size induced by observation of an outcome predicted decreases in representation strength between that trial and the next, providing evidence for changes in neuromodulatory state applying something analogous to a leak to the posterior belief or partial network reset (Bouret and Sara, 2005). The OFC representation strength was extracted from a region of interest (searchlight center) defined on the peak result shown in Figure 2b (change in model entropy was included as a coregressor of no interest).
This analysis indicated that changes in baseline pupil size indeed explained changes in representation strength in mOFC; as expected, this relationship was negative, such that increases in pupil size predicted decreases in representation strength. Interestingly, this relationship was limited to unrewarded trials (t_{(15)}=2.94, p=0.01); within the subset of rewarded trials there was no relationship (t_{(15)}=0.37, p=0.71; difference between rewarded and unrewarded: t_{(15)}=2.82, p=0.01; Figure 3b). This is significant because (due to the 4arm nature of the task and the fact that we included only exploitation trials in the analysis) increases in uncertainty occurred almost exclusively following reward omissions – on only 0.2% of core exploit trials (10/3833 trials over all participants) did uncertainty increase following a rewarded trial. Therefore the relationship between pupil dilation and change in representation strength seems to hold in cases where uncertainty is increasing, but not when it is decreasing. To test the temporal specificity of the result, we shifted the pupil and OFC signals relative to one another and recomputed the effect. We found that the result was no longer significant when shifting the signals in any direction (p>0.1 for all analyses), suggesting the effect was temporally specific to change in pupil size predicting change in representation strength from the current trial to the next. This effect remained significant when regressing out value from the univariate signal before fitting the multivariate model, as discussed above (unrewarded: t_{(15)}=3.07, p<0.01; rewarded: t_{(15)}=0.56, p=0.58; difference: t_{(15)}=2.95, p<0.01).
Although we focus here on the exploitation phase, we found this result replicated in the mOFC ROI when including only explore trials in the analysis (Figure 3—figure supplement 2; unrewarded: t_{(15)}=2.38, p=0.03; rewarded: t_{(15)}=0.97, p=0.35; difference: t_{(15)}=2.23, p=0.04). Furthermore, performing the analysis on unrewarded exploit trials over the whole brain reveals a region relatively localized to mOFC (Figure 3c). To illustrate this result further, we took the unrewarded exploit trials and looked at the change in representation strength in the mOFC ROI on trials separated according to a median split on change in pupil diameter. As expected, we found change in representation strength was more negative when change in pupil dilation was higher (Figure 3d; t_{(15)} = 2.72, p=0.02).
These results demonstrate that on trials when evidence against the current belief is observed (unrewarded trials), the change in uncertainty in the neural model is predicted by changes in pupil diameter, suggesting a role for neuromodulation in control of uncertainty in neural models. The fact the relationship is specific to unrewarded trials suggests that the relationship between pupil dilation and representation strength in mOFC is directionally specific to the process of increasing, not decreasing, model entropy.
Anterior cingulate cortex is activated when model entropy increases and predicts pupil change
Having provided evidence that neural model uncertainty may be increased via pupillinked neuromodulatory mechanisms, we next asked which brain regions are engaged during increases in uncertainty and may regulate changes in neuromodulatory systems. We used a whole brain GLM analysis to identify regions that were sensitive to increases in model entropy, even in the absence of overt action switches (analysis again limited to exploit trials as defined above; GLM7). This analysis identified activity in the ACC and adjacent preSMA that was directionally sensitive to change in entropy (Figure 3e, left). The greater the increase in entropy from the previous to current trial, the greater the activity in the ACC and surrounding regions of the dmPFC – that is, these regions were more active when entropy was increasing and less active when it was decreasing. These results are consistent with the observation ACC has been associated with the processing of feedback that should lead to changes of belief (O'Reilly et al., 2013).
ACC is one of the few cortical regions reported to have an afferent influence on the noradrenergic LC (Jodo et al., 1998) . This provides at least one anatomical route by which, if ACC is responsible for computing changes in beliefs, it may be able to broadcast increases in uncertainty to other brain regions via a neuromodulatory route. To probe the relationship between ACC and pupillometric effects, we asked where in the brain univariate activity predicts changes in baseline pupil size. We conducted a whole brain GLM analysis, again focused on the exploit period, in which univariate brain activity at each voxel was the predictor variable and change in pupil size the dependent variable; change in model entropy was included as a coregressor. Pupil dilation is globally correlated with neural gain (Eldar et al., 2013), and we observed a strong correlation of pupil size with mean BOLD signal across the entire grey matter. When we controlled for this global effect by including mean activity across the grey matter as a coregressor of no interest (GLM8), we observed a localized effect in ACC such that ACC activity on a given trial predicted the change in baseline pupil size from that trial to the next (Figure 3e, right; more activity in the brain associated with a greater increase in pupil dilation; corrected for multiple comparisons using permutation testing; p<0.05 at cluster forming threshold of p=0.001). Although we focus on the exploit period, this result is also present when conducting the analysis across all trials (Figure 3—figure supplement 3).
Given that ACC activity predicts pupil dilation and pupil dilation predicts changes in representation strength in mOFC, we tested whether there was a direct relationship between ACC activity and changes in representation strength in mOFC. However, there was no evidence for such a relationship (activity extracted from an ROI defined on peak of the ACC effect from either GLM7 or GLM8 did not have a significant relationship with change in representation strength in mOFC; p>0.1 for all trial subsets tested). This may reflect the fact that both brain measures have relatively low SNR compared to the pupillometry data, to which both were significantly related.
Although we focus on the exploitation phase in which overt behavior was stable, it is worth noting that a very similar pattern of activity in medial prefrontal cortex was present when participants switched from exploitation to exploration (GLM2, Figure 3—figure supplement 4). The activity observed in ACC in this analysis is activity over and above that driven by changing response/option selection (GLM2), which (by definition) occurs on the last trial of exploitation. Note that this effect is specific to exploration initiation, and was not observed when switching strategy from exploitation to exploration (Figure 3—figure supplement 4), suggesting a particular role in disengaging from the exploited strategy.
Discussion
We set out to investigate the mechanisms by which entropy in neural models of the world may be controlled, particularly mechanisms by which entropy may be increased. We identified patterns of activity indicative of a probabilistic model of the environment in the medial OFC. The hallmark of this probabilistic model is that it has stronger representation strength for the currently selected option when model entropy is low. This is distinct from sensory and motoric representations in visual and motor cortex that are not sensitive to model entropy. Changes in representation strength in medial OFC, over and above those predicted by our task model, were explained by change in pupil dilation, a proxy measure for neuromodulator activity. Pupil dilation was in turn predicted by activity in the ACC – a region also active when model entropy increased.
That an internal model of the task is represented in mOFC is consistent with the proposal that this region represents the current location in task state space (Wilson et al., 2014), and in particular represents information that is hidden to participants but required to solve the task (Chan et al., 2016; Schuck et al., 2016). Our work extends this proposal to include the representation of the strength of dynamic beliefs in a probabilistic task in which participants infer the hidden optionreward contingencies that define their location in the task state space.
The fact that mOFC is the locus of the probabilistic model in the current task is probably due to the choice of task, as mOFC plays a key role in representing stimulusactionoutcome contingencies during modelbased action (Jones et al., 2012). Different stimulus or task domains could rely on probabilistic models represented in other, task relevant parts of cortex. Furthermore, the exact nature of the representation we are interrogating in mOFC is an interesting avenue for future research. For example, it is possible the representation we observe here reflects the online representation of optionreward contingencies required to guide choice – for example, for online comparison – and is not where the optionreward contingencies are stored offline, which may be in another region such as hippocampus (Boorman et al., 2016) or striatum (Packard and Knowlton, 2002). The hippocampus showed some limited evidence of probabilistic option representations in the current study. The striatum also plays a key role in representing actionoutcome associations, and indeed mechanisms for maintaining flexibility within these representations have been proposed (Franklin and Frank, 2015). In the current study we were unable to detect evidence of a probabilistic model in the striatum. This could be because our multivariate analysis relied upon patterns across many voxels and was therefore relatively less sensitive to effects in small structures. Regardless of its exact nature, we are able to use the probabilistic model of the task space detected in mOFC as an index of the neural representation of participants’ beliefs; that is, a neural measure of an internal model.
We next sought evidence for a role of neuromodulatory systems in increasing uncertainty in such an internal model. Although most theoretical work in this field has focussed on noradrenaline, it should be noted that our results concern pupil dilation, not noradrenaline per se. Pupil dilation depends on at least two neuromodulators (noradrenaline and acetylcholine) (Joshi et al., 2016; Reimer et al., 2016), as well as other arousalrelated factors, so our results cannot be linked directly to noradrenaline.
A large body of theoretical work has ascribed noradrenaline various functions relating to the modification of internal models (changing gain (AstonJones and Cohen, 2005), network resets (Bouret and Sara, 2005), and signalling unexpected uncertainty (Preuschoff et al., 2011; Yu and Dayan, 2005)). However, evidence for the impact of noradrenaline on internal models has been relatively sparse, partly due to the requirement of measuring the internal model. In one significant rodent study, increasing noradrenaline results in broader – less selective – tuning curves of auditory neurons (akin to a network reset) (Martins and Froemke, 2015). Such an impact on representations is what is required for noradrenaline to implement its proposed role in belief updating (Nassar et al., 2012) and increased weighting of current evidence versus prior belief on choice (Krishnamurthy et al., 2017). By simultaneously measuring pupil size and an internal model in mOFC, we observed changes in pupil diameter predict increases in uncertainty in an internal model, demonstrating a potential analog of this rodent work in humans performing complex decisionmaking tasks, and suggesting a broad role of pupillinked neuromodulation in reducing selectivity of cortical representations.
We then asked which brain regions contain signals associated with changes in the certainty of beliefs and may be involved in regulating the pupillinked neuromodulatory system. We found ACC tracked changes in the certainty of beliefs, consistent with observations that it monitors the outcomes of one’s actions (Walton et al., 2004), predicts the updating of beliefs (Behrens et al., 2007), and is specifically engaged when internal models need to be updated (O'Reilly et al., 2013). ACC was particularly involved when uncertainty should increase, consistent with its role in exploratory behavior (Hayden et al., 2011) and information seeking (Stoll et al., 2016). While the present study provides evidence for this role of ACC, it was not designed to dissociate it from other roles ascribed to ACC, such as cognitive control (Shenhav et al., 2013) and choice difficulty (Shenhav et al., 2014). Rather, we double dissociate ACC from other regions representing an internal model (mOFC), and provide evidence ACC may broadcast information to such regions via neuromodulatory systems.
Anatomical work suggests ACC is one of the few cortical regions with descending control on noradrenaline release (Jodo et al., 1998). Further, recent studies have shown that activity of ACC neurons predicts pupil dilation and may temporally lead pupilpredicting LC activity (Ebitz and Platt, 2015; Joshi et al., 2016). Therefore it is anatomically plausible that ACC could influence belief uncertainty via the noradrenaline system, possibly amongst other routes. However, whilst both ACC and noradrenaline have been implicated in belief updating (Karlsson et al., 2012; O'Reilly et al., 2013) and setting the learning rate (Behrens et al., 2007; Nassar et al., 2012), evidence for their suspected (AstonJones and Cohen, 2005; Karlsson et al., 2012; Nassar et al., 2012) coordination during tasks requiring belief updating has been lacking. Here we found ACC activity explained changes in baseline pupil dilation, over and above the effect of changes in uncertainty and mean brain activity. Whilst our pupil results cannot be linked directly to noradrenaline, they are compatible with the hypothesis that the noradrenaline system could offer a mechanism via which the computation of uncertainty of beliefs in ACC may be broadcast to internal models.
An alternative explanation for the relationship between ACC and pupil dilation is that they are associated, not with a change in beliefs, but with a change in the drive to produce exploratory behavioural output. The circumstances in which model entropy should increase are the same circumstances that should drive participants towards exploratory (random) behavior. Indeed, increasing noradrenaline release to the ACC has been associated with the output of stochastic behavior (Tervo et al., 2014). However, an account in which ACC activity and noradrenaline release only track a drive to produce stochastic behavioural output, or to explore, cannot account for the association between pupil dilation and the increase in model uncertainty recorded in mOFC, in the absence of any behavioural change. Further, the animal work described above (Martins and Froemke, 2015) provides causal evidence that noradrenaline reduces selectivity of cortical representations, supporting the proposed role for noradrenaline in reducing the strength of beliefs, in addition to its role of increasing stochastic behaviour (Tervo et al., 2014). This hypothesis can be summarized in terms of economic models of choice as arguing that noradrenaline input to ACC alters the temperature parameter of a softmax choice model whereas noradrenaline input to internal models (mOFC) alters the option values, resulting in behavioural stochasticity (Tervo et al., 2014) and entropic beliefs (present results), respectively.
We started by asking whether there was a mechanism for increasing uncertainty in neural models. The results presented, in which increases in belief uncertainty engage ACC, which explains changes in pupil size, that in turn explain changes to the selectivity of internal models following evidence the underlying state of the world has changed (reward omission), provide a candidate mechanism. These results suggest different neural mechanisms may be engaged when forming as compared to revising beliefs, and that neuromodulatory systems may play a crucial role in implementing flexibility of internal representations during decisionmaking in humans.
Materials and methods
Participants
Twentytwo healthy human volunteers participated in the experiment. Participant identities were anonymized for analyses. All participants were righthanded, aged between 20 and 31 (mean age: 26), and eight were female. All participants had normal or corrected to normal (with contact lenses) vision. Three participants had to be excluded because of technical difficulties with fMRI, and three had to be excluded from pupil analyses because of technical difficulties with eyetracking in the fMRI scanner. Power analyses are as follows. For univariate effects of model updating: Based on a previously published result demonstrating an effect of model updating in the ACC (O'Reilly et al., 2013), we performed a simple apriori power calculation. This effect from O'Reilly et al. (2013) had a Cohen’s d of 0.89; for this effect size, 80% power and an alpha level of 0.05 (one tailed) the minimum sample size was calculated as 17 subjects. For multivariate effects, no a priori power calculations were performed as effect sizes for multivariate analysis are notoriously hard to estimate in advance, depend on the type of information to be decoded and vary across cortical regions – given that we used a whole brain searchlight approach this would make it very difficult to prespecify an effect size (Bhandari et al., 2018). We present a posteriori power calculations because of a reviewer request. The effect size (Cohen’s d) for the key result that representation strength is predicted by model entropy was d = 1.58. For a power of 80% and alpha = 0.05 one tailed, the minimum sample size was calculated at six individuals. Whilst we would not be confident in a result based on such a small sample, we note that the regression analysis did give a negative regression coefficient (high entropy predicts low representation strength) in 18/19 individuals in our group. For comparison, Schuck et al (Schuck et al., 2016) decoded task states from a similar medial OFC region to that found in the present study, in an a priori ROI. Their effect size (accuracy of decoding task state vs. chance) was d = 1.77, which would suggest a minimum sample size of 5. For pupillometry effects, the most relevant study in the literature is Nassar et al. (2012). In this study the effect size for a regression of pupil average (similar to our baseline pupil measure) on relative uncertainty yielded an effect size of d = 0.94, which would suggest a minimum sample size of 16 for this analysis. We collected data from a number of participants satisfying the estimated size effects, and in line with other papers in the field. The study was approved by a local University of Oxford ethics committee (ref: MSDIDRECC12013171), and all participants gave written informed consent.
Task
Participants were presented with eight options (Figure 1a, Figure 1—figure supplement 1), four of which were available for selection in each run of the task (participants completed four runs of the task – see below). The four available options were distinguished from unavailable options by their colour (blue or red, counterbalanced across participants). They were instructed that of the four available options, one had a high and the remainders a low probability of giving rise to reward, if selected. Participants selected, on each trial, any and only one of the four available options by pressing its corresponding button. Following this selection, reward was delivered probabilistically according to whether the option selected was the high probability (70 or 90% probability of reward) or one of the low probability (20%) options, and was displayed on the screen at the time of outcome.
The participants’ objective was to accumulate as much reward as possible. Hence, the optimal strategy was to always select the high rewardprobability option. This high reward option moved randomly to one of the three other options after a variable number of trials (mean 20 trials, SD 5) and independently of the participants’ behaviour – adding to the unpredictability (due to the variable length of the exploratory period). When the high reward option moved, it was randomly selected with equal probability to have a 70 or 90% chance of reward delivery. The task was embedded in a game in which participants controlled a mouse seeking cheese (Figure 1—figure supplement 1). Following selection of an option, the mouse moved to the option and the outcome (reward or no reward; cheese or apple, counterbalanced across participants) was revealed and remained on the screen for 200 ms. The next trial began soon after (jittered with a uniform distribution between 0 and 4 s; mean ITI of 3.44 s). The short spacing between trials was necessary, as we required large numbers of trials to drive participants between many periods of exploitation and exploration and because we focused our analysis on a subset of trials. The close trial spacing was also designed to encourage a sense of consistent engagement with a particular strategy within exploit blocks.
Participants completed two sessions – one behavioural and pupillometry only, and one additionally with fMRI – each containing four runs of 200 trials (1600 trials total per participant). The available (blue/red) stimuli switched across runs (see below). Before the first session, participants were given detailed instructions of the task. These included underlying statistics: they were explicitly told the probabilities of reward, although not the true underlying frequency at which the high reward probability jumped. Participants were told the probabilities so they would have comparable understanding of the task when doing the fMRI session. Selecting each of the eight options corresponds to a button press for each of the eight fingers. Before starting the first session, participants were trained on the mappings between the response buttons and the options by doing 300 trials of a training task, whereby one option would be highlighted and they would be required to press the corresponding response button. After this, they performed the four runs of the task in the behavioural session. The following day, participants completed the fMRI session, which similarly required participants to complete four runs of the task. The task and all instructions were the same across the two sessions. Participants were additionally instructed to centrally fixate the fixation cross throughout the whole experiment in both the behavioural and fMRI session.
The motive for having eight options but only four of which were available for selection was to test whether there was any difference between options available for selection but not selected and those not available for selection. The configuration of available options differed across the runs such that all options were available in two of the runs (to allow for training and testing sets of the multivariate model). Therefore the unavailable options were intended to act as a baseline. Analyses testing for differences between available but not selected and unavailable options did not yield significant results (data not shown).
Modelling of behaviour
In order to generate regressors to analyse neural data, we constructed a normative Bayesian learning model that reflected information communicated to participants – specifically, that one option had a high probability of reward and the remainders a low probability of reward.
We implemented a normative Bayesian learning model to simulate how belief uncertainty should change depending on trial history. The model had information that reflected that which was communicated to participants. We informed participants that there was always and only one location with an either 70 or 90% probability of receiving reward and the remainders only 20%. The model estimated the probability that each of four hypotheses ${H}_{t}\in \mathrm{}\{\mathrm{1,2},\mathrm{3,4}\}$ is correct – one hypothesis for each of the four locations that could be chosen on any given trial ${L}_{t}\in \mathrm{}\{\mathrm{1,2},\mathrm{3,4}\}$ being the high rewardprobability location. In each of these hypotheses, the three low probability locations were assigned p(reward)= 0.2, which was fixed. The other, high rewardprobability location was assigned probability P, where P could take any value in the range 0.3 to 1. We allowed P to be more flexible to allow for possible differences in participants’ estimates of probabilities, but constrained it to be greater than the default payoff of 0.2 by imposing a lower bound of p=0.3.
This information can be thought of as represented in a payout matrix – a matrix whose elements correspond to a probability of reward being delivered. The payout, or outcome, on trial t, ${O}_{t}$, was determined by a binomial distribution parameterised by q:
where q was obtained from the payout matrix M. M depends on which location is the true high rewardprobability location H, what the payout probability P at the high rewardprobability location is, and which location L was selected on that trial.
The posterior probability that each location was the correct one, H, and the payout probability P takes some value p, was found by considering the payout probability q associated with the cell ($h,{L}_{t},p$) in the payout matrix:
The posterior probability over q, $p\left(q=\text{}{\mathrm{M}}_{h,l,p}{O}_{1:t},{L}_{1:t}\right),$ was given using Bayes’ rule:
The likelihood $p\left({O}_{t}{q=\mathrm{M}}_{h,l,p},{L}_{t}\right)$ was simply:
meaning that the likelihood distribution was a 2D matrix over
The prior $p\left(q={\mathrm{M}}_{h,{L}_{t},p}{O}_{1:t1},{L}_{1:t1}s\right)$ was none other than the probability of the values H=h and P=p, given all previous observations and the switch probability s:
This was obtained from the previous trial’s posterior after accounting for the possibility of a switch in H with a leak. The probability of a switch s was fixed and modelled as 1/20. This reflects the true underlying mean frequency of a switch, but was unknown to participants (we could model this flexibly – however, once in the fMRI session, participants likely have a rough gauge of how frequently there is a switch due to having already completed 800 trials of the task in the behavioural session). The prior was therefore given by:
where $U\left(\mathrm{M}\right)\bullet s$ is a uniform leak over the parameter space, weighted by s, allowing the model to ‘forget’ past outcomes (otherwise the estimates in the posterior will just converge to the mean of all estimates).
To obtain the model entropy over Hypotheses (i.e. the spread of belief over which of the four possible locations was the high reward location), we marginalized over p:
This gave the marginalised posterior shown in Figure 1e. We then applied the formula for Entropy:
This was the equation that gave the entropy regressor used in analyses and shown in Figure 1d.
fMRI acquisition and preprocessing
In the fMRI session, four blocks, or runs, of functional data were acquired, with mean ~12 min each. Data were acquired on a 3T Siemens TIM Prisma, using a 64channel head and neck coil. Functional scans were collected using a gradient echoplanar imaging (EPI) sequence, Multiband 4, with TR = 1.35 s, TE = 32 ms, flip angle = 74 degrees, and voxel resolution of 2×2×2 mm. Structural scans were acquired using a T1weighted MPRAGE sequence, 1×1×1 mm voxels. All scans were of axial orientation angled to ACPC, covering the whole brain.
Preprocessing and univariate analysis of fMRI data was performed using tools from fMRI Expert Analysis Tool (FEAT), part of FMRIB’s Software Library (FSL). Data were preprocessed using the FEAT default options. Registration of EPI images to highresolution structural images and to standard (MNI) space was performed using FMRIB’s Linear and NonLinear Registration Tool, respectively (FLIRT and FNIRT). For GLMs run using FEAT, the four runs of the task for each participant were pooled using fixed effects, and then highlevel analyses across participants were performed using mixed effects analyses. For the multivariate and pupilbrain analyses, data was concatenated across runs for each participant and then the GLM was run. These GLMs contained a bias term for each run separately to remove any differences across runs (i.e. the GLM had four bias/main effect regressors, each with ones for one of the runs and zeros for the other three).
fMRI analyses
For univariate brain GLMs, we locked each trial event to the time of outcome presentation and convolved the BOLD signal with the canonical HRF. The same was performed for generating wholebrain beta maps for the multivariate analyses, which were then used to calculate representation strength (see fMRI multivariate measure of representation strength). We then ran the following GLMs that are presented in the main paper, where the dependent variable was either univariate brain activity, multivariate representation strength, or change in baseline pupil diameter. Analyses were carried out either on all trials or only included trials in the exploit period that were at least five trials from the adjacent explore periods so that any variation is not due to changes in action. These GLMs are specified below:
GLM1 included the following regressors: Exploit, explore. Dependent variable: Univariate brain activity. A [1 1] contrast tested for differences between exploitation and exploration, with positive activation indicating regions more active in exploitation than exploration, and vice versa for negative activations. All trials included in the analysis. The goal of this analysis was to identify regions that were differentially active in exploitation and exploration phases of the task.
GLM2 included the following regressors: Main effect (bias; column of ones), entropy, whether reward was delivered at outcome, whether participants changed their choice on the subsequent trial, whether participants transitioned from exploitation to exploration on the subsequent trial (last exploit trial), whether participants transitioned from exploration to exploitation on the subsequent trial (first exploit trial). Dependent variable: univariate brain activity. All trials included in the analysis. The goal of this analysis was to identify regions sensitive to certain task factors, as well as to test for regions with activity at the transitions between exploitation and exploration (over and above that explained by the other task regressors, such as switching options).
GLM3 included the following regressors: Exploit, explore, and all the regressors from GLM2 except the main effect regressor. Dependent variable: Univariate brain activity. A [1 1] contrast tested for differences between exploitation and exploration. All trials included in the analysis. The goal of this analysis was to investigate whether much of the difference between exploitation and exploration was explained by the task factors in GLM2.
GLM4 included the following regressors: Main effect (bias; column of ones), model entropy. Dependent variable: multivariate representation strength. Analysis carried out as a multivariate searchlight analysis. Only exploit trials (>5 trials from adjacent explore phase) included in the analysis. The goal of this analysis was to identify regions in which the representation strength varied with model entropy.
GLM5 was the same as GLM4 except instead of the entropy regressor used the probability (belief) assigned by the model that the currently exploited option is the high reward option. The goal of this analysis was to show representation strength in the region identified in GLM4 varied with the belief over the currently selected option (not just entropy, which captures the certainty of the belief).
GLM6 included the following regressors: Main effect, change in baseline pupil diameter between the current trial and the next, change in entropy. Dependent variable: change in multivariate representation strength in mOFC (also performed as a whole brain searchlight analysis, as in Figure 3c). Analysis performed separately on only exploit trials (>5 from adjacent explore phase), and only explore trials. When this analysis was performed on explore trials, whether participants changed their choice was added as a regressor to the GLM. These analyses were performed separately on trials on which reward was delivered and omitted. The goal of this analysis was to test whether changes in pupil diameter explained changes in representation strength.
GLM7 included the following regressors: Main effect, change in entropy. Dependent variable: Univariate brain activity. Only exploit trials (>5 from adjacent explore phase) included in the analysis. The goal of this analysis was to identify regions in which activity reflected changes in the model entropy, and to test whether they were dissociated from those in which representation strength varied with model entropy (as in GLM4).
GLM8 included the following regressors: Main effect, brain univariate activity, difference in entropy following outcome between current trial and the previous, and wholebrain mean univariate activity. Dependent variable: change in baseline pupil diameter between the current trial and the next. Only exploit trials (>5 from adjacent explore phase) included in the analysis. (When this analysis was performed on all trials, Figure 3—figure supplement 3, whether participants changed their choice on the subsequent trial and whether the current trial was the last trial of exploitation were added as regressors to the GLM.) The goal of this analysis was to identify brain regions in which univariate activity explained changes in baseline pupil diameter.
Multiple comparisons correction was performed with cluster massbased permutation testing with the FSL function randomise. The clusters were thresholded voxelwise at p=0.001 and cluster masses surviving correction at p<0.05 are presented.
fMRI multivariate measure of representation strength
To obtain a measure of representation strength at each grey matter voxel, we ran a multivariate searchlight analysis over the wholebrain grey matter. We trained and tested the model on different runs of the task. The configuration of available options differed across the runs, such that each option appeared in two different runs. We therefore split the runs in to two pairs – each pair containing all eight options. We train the model on one pair and test on the other, and then repeat, training and testing on the converse pairs.
The data used as input to the multivariate analyses are wholebrain, standard space beta maps for each trial. We then focus our analyses on the cortex using a grey matter mask. At each grey matter voxel, a set of voxels for that searchlight centre is defined by including all voxels within a radius of seven voxels from the searchlight centre and that lie within the grey matter mask. After performing dimensionality reduction on the voxels and ensuring a balanced training set (see following paragraphs), we train a multinomial logistic regression model using the training set and then use this model to obtain probabilities each of the eight locations was selected on each trial of the test set (see below). This is then repeated switching the training and test sets. The result of this analysis is a probability each of the eight options was selected on each trial of the task at each grey matter voxel.
At each searchlight of the multivariate analysis, we performed dimensionality reduction and ensured the number of exemplars of each option in the training set was balanced. We performed dimensionality reduction to denoise the data (Mante et al., 2013) and to facilitate training the multivariate model (so that we didn’t have many more dimensions [voxels] than exemplars [trials]). We concatenated the data across runs, and then at each searchlight centre extracted an nVoxels x nTrials matrix. To perform the dimensionality reduction, we carried out principle components analysis on the voxelsvoxels covariance matrix. This resulted in nVoxels principle components (PCs) – each a vector of length nVoxels. We then kept only the top twenty PCs, which captured the 20 strongest spatial modes of variation (shared variance across voxels). The eigenvalues associated with the remainder of the PCs were all near 0, suggesting that the 20 PCs we retained captured most of the spatial variance of activity. Having conducted the PCA, we obtained the contributions of each PC on each trial by linear regression of the voxelwise patterns of activity on each trial against the PC vectors. This resulted in a set of 20 PC contributions (the beta weights from the regression) for each trial – a reduceddimensionality (20 dimensions) representation of the activity on a given trial. These were taken as the dimensions to be used as predictors in fitting the model. The 20 PC weights on each trial were entered as predictor variables in to a multinomial logistic regression with the selected option as the predicted variable to train the model.
We performed dimensionality reduction on the entire dataset (800 trials). Because this data set included both training and test set, a reviewer raised the possibility that the procedure could introduce bias into the pattern classification. Since the PCA is agnostic to trial labels it is not clear how such bias could arise, but to check that the data reduction step did not give rise to spurious results we conducted a permutation analysis in which 1000 sets of null data (obtained by shuffling voxel identities within each trial) were passed through the entire analysis pipeline in the ROI defined on the peak of the OFC effect. The resulting t scores for the regression of representation strength on model entropy (the key result from this analysis) were symmetrically distributed about zero (Figure 2—figure supplement 2). The maximum value of t for these 1000 iterations was 3.09, compared to the peak tscore of 6.8 obtained in mOFC. This suggests that our result is not due to bias introduced by the data reduction procedure.
Because subjects had a free choice as to which option they selected, we had variable numbers of trials on which a given participant selected each option (i.e. an imbalanced training set). To avoid introducing bias in to the model to predict stimuli selected more often, we oversampled the training set such that there were equal numbers of trials of each stimulus type used to train the model. We did this by repeating the trials in the training set for each option until all classes had the same number of trials as the most selected option.
After training the model, we used it to provide probabilities of each option being selected on each trial in the remaining data. For each trial in the test data, we used the PC beta weights on that trial as input to the model to obtain the estimated probability that each option was the selected option. This gave a probability on each trial for each of the eight options. To obtain a summary measure, we computed the ratio of the odds of classifying the trial as the chosen option, to the odds of classifying it as one of the offered but unchosen options (i.e. those options that were available for that run of the task but not selected on that trial). We took the logarithm of this ratio, providing us with a measure of the selectivity to the selected action of each searchlight on each trial, namely the log odds ratio, that is in theory normally distributed, although we used nonparametric statistics (permutation testing) to assess the significance of results at the whole brain level. This trialbytrial scalar measure was then used in subsequent regression analyses to identify regions in which it is explained by participants’ belief entropy or baseline pupil diameter.
Region of interest definitions
The ACC ROIs for the analyses to test whether ACC activity explained changes in representation strength in mOFC were defined by thesholding the activation maps for both GLM7 and GLM8 at p=0.001 and p=0.01 separately. Hence we tried the analysis with four different ROIs separately. For ROIs we extracted the timecourses by applying the inverse of each participant’s registration to the mask to project the mask from standard space in to subject space and extracted the mean time series within this region of interest.
Pupil acquisition and analysis
We simultaneously measured pupil size while participants performed the task, in both the behavioural session and the fMRI session. We used an EyeLink eyetracker, SR Research to acquire pupil size data, sampled at 1000 Hz.
We preprocessed the pupil size data using standard methods. We identified trials on which subjects blinked and removed the blinks by interpolation of values measured just before and just after the identified blinks. Trials on which the pupil measurement was lost were removed from the analyses. Finally, we removed the first 25 trials of each run of the task from analyses to avoid any luminance changes that may be caused by starting a new run of the task. After removal of these different trials, the number of core exploitation trials went from 201 ± 7.6 to 162 ± 9.4 (mean ± SEM).
Pupil size in all analyses is normalised for each run by expressing it in terms of percentage difference from the overall mean pupil size on that run. Baseline pupil size is defined as mean (normalised) pupil size in the 20 ms window preceding feedback.
Appendix 1
Modelling of beliefs and behaviour
This appendix follows directly from the Modelling of behaviour section in the Methods. It includes additional modelling of participants’ behaviour using alternative models.
Relative Uncertainty
As an alternative measure of belief uncertainty, we calculated relative uncertainty (Nassar et al., 2012). Relative uncertainty is defined as the uncertainty about (variance of) the outcome given the most likely state of the world (defined as joint maximum likelihood estimate of $\left\{{H}_{t},{q}_{t}\right\}$ divided by uncertainty about the outcome given the full posterior over $\left\{{H}_{t},{q}_{t}\right\}$.
Over all participants and runs, Relative Uncertainty and Entropy were highly correlated (mean value of Pearson’s r over all participants = 0.80, SEM = 0.0056). This reflects the fact that both measures attempt to quantify the same construct (belief uncertainty). The evolution of entropy, relative uncertainty and change point probability (see below) over a representative run is shown in Appendix 1—figure 1 to give the reader a sense of the similarity between these measures.
Change Point Probability
Change point probability on trial t (CPP_{t}) is defined as the posterior probability that a change occurred between trial t1 and trial t, after observing data point x_{t}. From Bayes’ theorem, this is the probability of observation x_{t} given that a change point has occurred, multiplied by the prior change point probability (i.e. inverse change point frequency), divided by the probability of observation x_{t} regardless of whether a change point has occurred.
In our model this translates as
As per our Bayesian model, $p\left({O}_{t}{L}_{t}\ne {H}_{t}\right)$, the probability of the outcome given that there has been a change point and the observer has chosen a low reward location, is given by a binomial with a constant payout parameter (0.2). The probability of the outcome given that there has been a change point and the observer has chosen the high reward location is given by a binomial with the mean payout rate for high reward locations (0.8).
As per Nassar et al (Nassar et al., 2012), for the purposes of calculating CPP we define the probability of the outcome given no change point, $p\left({O}_{t}{L}_{t}={H}_{t}\right)$, as the probability of the outcome ${O}_{t}$ given that the true state of the world was the joint maximum likelihood values of H and q on trial t1, i.e. the most likely combination of the identity and payout rate of the high reward option.
Over all participants and all runs, CPP moderately correlated with entropy (mean Pearson’s r over all participants 0.37, SEM 0.0055) and weakly correlated with relative uncertainty (mean Pearson’s r 0.20, SEM 0.0053). Conceptually, CPP differs from both entropy and relative uncertainty in that CPP measures only the probability of a change point given the current observation, whilst entropy and relative uncertainty integrate information across multiple trials. This can be seen in the example plot of entropy, relative uncertainty and CPP, in which CPP rises and falls on a single trial basis whilst entropy and relative uncertainty vary more smoothly over trials (Appendix 1—figure 1).
Predicting participants’ choices
Although the purpose of our modelling was to provide trialbytrial regressors to relate to neural responses, rather than to optimally capture behaviour, we checked our model was a reasonable fit to behaviour. To determine whether our model was a reasonable fit to the behaviour of our participants, we defined an action policy by fitting a softmax function for each participant and run, in which the model probabilities assigned to each option (of being the high reward option) were used to predict option choice, with softmax temperature as a free parameter. The model’s predicted choice (defined as the option with the highest probability under the softmax) matched the participants’ actual choices on 94.2% of exploit trials (averaged over all participants; SEM 0.70% across participants) and 31.5% (SEM 1.3%) of explore trials.
Arguably it is more appropriate to fit two separate softmax functions with independent temperature parameters for explore and exploit phases of the task, since behaviour may be qualitatively different in the two cases (for example, behaviour could be more random in the explore phase, which would be reflected in a higher softmax temperature). When we fit softmax functions separately to explore and exploit trials, with temperature as an independent free parameter for each trial category, 96.0% of exploit trials (SEM 0.57%) and 50.9% of explore trials (SEM 1.8%) were predicted correctly. Notably the proportion of explore trials predicted correctly exceeded the expected proportion if people chose at random during explore phases, which would be 25% (one sample ttest comparing the proportion correctly classified to 25% across participants: t_{18} = 13.2, p=5.2×10^{−11}).
We compared the overall model log likelihood for the Bayesian model with a single softmax policy, and two softmax policies (fitted separately to explore and exploit periods) to two further models, which should represent a ceiling in terms of how well the model matches the actual data, in that they are fit to participants’ responses rather than to latent variables reflecting participants’ modelled beliefs. The first comparison model was a model in which the probability of choosing the selected option was always one during exploit periods and always 0.25 during explore periods. We call this the ‘all or none’ model because its behaviour is totally deterministic during exploit periods and totally random during explore periods. The second comparison model was a Hidden Markov Model (HMM) described below in which behaviour was classified as ‘explore’ or ‘exploit’ based on block length, and then the probability of choosing the selected option was always one during exploit periods and always 0.25 during explore periods as for the allornone model. In each case the model log likelihood was defined as the log likelihood of the model making the same choice as the human observer, summed over all trials.
Over all participants, the HMM had the best model log likelihood (mean over all participants = −165), closely followed by the Bayesian model with two separate softmax policies (−179). Both models did much better than the allornone model (−308) and the Bayesian model with a single softmax policy (−473). Model log likelihoods for all models are shown in Appendix 1—figure 2.
Does belief uncertainty predict ExploreExploit transitions?
An important feature of our model based approach was that we were able to model belief uncertainty as entropy or relative uncertainty in the model’s hypothesis space. To confirm the validity of this approach, we asked whether high entropy, relative uncertainty or change point probability made participants more likely to transition from exploiting an option to exploration.
Limiting our analysis to trials classified as ‘exploit’, we used logistic regression to ask whether the probability of switching option on trial t + 1, given that a trial t was unrewarded, depended on the relative uncertainty, entropy or change point probability on trial t (three separate regression analyses). We limited the analysis to cases where trial t was unrewarded, because participants switched chosen option only following unrewarded trials (mean proportion of switch trial following an unrewarded trial = 97.4% across all participants and runs). 21% of all trials over all participants were both classified as exploit and unrewarded and hence used in this analysis.
Both higher entropy and higher relative uncertainty predicted a higher probability of switching (entropy  t_{18} = 7.60, p=2.57×10^{−7}; relative uncertainty  t_{18} = 13.50, p=3.7×10^{−11}).
Interestingly, high change point probability on a given trial t predicted a lower probability of switching (change point probability  t_{18} = 6.45, p=2.28×10^{−6}). The reason for this can be seen in Appendix 1figure 3 in which change point probability for the trials in the run up to an exploitexplore transition is plotted. It is evident that CPP actually peaks 2–3 trials before the transition (Appendix 1figure 3). This lag is reflective of the fact that CPP is driven primarily by reward omission on the current trial t, whilst participants’ behaviour is guided by integrating outcomes over multiple trials, and thus participants were unlikely to switch on an unrewarded trial, unless other recent trials were also unrewarded. The proportion of exploitexplore transitions for which trial t1, t2 and t3 were unrewarded was (mean ±SEM across participants): 13.1% (±2.36%), 57.5% (±6.4%) and 73.9% (±2,6%) respectively.
Both entropy and relative uncertainty are measures of the belief uncertainty that arises from reward omission integrated over multiple trials. To obtain a modelagnostic view of factors predicting exploitexplore transitions, we performed a logistic regression (limited to unrewarded trials in exploit periods) using reward omission on the previous four trials (t1 to t4) as explanatory variables. Reward omission on trials t1 and t2 significantly predicted option switching (i.e. exploitexplore transitions); on trial t3 the effect was just n.s. and on trial t4 the effect was clearly n.s. (p values for trials t1 to t4 respectively: 3.81 × 10^{−6}, 7.63 × 10^{−5}, 0.064, 0.65 – nonparametric sign test used as the regression coefficients were grossly nonnormal across the group). These results further confirm that participants integrated reward omission over 3–4 trials (including the final exploit trial, t) to drive exploitexplore transitions.
Classification of behaviour into Explore and Exploit phases
Throughout the main manuscript we used a heuristic method to define exploitation periods. We defined the start of a period of exploitation as the first trial on which participants selected the true high reward probability option and received reward, and the end of the exploit period as the last trial before switching to a different option (thus within exploit periods, by definition, all choices were for the same option). We furthermore defined ‘core exploit’ trials as trials within an exploit block, more than five trials from the preceding and following changes of action. The primary aim of this heuristic was to identify, in a conservative way, trials that we could be confident represented stable exploitation behaviour. This was important because we wanted to carry out a number of analyses only on ‘core exploit’ trials – trials where overt behaviour was stable but belief certainty varied.
It is possible to take a more principled, modelbased approach to the definition of explore and exploit phases of the task and we present such an analysis here to justify the use of our heuristic.
Ebitz and colleagues (Ebitz et al., 2018) gave careful consideration to how trials may be classified as explore or exploit in a task similar but not identical to the present task (three arm bandit with independently drifting bandit values; the present task is a four arm bandit but the bandit values are tied to each other). They concluded that a Hidden Markov Model (HMM) may be used to classify trials into explore and exploit states, based on block length. They fitted a double exponential distribution to block lengths, with the logic that block lengths are shorter in explore periods and longer in exploit periods. They then fitted a hidden Markov model to participants’ choices, in which the hidden states were exploit (one state per option) or explore.
We used the same approach to classify trials as explore or exploit in the present task. The hidden states were {exploit option 1, exploit option 2, exploit option 3, exploit option 4, explore}. In each exploit state the only possible observation was the selection of the exploited option. In the explore state all options were equally likely to be selected. Transitions were possible (end equally likely) from the explore state to any exploit state. Transitions were also possible from each exploit state to the explore state, but not directly to another exploit state. The probabilities of transitioning between states were determined by the fitted time constants of the block length distributions for explore and exploit states; where ${\beta}_{1}$ was the timeconstant of the fitted exponential distribution of block lengths in the explore condition, $p}_{1}=\frac{1}{{\beta}_{1}$ was the probability of transitioning from exploration to an (any) explore state; as all exploit states were equally likely, the probability of transition to any given exploit state was $\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$4{\beta}_{1}$}\right.$. Similarly where ${\beta}_{2}$ was the timeconstant of the fitted exponential distribution of block lengths in the exploit condition, ${p}_{2}=\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{${\beta}_{2}$}\right.$ was the probability of transitioning from an exploitation state to the explore state. Transitions directly between exploit states had probability zero.
The classification of trials as ‘exploit’ and ‘core exploit’ by our heuristic and the Ebitz approach can be compared in Appendix 1—figure 4. The critical trial group used for the analysis of stable exploit periods, called ‘core exploit’, was defined as exploit trials more than five trials from the nearest explore trial. In this trial group the two models were concordant in that almost all trials classified as ‘core exploit’ by our heuristic were also classified as ‘core exploit’ by the HMM approach (mean ±SEM across participants: 97.3% ± 0.37%).
Conversely, the HMM approach classified some additional trials as core exploit, which were not soclassified by our heuristic (mean ±SEM across participants: 7.8% ± 1.3%). By definition, these excess exploit trials arise when participants erroneously select a low reward option for a long period of time, even though it had never (within that exploit phase) been the high reward option. For this reason, these additional exploit trials could be interpreted as periods of offtask behaviour. During the additional exploit trials, model entropy is likely high (as by definition participants are selecting a lowreward option and thus experiencing many reward omissions), but representational strength in any internal model would likely be low (due to disengagement from the task). We thus felt that the inclusion of these additional trials could artificially inflate the relation between model entropy and representation strength and chose to use our more conservative although heuristic approach throughout.
References

1
An integrative theory of locus coeruleusnorepinephrine function: adaptive gain and optimal performanceAnnual Review of Neuroscience 28:403–450.https://doi.org/10.1146/annurev.neuro.28.061604.135709

2
Learning the value of information in an uncertain worldNature Neuroscience 10:1214–1221.https://doi.org/10.1038/nn1954

3
Just above chance: is it harder to decode information from prefrontal cortex hemodynamic activity patterns?Journal of Cognitive Neuroscience 30:1473–1498.https://doi.org/10.1162/jocn_a_01291
 4

5
Network reset: a simplified overarching theory of locus coeruleus noradrenaline functionTrends in Neurosciences 28:574–582.https://doi.org/10.1016/j.tins.2005.09.002

6
A probability distribution over latent causes, in the orbitofrontal cortexThe Journal of Neuroscience 36:7817–7828.https://doi.org/10.1523/JNEUROSCI.065916.2016
 7
 8

9
The effects of neural gain on attention and learningNature Neuroscience 16:1146–1153.https://doi.org/10.1038/nn.3428
 10

11
Neuronal basis of sequential foraging decisions in a patchy environmentNature Neuroscience 14:933–939.https://doi.org/10.1038/nn.2856
 12

13
Pupil diameter predicts changes in the explorationexploitation tradeoff: evidence for the adaptive gain theoryJournal of Cognitive Neuroscience 23:1587–1596.https://doi.org/10.1162/jocn.2010.21548
 14
 15
 16
 17
 18
 19

20
Coordinated forms of noradrenergic plasticity in the locus coeruleus and primary auditory cortexNature Neuroscience 18:1483–1492.https://doi.org/10.1038/nn.4090
 21
 22

23
Rational regulation of learning dynamics by pupillinked arousal systemsNature Neuroscience 15:1040–1046.https://doi.org/10.1038/nn.3130
 24
 25

26
Learning and memory functions of the Basal GangliaAnnual Review of Neuroscience 25:563–593.https://doi.org/10.1146/annurev.neuro.25.112701.142937

27
Pupil dilation signals surprise: evidence for noradrenaline's role in decision makingFrontiers in Neuroscience 5:1–12.https://doi.org/10.3389/fnins.2011.00115
 28

29
Pupil fluctuations track rapid changes in adrenergic and cholinergic activity in cortexNature Communications, 7, 10.1038/ncomms13289, 27824036.
 30
 31
 32

33
Specific frontal neural dynamics contribute to decisions to checkNature Communications 7:1–14.https://doi.org/10.1038/ncomms11990
 34

35
The time course of perceptual choice: the leaky, competing accumulator modelPsychological Review 108:550–592.https://doi.org/10.1037/0033295X.108.3.550

36
Interactions between decision making and performance monitoring within prefrontal cortexNature Neuroscience 7:1259–1265.https://doi.org/10.1038/nn1339
 37

38
Bayesian online learning of the hazard rate in changepoint problemsNeural Computation 22:2452–2476.https://doi.org/10.1162/NECO_a_00007
 39
 40
Decision letter

Tobias H DonnerReviewing Editor; University Medical Center HamburgEppendorf, Germany

Michael J FrankSenior Editor; Brown University, United States

Samuel J GershmanReviewer; Harvard University, United States

Florent MeynielReviewer; CEA, France
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
[Editors’ note: the authors were asked to provide a plan for revisions before the editors issued a final decision. What follows is the editors’ letter requesting such plan.]
Thank you for sending your article entitled "Control of entropy in neural models of environmental state" for peer review at eLife. Your article is being evaluated by Michael Frank as the Senior Editor, Tobias Donner as the Reviewing Editor and three reviewers. The following individuals involved in review of your submission have agreed to reveal their identity: Samuel J Gershman (Reviewer #2).
Given the list of essential revisions, including new experiments, the editors and reviewers invite you to respond within the next two weeks with an action plan and timetable for the completion of the additional work. We plan to share your responses with the reviewers and then issue a binding recommendation.
Summary:
The authors investigate how learners maintain flexibility, a necessary feature of learning in dynamic environments. They propose that noradrenaline increases posterior uncertainty in the currently held representation, thereby maintaining flexibility and counteracting the decrease in uncertainty that is entailed by accumulating more observations. They provide evidence in favor of this hypothesis by means of pupillometry and fMRI. They further propose that the current representation, whose uncertainty is increased by noradrenaline, is encoded in the medial orbitofrontal cortex and that the anterior cingulate cortex detects when the current model is no longer valid and signals it to the locuscoeruleus – noradrenaline system.
Overall, the reviewers found this work interesting. The reported results provide a potentially exciting link between two rapidly evolving domains in computational and cognitive neuroscience. The authors employ some clever analyses; the dissociation between state uncertainty and action variability, in particular, is a nice feature of the analysis.
But all reviewers also raised substantial concerns, of either conceptual or technical nature, which would need to be addressed before this paper can be reconsidered for publication at eLife. The conceptual concerns hover around the mechanistic interpretation of the results: reviewers did not feel like the modeling, and the link between model and physiology, were as clear as they could have been.
We realize that addressing all the essential points below might be challenging within the relatively short period of time intended for the revision process at eLife, and hence we would like you to prepare a response and action plan.
Essential revisions:
1) Please provide a detailed modelbased analysis of participants' behavior, ideally comparing the current model to other plausible ones.
Currently, none of your analyses shows how well the model accounts for participants' behavior. The model doesn't even define a policy, so it doesn't address the exploreexploit tradeoff directly. Modeling behavior is important both in order to state clearly a mechanistic hypothesis about choice behavior, and to show empirically that such a model is supported by the data.
Modeling behavior would be critical, for example, to determine whether the representational change in OFC reflects some multifaceted entropy measure as might be predicted by a "state" account of OFC representation, or overrepresentation of an error signal during periods of high arousal, which would also be consistent with ideas about value signals in OFC.
By the same token, your primary behavioral figure shows that participants switched more under conditions of high entropy, but this is quite highly correlated with the actual trial outcome. Does the modelbased entropy measure affect participant behavior beyond simple winstay loseshift or reinforcement learning effects? If so, what are the additional aspects of task environment (e.g. feedback validity, prior feedback history) that jointly affect entropy measures and switching behavior?
2) Please unpack your mechanistic reasoning with regards to entropy. You put forward the idea (i) that posterior uncertainty always decreases with more observations, and therefore (ii) that some specific mechanism must "drive up entropy" in dynamic environments, and they look for such a mechanism in this paper. The reviewers understand this can be an interpretation of a leaky accumulation process, in which the leak "drives up entropy" (ii) and the accumulation "decreases posterior uncertainty" (i). But it is not entirely clear from the perspective of Bayesian inference. Specifically, reviewers raised two issues in this context.
2a) It is not generally true that posterior entropy decreases with more observations.
You state (Introduction): "if beliefs about the environment are expressed as a posterior distribution over possible states, the entropy, or uncertainty, of this posterior decreases as the number of concordant data points increases." Whether or not this is a mathematically correct statement depends on what is meant by "concordant" (which does not seem to be a technical concept). There are plenty of cases where posterior entropy will increase as more data arrive. For example, if you initially strongly believe that a coin is biased to produce heads and then you observe tails, then your posterior will be higher entropy than your prior. (This also comes up later on the same page: "If the 'leak' were removed, the model entropy would still decrease with each new observation, as this is a natural consequence of gaining information").
2b) It is not clear why a dedicated mechanism would be required to "drive up" or "inject" entropy.
Reviewers do not understand why a Bayesian model of inference must "counter" inflexibility by "injecting" entropy. There is nothing, mathematically speaking, that requires a model of (presumably human) inference to inject entropy. Also, talking about mechanisms like "leak" seem to be confusing algorithmic/implementational level concerns with the computationallevel analysis of the internal probabilistic model (see point 3a below). The "leak" in the Behrens et al. (2007) model, for example, comes directly from assumptions about volatility; it's not a mechanism that is "injected" into the model to increase entropy. The two computations called (i) and (ii) above are a direct consequence of probability calculus (subsection “Bayesian learning model”). Equation 6 in subsection “Bayesian learning model” shows that the assumption of nonstationarity (s > 0) counterbalances, and may even override, the decrease in posterior uncertainty by constantly injecting a uniform prior in the inference. There is no need to further "drive up entropy". Thus, reviewers feel that the statement in the Introduction: "… injecting uncertainty back into the model's state estimate to make rapid changes of belief possible" needs to be qualified or dropped.
3) There are other mechanistic concepts and related terminology, which should be clarified.
3a) In general, the Introduction tends to conflate the computational level of analysis with the level of the biological mechanisms. Reviewers feel it would help to more carefully separate those two.
3b) In choosing your definition of exploitation, you try to rely on "as few assumptions or arbitrary cutoffs as possible" (Results section). Reviewers don't think that's possible. The definition of exploitation you adopted (the first trial on which participants selected the true high reward probability option and received reward) is arguably just as arbitrary as other sensible definitions. An agent in an exploratory mode will place some probability on this option, and hence its selection does not reliably indicate exploitation. Assumptions are inescapable, and it's better to adopt a modelbased definition. But until the policy is defined explicitly, this is not possible.
3c) Reviewers do not understand the statement that uncertainty increases on unrewarded trials (subsection “Noradrenaline as a candidate mechanism for increasing flexibility of belief representations”). It's mathematically possible (though relatively unlikely) that uncertainty will increase on unrewarded trials, if subjects have low confidence during an exploitation phase.
3d) The following statement (Introduction) may be incorrect depending on how one interprets the terminology: "the better established one's model of the environment, the less flexible one's beliefs are." This depends on what "flexible" and "established" mean. If flexibility means something like the degree to which the posterior can deviate from the prior when new data arrive (e.g., expected KLdivergence), and if "established" means lower prior entropy, then one can conceive situations where the claim is incorrect.
4) Please elaborate on the decoding analysis.
4a) You look for a region that shows "stronger [decoded] representation strength for the currently selected option when model entropy is low", which should be a hallmark of a probabilistic model. It would be more natural to look for the posterior distribution of option values, rather than a representation of the selection option. It actually looks like it is the representation decoded in mOFC here. Indeed, if the representation decoded is that of the selected option, then it does not really make sense that the classifier's probability is actually lowest for the current option (compared to other options) in the high entropy condition (Figure 2C). By contrast, this result is expected for the option value. As seen on Figure 1C, when posterior entropy is highest in exploitation periods (the ones analyzed here), the value of the currently selected option tends to be the lowest.
4b) From the same basic perspective, the representational analyses could better clarify the source of the effects, rather than lumping everything into entropy. A difference in decoding reliability driven by differences in error versus correct trials is a rather different effect than one that emerges across feedbackvalidity conditions when accounting statistically for feedback history. The error versus correct distinction is one where the difference in frequency of these two conditions might lead to artificially low estimates of decoding reliability in error trials, which, given larger pupil responses on error trials, could make interpretation of the OFC pupil results fairly tricky. By this token, one could argue that a number of the findings reported in OFC [general uncertainty, choicespecific belief, and 70% vs 90% in the same regions (subsection “Medial OFC represents probabilistic beliefs about the state of the environment” and Figure 2—supplement 1)] could all be driven by a conjunctive code over feedback and option value, rather than by the tracking of reliability weighted state representations.
5) Please substantiate claims about temporal precedence.
The results are interpreted in terms of temporal precedence with the implicit assumption being that pupil diameter prior to outcome presentation relates to subsequent reliability of OFC representations. However, since pupil diameter, BOLD responses, and entropy all likely contain fairly strong autocorrelations, it would be useful if you could better test this assumption by determining whether the relationships hold even after shifting the pupil and OFC signals in time relative to one another (e.g. does OFC representation reliability predict pupil diameter on subsequent trial?).
6) Please provide a more comprehensive characterization of LC responses, and the methods used to detect them.
You should show the actual response time courses, rather than only statistical maps. You should also show the correlation between the LC and pupil responses. There is a dispute in the literature about the feasibility of LC imaging and the pitfalls and physiological artefacts (e.g. Astafiev, 2010; de Gee et al., 2017). Please also provide more information in Materials and methods section – did you correct for physiological noise (using RETROICOR or related approaches)? If not, it might be better to take the LC findings out altogether and focus on the pupil. Finally, some of your claims seem to require that the effects should specific to the LCNA system. Substantiating this claim would require contrasting LC responses to responses of other brainstem regions.
7) Please elaborate on the representation of the relevant prior literature.
There is a whole literature on this topic, which is only scarcely referred to. The following references (potentially more) should be included:
– Iigaya, 2016 presents a mechanistic view for driving up entropy with noradrenaline, with the idea that a surprise detection module is used to reset (by means of noradrenaline) the representation currently hold and stored in synapses.
– Meyniel and Dehaene, 2017 for waning and waxing of posterior uncertainty and its neural (fMRI) representation
– Pulcu and Browning, 2017 on pupil size variations with volatility
– Krishnamurthy, Nassar and Gold, 2017 on a link between pupil size and the balance between prior and current evidence in uncertain environments
– de Gee et al., 2017 for showing that fMRI responses in the human LC, but also other neuromodulatory brainstem nuclei, correlate with pupil dilation. In that regard, please tone down the statement that pupil dilation is a specific marker of LC activity (how specific this link is, has rarely been tested directly and the few studies that did, in humans, monkey, and rodents, founds links to several brainstem systems).
[Editors' note: further revisions were requested prior to acceptance, as described below.]
Thank you for resubmitting your article "Control of entropy in neural models of environmental state" for further consideration by eLife. Your revised article has been favorably evaluated by a Reviewing Editor, a Senior Editor and three peer reviewers.
The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined below.
Summary:
All three reviewers and the Reviewing Editor were overall happy with the revisions and think that the paper has become much clearer. All remain convinced that the paper addresses an important and interesting topic. One reviewer still voiced three major concerns and suggested ways of addressing them through textual revisions, plus one possible additional analysis. We feel that addressing these points through textual revisions, and toning down certain parts of the conclusions, would be essential for publication. We leave it to the authors to decide whether or not perform the additional analysis suggested.
Essential revisions:
1) Mathematical constructs that motivate the study and analysis.
The authors now clearly decompose the inference into (1) computing a posterior distribution and (2) turning this posterior distribution into a predictive distribution by applying a transitionfunction which, if volatility is assumed, blunts the posterior. Using this (clear) terminology, they then say, "we investigate the hypothesis that the brain has a mechanism for increasing entropy in internal models that is analogous to the transitionfunction update described above". It seems that their analysis cannot be in line with this, since in their experiment, volatility (the rate of change) is constant, therefore the transitionfunction is constant, and it is impossible to correlate a brain/pupil signal with it or claim that a brain mechanism has been found for it. We suggest further editing the Introduction clarify this aspect.
2) Effect of entropy on representation strength.
It remains that the decoding results are not fully in line with the cartoon illustration in Figure 1B, in particular, that decoding is not significant in the high entropy condition. Instead, it looks like an effect of value, and indeed the effect is significant in GLM 4 testing reward probability.
We are concerned that the control (point 4, second part, in the response letter) for the contribution of value may not have been adequate. They regressed out, voxelwise, the linear effect of value. The analysis in the (univariate) residuals does not preclude that there remains multivariate activity related to value, which, if correlated with entropy, will boost the effect of entropy. The conserve approach may be more appropriate: value should be regressed out from entropy, then if the correlation between representation strength (derived from the decoder) and residual entropy remains significant and negative, it will be convincing that the effect is not confounded by value.
We realize that this control analysis may not be possible with the current design (strong negative correlation between value and entropy) – if that is the case, then the claim about a specific involvement of entropy should be toned down.
3) Effect of noradrenaline.
The authors opted for removing the LCfMRI results about which we had technical concerns. So, their claim about noradrenaline now relies entirely on the pupil data, but this overly specific conclusion is not supported by the available data. Specifically, the authors quote Joshi et al., 2016 to relate baseline pupil size (which is central in their analysis) to noradrenaline levels, but the quoted paper in fact shows that LC activity is specifically related to changes in pupil size. Reimer et al., 2016 go further in showing that pupil dynamics is correlated to both, noradrenaline and acetylcholinerelease, and the former is more strongly related to the fast dilations of the pupil whereas the latter is more strongly related to slow variations in pupil size, which would correspond to fluctuations in baseline pupil size in the current paper. The authors should refrain from linking their results specifically to the LCNA system; they should link their results to "pupillinked arousal". They can go on to speculate that the LCNA system might play a key role in their findings, but then they would need to acknowledge that their slow baseline pupil size effects would be more in line with a cholinergic effect. It does not seem to match well with the influential accounts in this field, but we need to take this recent evidence from careful experiments into neuromodulatory correlates of pupil dynamics seriously.
https://doi.org/10.7554/eLife.39404.029Author response
[Editors' note: the authors’ plan for revisions was approved and the authors made a formal revised submission.]
Essential revisions:
1) Please provide a detailed modelbased analysis of participants' behavior, ideally comparing the current model to other plausible ones.
Currently, none of your analyses shows how well the model accounts for participants' behavior. The model doesn't even define a policy, so it doesn't address the exploreexploit tradeoff directly. Modeling behavior is important both in order to state clearly a mechanistic hypothesis about choice behavior, and to show empirically that such a model is supported by the data.
Modeling behavior would be critical, for example, to determine whether the representational change in OFC reflects some multifaceted entropy measure as might be predicted by a "state" account of OFC representation, or overrepresentation of an error signal during periods of high arousal, which would also be consistent with ideas about value signals in OFC.
By the same token, your primary behavioral figure shows that participants switched more under conditions of high entropy, but this is quite highly correlated with the actual trial outcome. Does the modelbased entropy measure affect participant behavior beyond simple winstay loseshift or reinforcement learning effects? If so, what are the additional aspects of task environment (eg. feedback validity, prior feedback history) that jointly affect entropy measures and switching behavior?
We carried out the following further behavioural analysis:
a) We added a soft max policy to our model to predict trialwise choice and calculate model log likelihood. This is reported in Appendix 1 subsection “Predicting participants’ choices”, and Appendix 1—figure 2.
b) We implemented the change point model of Wilson and colleagues (requires adapting from the original continuous hypothesis space to the discrete hypothesis space of the current experimental paradigm) to obtain alternative measures of uncertainty (Relative Uncertainty and Change Point probability). Nassar et al., 2012. This is reported in Appendix 1, subsections “Relative uncertainty” and “Change point probability”, and Appendix 1—figures 1, 2, and 3.
c) We implemented the Hidden Markov Model of Ebitz et al. 2018 to obtain an alternative definition of explore/exploit trials and an alternative prediction of trialwise choice. This is reported in Appendix 1 subsection “Classification of behaviour into Explore and Exploit phases” and Appendix 1—figure 4.
d) We did a non modelbased (GLM) analysis to determine how well patch leaving (exploitexplore switch) is predicted by recent feedback history (reward omission on trials t1, t2 etc) and compared to how well patchleaving is predicted by entropy. This is reported in Appendix 1 subsection “Does belief uncertainty predict ExploreExploit transitions?”.
2) Please unpack your mechanistic reasoning with regards to entropy. You put forward the idea (i) that posterior uncertainty always decreases with more observations, and therefore (ii) that some specific mechanism must "drive up entropy" in dynamic environments, and they look for such a mechanism in this paper. The reviewers understand this can be an interpretation of a leaky accumulation process, in which the leak "drives up entropy" (ii) and the accumulation "decreases posterior uncertainty" (i). But it is not entirely clear from the perspective of Bayesian inference. Specifically, reviewers raised two issues in this context.
2a) It is not generally true that posterior entropy decreases with more observations.
You state (Introduction): "if beliefs about the environment are expressed as a posterior distribution over possible states, the entropy, or uncertainty, of this posterior decreases as the number of concordant data points increases." Whether or not this is a mathematically correct statement depends on what is meant by "concordant" (which does not seem to be a technical concept). There are plenty of cases where posterior entropy will increase as more data arrive. For example, if you initially strongly believe that a coin is biased to produce heads and then you observe tails, then your posterior will be higher entropy than your prior. (This also comes up later on the same page: "If the 'leak' were removed, the model entropy would still decrease with each new observation, as this is a natural consequence of gaining information").
2b) It is not clear why a dedicated mechanism would be required to "drive up" or "inject" entropy.
Reviewers do not understand why a Bayesian model of inference must "counter" inflexibility by "injecting" entropy. There is nothing, mathematically speaking, that requires a model of (presumably human) inference to inject entropy. Also, talking about mechanisms like "leak" seem to be confusing algorithmic/implementational level concerns with the computationallevel analysis of the internal probabilistic model (see point 3a below). The "leak" in the Behrens et al. (2007) model, for example, comes directly from assumptions about volatility; it's not a mechanism that is "injected" into the model to increase entropy. The two computations called (i) and (ii) above are a direct consequence of probability calculus (subsection “Bayesian learning model”). Equation 6 in subsection “Bayesian learning model” shows that the assumption of nonstationarity (s > 0) counterbalances, and may even override, the decrease in posterior uncertainty by constantly injecting a uniform prior in the inference. There is no need to further "drive up entropy". Thus, reviewers feel that the statement in the Introduction: "… injecting uncertainty back into the model's state estimate to make rapid changes of belief possible" needs to be qualified or dropped.
We accept the reviewers’ points and have redrafted the Introduction accordingly. We are not quoting all individual changes here since we did a complete redraft of the Introduction.
3) There are other mechanistic concepts and related terminology, which should be clarified.
3a) In general, the Introduction tends to conflate the computational level of analysis with the level of the biological mechanisms. Reviewers feel it would help to more carefully separate those two.
We accept the reviewers’ points and have redrafted the Introduction accordingly. We are not quoting all individual changes here since we did a complete redraft of the Introduction.
3b) In choosing your definition of exploitation, you try to rely on "as few assumptions or arbitrary cutoffs as possible" (Results section). Reviewers don't think that's possible. The definition of exploitation you adopted (the first trial on which participants selected the true high reward probability option and received reward) is arguably just as arbitrary as other sensible definitions. An agent in an exploratory mode will place some probability on this option, and hence its selection does not reliably indicate exploitation. Assumptions are inescapable, and it's better to adopt a modelbased definition. But until the policy is defined explicitly, this is not possible.
The Ebitz paper cited above gives careful consideration to how trials may be classified as explore or exploit. In the paper we now present results of defining explore/exploit periods using this HMM approach but decided to keep our heuristic approach for the main analyses as it is actually more conservative.
This is reported in Appendix 1 subsection “Classification of behaviour into Explore and Exploit phases” and Appendix 1—figure 4.
3c) Reviewers do not understand the statement that uncertainty increases on unrewarded trials (subsection “Noradrenaline as a candidate mechanism for increasing flexibility of belief representations”). It's mathematically possible (though relatively unlikely) that uncertainty will increase on unrewarded trials, if subjects have low confidence during an exploitation phase.
We have edited this sentence to make clear that it is possible, but unlikely, that uncertainty would increase on rewarded trials.
3d) The following statement (Introduction) may be incorrect depending on how one interprets the terminology: "the better established one's model of the environment, the less flexible one's beliefs are." This depends on what "flexible" and "established" mean. If flexibility means something like the degree to which the posterior can deviate from the prior when new data arrive (e.g., expected KLdivergence), and if "established" means lower prior entropy, then one can conceive situations where the claim is incorrect.
We accept the reviewers’ points and have redrafted the Introduction accordingly. We are not quoting all individual changes here since we did a complete redraft of the Introduction.
4) Please elaborate on the decoding analysis.
4a) You look for a region that shows "stronger [decoded] representation strength for the currently selected option when model entropy is low", which should be a hallmark of a probabilistic model. It would be more natural to look for the posterior distribution of option values, rather than a representation of the selection option.
We said in our response plan that we would attempt to decode the full posterior from mOFC.
However, when we came to do this analysis, we found that within core exploit trials (the trials that went into the fMRI analysis) it was not possible to dissociate the model probability associated with the three unchosen options from each other or from the probability associated with the chosen option. This is because on core exploit trials, by definition none of the unchosen options have been sampled for at least 5 trials. Therefore, most of the evidence informing the probabilities assigned to these options is in fact based on the sampling of the chosen option (i.e., if the chosen option is less likely to be correct based on recent feedback, all other options are upweighted equally). The average correlation matrix between the model probabilities for chosen option and the three other options (ordered from most to least probable under the model) is given in Author response image 1. As can be seen from this matrix, decoding the probability that the chosen option is correct is virtually equivalent, under the model, to decoding the full posterior over all 4 options. This is an inescapable feature of the design of our task.
The correlation matrix shows that the probability assigned to the exploited option is anticorrelated with that assigned to each of the other three, which themselves are all highly correlated.
It actually looks like it is the representation decoded in mOFC here. Indeed, if the representation decoded is that of the selected option, then it does not really make sense that the classifier's probability is actually lowest for the current option (compared to other options) in the high entropy condition (Figure 2C). By contrast, this result is expected for the option value.
To address the question, we repeated the decoding analysis (of option value) after having regressed out the effect of value, and the effect of recent feedback (reward omission on the current trial and previous 4 trials) from the signal at each voxel in the mOFC ROI. We were still able to decode chosen option identity from mOFC and the representation strength was proportional to model entropy.
We would like to point out, and have clarified in the manuscript, that we are not claiming that value/ feedback are not represented in mOFC. Indeed, model entropy is closely related to both option value and feedback history. The point is that we attempted to decode the identity of the chosen option (and hence, the current state of the environment) from mOFC, and found that entropy/value/feedback history modulated the strength with which one state representation dominated the activity pattern. Thus, the pattern of activity in mOFC represents a conjunctive code of state and confidence in that state, or state and value of that state, which are closely correlated in the current paradigm (as they would be in most RL paradigms).
Representation strength in mOFC was better predicted by model entropy than by the value of the chosen option; however, these two measures were rather closely correlated and were both related to recent feedback history. Because model entropy and option value both depended, within our learning model, on feedback history, this is to be expected; we cannot distinguish between these conceptually and statistically related constructs.
As seen on Figure 1C, when posterior entropy is highest in exploitation periods (the ones analyzed here), the value of the currently selected option tends to be the lowest.
Note that in Figure 2C, the probability of decoding the current option is not actually significantly below chance, we think it is around chance and appears below chance due to noise.
4b) From the same basic perspective, the representational analyses could better clarify the source of the effects, rather than lumping everything into entropy. A difference in decoding reliability driven by differences in error versus correct trials is a rather different effect than one that emerges across feedbackvalidity conditions when accounting statistically for feedback history. The error versus correct distinction is one where the difference in frequency of these two conditions might lead to artificially low estimates of decoding reliability in error trials, which, given larger pupil responses on error trials, could make interpretation of the OFC pupil results fairly tricky. By this token, one could argue that a number of the findings reported in OFC [general uncertainty, choicespecific belief, and 70% vs 90% in the same regions (subsection “Medial OFC represents probabilistic beliefs about the state of the environment” and Figure 2—supplement 1)] could all be driven by a conjunctive code over feedback and option value, rather than by the tracking of reliability weighted state representations.
We asked for clarification on this point and it was agreed that the analyses described in response to the previous point should address this point also.
5) Please substantiate claims about temporal precedence.
The results are interpreted in terms of temporal precedence with the implicit assumption being that pupil diameter prior to outcome presentation relates to subsequent reliability of OFC representations. However, since pupil diameter, BOLD responses, and entropy all likely contain fairly strong autocorrelations, it would be useful if you could better test this assumption by determining whether the relationships hold even after shifting the pupil and OFC signals in time relative to one another (eg. does OFC representation reliability predict pupil diameter on subsequent trial?).
We have repeated the pupilrepresentation strength analysis shifting the trials used in the analysis by one trial either way (so pupil on trial t is paired with representation strength on trial t+1 or t1). The analysis does not yield significant results if we make these shifts, suggesting the association is relatively specific to each trial.
This is reported in Results section of the manuscript, statement beginning “To test the temporal specificity of the result…”
We should point out that we are not trying to make a strong claim about temporal precedence. We are simply saying that pupil dilation changes and representation strength changes tend to cooccur on the same trial as each other, as would be expected if internal models are updated on a trialbytrial basis.
6) Please provide a more comprehensive characterization of LC responses, and the methods used to detect them.
You should show the actual response time courses, rather than only statistical maps. You should also show the correlation between the LC and pupil responses. There is a dispute in the literature about the feasibility of LC imaging and the pitfalls and physiological artefacts (e.g. Astafiev, 2010; de Gee et al., 2017). Please also provide more information in Materials and methods section – did you correct for physiological noise (using RETROICOR or related approaches)? If not, it might be better to take the LC findings out altogether, and focus on the pupil. Finally, some of your claims seem to require that the effects should specific to the LCNA system. Substantiating this claim would require contrasting LC responses to responses of other brainstem regions.
We have removed this analysis entirely on the reviewer’s recommendation.
7) Please elaborate on the representation of the relevant prior literature.
There is a whole literature on this topic, which is only scarcely referred to. The following references (potentially more) should be included:
– Iigaya, 2016 presents a mechanistic view for driving up entropy with noradrenaline, with the idea that a surprise detection module is used to reset (by means of noradrenaline) the representation currently hold and stored in synapses.
– Meyniel and Dehaene, 2017 for waning and waxing of posterior uncertainty and its neural (fMRI) representation
– Pulcu and Browning, 2017 on pupil size variations with volatility
– Krishnamurthy, Nassar and Gold, 2017 on a link between pupil size and the balance between prior and current evidence in uncertain environments
– de Gee et al., 2017 for showing that fMRI responses in the human LC, but also other neuromodulatory brainstem nuclei, correlate with pupil dilation. In that regard, please tone down the statement that pupil dilation is a specific marker of LC activity (how specific this link is has rarely been tested directly and the few studies that did, in humans, monkey, and rodents, founds links to several brainstem systems).
Thanks for pointing these references out, we have now included discussion of these in the manuscript.
[Editors' note: further revisions were requested prior to acceptance, as described below.]
Essential revisions:
1) Mathematical constructs that motivate the study and analysis.
The authors now clearly decompose the inference into (1) computing a posterior distribution and (2) turning this posterior distribution into a predictive distribution by applying a transitionfunction which, if volatility is assumed, blunts the posterior. Using this (clear) terminology, they then say, "we investigate the hypothesis that the brain has a mechanism for increasing entropy in internal models that is analogous to the transitionfunction update described above". It seems that their analysis cannot be in line with this, since in their experiment, volatility (the rate of change) is constant, therefore the transitionfunction is constant and it is impossible to correlate a brain/pupil signal with it or claim that a brain mechanism has been found for it. We suggest further editing the Introduction clarify this aspect.
We agree with the reviewer’s point here: indeed, we have not manipulated the magnitude of the transition function. This is in contrast to many previous papers investigating uncertainty, which have effectively searched for neural correlates of a control process (that determines the level of uncertainty) (discussed in the Introduction, fifth paragraph). The approach used by these previous papers is to manipulate the environment so that the optimal level of uncertainty in the transition function varies over time. To introduce this idea, we need to explain the concept of the transition function.
However, in the current study our aim was not to manipulate the optimal level of uncertainty in a model and identify brain regions tracking this, but to measure uncertainty as an intrinsic property of a model of the environment (discussed in the Introduction).
We tried to make the distinction between these concepts clear in the part of the Introduction referring to “2 questions” (fifth to final paragraph). However, we agree that the text suggested we had actually manipulated the transition function, rather than studied how uncertainty within a probabilistic world model is controlled – which amongst other things provides a mechanism by which a transition function could be implemented.
To try to clarify this point we have added a paragraph as follows
“In the current experiment we focused on the second question above [i.e. not the strength of the leak but its consequences as manifest in the internal model itself]. […] These are key components for any candidate mechanism for the implementation of a transitionfunction that increases uncertainty, such as a leak.”
2) Effect of entropy on representation strength.
It remains that the decoding results are not fully in line with the cartoon illustration in Figure 1B, in particular, that decoding is not significant in the high entropy condition.
We think the reviewer is referring to the mOFC results in Figure 2C here. This figure illustrates the effect of entropy on decoding accuracy and decoded option probabilities using a median split of trials into high and low entropy. It is true that the figure suggests that decoding would not be significant in this subset of trials. This simply reflects the key result that decoding was (parametrically) more accurate when entropy was lower. This graded effect is the effect we have tested statistically (analysis described in subsection “Decoding a probabilistic representation of beliefs about the state of the environment”).
The cartoon in Figure 1B is just an illustrative cartoon, which is supposed to show that the effect of entropy on representation strength is graded (i.e. we would expect the high reward option to be represented more strongly than other options in general, but that this pattern would be weaker in high entropy conditions). This is compatible with the linear regression showing a relationship between entropy and representation strength (analysis described in subsection “Decoding a probabilistic representation of beliefs about the state of the environment”). We don’t think it would make the cartoon more informative to change it to look like the median split results when in fact the effect of entropy on representation strength is graded.
Instead, it looks like an effect of value, and indeed the effect is significant in GLM 4 testing reward probability. We are concerned that the control (point 4, second part, in the response letter) for the contribution of value may not have been adequate. They regressed out, voxelwise, the linear effect of value. The analysis in the (univariate) residuals does not preclude that there remains multivariate activity related to value, which, if correlated with entropy, will boost the effect of entropy. The conserve approach may be more appropriate: value should be regressed out from entropy, then if the correlation between representation strength (derived from the decoder) and residual entropy remains significant and negative, it will be convincing that the effect is not confounded by value.
We realize that this control analysis may not be possible with the current design (strong negative correlation between value and entropy)  if that is the case, then the claim about a specific involvement of entropy should be toned down.
The reviewer is correct that there is a strong relationship between the expected value of choosing the currently selected option, and the ideal observer’s confidence that this is the correct option. Therefore, it doesn’t make sense to regress out option value from entropy – even if there was a residual relationship between entropy and representation strength, it is not clear what the residual of entropy, after value is regressed out, reflects conceptually in this context. In our opinion this is not so much a bug as a feature – because a relationship between action’s expected value, and the agent’s certainty that is the correct action to take, would be a property of many naturalistic models of value based action. However, we agree that this should be spelled out clearly and have added a paragraph to do so, at an early point in the paper (directly after we introduce the idea of model entropy itself).
This is the text we added (subsection “What is the nature of the representation in mOFC?”):
“Throughout the next sections of this paper, we discuss the relationship between model entropy (in our theoretical model) and the level of uncertainty in a neural model of the state of the world. […] In behavioural control more generally, this association would hold for any world model for which the aim was to determine the expected value associated with different candidate actions.”
We have retained the text about the control analysis (in which we regress out the univariate effect of value) because we think it may be of interest to readers in addition to the text above, and made the rest of the paragraph clearer:
“The reason we interpret the multivariate results as evidence for a probabilistic representation of the current state (which option is the current high reward option) is that our ability to decode the identity of the chosen option from mOFC (indexed as representation strength for the identity of that option) was higher when the participant’s uncertainty (model entropy) was lower. […] Rather the decoding must be driven by different relative weightings of specific option representations, consistent with a probabilistic state representation.”
3) Effect of noradrenaline.
The authors opted for removing the LCfMRI results about which we had technical concerns. So, their claim about noradrenaline now relies entirely on the pupil data, but this overly specific conclusion is not supported by the available data. Specifically, the authors quote Joshi et al., 2016 to relate baseline pupil size (which is central in their analysis) to noradrenaline levels, but the quoted paper in fact shows that LC activity is specifically related to changes in pupil size. Reimer et al., 2016 go further in showing that pupil dynamics is correlated to both, noradrenaline and acetylcholinerelease, and the former is more strongly related to the fast dilations of the pupil whereas the latter is more strongly related to slow variations in pupil size, which would correspond to fluctuations in baseline pupil size in the current paper. The authors should refrain from linking their results specifically to the LCNA system; they should link their results to "pupillinked arousal". They can go on to speculate that the LCNA system might play a key role in their findings, but then they would need to acknowledge that their slow baseline pupil size effects would be more in line with a cholinergic effect. It does not seem to match well with the influential accounts in this field, but we need to take this recent evidence from careful experiments into neuromodulatory correlates of pupil dynamics seriously.
We agree that the pupillometry effects alone can’t be linked directly to noradrenaline as opposed to Ach, for example. We have edited the language throughout to be careful about whether we are reporting (a) our results which speak only to pupilrelated arousal per se, (b) other groups’ results, some of which are in fact specific to noradrenaline (animal studies, anatomical studies, etc) or (c) other researchers’ theories, which are specifically theories about noradrenaline, even if experimental evidence for a specific role of noradrenaline as described in said theories is still lacking. In other words, we have removed any claim that our results specifically index noradrenaline, but we have not in general removed specific references to noradrenaline where discussing the work of others.
We have also added some text to make clear the current state of the literature, as we understand it, about how (non)specifically pupil data can be linked to noradrenaline:
“We used baseline pupil size (mean pupil area in the 20ms before outcome presentation, expressed as% signal change relative to the participant’s mean pupil area over the task run) to index neuromodulatory state. […] However, we must emphasise that pupillometry is a very indirect measure of neuromodulation and although we mostly interpret our results in the context of noradrenaline due to the extensive literature on pupillometry and noradrenaline, as well as influential theories and experimental work linking noradrenaline to uncertainty, model updating and exploration, we note that our pupil size data cannot be specifically linked to noradrenaline over other pupillinked arousal factors such as acetylcholine.”
https://doi.org/10.7554/eLife.39404.030Article and author information
Author details
Funding
Medical Research Council (MR/L019639/1)
 Jill O'Reilly
Wellcome Trust (HMR00560.001)
 Timothy E Behrens
Medical Research Council (BRT00020)
 Timothy H Muller
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Alon Baram, Philipp Schwartenbeck, James Whittington, Emma Lawrence, Anna Shpektor and Jacob Bakermans for very helpful discussions and for comments on an earlier version of this manuscript. We gratefully acknowledge the Medical Research Council (Career Development Award to JXO’R grant MR/L019639/1, and a grant to THM BRT00020) for funding.
Ethics
Human subjects: The study was approved by a local University of Oxford ethics committee (ref: MSDIDRECC12013171), and all participants gave written informed consent.
Senior Editor
 Michael J Frank, Brown University, United States
Reviewing Editor
 Tobias H Donner, University Medical Center HamburgEppendorf, Germany
Reviewers
 Samuel J Gershman, Harvard University, United States
 Florent Meyniel, CEA, France
Publication history
 Received: June 20, 2018
 Accepted: February 6, 2019
 Version of Record published: February 28, 2019 (version 1)
 Version of Record updated: March 15, 2019 (version 2)
Copyright
© 2019, Muller et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 690
 Page views

 130
 Downloads

 1
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.