Striatal actionvalue neurons reconsidered
Abstract
It is generally believed that during economic decisions, striatal neurons represent the values associated with different actions. This hypothesis is based on studies, in which the activity of striatal neurons was measured while the subject was learning to prefer the more rewarding action. Here we show that these publications are subject to at least one of two critical confounds. First, we show that even weak temporal correlations in the neuronal data may result in an erroneous identification of actionvalue representations. Second, we show that experiments and analyses designed to dissociate actionvalue representation from the representation of other decision variables cannot do so. We suggest solutions to identifying actionvalue representation that are not subject to these confounds. Applying one solution to previously identified actionvalue neurons in the basal ganglia we fail to detect actionvalue representations. We conclude that the claim that striatal neurons encode actionvalues must await new experiments and analyses.
https://doi.org/10.7554/eLife.34248.001Main text
There is a long history of operant learning experiments, in which a subject, human or animal, repeatedly chooses between actions and is rewarded according to its choices. A popular theory posits that the subject’s decisions in these tasks utilize estimates of the different actionvalues. These actionvalues correspond to the expected reward associated with each of the actions, and actions associated with a higher estimated actionvalue are more likely to be chosen (Sutton and Barto, 1998). In recent years, there is a lot of interest in the neural mechanisms underlying this computation (Louie and Glimcher, 2012; Schultz, 2015). In particular, based on electrophysiological, functional magnetic resonance imaging (fMRI) and intervention experiments, it is now widely accepted that a population of neurons in the striatum represents these actionvalues, adding sway to this actionvalue theory (Cai et al., 2011; FitzGerald et al., 2012; Funamizu et al., 2015; GuitartMasip et al., 2012; Her et al., 2016; Ito and Doya, 2009; 2015a; Ito and Doya, 2015b; Kim et al., 2013; Kim et al., 2009; Kim et al., 2012; 2007; Lau and Glimcher, 2008; Lee et al., 2015; Samejima et al., 2005; Stalnaker et al., 2010; Tai et al., 2012; Wang et al., 2013; Wunderlich et al., 2009). Here we challenge the evidence for actionvalue representation in the striatum by describing two major confounds in the interpretation of the data that have not yet been successfully addressed.
To identify neurons that represent the values the subject associates with the different actions, researchers have searched for neurons whose firing rate is significantly correlated with the average reward associated with exactly one of the actions. There are several ways of defining the average reward associated with an action. For example, the average reward can be defined by the reward schedule, for example, the probability of a reward associated with the action. Alternatively, one can adopt the subject’s perspective, and use the subjectspecific history of rewards and actions in order to estimate the average reward. In particular, the Rescorla–Wagner model (equivalent to the standard onesstate Qlearning model) has been used to estimate actionvalues (Kim et al., 2009; Samejima et al., 2005). In this model, the value associated with an action $i$ in trial $t$, termed ${Q}_{i}\left(t\right)$, is an exponentiallyweighted average of the rewards associated with this action in past trials:
where $a\left(t\right)$ and $R\left(t\right)$ denote the choice and reward in trial $t$, respectively, and $\alpha $ is the learning rate.
The model also posits that in a twoalternative task, the probability of choosing an action is a sigmoidal function, typically softmax, of the difference of the actionvalues (see also [Shteingart and Loewenstein, 2014]):
where $\beta $ is a parameter that determines the bias towards the action associated with the higher actionvalue. The parameters of the model, $\alpha $ and $\beta $, can be estimated from the behavior, allowing the researchers to compute ${Q}_{1}$ and ${Q}_{2}$ on a trialbytrial basis.
In principle, one can identify the neurons that represent an actionvalue by identifying neurons for which the regression of the trialbytrial spike count on one of the variables ${Q}_{i}\left(t\right)$ is statistically significant. Using this framework, electrophysiological studies have found that the firing rate of a substantial fraction of striatal neurons (12–40% for different significance thresholds) is significantly correlated with an actionvalue. These and similar results were considered as evidence that neurons in the striatum represent actionvalues (Funamizu et al., 2015; Her et al., 2016; Ito and Doya, 2015a; Ito and Doya, 2015b; Kim et al., 2013; Kim et al., 2009; Lau and Glimcher, 2008; Samejima et al., 2005).
In this paper we conduct a systematic literature search and conclude that the literature has, by and large, ignored two major confounds in this and in similar analyses. First, it is wellknown that spurious correlations can emerge in correlation analysis if both variables have temporal correlations (Granger and Newbold, 1974; Phillips, 1986). Here we show that neurons can be erroneously classified as representing actionvalues when their firing rates are weakly temporally correlated. Second, it is also wellknown that lack of a statistically significant result in the analysis does not imply lack of correlation. Because in standard analyses neurons are classified as representing actionvalues if they have a significant regression coefficient on exactly one actionvalue and because decision variables such as policy are correlated with both actionvalues, neurons representing other decision variables may be misclassified as representing actionvalues. We propose different approaches to address these issues. Applying one of them to recordings from the basal ganglia, we fail to identify any actionvalue representation there. Thus, we conclude that the hypothesis that striatal neurons represent actionvalues still remains to be tested by experimental designs and analyses that are not subject to these confounds. In the Discussion we address additional conceptual issues with identifying such a representation.
This paper discusses methodological problems that may also be of relevance in other fields of biology in general and neuroscience in particular. Nevertheless, the focus of this paper is a single scientific claim, namely, that actionvalue representation in the striatum is an established fact. Our criticism is restricted to the representation of actionvalues, and we do not make any claims regarding the possible representations of other decision variables, such as policy, chosenvalue or rewardpredictionerror. This we leave for future studies. Moreover, we do not make any claims about the possible representations of actionvalues elsewhere in the brain, although our results suggest caution when looking for such representations.
The paper is organized in the following way. We commence by describing a standard method for identifying actionvalue neurons. Next, we show that this method erroneously classifies simulated neurons, whose activity is temporally correlated, as representing actionvalues. We show that this confound brings into question the conclusion of many existing publications. Then, we propose different methods for identifying actionvalue neurons, that overcome this confound. Applying such a method to basal ganglia recordings, in which actionvalue neurons were previously identified, we fail to conclusively detect any actionvalue representations. We continue by discussing the second confound: neurons that encode the policy (the probability of choice) may be erroneously classified as representing actionvalue, even when the policy is the result of learning algorithms that are devoid of actionvalue calculation. Then we discuss a possible solution to this confound.
Results
Identifying actionvalue neurons
We commence by examining the standard methods for identifying actionvalue neurons using a simulation of an operant learning experiment. We simulated a task, in which the subject repeatedly chooses between two alternative actions, which yield a binary reward with a probability that depends on the action. Specifically, each session in the simulation was composed of four blocks such that the probabilities of rewards were fixed within a block and varied between the blocks. The probabilities of reward in the blocks were (0.1,0.5), (0.9,0.5), (0.5,0.9) and (0.5,0.1) for actions 1 and 2, respectively (Figure 1A). The order of blocks was random and a block terminated when the more rewarding action was chosen more than 14 times within 20 consecutive trials (Ito and Doya, 2015a; Samejima et al., 2005).
To simulate learning behavior, we used the Qlearning framework (Equations 1 and 2 with $\alpha =0.1$ and $\beta =2.5$ (taken from distributions reported in [Kim et al., 2009]) and initial conditions ${Q}_{i}\left(1\right)=0.5$). As demonstrated in Figure 1A, the model learned: the probability of choosing the more rewarding alternative increased over trials (black line). To model the actionvalue neurons, we simulated neurons whose firing rate is a linear function of one of the two Qvalues and whose spike count in a 1 sec trial is randomly drawn from a corresponding Poisson distribution (see Materials and methods). The firing rates and spike counts of two such neurons, representing actionvalues 1 and 2, are depicted in Figure 1B in red and blue, respectively.
One standard method for identifying actionvalue neurons is to compare neurons' spike counts after learning, at the end of the blocks (horizontal bars in Figure 1B). Considering the redlabeled Poisson neuron, the spike count in the last 20 trials of the second block, in which the probability of reward associated with action 1 was 0.9, was significantly higher than that count in the first block, in which the probability of reward associated with action 1 was 0.1 (p<0.01; rank sum test). By contrast, there was no significant difference in the spike counts between the third and fourth blocks, in which the probability of reward associated with action 1 was equal (p=0.91; rank sum test). This is consistent with the fact that the redlabeled neuron was an action 1value neuron: its firing rate was a linear function of the value of action 1 (Figure 1B, red) Similarly for the blue labeled neuron, the spike counts in the last 20 trials of the first two blocks were not significantly different (p=0.92; rank sum test), but there was a significant difference in the counts between the third and fourth blocks (p<0.001; rank sum test). These results are consistent with the probabilities of reward associated with action 2 and the fact that in our simulations, this neuron’s firing rate was modulated by the value of action 2 (Figure 1B, blue).
This approach for identifying actionvalue neurons is limited, however, for several reasons. First, it considers only a fraction of the data, the last 20 trials in a block. Second, actionvalue neurons are not expected to represent the block average probabilities of reward. Rather, they will represent a subjective estimate, which is based on the subjectspecific history of actions and rewards. Therefore, it is more common to identify actionvalue neurons by regressing the spike count on subjective actionvalues, estimated from the subject’s history of choices and rewards (Funamizu et al., 2015; Ito and Doya, 2015a; Ito and Doya, 2015b; Kim et al., 2009; Lau and Glimcher, 2008; Samejima et al., 2005). Note that when studying behavior in experiments, we have no direct access to these estimated actionvalues, in particular because the values of the parameters $\alpha $ and $\beta $ are unknown. Therefore, following common practice, we estimated the values of $\alpha $ and $\beta $ from the model’s sequence of choices and rewards using maximum likelihood, and used the estimated learning rate ($\alpha $) and the choices and rewards to estimate the actionvalues (thin lines in Figure 1C, see Materials and methods). These estimates were similar to the true actionvalue, which underlay the model’s choice behavior (thick lines in Figure 1C).
Next, we regressed the spike count of each simulated neuron on the two estimated actionvalues from its corresponding session. As expected, the tvalue of the regression coefficient of the redlabeled action 1value neuron was significant for the estimated ${Q}_{1}$ $\left({t}_{182}\left({Q}_{1}\right)=4.05\right)$ but not for the estimated ${Q}_{2}$ $\left({t}_{182}\left({Q}_{2}\right)=0.27\right)$. Similarly, the tvalue of the regression coefficient of the bluelabeled action 2value neuron was significant for the estimated ${Q}_{2}$ $\left({t}_{182}\left({Q}_{2}\right)=3.05\right)$ but not for the estimated ${Q}_{1}$ $\left({t}_{182}\left({Q}_{1}\right)=0.78\right)$.
A population analysis of the tvalues of the two regression coefficients is depicted in Figure 1D,E. As expected, a substantial fraction (42%) of the simulated neurons were identified as actionvalue neurons. Only 2% of the simulated neurons had significant regression coefficients with both actionvalues. Such neurons are typically classified as state $\left(\mathrm{\Sigma}Q\right)$ or policy (also known as preference) $\left(\mathrm{\Delta}Q\right)$ neurons, if the two regression coefficients have the same or different signs, respectively (Ito and Doya, 2015a). Note that despite the fact that by construction, all neurons were actionvalue neurons, not all of them were detected as such by this method. This failure occurred for two reasons. First, the estimated actionvalues are not identical to the true actionvalues, which determine the firing rates. This is because of the finite number of trials and the stochasticity of choice (note the difference, albeit small, between the thin and thick lines in Figure 1C). Second and more importantly, the spike count in a trial is only a noisy estimate of the firing rate because of the Poisson generation of spikes.
Several prominent studies have implemented the methods we described in this section and reported that a substantial fraction (10–40% depending on significance threshold) of striatal neurons represent actionvalues (Ito and Doya, 2015a; Ito and Doya, 2015b; Samejima et al., 2005). In the next two sections we show that these methods, and similar methods employed by other studies (Cai et al., 2011; FitzGerald et al., 2012; Funamizu et al., 2015; GuitartMasip et al., 2012; Her et al., 2016; Ito and Doya, 2009; Kim et al., 2013; Kim et al., 2009; Kim et al., 2012; 2007; Lau and Glimcher, 2008; Stalnaker et al., 2010; Wang et al., 2013; Wunderlich et al., 2009) are all subject to at least one of two major confounds.
Confound 1 – temporal correlations
Simulated randomwalk neurons are erroneously classified as actionvalue neurons
The red and bluelabeled neurons in Figure 1D were classified as actionvalue neurons because their tvalues were improbable under the null hypothesis that the firing rate of the neuron is not modulated by actionvalues. The significance threshold (t = 2) was computed assuming that trials are independent in time. To see why this assumption is essential, we consider a case in which it is violated. Figure 2A depicts the firing rates and spike counts of two simulated Poisson neurons, whose firing rates follow a bounded Gaussian randomwalk process:
where $f\left(t\right)$ is the firing rate in trial $t$ (we consider epochs of 1 second as ‘trials’), $z\left(t\right)$ is a diffusion variable, randomly and independently drawn from a normal distribution with mean 0 and variance ${\sigma}^{2}=0.01$ and ${\left[x\right]}_{+}$ denotes a linearthreshold function, ${\left[x\right]}_{+}=x$ if $x\ge 0$ and 0 otherwise.
These randomwalk neurons are clearly not actionvalue neurons. Nevertheless, we tested them using the analyses depicted in Figure 1. To that goal, we randomly matched the trials in the simulation of the randomwalk neurons (completely unrelated to the task) to the trials in the simulation depicted in Figure 1A. Then, we considered the spike counts of the randomwalk neurons in the last 20 trials of each of the four blocks in Figure 1A (block being defined by the simulation of learning and is unrelated to the activity of the randomwalk neurons). Surprisingly, when considering the top neuron in Figure 2A and utilizing the same analysis as in Figure 1B, we found that its spike count differed significantly between the first two blocks (p<0.01, rank sum test) but not between the last two blocks (p=0.28, rank sum test), similarly to the simulated action 1value neuron of Figure 1B (red). Similarly, the spike count of the bottom randomwalk neuron matched that of a simulated action 2value neuron (compare with the bluelabeled neuron in Figure 1B; Figure 2A).
Moreover, we regressed each vector of spike counts for 20,000 randomwalk neurons on randomly matched estimated actionvalues from Figure 1E and computed the tvalues (Figure 2B). This analysis erroneously classified 42% of these randomwalk neurons as actionvalue neurons (see Figure 2C). In particular, the top and bottom randomwalk neurons of Figure 2A were identified as actionvalue neurons for actions 1 and 2, respectively (squares in Figure 2B).
To further quantify this result, we computed the fraction of randomwalk neurons erroneously classified as actionvalue neurons as a function of the diffusion parameter $\mathrm{\sigma}$ (Figure 2D). When $\mathrm{\sigma}$=0, the spike counts of the neurons in the different trials are independent and the number of randomwalk neurons classified as actionvalue neurons is slightly less than 10%, the fraction expected by chance from a significance criterion of 5% and two statistical tests, corresponding to the two actionvalues. The larger the value of $\mathrm{\sigma}$, the higher the probability that a randomwalk neuron will pass the selection criterion for at least one actionvalue and thus be erroneously classified as an actionvalue, state or policy neuron.
The excess actionvalue neurons in Figure 2 emerged because the significance boundary in the statistical analysis was based on the assumption that the different trials are independent from each other. In the case of a regression of a randomwalk process on an actionvalue related variable, this assumption is violated. The reason is that in this case, both predictor (actionvalue) and the dependent variable (spike count) slowly change over trials, the former because of the learning and the latter because of the random drift. As a result, the statistic, which relates these two signals, is correlated between trials, violating the independenceoftrials assumption of the test. Because of these dependencies, the expected variance of the statistic (be it average spike count in 20 trials or the regression coefficient), which is calculated under the independenceoftrials assumption, is an underestimate of the actual variance. Therefore, the fraction of randomwalk neurons classified as actionvalue neurons increases with the magnitude of the diffusion, which is directly related to the magnitude of correlations between spike counts in proximate trials (Figure 2D). The phenomenon of spurious significant correlations in timeseries with temporal correlations has been described previously in the field of econometrics and a formal discussion of this issue can be found in (Granger and Newbold, 1974; Phillips, 1986).
Is this confound relevant to the question of actionvalue representation in the striatum?
Is a randomwalk process a good description of striatal neurons’ activity?
The Gaussian randomwalk process is just an example of a temporally correlated firing rate and we do not argue that the firing rates of striatal neurons follow such a process. However, any other type of temporal correlations, for example, oscillations or trends, will violate the independenceoftrials assumption, and may lead to the erroneous classification of neurons as representing actionvalues. Such temporal correlations can also emerge from stochastic learning. For example, in Figure 2—figure supplement 1 we consider a model of operant leaning that is based on covariance based synaptic plasticity (Loewenstein, 2008; Loewenstein, 2010; Loewenstein and Seung, 2006; Neiman and Loewenstein, 2013) and competition (Bogacz et al., 2006). Because such plasticity results in slow changes in the firing rates of the neurons, applying the analysis of Figure 1E to our simulations results in the erroneous classification of 43% of the simulated neurons as representing actionvalues. This is despite the fact that actionvalues are not computed as part of this learning, neither explicitly or implicitly.
Are temporal correlations in neural recordings sufficiently strong to affect the analysis?
To test the relevance of this confound to experimentallyrecorded neural activity, we repeated the analysis of Figure 2B,C on neurons recorded in two unrelated experiments: 89 neurons from extracellular recordings in the motor cortex of an awake monkey (Figure 2—figure supplement 2A–B) and 39 auditory cortex neurons recorded intracellularly in anaesthetized rats (Figure 2—figure supplement 2C–D; [Hershenhoren et al., 2014]). We regressed the spike counts on randomly matched estimated actionvalues from Figure 1E. In both cases we erroneously classified neurons as representing actionvalue in a fraction comparable to that reported in the striatum (36 and 23%, respectively).
Strong temporal correlations in the striatum
To test the relevance of this confound to striatal neurons, we considered previous recordings from neurons in the nucleus accumbens (NAc) and ventral pallidum (VP) of rats in an operant learning experiment (Ito and Doya, 2009) and regressed their spike counts on simulated, unrelated actionvalues (using more blocks and trials than in Figure 1E, see Figure legend). Note that although the recordings were obtained during an operant learning task, the actionvalues that we used in the regression were obtained from simulated experiments and were completely unrelated to the true experimental settings. Again, we erroneously classified a substantial fraction of neurons (43%) as representing actionvalues, a fraction comparable to that reported in the striatum (Figure 2—figure supplement 3).
Haven't previous publications acknowledged this confound and successfully addressed it?
We conducted an extensive literature search to see whether previous studies have identified this confound and addressed it (see Materials and methods). Two studies noted that processes such as slow drift in firing rate may violate the independenceoftrials assumption of the statistical tests and suggested unique methods to address this problem (Kim et al., 2013; Kim et al., 2009): one method (Kim et al., 2009) relied on permutation of the spike counts within a block (Figure 2—figure supplement 4, see Materials and methods) and another (Kim et al., 2013), used spikes in previous trials as predictors (Figure 2—figure supplement 5). However, both approaches still erroneously classify unrelated recorded and randomwalk neurons as actionvalue neurons (Figure 2—figure supplements 4 and 5). The failure of both these approaches stems from the fact that a complete model of the learningindependent temporal correlations is lacking. As a result, these methods are unable to remove all the temporal correlations from the vector of spikecounts.
Our literature search yielded four additional methods that have been used to identify actionvalue neurons. However, as depicted in Figure 2—figure supplement 6 (corresponding to the analyses in [Ito and Doya, 2009; Samejima et al., 2005]), Figure 2—figure supplement 7 (corresponding to the analysis in [Ito and Doya, 2015a]), Figure 2—figure supplement 8 (corresponding to the analysis in [Wang et al., 2013]) and Figure 2—figure supplement 9 (corresponding to a trial design experiment in [FitzGerald et al., 2012]), all these additional methods erroneously classify neurons from unrelated recordings and randomwalk neurons as actionvalue neurons in numbers comparable to those reported in the striatum (Figure 2—figure supplement 6–9). The fMRI analysis in (FitzGerald et al., 2012) focused on the difference between actionvalues rather than on the actionvalues themselves (see confound 2), and therefore we did not attempt to replicate it (and cannot attest to whether it is subject to the temporal correlations confound). We did, however, conduct the standard analysis on their unique experimental design  a trialdesign experiment in which trials with different reward probabilities are randomly intermingled. Surprisingly, we erroneously detect actionvalue representation even when using this trial design (Figure 2—figure supplement 9). This erroneous detection occurs because in this analysis, the regression’s predictors are estimated actionvalues, which are temporally correlated. From this example it follows that even trialdesign experiments may still be subject to the temporal correlations confound.
Some previous publications used more blocks. Shouldn’t adding blocks solve the problem?
In Figures 1 and 2 we considered a learning task composed of four blocks with a mean length of 174 trials (standard deviation 43 trials). It is tempting to believe that experiments with more blocks and trials (e.g., [Ito and Doya, 2009; Wang et al., 2013]) will be immune to this confound. The intuition is that the larger the number of trials, the less likely it is that a neuron that is not modulated by actionvalue (e.g., a randomwalk neuron) will have a large regression coefficient on one of the actionvalues. Surprisingly, however, this intuition is wrong. In Figure 2—figure supplement 10 we show that doubling the number of blocks, so that the original blocks are repeated twice, each time in a random order, does not decrease the fraction of neurons erroneously classified as representing actionvalues. For the case of randomwalk neurons, it can be shown that, contrary to this intuition, the fraction of erroneously identified actionvalue neurons is expected to increase with the number of trials (Phillips, 1986). This is because the expected variance of the regression coefficients under the null hypothesis is inversely proportional to the degrees of freedom, which increase with the number of trials. As a result, the threshold for classifying a regression coefficient as significant decreases with the number of trials.
Possible solutions to the temporal correlations confound
The temporal correlations confound has been acknowledged in the fMRI literature, and several methods have been suggested to address it, such as ‘prewhitening’ (Woolrich et al., 2001). However, these methods require prior knowledge, or an estimate of the predictorindependent temporal correlations. Both are impractical for the slow timescale of learning and therefore are not applicable in the experiments we discussed.
Another suggestion is to assess the level of autocorrelations between trials in the data and to use it to predict the expected fraction of erroneous classification of actionvalue neurons. However, using such a measure is problematic in the context of actionvalue representation because the autocorrelations relevant for the temporal correlations confound are those associated with the timescale relevant for learning  tens of trials. Computing such autocorrelations in experiments of a few hundreds of trials introduces substantial biases (Kohn, 2006; Newbold and Agiakloglou, 1993). Moreover, even when these autocorrelations are computed, it is not clear exactly how they can be used to estimate the expected false positive rate for actionvalue classification.
Finally, it has been suggested that the temporal correlation confound can be addressed by using repeating blocks and removing neurons whose activity is significantly different in identical blocks (Asaad et al., 2000; Mansouri et al., 2006). We applied this method by applying a design in which the four blocks of Figure 1 are repeated twice. However, even when this method was applied, a significant number of neurons were erroneously classified as representing actionvalues (Materials and methods).
We therefore propose two alternative approaches.
Permutation analysis
Trivially, an actionvalue neuron (or any taskrelated neuron) should be more strongly correlated with the actionvalue of the experimental session, in which the neuron was recorded, than with actionvalues of other sessions (recorded in different days). We propose to use this requirement in a permutation test, as depicted in Figure 3. We first consider the two simulated actionvalue neurons of Figure 1B. For each of the two neurons, we computed the tvalues of the regression coefficients of the spike counts on each of the estimated actionvalues in all possible sessions (see Materials and methods). Figure 3A depicts the two resulting distributions of tvalues. As a result of the temporal correlations, the 5% significance boundaries (vertical dashed lines), which are defined to be exceeded by exactly 5% of tvalues in each distribution, are substantially larger (in absolute value) than 2, the standard significance boundaries. On this analysis, a neuron is significantly correlated with an actionvalue if the tvalue of the regression on the actionvalue from its corresponding session exceeds the significance boundaries derived from the regression of its spike count on all possible actionvalues.
Indeed, when considering the Top (red) simulated action 1value neuron, we find that its spike count has a significant regression coefficient on the estimated ${Q}_{1}$ from its session (red arrow) but not on the estimated ${Q}_{2}$ (blue arrow). Importantly, because the significance boundary exceeds 2, this approach is less sensitive than the original one (Figure 1) and indeed, the regression coefficients of the Bottom simulated neuron (blue) do not exceed the significance level (red and blue arrows) and thus this analysis fails to identify it as an actionvalue neuron. Considering the population of simulated actionvalue neurons of Figure 1, this analysis identified 29% of the actionvalue neurons of Figure 1 as such (Figure 3B, green), demonstrating that this analysis can identify actionvalue neurons. When considering the randomwalk neurons (Figure 2), this method classifies only approximately 10% of the randomwalk neurons as actionvalue neurons, as predicted by chance (Figure 3B, yellow). Similar results were obtained for the motor cortex and auditory cortex neurons (not shown).
Permutation analysis of basal ganglia neurons
Importantly, this permutation method can also be used to reanalyze the activity of previously recorded neurons. To that goal, we considered the recordings reported in (Ito and Doya, 2009). The results of their modelfree method (Figure 2—figure supplement 6) imply that approximately 23% of the recorded neurons represent actionvalues at different phases of the experiment. As a first step, we estimated the actionvalues and regressed the spike counts in the different phases of the experiment on the estimated actionvalues, as in Figure 1 (activity in each phase is analyzed as if it is a different neuron; see Materials and methods). The results of this analysis implied that 32% of the neurons represent action values (p<0.01) (Figure 3—figure supplement 1). Next, we applied the permutation analysis. Remarkably, this analysis yielded that only 3.6% of the neurons have a significantly higher regression coefficient on an actionvalue from their session than on other actionvalues (Figure 3C). Similar results were obtained when performing a similar modelfree permutation analysis (regression of spike counts in the last 20 trials of the block on reward probabilities, not shown). These results raise the possibility that all or much of the apparent actionvalue representation in (Ito and Doya, 2009) is the result of the temporal correlations confound.
Trialdesign experiments
Another way of overcoming the temporal correlations confound is to use a trial design experiment. The idea is to randomly mix the reward probabilities, rather than use blocks as in Figure 1. For example, we propose the experimental design depicted in Figure 4A. Each trial is presented in one of four clearly marked contexts (color coded). The reward probabilities associated with the two actions are fixed within a context but differ between the contexts. Within each context the participant learns to prefer the action associated with a higher probability of reward. Naively, we can regress the spike counts on the actionvalues estimated from behavior, as in Figure 1. However, because the estimated actionvalues are temporally correlated, this regression is still subject to the temporal correlations confound (Figure 2—figure supplement 9). Alternatively, we can regress the spike counts on the reward probabilities. If the contexts are randomly mixed, then by construction, the reward probabilities are temporally independent. These reward probabilities are the objective actionvalues. After learning, the subjective actionvalues are expected to converge to these reward probabilities. Therefore, the reward probabilities can be used as proxies for the subjective actionvalues after a sufficiently large number of trials. It is thus possible to conduct a regression analysis on the spike counts at the end of the experiment, with reward probabilities as predictors that do not violate the independence assumption.
To demonstrate this method, we simulated learning in a session composed of 400 trials, randomly divided into 4 different contexts (Figure 4). Learning followed the Qlearning equations (Equations 1 and 2), independently for each context. Next, we simulated actionvalue neurons, whose firing rate is a linear function of the actionvalue in each trial (dots in Figure 4A, upper panel). We regressed the spike counts of the neurons in the last 200 trials (approximately 50 trials in each context) on the corresponding reward probabilities (Figure 4B). Indeed, 59% of the neurons were classified this way as actionvalue neurons (Figure 4C, 9.5% is chance level). By contrast, considering randomwalk neurons, only 8.5% were erroneously classified as actionvalue neurons, a fraction expected by chance.
Three previous studies used trialdesigns to search for actionvalue representation in the striatum (Cai et al., 2011; FitzGerald et al., 2012; Kim et al., 2012). In two of them (Cai et al., 2011; Kim et al., 2012) the reward probabilities were explicitly cued and therefore their results can be interpreted in the framework of cuevalues and not actionvalues (PadoaSchioppa, 2011). Moreover, all these studies focused on significant neural modulation by both actionvalues or by their difference, analyses that support state or policy representations (Ito and Doya, 2015a). As discussed in details in the next section, policy representation can emerge without actionvalue representation (Darshan et al., 2014; Fiete et al., 2007; Frémaux et al., 2010; Loewenstein, 2008; Loewenstein, 2010; Loewenstein and Seung, 2006; Neiman and Loewenstein, 2013; Seung, 2003; Urbanczik and Senn, 2009). Therefore, the results reported in (Cai et al., 2011; FitzGerald et al., 2012; Kim et al., 2012) cannot be taken as evidence for actionvalue representation in the striatum.
Confound 2 – correlated decision variables
In the previous sections we demonstrated that irrelevant temporal correlations may lead to the erroneous classification of neurons as representing actionvalues, even if their activity is taskindependent. Here we address an unrelated confound. We show that neurons that encode different decision variables, in particular policy, may be erroneously classified as representing actionvalues. For clarity, we will commence by discussing this caveat independently of the temporal correlations confound. Specifically, we show that neurons whose firing rate encodes the policy (probability of choice) may be erroneously classified as representing actionvalues, even when this policy emerged in the absence of any implicit or explicit actionvalue representation. We will conclude by discussing a possible solution that addresses this and the temporal correlations confounds.
Policy without actionvalue representation
It is wellknown that operant learning can occur in the absence of any value computation, for example, as a result of directpolicy learning (Mongillo et al., 2014). Several studies have shown that rewardmodulated synaptic plasticity can implement directpolicy reinforcement learning (Darshan et al., 2014; Fiete et al., 2007; Frémaux et al., 2010; Loewenstein, 2008; Loewenstein, 2010; Loewenstein and Seung, 2006; Neiman and Loewenstein, 2013; Seung, 2003; Urbanczik and Senn, 2009).
For concreteness, we consider a particular reinforcement learning algorithm, in which the probability of choice $\mathrm{P}\mathrm{r}\left(a\left(t\right)=1\right)$ is determined by a single variable $W$ that is learned in accordance with the REINFORCE learning algorithm (Williams, 1992): $\mathrm{P}\mathrm{r}\left(a\left(t\right)=1\right)=\frac{1}{1+{e}^{W\left(t\right)}}$ where $\u2206W\left(t\right)=\alpha \bullet (2\bullet R\left(t\right)1)\bullet (a\left(t\right)\mathrm{P}\mathrm{r}\left(a\left(t\right)=1\right))$, where $\alpha $ is the learning rate, $R\left(t\right)$ is the binary reward in trial $t$ and $a\left(t\right)$ is a binary variable indicating whether action 1 was chosen in trial $t$. In our simulations $W\left(t=1\right)=0$, $\alpha =0.17$. For biological implementation of this algorithm see (Loewenstein, 2010; Seung, 2003).
We tested this model in the experimental design of Figure 1 (Figure 5A). As expected, the model learned to prefer the action associated with a higher probability of reward, completing the four blocks within 228 trials on average (standard deviation 62 trials).
Spike count of neurons representing policy are correlated with estimated $\mathrm{\Delta}\mathit{Q}$
Despite the fact that the learning was valueindependent, we can still fit a Qlearning model to the behavior, extract bestfit model parameters and compute actionvalues (see also Figure 2—figure supplement 1). The computed actionvalues are presented in Figure 5B. Note that according to Equation 2, the probability of choice is a monotonic function of the difference between ${Q}_{1}$ and ${Q}_{2}$. Therefore, we expect that the probability of choice will be correlated with the computed ${Q}_{1}$ and ${Q}_{2}$, with opposite signs (Figure 5C).
We simulated policy neurons as Poisson neurons whose firing rate is a linear function of the policy $\mathrm{P}\mathrm{r}\left(\mathrm{a}(\mathrm{t})=1\right)$ (Materials and methods). Next, we regressed the spike counts of these neurons on the two actionvalues that were computed from behavior (same as in Figures 1D,E and 2B,C, Figure 2—figure supplement 1C,D, – Figure 2—figure supplement 2B,D, – Figure 2—figure supplement 3). Indeed, as expected, 14% of the neurons were significantly correlated with both action values with opposite signs (chance level for each action value is 5%, naïve chance level for both with opposite signs is 0.125%, see Materials and methods), as depicted in Figure 5D,E. These results demonstrate that neurons representing valueindependent policy can be erroneously classified as representing $\mathrm{\Delta}Q$.
Neurons representing policy may be erroneously classified as actionvalue neurons
Surprisingly, 38% of policy neurons were significantly correlated with exactly one estimated actionvalue, and therefore would have been classified as actionvalue neurons in the standard method of analysis (9.5% chance level).
To understand why this erroneous classification emerged, we note that a neuron is classified as representing an actionvalue if its spike count is significantly correlated with one of the action values, but not with the other. The confound that led to the classification of policy neurons as representing actionvalues is that a lack of statistically significant correlation is erroneously taken to imply lack of correlation. All policy neurons are modulated by the probability of choice, a variable that is correlated with the difference in the two actionvalues. Therefore, this probability of choice is expected to be correlated with both actionvalues, with opposite signs. However, because the neurons are Poisson, the spike count of the neurons is a noisy estimate of the probability of choice. As a result, in most cases (86%), the regression coefficients do not cross the significance threshold for both actionvalues. More often (38%), only one of them crosses the significance threshold, resulting in an erroneous classification of the neurons as representing action values.
Is this confound relevant to the question of actionvalue representation in the striatum?
If choice is included as a predictor, is policy representation still a relevant confound?
It is common, (although not ubiquitous) to attempt to differentiate actionvalue representation from choice representation by including choice as another regressor in the regression model (Cai et al., 2011; FitzGerald et al., 2012; Funamizu et al., 2015; Her et al., 2016; Ito and Doya, 2015a; Ito and Doya, 2015b; Kim et al., 2013; Kim et al., 2009; Kim et al., 2012; Lau and Glimcher, 2008). Such analyses may be expected to exclude policy neurons, whose firing rate is highly correlated with choice, from being classified as actionvalue neurons. However, repeating this analysis for the policy neurons of Figure 5, we still erroneously classify 36% of policy neurons as actionvalue neurons (Figure 5—figure supplement 1A).
An alternative approach has been to consider only those neurons whose spike count is not significantly correlated with choice (Stalnaker et al., 2010; Wunderlich et al., 2009). Repeating this analysis for the Figure 5 policy neurons, we still find that 24% of the neurons are erroneously classified as actionvalue neurons (8% are classified as policy neurons).
Is this confound the result of an analysis that is biased against policy representation?
The analysis depicted in Figures 1D,E, 2B,C, 4B–E and 5D,E is biased towards classifying neurons as actionvalue neurons, at the expense of state or policy neurons, as noted by (Wang et al., 2013). This is because actionvalue classification is based on a single significant regression coefficient whereas policy or state classification requires two significant regression coefficients. Therefore, (Wang et al., 2013) have proposed an alternative approach. First, compute the statistical significance of the whole regression model for each neuron (using fvalue). Then, classify those significant neurons according to the tvalues corresponding to the two actionvalues (Figure 5—figure supplement 1B). Applying this analysis to the policy neurons of Figure 5 with a detection threshold of 5% we find that indeed, this method is useful in detecting which decision variables are more frequently represented (its major use in [Wang et al., 2013]): 25% of the neurons are classified as representing policy (1.25% expected by chance). Nevertheless, 12% of the neurons are still erroneously classified as actionvalue neurons (2.5% expected by chance; Figure 5—figure supplement 1B).
Additional issues
In many cases, the term actionvalue was used, while the reported results were equally consistent with other decision variables. In some cases, significant correlation with both actionvalues (with opposite signs) or significant correlation with the difference between the actionvalues was used as evidence for ‘actionvalue representations’ (FitzGerald et al., 2012; GuitartMasip et al., 2012; Kim et al., 2012; 2007; Stalnaker et al., 2010). Similarly, other papers did not distinguish between neurons whose activity is significantly correlated with one actionvalue and those whose activity is correlated with both actionvalues (Funamizu et al., 2015; Her et al., 2016; Kim et al., 2013; Kim et al., 2009). Finally, one study used a concurrent variableinterval schedule, in which the magnitudes of rewards associated with each action were anticorrelated (Lau and Glimcher, 2008). In such a design, the two probabilities of reward depend on past choices and therefore, the objective values associated with the actions change on a trialbytrial basis and are, in general, correlated.
A possible solution to the policy confound
The policy confound emerged because policy and actionvalues are correlated. To distinguish between the two possible representations, we should seek a variable that is correlated with the actionvalue but uncorrelated with the policy. Consider the sum of the two actionvalues. It is easy to see that $\mathrm{C}\mathrm{o}\mathrm{r}\mathrm{r}\left({Q}_{1}+{Q}_{2},{Q}_{1}{Q}_{2}\right)\propto \mathrm{V}\mathrm{a}\mathrm{r}\left({Q}_{1}\right)\mathrm{V}\mathrm{a}\mathrm{r}\left({Q}_{2}\right)$. Therefore, if the variances of the two actionvalues are equal, their sum is uncorrelated with their difference. An actionvalue neuron is expected to be correlated with the sum of actionvalues. By contrast, a policy neuron, modulated by the difference in actionvalues is expected to be uncorrelated with this sum.
We repeated the simulations of Figure 4 (which addresses the temporal correlations confound), considering three types of neurons: actionvalue neurons (of Figure 1), randomwalk neurons (of Figure 2), and policy neurons (of Figure 5). As in Figure 4, we considered the spike counts of the three types of neurons in the last 200 trials of the session, but now we regressed them on the sum of reward probabilities (state; in this experimental design the reward probabilities are also the objective actionvalues, which the subject learns). We found that only 4.5 and 6% of the randomwalk and policy neurons, respectively, were significantly correlated with the sum of reward probabilities (5% chance level). By contrast, 47% of the actionvalue neurons were significantly correlated with this sum.
This method is able to distinguish between policy and actionvalue representations. However, it will fail in the case of state representation because both state and actionvalues are correlated with the sum of probabilities of reward. To dissociate between state and actionvalue representations, we can consider the difference in reward probabilities because this difference is correlated with the actionvalues but is uncorrelated with the state. Regressing the spike count on both the sum and difference of the probabilities of reward, a randomwalk neuron is expected to be correlated with none, a policy neuron is expected to be correlated only with the difference, whereas an actionvalue neuron is expected to be correlated with both (this analysis is inspired by Fig. S8b in (Wang et al., 2013) in which the predictors in the regression model were policy and state). We now classify a neuron that passes both significance tests as an actionvalue neuron. Indeed, for a significance threshold of p<0.05 (for each test), only 0.2% of the randomwalk neurons and 5% of the policy neurons were classified as actionvalue neurons. By contrast, 32% of the actionvalue neurons were classified as such (Figure 6). Note that in this analysis only when more than 5% of the neurons are classified as actionvalue neurons we have support for the hypothesis that there is actionvalue representation rather than policy or state representation.
A word of caution is that the analysis should be performed only after the learning converges. This is because stochastic fluctuations in the learning process may be reflected in the activities of neurons representing decisionrelated variables. As a result, policy or staterepresenting neurons may appear correlated with the orthogonal variables. For the same reason, any blockrelated heterogeneity in neural activity could also result in this confound (O'Doherty, 2014).
To conclude, it is worthwhile repeating the key features of the analysis proposed in this section:
Trial design is necessary because otherwise temporal correlations in spike count may inflate the fraction of neurons that pass the significance tests.
Regression should be performed on reward probabilities (i.e., the objective actionvalues) and not on estimated actionvalues. The reason is that because the estimated actionvalues evolve over time, this trial design does not eliminate all temporal correlations between them (Figure 2—figure supplement 9).
Reward probabilities associated with the two actions should be chosen such that their variances should be equal. Otherwise policy or state neurons may be erroneously classified as actionvalue neurons.
Discussion
In this paper, we performed a systematic literature search to discern the methods that have been previously used to infer the representation of actionvalues in the striatum. We showed that none of these methods overcome two critical confounds: (1) neurons with temporal correlations in their firing rates may be erroneously classified as representing actionvalues and (2) neurons whose activity covaries with other decision variables, such as policy, may also be erroneously classified as representing actionvalues. Finally, we discuss possible experiments and analyses that can address the question of whether neurons encode actionvalues.
Temporal correlations and actionvalue representations
It is well known in statistics that the regression coefficient between two independent slowlychanging variables is on average larger (in absolute value) than this coefficient when the series are devoid of a temporal structure. If these temporal correlations are overlooked, the probability of a falsepositive is underestimated (Granger and Newbold, 1974). When searching for actionvalue representation in a block design, then by construction, there are positive correlations in the predictor (actionvalues). Positive temporal correlations in the dependent variable (neural activity) will result in an inflation of the falsepositive observations, compared with the naïve expectation.
This confound occurs only when there are temporal correlations in both the predictor and the dependent variable. In a trial design, in which the predictor is chosen independently in each trial and thus has no temporal structure, we do not expect this confound. However, when studying incremental learning, it is difficult to randomize the predictor in each trial, making the task of identifying neural correlates of learning, and specifically actionvalues, challenging. With respect to the dependent variable (neural activity), temporal correlations in BOLD signal and their consequences have been discussed (Arbabshirani et al., 2014; Woolrich et al., 2001). Considering electrophysiological recordings, there have been attempts to remove these correlations, for example, using previous spike counts as predictors (Kim et al., 2013). However, these are not sufficient because they are unable to remove all taskindependent temporal correlations (see also Figure 2—figure supplements 4–10). When repeating these analyses, we erroneously classified a fraction of neurons as representing actionvalue that is comparable to that reported in the striatum. The probability of a falsepositive identification of a neuron as representing actionvalue depends on the magnitude and type of temporal correlations in the neural activity. Therefore, we cannot predict the fraction of erroneously classified neurons expected in various experimental settings and brain areas.
One may argue that the fact that actionvalue representations are reported mostly in a specific brain area, namely the striatum, is an indication that their identification there is not a result of the temporal correlations confound. However, because different brain regions are characterized by different spiking statistics, we expect different levels of erroneous identification of actionvalue neurons in different parts of the brain and in different experimental settings. Indeed, the fraction of erroneously identified actionvalue neurons differed between the auditory and motor cortices (compare B and D within Figure 2—figure supplement 2). Furthermore, many studies reported actionvalue representation outside of the striatum, in brain areas including the supplementary motor area and presupplementary eye fields (Wunderlich et al., 2009), the substantia nigra/ventral tegmental area (GuitartMasip et al., 2012) and ventromedial prefrontal cortex, insula and thalamus (FitzGerald et al., 2012).
Considering the ventral striatum, our analysis on recordings from (Ito and Doya, 2009) indicates that the identification of actionvalue representations there may have been erroneous, resulting from temporally correlated firing rates (Figure 3 and Figure 2—figure supplement 3). It should be noted that the fraction of actionvalue neurons reported in (Ito and Doya, 2009) is low relative to other publications, a difference that has been attributed to the location of the recording in the striatum (ventral as opposed to dorsal). It would be interesting to apply this method to other striatal recordings (Ito and Doya, 2015a; Samejima et al., 2005; Wang et al., 2013). We were unable to directly analyze these recordings from the dorsal striatum because relevant raw data is not publicly available. However, previous studies have reported that the firing rates of dorsalstriatal neurons change slowly over time (Gouvêa et al., 2015; Mello et al., 2015). As a result, identification of apparent actionvalue representation in dorsalstriatal neurons may also be the result of this confound.
Temporal correlations naturally emerge in experiments composed of multiple trials. Participants become satiated, bored, tired, etc., which may affect neuronal activity. In particular, learning in operant tasks is associated, by construction, with variables that are temporally correlated. If neural activity is correlated with performance (e.g., accumulated rewards in the last several trials) then it is expected to have temporal correlations, which may lead to an erroneous classification of the neurons as representing actionvalues.
Temporal correlations – beyond actionvalue representation
Actionvalues are not the only example of slowlychanging variables. Any variable associated with incremental learning, motivation or satiation is expected to be temporally correlated. Even 'benign' behavioral variables, such as the location of the animal or the activation of different muscles may change at relatively long timescales. When recording neural activity related to these variables, any temporal correlations in the neural recording, be it in fMRI, electrophysiology or calcium imaging may result in an erroneous identification of correlates of these behavioral variables because of the temporal correlation confound.
In general, the temporal correlation confound can be addressed by using the permutation analysis of Figure 3, which can provide strong support to the claim that the activity of a particular neuron or voxel covaries with the behavioral variable. Therefore, the permutation test is a general solution for scientists studying slow processes such as learning. More challenging, however, is precisely identifying what the activity of the neuron represents (for example an actionvalue or policy). There are no easy solutions to this problem and therefore caution should be applied when interpreting the data.
Differentiating actionvalue from other decision variables
Another difficulty in identifying actionvalue neurons is that they are correlated with other decision variables such as policy, state or chosenvalue. Therefore, finding a neuron that is significantly correlated with an actionvalue could be the byproduct of its being modulated by other decision variables, in particular policy. The problem is exacerbated by the fact that standard analyses (e.g., Figure 1D–E) are biased towards classifying neurons as representing actionvalues at the expense of policy or state.).
As shown in Figure 6, policy representation can be ruled out by finding a representation that is orthogonal to policy, namely state representation. This solution leads us, however, to a serious conceptual issue. All analyses discussed so far are based on significance tests: we divide the space of hypothesis into the ‘scientific claim’ (e.g., neurons represent actionvalues) and the null hypothesis (e.g., neural activity is independent of the task). An observation that is not consistent with the null hypothesis is taken to support the alternative hypothesis.
The problem we faced with correlated variables is that the null hypothesis and the ‘scientific claim’ were not complementary. A neuron that represents policy is expected to be inconsistent with the null hypothesis that neural activity is independent of the task but it is not an actionvalue neuron. The solution proposed was to devise a statistical test that seeks to identify a representation that is correlated with actionvalue and is orthogonal to the policy hypothesis, in order to also rule out a policy representation.
However, this does not rule out other decisionrelated representations. A ‘pure’ actionvalue neuron is modulated only by ${Q}_{1}$ or by ${Q}_{2}$. A ‘pure’ policy neuron is modulated exactly by ${Q}_{1}{Q}_{2}$. More generally, we may want to consider the hypotheses that the neuron is modulated by a different combination of the action values, ${a\bullet Q}_{1}+{b\bullet Q}_{2}$, where a and b are parameters. For every such set of parameters a and b we can devise a statistical test to reject this hypothesis by considering the direction that is orthogonal to the vector $\left(a,b\right)$. In principle, this procedure should be repeated for every pair of parameters a and b that in not consistent with the actionvalue hypothesis.
Put differently, in order to find neurons that represent actionvalues, we first need to define the set of parameters a and b such that a neuron whose activity is modulated by ${a\bullet Q}_{1}+{b\bullet Q}_{2}$ will be considered as representing an actionvalue. Only after this (arbitrary) definition is given, can we construct a set of statistical tests that will rule out the competing hypotheses, namely will rule out all values of a and b that are not in this set. The analysis of Figure 6 implicitly defined the set of a and b such that $a\ne b$ and $a\ne b$ as the set of parameters that defines actionvalue representations. In practice, it is already very challenging to identify actionvalues using the procedure of Figure 6 and going beyond it seems impractical. Therefore, studying the distribution of tvalues across the population of neurons may be more useful when studying representations of decision variables than asking questions about the significance of individual neurons.
Importantly, the regression models described in this paper allow us to investigate only some types of representations, namely, linear combinations of the two actionvalues. However, value representations in learning models may fall outside of this regime. It has been suggested that in decision making, subjects calculate the ratio of actionvalues (Worthy et al., 2008), or that subjects compute, for each action, the probability that it is associated with the highest value (Morris et al., 2014). Our proposed solution cannot support or refute these alternative hypotheses. If these are taken as additional alternative hypotheses, a neuron should be classified as representing an actionvalue if its activity is also significantly modulated in the directions that are correlated with actionvalue and are orthogonal to these hypotheses. Clearly, it is never possible to construct an analysis that can rule out all possible alternatives.
We believe that the confounds that we described have been overlooked because the null hypothesis in the significance tests was not made explicit. As a result, the complementary hypothesis was not explicitly described and the conclusions drawn from rejecting the null hypothesis were too specific. That is, alternative plausible interpretations were ignored. It is important, therefore, to keep the alternative hypotheses explicit when analyzing the data, be it using significance tests or other methods, such as model comparison (Ito and Doya, 2015b).
Are actionvalue representations a necessary part of decision making?
One may argue that the question of whether neurons represent actionvalue, policy, state or some other correlated variable is not an interesting question. This is because all these correlated decision variables implicitly encode actionvalues. Even directpolicy models can be taken to implicitly encode actionvalues because policy is correlated with the difference between the actionvalues. However, we believe that the difference between actionvalue representation and representation of other variables is an important one, because it centers on the question of the computational model that underlies decision making in these tasks. Specifically, the implication of a finding that a population of neurons represents actionvalues is not that these neurons are involved somehow in decision making. Rather, we interpret this finding as supporting the hypothesis that actionvalues are explicitly computed in the brain, and that these actionvalues play a specific role in the decision making process. However, if the results are also consistent with various alternative computational models then this is not the case. Some consider actionvalue computation to be a necessary part of decision making. By contrast, however, we presented here two models of learning and decision making that do not entail this computation (Figure 2—figure supplement 1, Figure 5). Other examples are discussed in (Mongillo et al., 2014; Shteingart and Loewenstein, 2014) and references therein.
Other indications for actionvalue representation
Several trialdesign experiments have associated cues with upcoming rewards and reported representations of expected reward, the upcoming action, or the interaction of action and reward (Cromwell and Schultz, 2003; Cromwell et al., 2005; Hassani et al., 2001; Hori et al., 2009; Kawagoe et al., 1998; Pasquereau et al., 2007). Another trialdesign experiment reported representation of offervalue and chosenvalue in the orbitofrontal cortex (PadoaSchioppa and Assad, 2006). While such studies do not provide direct evidence for actionvalue representation, they do provide evidence for representation of closely related decision variables (but see [O'Doherty, 2014]).
The involvement of the basal ganglia in general and the striatum in particular in operant learning, planning and decisionmaking is well documented (Ding and Gold, 2010; McDonald and White, 1993; O'Doherty et al., 2004; Palminteri et al., 2012; Schultz, 2015; Tai et al., 2012; Thorn et al., 2010; Yarom and Cohen, 2011). However, there are alternatives to the possibility that the firing rate of striatal neurons represents actionvalues. First, as discussed above, learning and decision making do not entail actionvalue representation. Second, it is possible that actionvalue is represented elsewhere in the brain. Finally, it is also possible that the striatum plays an essential role in learning, but that the representation of decision variables there is distributed and neural activity of single neurons could reflect a complex combination of valuerelated features, rather than ‘pure’ decision variables. Such complex representations are typically found in artificial neural networks (Yamins and DiCarlo, 2016).
Actionvalue representation in the striatum requires further evidence
Considering the literature, both confounds have been partially acknowledged. Moreover, there have been some attempts to address them. However, as discussed above, even when these confounds were acknowledged and solutions were proposed, these solutions do not prevent the erroneous identification of actionvalue representation (see Figure 2—figure supplements 4, 5 and 10, Figure 5—figure supplement 1). We therefore conclude that to the best of our knowledge, all studies that have claimed to provide direct evidence that neuronal activity in the striatum is specifically modulated by actionvalue were either susceptible to the temporal correlations confound (Funamizu et al., 2015; Ito and Doya, 2009; 2015a; Ito and Doya, 2015b; Kim et al., 2013; Kim et al., 2009; Lau and Glimcher, 2008; Samejima et al., 2005; Wang et al., 2013), or reported results in a manner indistinguishable from policy (Cai et al., 2011; FitzGerald et al., 2012; Funamizu et al., 2015; GuitartMasip et al., 2012; Her et al., 2016; Kim et al., 2013; Kim et al., 2009; Kim et al., 2012; 2007; Stalnaker et al., 2010; Wunderlich et al., 2009). Many studies presented actionvalue and policy representations separately, but were subject to the second confound (Ito and Doya, 2009; 2015a; Ito and Doya, 2015b; Lau and Glimcher, 2008; Samejima et al., 2005). Furthermore, it should be noted that not all studies investigating the relation between striatal activity and actionvalue representation have reported positive results. Several studies have reported that striatal activity is more consistent with directpolicy learning than with actionvalue learning (FitzGerald et al., 2014; Li and Daw, 2011) and one noted that lesions to the dorsal striatum do not impair actionvalue learning (Vo et al., 2014).
Finally, we would like to emphasize that we do not claim that there is no representation of actionvalue in the striatum. Rather, our results show that special caution should be applied when relating neural activity to reinforcementlearning related variables. Therefore, the prevailing belief that neurons in the striatum represent actionvalues must await further tests that address the confounds discussed in this paper.
Materials and methods
Literature search
Request a detailed protocolIn order to thoroughly examine the finding of actionvalue neurons in the striatum, we conducted a literature search to find all the different approaches used to identify actionvalue representation in the striatum and see whether they are subject to at least one of the two confounds we described here.
The key words ‘actionvalue’ and ‘striatum’ were searched for in WebofKnowledge, Pubmed and Google Scholar, returning 43, 21 and 980 results, respectively. In the first screening stage, we excluded all publications that did not report new experimental results (e.g., reviews and theoretical papers), focused on other brain regions, or did not address valuerepresentation or learning. In the remaining publications, the abstract of the publication was read and the body of the article was searched for ‘actionvalue’ and ‘striatum’. After this step, articles in which it was possible to find description of actionvalue representation in the striatum were read thoroughly. The search included PhD theses, but none were found to report new relevant data, not found in papers. We identified 22 papers that directly related neural activity in the striatum to actionvalues. These papers included reports of singleunit recordings, fMRI experiments and manipulations of striatal activity.
Of these, two papers have used the term actionvalue to refer to the value of the chosen action (also known as chosenvalue) (Day et al., 2011; Seo et al., 2012) and therefore we do not discuss them.
An additional study (Pasquereau et al., 2007) used the expected reward and the chosen action as predictors of the neuronal activity and found neurons that were modulated by the expected reward, the chosen action and their interaction. The authors did not claim that these neurons represent actionvalues, but it is possible that these neurons were modulated by the values of specific actions. However, the representation of the value of the action when the action is not chosen is a crucial part of actionvalue representation which differentiates it from the representation of expected reward, and the values of the actions when they were not chosen were not analyzed in this study. Therefore, the results of this study cannot be taken as an indication for actionvalue representation, rather than other decision variables.
A second group of 11 papers did not distinguish between actionvalue and policy representations (Cai et al., 2011; Funamizu et al., 2015; Her et al., 2016; Kim et al., 2013; Kim et al., 2009; Wunderlich et al., 2009), or reported policy representation (FitzGerald et al., 2012; GuitartMasip et al., 2012; Kim et al., 2012; 2007; Stalnaker et al., 2010) in the striatum and therefore their findings do not necessarily imply actionvalue representation, rather than policy representation in the striatum (see confound 2).
In two additional papers, it was shown that the activation of striatal neurons changes animals’ behavior, and the results were interpreted in the actionvalue framework (Lee et al., 2015; Tai et al., 2012). However, a change in policy does not entail an actionvalue representation (see, for example, Figure 5 and Figure 2—figure supplement 1). Therefore, these papers were not taken as strong support to the striatal actionvalue representation hypothesis.
Finally, six papers correlated actionvalues, separately from other decision variables, with neuronal activity in the striatum (Ito and Doya, 2009; 2015a; Ito and Doya, 2015b; Lau and Glimcher, 2008; Samejima et al., 2005; Wang et al., 2013). All of them used electrophysiological recordings of single units in the striatum. From these papers, only one utilized an analysis which is not biased towards identifying actionvalue neurons at the expense of policy and state neurons (Wang et al., 2013). All papers used blockdesign experiments where actionvalues are temporally correlated.
Taken together, we concluded that previous reports on actionvalue representation in the striatum could reflect the representation of other decision variables or temporal correlations in the spike count that are not related to actionvalue learning.
The actionvalue neurons model (Figure 1, Figure 4)
Request a detailed protocolTo model neurons whose firing rate is modulated by an actionvalue, we considered neurons whose firing rate changes according to:
Where $f\left(t\right)$ is the firing rate in trial $t$, $B=2.5$Hz is the baseline firing rate, ${Q}_{i}\left(t\right)$ is the actionvalue associated with one of the actions $i\in \left\{\mathrm{1,2}\right\}$, $K=2.35$Hz is the maximal modulation and $r$ denotes the neuronspecific level of modulation, drawn from a uniform distribution, $r~U\left[\mathrm{1,1}\right]$. The spike count in a trial was drawn from a Poisson distribution, assuming a 1 seclong trial.
The policy neurons model (Figure 5)
Request a detailed protocolTo model neurons whose firing rate is modulated by the policy, we considered neurons whose firing rate changes according to:
Where $f\left(t\right)$ is the firing rate in trial $t$, $B=2.5$Hz is the baseline firing rate, $\mathrm{P}\mathrm{r}\left(a\left(t\right)=1\right)$ is the probability of choosing action 1 in trial $t$ that changes in accordance with REINFORCE (Williams, 1992) (see also Figure 5 and corresponding text). $K=3$Hz is the maximal modulation and $r$ denotes the neuronspecific level of modulation, drawn from a uniform distribution, $r~U\left[\mathrm{1,1}\right]$. The spike count in a trial was drawn from a Poisson distribution, assuming a 1 seclong trial.
The covariance neurons model (Figure 2—figure supplement 1)
Request a detailed protocolIn the covariance based plasticity model the decisionmaking network is composed of two populations of Poisson neurons: each neuron is characterized by its firing rate and the spike count of a neuron in a trial (1 sec) is randomly drawn from a Poisson distribution. The chosen action corresponds to the population that fires more spikes in a trial (Loewenstein, 2010; Loewenstein and Seung, 2006). At the end of the trial, the firing rate of each of the neurons (in the two population) is updated according to $f\left(t+1\right)=f\left(t\right)+\eta \bullet R\left(t\right)\bullet \left(s\left(t\right)f\left(t\right)\right)$, where $f\left(t\right)$ is the firing rate in trial $t$, $\eta =0.07$ is the learning rate, $R\left(t\right)$ is the reward delivered in trial $t$ ($R\left(t\right)\in \left\{\mathrm{0,1}\right\}$ in our simulations) and $s\left(t\right)$ is the measured (realized) firing rate in that trial, that is the spike count in the trial. The initial firing rate of all simulated neurons is 2.5Hz. The network model was tested in the operant learning task of Figure 1. A session was terminated (without further analysis) if the model was not able to choose the better option more than 14 out of 20 consecutive times for at least 200 trials in the same block. This occurred on 20% of the sessions. We simulated two populations of 1,000 neurons in 500 successful sessions. Note that because on average, the empirical firing rate is equal to the true firing rate, $f\left(t\right)=\u27e8s\left(t\right)\u27e9$, changes in the firing rate are driven, on average, by the covariance of reward and the empirical firing rate: $\u27e8\u2206f\left(t\right)\u27e9\equiv \u27e8f\left(t+1\right)f\left(t\right)\u27e9=\eta \bullet \mathrm{c}\mathrm{o}\mathrm{v}\left(R\left(t\right),s\left(t\right)\right)$(Loewenstein and Seung, 2006). The estimated actionvalues in Figure 2—figure supplement 1 were computed from the actions and rewards of the covariance model by assuming the Qlearning model (Equations 1 and 2).
The motor cortex recordings (Figure 2—figure supplement 2)
Request a detailed protocolThe data in Figure 2—figure supplement 2A–B was recorded by Oren Peles in Eilon Vaadia's lab. It was recorded from one female monkey (Macaca fascicularis) at 3 years of age, using a 10 × 10 microelectrode array (Blackrock Microsystems) with 0.4 mm interelectrode distance. The array was implanted in the arm area of M1, under anesthesia and aseptic conditions.
Behavioral Task: The Monkey sat in a behavioral setup, awake and performing a Brain Machine Interface (BMI) and sensorimotor combined task. Spikes and Local Field Potentials were extracted from the raw signals of 96 electrodes. The BMI was provided through real time communication between the data acquisition system and a custommade software, which obtained the neural data, analyzed it and provided the monkey with the desired visual and auditory feedback, as well as the food reward. Each trial began with a visual cue, instructing the monkey to make a small hand movement to express alertness. The monkey was conditioned to enhance the power of beta band frequencies (2030Hz) extracted from the LFP signal of 2 electrodes, receiving a visual feedback from the BMI algorithm. When a required threshold was reached, the monkey received one of 2 visual cues and following a delay period, had to report which of the cues it saw by pressing one of two buttons. Food reward and auditory feedback were delivered based on correctness of report. The duration of a trial was on average 14.2s. The intertrialinterval was 3s following a correct trial and 5s after error trials. The data used in this paper, consists of spiking activity of 89 neurons recorded during the last second of intertrialintervals, taken from 600 consecutive trials in one recording session. Pairwise correlations were comparable to previously reported (Cohen and Kohn, 2011), ${r}_{SC}=0.047\pm 0.17$ (SD), (${r}_{SC}=0.037\pm 0.21$ for pairs of neurons recorded from the same electrode).
Animal care and surgical procedures complied with the National Institutes of Health Guide for the Care and Use of Laboratory Animals and with guidelines defined by the Institutional Committee for Animal Care and Use at the Hebrew University.
The auditory cortex recordings (Figure 2—figure supplement 2)
Request a detailed protocolThe auditory cortex recordings appearing in Figure 2—figure supplement 2C–D are described in detail in (Hershenhoren et al., 2014). In short, membrane potential was recorded intracellularly in the auditory cortex of halothaneanesthetized rats. The data consists of 125 experimental sessions recorded from 39 neurons. Each session consisted of 370 pure tone bursts. Tone duration was 50 ms with 5 ms linear rise/fall ramps. In the data presented here, trials began 50 ms prior to the onset of the tone burst. For each session, all trials were either 300 msec or 500 msec long. Trial length remained identical throughout a session and depended on smallest interval between two tones in each session. Spike events were identified following high pass filtering with a corner frequency of 30Hz. Local maxima that were larger than 60 times the median of the absolute deviation from the median (MAD) were classified as spikes. The data presented here consists only of the spike counts in each trial, rather than the full membrane potential trace.
The basal ganglia recordings (Figure 3 and Figure 2—figure supplement 3)
Request a detailed protocolThe basal ganglia recordings that are analyzed in Figure 3 and Figure 2—figure supplement 3 are described in detail in (Ito and Doya, 2009). In short, rats performed a combination of a tone discrimination task and a rewardbased freechoice task. Extracellular voltage was recorded in the behaving rats from the NAc and VP using an electrode bundle. Spike sorting was done using principal component analysis. In total, 148 NAc and 66 VP neurons across 52 sessions were used for analyses (In 18 of the 70 behavioral sessions there were no neural recordings).
Estimation of actionvalues from choices and rewards
Request a detailed protocolTo imitate experimental procedures, we regressed the spike counts on estimates of the actionvalues, rather than the subjective actionvalues that underlay model behavior (to which the experimentalist has no direct access). For that goal, for each session, we assumed that ${Q}_{i}\left(1\right)=0.5$ and found the set of parameters $\widehat{\alpha}$ and $\widehat{\beta}$ that yielded the estimated actionvalues that best fit the sequences of actions in each experiment by maximizing the likelihood of the sequence. Actionvalues were estimated from Equation 1, using these estimated parameters and the sequence of actions and rewards. Overall, the estimated values of the parameters $\alpha $ and $\beta $ were comparable to the actual values used: on average, $\widehat{\alpha}=0.12\mathrm{}\pm 0.09$ (standard deviation) and $\widehat{\beta}=2.6\mathrm{}\pm 0.7$ (compare with $\alpha $=0.1 and $\beta $=2.5).
Exclusion of neurons
Request a detailed protocolFollowing standard procedures (Samejima et al., 2005), a sequence of spikecounts, either simulated or experimentally measured was excluded due to low firing rate if the mean spike count in all blocks was smaller than 1. This procedure excluded 0.02% (4/20,000) of the randomwalk neurons and 0.03% (285/1,000,000) of the covariancebased plasticity neurons. Considering the auditory cortex recordings, we assigned each of the 125 spike counts to 40 randomlyselected sessions. 23% of the neural recordings (29/125) were excluded in all 40 sessions. Because blocks are defined differently in different sessions, some neural recordings were excluded only when assigned to some sessions but not others. Of the remaining 96 recordings, 14% of the recordings × sessions were also excluded. Similarly, considering the basal ganglia neurons, we assigned each of the 642 recordings (214 × 3 phases) to 40 randomlyselected sessions. 11% (74/(214 × 3)) of the recordings were excluded in all 40 sessions. Of the remaining 568 recordings, 9% of the recordings × sessions were also excluded. None of the simulated actionvalue neurons (0/20,000) or the motor cortex neurons (0/89) were excluded.
Statistical analyses
Request a detailed protocolThe computation of the tvalues of the regression of the spike counts on the estimated actionvalues (as in Figures 1, 2 and 5, Figure 2—figure supplement 1, – Figure 2—figure supplement 2, –Figure 2—figure supplement 3) was done using the following regression model:
Where $\mathrm{s}\left(\mathrm{t}\right)$ is the spike count in trial $t$, ${Q}_{1}\left(t\right)$ and ${Q}_{2}\left(t\right)$ are the estimated actionvalues in trial $t$, $\u03f5\left(t\right)$ is the residual error in trial $t$ and ${\beta}_{02}$ are the regression parameters.
The computation of the tvalues of the regression of the spike counts on the reward probabilities in the trial design experiment (as in Figure 4) was done using the following regression model:
Where $t$ denotes the trial. Only the last 200 trials of the session were anlyzed. $s\left(t\right)$ is the mean spike count, ${RP}_{1}\left(t\right)$ and ${RP}_{2}\left(t\right)$ are the reward probabilities corresponding to action 1 or action 2, respectively (in this experimental design $RP$ could be 0.1,0.5 or 0.9), $\u03f5\left(t\right)$ is the residual error and ${\beta}_{02}$ are the regression parameters.
The computation of the tvalues of the regression of the spike counts on state and policy in a trial design experiment (as in Figure 6) was done using the following regression model:
All variables and parameters are the same as in Equation 7
All regression analyses were done using regstats in MATLAB (version 2016A).
To compare the spike counts of the example neurons, in the last 20 trials of each block (Figure 1B; Figure 2—figure supplement 1B; Figure 2—figure supplement 2A; Figure 2—figure supplement 2C; Figure 2A) we executed the Wilcoxon rank sum test, using ranksum in MATLAB. All tests were twotailed.
Significance of tvalues slightly depends on session length. For the session lengths we considered, 0.05 significance bounds varied between 1.962 and 1.991. For consistency, we chose a single conservative bound of 2. Similarly, 0.025 and 0.01 significance bounds were chosen to be 2.3 and 2.64, respectively.
For all significance boundaries the false positive thresholds were computed naively, that is, assuming the analysis is not confounded in any way and that the two predictors are not correlated with each other. For example, assuming the false positive rate from a single ttest for a significant regression coefficient is $P$, for the standard analysis, the false positive rate for each actionvalue classification was defined as $P\bullet (1P)$, and the false positive rate was equal for state and policy classification and was defined as ${P}^{2}/2$. In Figure 6 the false positive rate computed for randomwalk neurons was ${P}^{2}/2$ for each actionvalue classification, and the false positive rate computed for state or policy neurons was $P/2$ for each actionvalue classification.
Permutation test (Figure 3)
Request a detailed protocolFor each actionvalue and randomwalk neuron, we computed the tvalues of the regressions of its spikecount on estimated actionvalues from the sessions of Figure 1E. Because the number of trials can affect the distribution of tvalues, we only considered in our analysis the first 170 trials of the 504 sessions longer or equal to 170 trials. This number, which is approximately the median of the distribution of number of trials per session, was chosen as a compromise between the number of trials per session and number of sessions. When performing the permutation test on the basal ganglia data we included all recordings and only the first 332 trials in each session, which is the smallest number of trials used in a session in this dataset.
Two points are noteworthy. First, the distribution of the tvalues of the regression of the spike count of a neuron on all actionvalues depends on the neuron (see difference between distributions in Figure 3A). Similarly, the distribution of the tvalues of the regression of the spike counts of all neurons on an actionvalue depends on the actionvalue (not shown). Therefore, the analysis could be biased in favor (or against) finding actionvalue neurons if the number of neurons analyzed from each session (and therefore are associated with the same actionvalues) differs between sessions. Second, this analysis does not address the correlated decision variables confound.
Finally, we would like to point out that there is an alternative way of performing the permutation test, which is applicable when the number of sessions is small, while the number of neurons recorded in a session is large. Instead of comparing the tvalues from the regression of a neuron on different actionvalues, one can compare the tvalues from different neurons on the same actionvalue. However, this method is only applicable under the assumption that the temporal correlations that are not related to actionvalue in the neuronal activity are similar between sessions.
Comparison with permuted spike counts (Figure 2—figure supplement 4)
Request a detailed protocolIn Figure 2—figure supplement 4 we considered the experiment and analysis described in (Kim et al., 2009). That experiment consisted of four blocks, each associated with a different pair of reward probabilities, (0.72, 0.12), (0.12, 0.72), (0.21, 0.63) and (0.63, 0.21), appearing in a random order, with the better option changing location with each block change. The number of trials in a block was preset, ranging between 35 and 45 with a mean of 40 (this is unlike the experiment described in Figure 1, in which termination of a block depended on performance).
First, we used Equations 1 and 2 to model learning behavior in this protocol. Then, we estimated the actionvalues according to choice and reward sequences, as in Figure 1. These estimated actionvalues were used for regression of the spike counts of the randomwalk, motor cortex, auditory cortex, and basal ganglia neurons in the following way: each spike count sequence was randomly assigned to a particular pair of estimated actionvalues from one session. The spike count sequence was regressed on these estimated actionvalues. The resultant tvalues were compared with the tvalues of 1000 regressions of the spikecount, permuted within each block, on the same actionvalues. The pvalue of this analysis was computed as the percentage of tvalues from the permuted spikecounts that were higher in absolute value than the tvalue from the regression of the original spike count. The significance boundary was set at p<0.025 (Kim et al., 2009). Neurons with at least one significant regression coefficient (rather than exactly one significant regression coefficient) were classified as actionvalue modulated neurons (Kim et al., 2009).
ANOVA tests for comparisons between blocks, excluding ‘drifting’ neurons
Request a detailed protocolFollowing (Asaad et al., 2000) we conducted an additional analysis with repeating blocks. We simulated learning behavior in the same experiment as in Figure 2—figure supplement 10. This experiment is composed of 8 blocks  the 4 blocks of Figure 1, repeated twice, in random permutation. We restricted our analysis to the 438 sessions with 332 trials or fewer (332 trials is the shortest session in the basal ganglia recording). Each spike count was analyzed 40 times, using 40 randomlyassigned sessions. For each block, we restricted the analysis to the neuronal activity in the last 20 trials of the block.
First, we conducted four oneway ANOVAs (using MATLAB’s anova1) to compare the neuronal activities in blocks associated with the same actionvalues (e.g., the neuronal activity in the two blocks, in which reward probabilities were (0.1,0.5)). Neurons were excluded from further analysis if we found a significant difference in their firing rates in at least one of these comparisons (df(columns)=1, df(error)=38, p<0.1). This procedure excludes from further analysis ‘drifting’ neurons, whose spike count significantly varied in the session.
Next, for each actionvalue we conducted a oneway ANOVA (using MATLAB’s anova1), which compared the neuronal activity between the two blocks in which the actionvalue was 0.1 and the two blocks in which the actionvalue was 0.9 (df(columns)=1, df(error)=78, p<0.01). We classified neurons as representing actionvalues if there was a significant difference between their firing rates for one actionvalue but not for the other.
Despite the removal of ‘drifting’ neurons, this analysis yielded an erroneous classification of actionvalue neurons in all datasets: randomwalk neurons, 18%; motor cortex neurons, 12%; auditory cortex neurons, 5%; basal ganglia neurons, 9%. This is despite the fact that the expected false positive rate is only 2%. These results indicate that the exclusion of ‘drifting’ neurons as in (Asaad et al., 2000) does not solve the temporal correlations confound.
Data from the motor cortex, auditory cortex, and basal ganglia was the same as in Figure 2—figure supplements 2–3. Data for randomwalk included 1000 newly simulated neurons, using the same parameters as in Figure 2 (this was done to create enough trials in each simulated spike count).
Data and code availability
Request a detailed protocolThe data of the basal ganglia recordings from (Ito and Doya, 2009) is available online at https://groups.oist.jp/ncu/data and was analyzed with permission from the authors. Motor cortex data (recorded by Oren Peles in Eilon Vaadia's lab) and auditory cortex data (taken from the recordings in (Hershenhoren et al., 2014)) is available at https://github.com/lotemelber/striatalactionvalueneuronsreconsideredcodes (ElberDorozko and Loewenstein, 2018). The custom MATLAB scripts used to create simulated neurons and to analyze simulated and recorded neurons are also available at https://github.com/lotemelber/striatalactionvalueneuronsreconsideredcodes (ElberDorozko and Loewenstein, 2018; copy archived at https://github.com/elifesciencespublications/striatalactionvalueneuronsreconsideredcodes).
Data availability
The data of the basal ganglia recordings from (Ito and Doya 2009) is available online at https://groups.oist.jp/ncu/data and was analyzed with permission from the authors. Motor cortex data (recorded by Oren Peles in Eilon Vaadia's lab) and auditory cortex data (taken from the recordings in (Hershenhoren, Taaseh, Antunes, & Nelken, 2014)) is available at https://github.com/lotemelber/striatalactionvalueneuronsreconsideredcodes (ElberDorozko & Loewenstein 2018). The custom MATLAB scripts used to create simulated neurons and to analyze simulated and recorded neurons are also available at https://github.com/lotemelber/striatalactionvalueneuronsreconsideredcodes (copy archived at https://github.com/elifesciencespublications/striatalactionvalueneuronsreconsideredcodes).

Validation of decision making models and analysis of decision variables in the rat basal ganglia.Publicly available at OIST Groups website.
References

Taskspecific neural activity in the primate prefrontal cortexJournal of Neurophysiology 84:451–459.https://doi.org/10.1152/jn.2000.84.1.451

Measuring and interpreting neuronal correlationsNature Neuroscience 14:811–819.https://doi.org/10.1038/nn.2842

Relative reward processing in primate striatumExperimental Brain Research 162:520–525.https://doi.org/10.1007/s002210052223z

Effects of expectations for different reward magnitudes on neuronal activity in primate striatumJournal of Neurophysiology 89:2823–2838.https://doi.org/10.1152/jn.01014.2002

Interference and shaping in sensorimotor adaptations with rewardsPLoS Computational Biology 10:e1003377.https://doi.org/10.1371/journal.pcbi.1003377

Nucleus accumbens neurons encode predicted and ongoing reward costs in ratsEuropean Journal of Neuroscience 33:308–321.https://doi.org/10.1111/j.14609568.2010.07531.x

Caudate encodes multiple computations for perceptual decisionsJournal of Neuroscience 30:15747–15759.https://doi.org/10.1523/JNEUROSCI.289410.2010

Model of birdsong learning based on gradient estimation by dynamic perturbation of neural conductancesJournal of Neurophysiology 98:2038–2057.https://doi.org/10.1152/jn.01311.2006

Actionspecific value signals in rewardrelated regions of the human brainJournal of Neuroscience 32:16417–16423.https://doi.org/10.1523/JNEUROSCI.325412.2012

Functional requirements for rewardmodulated spiketimingdependent plasticityJournal of Neuroscience 30:13326–13337.https://doi.org/10.1523/JNEUROSCI.624909.2010

Spurious regressions in econometricsJournal of Econometrics 2:111–120.https://doi.org/10.1016/03044076(74)900347

Influence of expectation of different rewards on behaviorrelated neuronal activity in the striatumJournal of Neurophysiology 85:2477–2489.https://doi.org/10.1152/jn.2001.85.6.2477

Intracellular correlates of stimulusspecific adaptationJournal of Neuroscience 34:3303–3319.https://doi.org/10.1523/JNEUROSCI.216613.2014

Neuronal encoding of reward value and direction of actions in the primate putamenJournal of Neurophysiology 102:3530–3543.https://doi.org/10.1152/jn.00104.2009

Validation of decisionmaking models and analysis of decision variables in the rat basal gangliaJournal of Neuroscience 29:9861–9874.https://doi.org/10.1523/JNEUROSCI.615708.2009

Parallel representation of valuebased and finite StateBased strategies in the ventral and dorsal striatumPLoS Computational Biology 11:e1004540.https://doi.org/10.1371/journal.pcbi.1004540

Expectation of reward modulates cognitive signals in the basal gangliaNature Neuroscience 1:411–416.https://doi.org/10.1038/1625

Role of striatum in updating values of chosen actionsJournal of Neuroscience 29:14701–14712.https://doi.org/10.1523/JNEUROSCI.272809.2009

Prefrontal and striatal activity related to values of objects and locationsFrontiers in Neuroscience 6:108.https://doi.org/10.3389/fnins.2012.00108

Encoding of action history in the rat ventral striatumJournal of Neurophysiology 98:3548–3556.https://doi.org/10.1152/jn.00310.2007

BookAutocorrelation and CrossCorrelation MethodsIn: Akay M, editors. Wiley Encyclopedia of Biomedical Engineering. John Wiley & Sons. pp. 260–283.https://doi.org/10.1002/9780471740360.ebs0094

Signals in human striatum are appropriate for policy update rather than value predictionJournal of Neuroscience 31:5504–5511.https://doi.org/10.1523/JNEUROSCI.631610.2011

Robustness of learning that is based on covariancedriven synaptic plasticityPLoS Computational Biology 4:e1000007.https://doi.org/10.1371/journal.pcbi.1000007

Synaptic theory of replicatorlike meliorationFrontiers in Computational Neuroscience 4:17.https://doi.org/10.3389/fncom.2010.00017

Efficient coding and the neural representation of valueAnnals of the New York Academy of Sciences 1251:13–32.https://doi.org/10.1111/j.17496632.2012.06496.x

A triple dissociation of memory systems: hippocampus, amygdala, and dorsal striatumBehavioral Neuroscience 107:3–22.https://doi.org/10.1037/07357044.107.1.3

A scalable population code for time in the striatumCurrent Biology 25:1113–1122.https://doi.org/10.1016/j.cub.2015.02.036

The misbehavior of reinforcement learningProceedings of the IEEE 102:528–541.https://doi.org/10.1109/JPROC.2014.2307022

The problem with valueNeuroscience & Biobehavioral Reviews 43:259–268.https://doi.org/10.1016/j.neubiorev.2014.03.027

Neurobiology of economic choice: a goodbased modelAnnual Review of Neuroscience 34:333–359.https://doi.org/10.1146/annurevneuro061010113648

Shaping of motor responses by incentive values through the basal gangliaJournal of Neuroscience 27:1176–1183.https://doi.org/10.1523/JNEUROSCI.374506.2007

Understanding spurious regressions in econometricsJournal of Econometrics 33:311–340.https://doi.org/10.1016/03044076(86)900011

Neuronal reward and decision signals: from theories to dataPhysiological Reviews 95:853–951.https://doi.org/10.1152/physrev.00023.2014

Reinforcement learning and human behaviorCurrent Opinion in Neurobiology 25:93–98.https://doi.org/10.1016/j.conb.2013.12.004

Neural correlates of stimulusresponse and responseoutcome associations in dorsolateral versus dorsomedial striatumFrontiers in Integrative Neuroscience 4:12.https://doi.org/10.3389/fnint.2010.00012

Reinforcement learning in populations of spiking neuronsNature Neuroscience 12:250–252.https://doi.org/10.1038/nn.2264

Using goaldriven deep learning models to understand sensory cortexNature Neuroscience 19:356–365.https://doi.org/10.1038/nn.4244
Decision letter

Timothy E BehrensReviewing Editor; University of Oxford, United Kingdom
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
Thank you for submitting your article "Striatal actionvalue neurons reconsidered" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Timothy Behrens as the Senior Editor. The reviewers have opted to remain anonymous.
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
Summary:
ElberDorozko and Loewenstein examine issues using trialbytrial spike data to determine whether neural activity is associated with actionvalues. First, they note that standard regression inference can lead to false detection of actionvalue correlations when samples are not independent. They illustrate the prevalence of false detections using simulation and data with no plausible relation to actionvalues. Second, they note that different reinforcement learning models without actionvalue representations can yield significant actionvalue coding when analyzed using prior approaches. The authors present wellthoughtout analyses and data that highlight analytic, experimental and conceptual difficulties in identifying actionvalue coding. Although both issues the authors examine have been acknowledged in the literature, and different approaches have been made to deal with or minimize their potential effects, the authors' paper represents an important synthesis and analysis that will likely lead to clearer future experiments.
However, there were several key concerns.
Firstly, the reviewers thought that there should be a more careful attention to the wider literature. In the reviewer discussion, this was thought to be of particular importance as you are making a technical point in a journal with a broad readership. It is particularly important that you are clear about which parts of the literature your results speak to.
For example:
1) The authors conclude that current methods of studying actionvalues are confounded, and propose that experiments should use actions whose reward values are not learned over time during the experiment, and instead are indicated by sensory cues with their values picked randomly on each trial.
However, the authors leave unmentioned the large literature of studies taking an alternate approach: recording striatal activity during planning and execution of instructed actions for cued reward outcomes, in which each trial randomizes the instructed action, the reward, or both. The authors should discuss what implications this literature has for whether the striatum encodes actionvalues vs. policies vs. other variables. Examples include the work from the labs of Schultz (e.g. Hassani et al., 2001; Cromwell and Schultz, 2003; Cromwell et al., 2005), Hikosaka (e.g. Kawagoe et al., 1998; Lauwereyns et al., Neuron 2004), and Kimura (e.g. Hori et al., 2009).
Most of these studies did not specifically claim that the activity they reported represents "actionvalues" in the reinforcement learning sense (and hence the present authors shouldn't feel obligated to try to debunk them), but they do seem highly relevant to the larger question the present authors raise. These studies did attempt to test whether neurons represented actions, values, and notably, their interaction (e.g. a cell whose activity scales more with action A's value than action B's value), which resembles the concept of "actionvalue".
Also, these studies may be somewhat resistant to the authors' criticisms about confounds from temporal correlations (since rewards were either explicitly cued, or kept deterministic and stable in welllearned blocks of trials, rather than slowly fluctuating during extended learning) and confounds with action probability (since the actions were instructed and hence a priori equally probable on each trial).
Of particular interest, a paper by Pasquereau et al. (2007) seems to fulfill all the requirements the present authors set for a test of striatal actionvalue coding; if so, this seems worthy of mention. That study manipulated the reward value of four actions (up, down, left, right), randomly assigning their reward probabilities on each trial and indicating them with visual cues. Unfortunately, as the present authors note, the study did not explicitly analyze their results in the actionvalue vs. chosenvalue framework. However, the paper did report that some neurons had significant action x value interactions – for example, a cell that is more active when planning rightward movements (action), with stronger activity when the planned movement was more valuable (value), and with this valuemodulation greater for rightward movements than other movements (action x value). This is not a pure chosenvalue signal as the present authors seem to claim that paper reported. One could argue that it contains a key feature of an actionvalue signal as the value modulation is strongest for one specific action.
2) The authors correctly pointed out that some earlier studies of action value used a suboptimal task design and their conclusions need to be more rigorously validated. However, in the broader field, the potential risk of "drift" in neural recording has been well recognized. For example, "Neurophysiological experiments that compare activity across different blocks of trials must make efforts to be confident that any neural effects are not the result of artifacts of that design, such as slowwave changes in neural activity over time." (Asaad et al., 2000). In the same Asaad et al. paper, a better design with repeated, alternated block types was used, similar in concept to randomized block design that the authors proposed here. Such designs have also been used in many neural studies of cognition – to name just two examples: value manipulations (Lauwereyns et al., 2002), rule manipulations (Mansouri et al., 2006). The problem thus seems relatively limited to one type of analysis that introduces temporal correlation across trials in an effort to estimate Q values. By the authors' account, this amounts to 5 papers from 3 different labs.
3) What about previous results arguing for prominence of a specific type of value representation? The authors touch on this, but it would be helpful to discuss specific results. In particular, the cited study of Wang et al., Nat Neuro 2003 reported that their unbiased angular measure of DMS value coding was distributed significantly nonuniformly, with net value (ΣQ) coding more prevalent than other types (their Figure 7). Whereas the null hypothesis simulations in this paper predict very different results, either a uniform distribution (Figure 2—figure supplement 8) or a dearth of ΣQ neurons (Figure 5—figure supplement 1). The authors should discuss whether this previous result can therefore still be interpreted as evidence of value coding (at least, netvalue coding), rather than strictly policy coding, in the striatum.
(Also, it is odd that the authors cite Wang et al. as a study that "claimed to provide direct evidence that neuronal activity in the striatum is specifically modulated by actionvalue", since the main result was specifically finding prevalence of netvalue, not actionvalue coding).
4) I found the authors' choice of basal ganglia data misleading (Ito and Doya, 2009). First, because these data are recordings from the nucleus accumbens and ventral pallidum, which are not the first basal ganglia structures one thinks of as encoding action values. Second, because the original authors of the study from which the data was obtained noted that action value coding was low in these structures, leading them to suggest that action value coding was not the primary function of the nucleus accumbens and ventral pallidum. This is mentioned in the Results subsection “Permutation analysis of basal ganglia neurons”, but should be noted in the Discussion (the current text in the Results could probably just be moved).
5) Previous methods dealing with trial correlations have different success at reducing false positive rate of detecting actionvalues. In particular, the method of Wang et al. (2013) comes very close to attaining the correct size of the test for actionvalues. Indeed, it appears to be the only existing method from which one would reasonably conclude that the ventral striatal data set analyzed probably does not exhibit much actionvalue coding (23% above the expected 5%). I think it would be useful to have a figure in the main text comparing the different methods to the authors' permutation test (using for example, just the basal ganglia data set). In addition, Wang et al.’s method is also pretty good at identifying policy neurons, which is important because it could be applied retrospectively to existing data sets.
6) The authors' biggest suggestion for rigorously detecting a neural action value representation is "Don't use a task with learning, use a trialbased design where subjects associate contexts with welllearned sets of action values". That is perfectly fine for scientists whose goal is specifically to test whether a brain area encodes action values. However, what about the many scientists whose explicit goal is to study neural representation of timevarying values, and hence need to use learning tasks? Many scientists are studying (1) the neural basis of value learning, (2) brain areas specifically involved in early learning (not welltrained performance), (3) motivational variables specifically present for timevarying action values not welltrained ones (e.g. volatility, certain forms of uncertainty, etc.). If the authors can give an approach that will let these scientists make accurate estimates of neural timevarying value coding during learning tasks, that would certainly be valuable to the field.
I feel like their methods could potentially be used to achieve this in a straightforward way (by combining their novel permutation test from Part 1 of their paper with their method of testing for correlation with both sum and difference of values from Part 2 of their paper). But they don't lay this out explicitly in their paper at present, since they are more focused on the narrower implication ("Do striatal neurons encode action values?") rather than the broader implication of their results ("In general, how can one properly measure timevarying action values?").
Secondly, the reviewers had some particular concerns about the action value vs. policy representation issue. For example:
1) Regarding the second confound of policy vs. action value:
 The authors seem to be arguing against a strawman version of how to relate neural activity to behavior. Typically we infer the underlying computations by testing how well different hypothesized models can fit the behavior and then search for correlates of the most likely computation in the brain. The authors seem to test only how well the neural activity correlates with different hypothesized models.
 The proposed solution for distinguishing policy from action value also has a very high rate of false negative (78%).
2) I feel that the point the authors make about actionvalue vs. policy representations may actually be underselling the true extent of the confound, and so their proposed solution may not be sufficient. However, this all depends on how the authors want to define a 'policy neuron' vs. a 'value neuron', as I explain below. I think they should clarify this.
2.1) Their arguments seem to assume that neural policy representations are in the form of action probabilities, which can then be identified by the key signature that they relate to actionvalues in a 'relative' manner (e.g. an 'action 1' neuron that is correlated positively with Q_{1} must be correlated negatively with Q_{2}), and hence will be best fit as encoding ΔQ (Q_{1} – Q_{2}). However, depending on how they define 'policy', this may not be the case.
Notably, even for reinforcement learning agents that do not explicitly represent actionvalues, few of them directly learn a policy in its most raw form of a set of action probabilities. Instead, they represent the policy in a more abstract parameter space. The simplest parameterization is a vector of action strengths, one for each possible action. Then during a choice, the probability of each action is determined by applying a choice function (e.g. softmax) to the action strengths of the set of actions that are currently available. The choice's outcome is then used to do learning on the action strengths. This method is used by some traditional actorcritic agents (which represent state values and action strengths, but not actionvalues). My impression is that the authors' covariancebased model is similar, in that the variables that it updates when it learns are the input weights W_{1} and W_{2} to each pool (with one input weight for each action, thus being analogous to action strengths).
Note that in these models, the action strengths are not explicitly represented in a 'relative' manner; only the resulting action probabilities are (since the probabilities must sum to 1). It's not clear to me whether a neuron encoding an action strength would be classified as a 'policy neuron' or an 'actionvalue neuron' by the authors' current framework, nor is it clear to me which outcome the authors would prefer. I believe the dynamics of actorcritic learning would cause the action strengths to be somewhat 'relative' (e.g. the best action is nudged toward positive strength while all others are nudged to negative strength), but I'm not sure big this effect would be, or whether this would also occur for the authors' covariancebased model, or whether this would occur if > 2 actions are available. It is possible that these types of learning tasks can't discriminate between action strengths (e.g. from an actorcritic) versus actionvalues (e.g. from a Qlearner). So, the authors should clarify whether they believe this is an important distinction for the present study.
2.2) Suppose we agree that neurons only count as coding the policy if they encode action probability (and not strength). Their proposed solution still seems modeldependent because it assumes that the policy is such that the probability of choosing an action is a function of the difference in actionvalues (Q_{1} – Q_{2}) and hence policy neurons can be identified as encoding ΔQ and not ΣQ. However, there is data suggesting that humans and animals are also influenced by ratios of reward rates rather than just differences (e.g. "Ratio vs. difference comparators in choice", Gibbon and Fairhurst, J Exp Anal Behav 1994; "Ratio and Difference Comparisons of Expected Reward in Decision Making Tasks", Worthy, Maddox, and Markman, 2008). If so, then policy neuron activity could be related to a ratio (e.g. Q_{1} / Q_{2}), which is correlated with both ΔQ and ΣQ.
Here is my proposed solution. It seems to me that if 'policy neurons' are equated to action probabilities, then the proper method of distinguishing policy from value coding would be to design a task that explicitly dissociates between the probability of choosing an action (encoded by policy neurons) and the action's value (encoded by actionvalue neurons). For instance, suppose an animal is found to choose based on the ratio of the reward rates, such that it chooses A 80% of the time when V(A) = 4*V(B). Then we can set up the following three trial types:
V(A), V(B), p(choose A)
8, 2, 80%
4, 1, 80%
4, 4, 50%
A neuron encoding V(A) should be twice as active on the first trial type as the other two trial types (since V(A) is twice as high), while a neuron encoding the policy p(choose A) should be equally active on the first two trial types (since p(choose A) = 80%). Of course, more trial types might be desired to further dissociate this from encoding of ΣQ and ΔQ. Also, note that this approach is modeldependent, because it requires a model of behavior to estimate the true p(action) on each trial (or else careful psychophysics to find two pairs of actionvalues that make the subject have the same action probabilities).
In general, to use this approach in a regressionbased manner, one would (1) fit a model to behavior, (2) use the model to derive p(action,t) and V(action,t) for each action and each trial t, (3) fit neural activity as a function of those variables (and possibly others, such as the actually performed action, ΣQ, etc.), (4) test whether the neuron is significantly modulated by p(action), V(action), or both, controlling for temporal correlation using the authors' proposed method that uses task trajectories from other sessions as a control. Of course, if the model says that choice is indeed based on the value difference ΔQ as the authors currently assume, then this approach would simplify to the same one the authors currently propose.
Thirdly, the reviewers raised some questions about the corrections proposed and whether there in fact remained evidence for action value coding in the Basal Ganglia.
1) A critical assumption is that there exists temporal correlation strong enough to contaminate the analysis. It would be helpful to report the degree of this temporal correlation in the basal ganglia data set vs. the motor/auditory cortex data and the random walk model.
A figure, in the format of Figure 1D, showing the distribution of tvalues for the actual basal ganglia data set analyzed with trialmatched Q estimates should be presented. This information is critical for effective comparisons to other data sets.
2) The authors proposed two possible solutions for this type of study. The first is to use a more stringent (and appropriate) criterion for significance, given the often wrongly assumed variance due to correlation. The permutation test is definitely in the right direction, particularly for reducing false positives. However, I am concerned by the really high rate of false negatives (~70% misses). "Considering the population of simulated actionvalue neurons of Figure 1, this analysis identified 29% of the actionvalue neurons of Figure 1 as such". Considering other unaccountable variables in typical experiments, particularly that basal ganglia neurons may have mixed selectivity both at the population and singleneuron level, such a high false negative rate seems to carry high risks of missing a true representation.
3) The authors suggested randomized blocks as the second solution. In addition to my earlier point, by their own account, such a design is not new and has been implemented in three separate studies >5 years ago. The authors pointed out some issues with those studies, which will need to be addressed in the future, but did not suggest any solutions.
4) The authors stated that the detrending analysis does not resolve the confound. However, judging from Figure 2—figure supplement 7, the detrending analysis resulted in ~29% significant Q modulation in the basal ganglia, in contrast to ~14% for random walk, ~12% for motor cortex and 10% for auditory cortex. Compared to other figures, which showed similar percentage for all four datasets, it seems that the basal ganglia data set is most robust to this analysis. Doesn't this support the idea of an action value representation in the basal ganglia?
5) The authors focus on statistical significance. Does examining the magnitude of the effects distinguish erroneous from "real" action value coding? It seems incomplete to only plot the tvalues, which are important for understanding parameter precision, without presenting the parameters effect sizes. Can real action value coding be distinguished by effect sizes that were meaningfully large (i.e., substantive versus statistical significance)?
6) Along related lines, it seems like examining the pattern of effects is also useful. When comparing Figure 1D and Figure 2B, one can see that the erroneous detections included positive and negative ΔQ and ΣQ neurons, whereas for real detections (Figure 1D), there are much fewer of these neurons (by definition). All the erroneous detections generate spherical tvalue plots, indicating that combinations of one or the other action value are independent. This seems not to be the case for real detections (in the authors simulations), nor in real data (Samejima et al., 2005). This suggests that any nonuniformity in detecting combinations of action value coding would be evidence that it is not erroneous (even if the type I error is not properly controlled).
7) The simulations in Figure 2 are useful, but it would be useful to translate the diffusion parameter (σ) of the random walk into an (auto) correlation. This would make it easier for a reader to interpret how this relates to real data.
8) Is the M1 data a proper control? It is hard to tell from the task description here. I wouldn't be able to replicate the task that was used given the description here. If that M1 data is published, a citation would be helpful. My concerns are whether it might have had unusually large temporal correlations and thus exaggerated the degree to which such correlations might confound actionvalue studies, due to either (1) having blocks of trials (as opposed to randomly interleaved trial types), (2) being a BMI task in which animals were trained to induce the recorded ensemble to emit specific longduration activity patterns.
https://doi.org/10.7554/eLife.34248.028Author response
However, there were several key concerns.
Firstly, the reviewers thought that there should be a more careful attention to the wider literature. In the reviewer discussion, this was thought to be of particular importance as you are making a technical point in a journal with a broad readership. It is particularly important that you are clear about which parts of the literature your results speak to.
For example:
1) The authors conclude that current methods of studying actionvalues are confounded, and propose that experiments should use actions whose reward values are not learned over time during the experiment, and instead are indicated by sensory cues with their values picked randomly on each trial.
However, the authors leave unmentioned the large literature of studies taking an alternate approach: recording striatal activity during planning and execution of instructed actions for cued reward outcomes, in which each trial randomizes the instructed action, the reward, or both. The authors should discuss what implications this literature has for whether the striatum encodes actionvalues vs. policies vs. other variables. Examples include the work from the labs of Schultz (e.g. Hassani et al., 2001; Cromwell and Schultz, 2003; Cromwell et al., 2005), Hikosaka (e.g. Kawagoe et al., 1998; Lauwereyns et al., Neuron 2004), and Kimura (e.g. Hori et al., 2009).
Most of these studies did not specifically claim that the activity they reported represents "actionvalues" in the reinforcement learning sense (and hence the present authors shouldn't feel obligated to try to debunk them), but they do seem highly relevant to the larger question the present authors raise. These studies did attempt to test whether neurons represented actions, values, and notably, their interaction (e.g. a cell whose activity scales more with action A's value than action B's value), which resembles the concept of "actionvalue".
Also, these studies may be somewhat resistant to the authors' criticisms about confounds from temporal correlations (since rewards were either explicitly cued, or kept deterministic and stable in welllearned blocks of trials, rather than slowly fluctuating during extended learning) and confounds with action probability (since the actions were instructed and hence a priori equally probable on each trial).
Of particular interest, a paper by Pasquereau et al. (2007) seems to fulfill all the requirements the present authors set for a test of striatal actionvalue coding; if so, this seems worthy of mention. That study manipulated the reward value of four actions (up, down, left, right), randomly assigning their reward probabilities on each trial and indicating them with visual cues. Unfortunately, as the present authors note, the study did not explicitly analyze their results in the actionvalue vs. chosenvalue framework. However, the paper did report that some neurons had significant action x value interactions – for example, a cell that is more active when planning rightward movements (action), with stronger activity when the planned movement was more valuable (value), and with this valuemodulation greater for rightward movements than other movements (action x value). This is not a pure chosenvalue signal as the present authors seem to claim that paper reported. One could argue that it contains a key feature of an actionvalue signal as the value modulation is strongest for one specific action.
We agree with the reviewers that such trial designs, when trials are temporally independent, are not subject to the temporal correlation confound. We have added a paragraph about these papers and explained there why their findings cannot be used as a support to the striatal actionvalue representation hypothesis. In short, we do not doubt that the striatum plays an important role in decision making and learning. However, this finding, as well as the evidence in support of representation of other decision variables in the basal ganglia do not entail actionvalue representation in the striatum, as there are alternatives that are consistent with these findings. These points are clarified in the Discussion (Section “Other indications for actionvalue representation”).
Specifically regarding Pasquereau et al. (2007), we agree that the results are not consistent with pure chosenvalue representation and changed the text accordingly. The finding that neurons are comodulated by action and expected reward is indeed very interesting. However, it cannot be taken as evidence for actionvalue representation for several reasons. First, a policy neuron is also expected to be comodulated by these two variables. Second, the example neurons in Figure 6 in that paper are clearly modulated by the value of other actions, which is inconsistent with the actionvalue hypothesis (no such quantitative analysis was performed at the population level). Finally, an essential test of actionvalue representation is that the value of the action is represented even when this action is not chosen. This was not tested in that paper (although in principle, it can be tested using existing data; The prediction of actionvalue representation is that the activity of that neuron is modulated by the value of the left target even when this target is not chosen). This is clarified, in short, in the “Literature search” section in the Materials and methods.
2) The authors correctly pointed out that some earlier studies of action value used a suboptimal task design and their conclusions need to be more rigorously validated. However, in the broader field, the potential risk of "drift" in neural recording has been well recognized. For example, "Neurophysiological experiments that compare activity across different blocks of trials must make efforts to be confident that any neural effects are not the result of artifacts of that design, such as slowwave changes in neural activity over time." (Asaad et al., 2000). In the same Asaad et al. paper, a better design with repeated, alternated block types was used, similar in concept to randomized block design that the authors proposed here. Such designs have also been used in many neural studies of cognition – to name just two examples: value manipulations (Lauwereyns et al., 2002), rule manipulations (Mansouri et al., 2006). The problem thus seems relatively limited to one type of analysis that introduces temporal correlation across trials in an effort to estimate Q values. By the authors' account, this amounts to 5 papers from 3 different labs.
In response to this comment, we examined the papers proposed by the reviewer. We found that this method does not resolve the temporal correlations confound, as described in the Results section about possible solutions to the first confound (section "Possible solutions to the temporal correlations confound”) and in the Materials and methods section (the section “ANOVA tests for comparisons between blocks, excluding ‘drifting’ neurons”).
3) What about previous results arguing for prominence of a specific type of value representation? The authors touch on this, but it would be helpful to discuss specific results. In particular, the cited study of Wang et al., Nat Neuro 2003 reported that their unbiased angular measure of DMS value coding was distributed significantly nonuniformly, with net value (ΣQ) coding more prevalent than other types (their Figure 7). Whereas the null hypothesis simulations in this paper predict very different results, either a uniform distribution (Figure 2—figure supplement 8) or a dearth of ΣQ neurons (Figure 5—figure supplement 1). The authors should discuss whether this previous result can therefore still be interpreted as evidence of value coding (at least, netvalue coding), rather than strictly policy coding, in the striatum.
(Also, it is odd that the authors cite Wang et al. as a study that "claimed to provide direct evidence that neuronal activity in the striatum is specifically modulated by actionvalue", since the main result was specifically finding prevalence of netvalue, not actionvalue coding).
We do not discuss the issue of nonuniform results in the paper but we agree that nonuniform results may be an indication of a true modulation by some variable. For example, if only neurons that are positively correlated with actionvalues are found (rather than negatively correlated with them) – this would be a strong indication for a modulation that is not caused by random fluctuations.
However, it is important to point out that small changes in the analysis may bias it in unexpected ways. In Author response image 1 we repeated the analysis of Wang et al., 2013 for the randomwalk neurons. This analysis is slightly different form the one presented in Figure 2—figure supplement 8. There, we analyzed only the last 20 trials in each block (following Samejima et al. (2005), we now added a clarification in the figure legend). Wang et al. (2013) analyzed all the trials in a block except the first 10 and utilized 59 blocks. Analyzing all the trials in a block except the first 10 and utilizing 8 blocks (order of blocks as in Figure 2—figure supplement 10), surprisingly, we find a small, but significant bias towards representation of (𝛴𝑄) (p=2.9%), as in Wang et al., 2013.
Importantly, we have not fully followed the experimental setting in Wang et al. (2013). Specifically, we were not sure what was their rule for a termination of a block and we used the Samejima et al. (2005) rule. Therefore, we are unsure about the consequence of the bias we now found to their conclusions. However, this analysis shows that a biased result is not always an indication of true modulation.
With respect to the second point, we agree that Wang et al.’s (2013) main point is that the dorsomedial striatum represents netvalue (i.e., 𝛴𝑄). However, they do report that "in the DMS, all categories of neuron types were represented above chance" (p. 645). Nevertheless, we added this point in the legend of Figure 2—figure supplement 8, where the Wang et al. (2013) analysis is repeated.
(Computation of pvalue: The pvalue for the probability of receiving this fraction of state neurons was computed under the assumption that the significant neurons were distributed uniformly between classifications. If classification is uniform, the expected fraction of neurons in each category will be 10.11%. Here we classified 11.93% of the neurons as representing state. We used 20,000 neurons in 1000 different sessions. Taking 1000 sessions as the sample size, we calculated the probability of a binomial distribution with prob. 10.11% to yield more than 119 classifications in 1000 sessions).
4) I found the authors' choice of basal ganglia data misleading (Ito and Doya, 2009). First, because these data are recordings from the nucleus accumbens and ventral pallidum, which are not the first basal ganglia structures one thinks of as encoding action values. Second, because the original authors of the study from which the data was obtained noted that action value coding was low in these structures, leading them to suggest that action value coding was not the primary function of the nucleus accumbens and ventral pallidum. This is mentioned in the Results subsection “Permutation analysis of basal ganglia neurons”, but should be noted in the Discussion (the current text in the Results could probably just be moved).
We moved the text to the Discussion (section “Temporal correlations and actionvalue representations”, fourth paragraph).
5) Previous methods dealing with trial correlations have different success at reducing false positive rate of detecting actionvalues. In particular, the method of Wang et al. (2013) comes very close to attaining the correct size of the test for actionvalues. Indeed, it appears to be the only existing method from which one would reasonably conclude that the ventral striatal data set analyzed probably does not exhibit much actionvalue coding (23% above the expected 5%). I think it would be useful to have a figure in the main text comparing the different methods to the authors' permutation test (using for example, just the basal ganglia data set). In addition, Wang et al.’s method is also pretty good at identifying policy neurons, which is important because it could be applied retrospectively to existing data sets.
In an attempt to make our analyses as similar as possible to the original analyses we used different thresholds for significance for different methods. Specifically, in Wang et al. analysis we find that 7% – 8% of the basal ganglia neurons represent an action value, whereas only 0.25% are expected by chance. To clarify this, we added the significance threshold to the different figures to make this difference clear.
Regarding the analysis in Wang et al. (2013) on policy neurons, we address this question in the section “Is this confound the result of an analysis that is biased against policy representation?”. This analysis indeed yields more policy than actionvalue neurons, but still a fraction much larger than expected by chance of policy neurons is classified as actionvalue neurons.
With regards to the suggestion of adding the figure, we are unsure about the added value of such a figure. In the supplementary figures we demonstrate that all these methods erroneously classify neurons in the basal ganglia recordings as representing unrelated actionvalues. In view of these findings, we fear that using them to identify true actionvalues in those recordings may mislead the readers.
6) The authors' biggest suggestion for rigorously detecting a neural action value representation is "Don't use a task with learning, use a trialbased design where subjects associate contexts with welllearned sets of action values". That is perfectly fine for scientists whose goal is specifically to test whether a brain area encodes action values. However, what about the many scientists whose explicit goal is to study neural representation of timevarying values, and hence need to use learning tasks? Many scientists are studying (1) the neural basis of value learning, (2) brain areas specifically involved in early learning (not welltrained performance), (3) motivational variables specifically present for timevarying action values not welltrained ones (e.g. volatility, certain forms of uncertainty, etc.). If the authors can give an approach that will let these scientists make accurate estimates of neural timevarying value coding during learning tasks, that would certainly be valuable to the field.
I feel like their methods could potentially be used to achieve this in a straightforward way (by combining their novel permutation test from Part 1 of their paper with their method of testing for correlation with both sum and difference of values from Part 2 of their paper). But they don't lay this out explicitly in their paper at present, since they are more focused on the narrower implication ("Do striatal neurons encode action values?") rather than the broader implication of their results ("In general, how can one properly measure timevarying action values?").
The paper addresses two confounds, that are somewhat orthogonal. The temporal correlation confound can be addressed using the permutation analysis of Figure 3, which can provide strong support to the claim that the activity of a particular neuron covaries with learning. This is a general solution for scientists studying slow processes such as learning.
Precisely defining or interpreting what the activity of the neuron represents (for example an actionvalue or policy) is more challenging and in general, there are no easy solutions and caution should be applied when interpreting the data. We now discuss these points in the 'Temporal correlations – beyond actionvalue representation' section of the Discussion.
With respect to the proposed solution, to rule out policy representation, the analysis in Figure 6 includes a regression on an orthogonal variable – state. For the two variables to be orthogonal it is required mathematically that the two actionvalues will have the same variance (section “A possible solution to the policy confound”). This can be achieved in a controlled experiment where reward probabilities are used, but we cannot control for the variance of the actionvalues when we estimate them from behavior. Therefore, we could not find a way to combine the solution from Figure 3 with the regression analysis from Figure 6. However, in other cases, this may not be an issue, depending on the specific variable and question.
Secondly, the reviewers had some particular concerns about the action value vs. policy representation issue. For example:
1) Regarding the second confound of policy vs. action value:
 The authors seem to be arguing against a strawman version of how to relate neural activity to behavior. Typically we infer the underlying computations by testing how well different hypothesized models can fit the behavior and then search for correlates of the most likely computation in the brain. The authors seem to test only how well the neural activity correlates with different hypothesized models.
We respectfully disagree with the review for two reasons:
First, the reviewer hints that because actionvalue based models best describe behavior, we should search for actionvalue representations. We would like to note that while the view that actionvalue based models best describe behavior is widespread, there is strong evidence that favors other models (e.g., Erev et al., Economic Theory, 2007, see also Shteingart and Loewenstein, 2014 for review). Therefore, it is still an open question whether actionvalue representation exists in the brain.
Second, policy representation (representation of the probability of choice) is likely to exist even if the brain computes actionvalues. If neurons represent policy, then they may be misclassified as representing actionvalues.
 The proposed solution for distinguishing policy from action value also has a very high rate of false negative (78%).
We agree with this point and we remedied the analysis to decrease its false negative rate. For true actionvalue neurons, the rate of correct detection vs. false negatives depends on the strength of their modulation by actionvalue, together with the power of the analysis.
We used neurons whose correct detection rate in the original analyses was comparable to the literature (~40%). The analysis in the previous version of the manuscript decreased this rate to 22%. It indeed suffered from limited power also because it only employed 80 trials. To increase the power of the analysis, we repeated the analysis using 400 trials in total (rather than the original 280 trials) and conducting the analysis on the last 200 trials. We now correctly classify 32% of actionvalue neurons as such (see Figure 6). Considering that the original analysis in Figure 1 was biased towards classifying neurons as representing actionvalue, rather than policy or state and that our new analysis requires passing two significance tests, we take this correct detection rate to be reasonable.
We changed Figures 4 and 6, together with their figure legends and descriptions of the analysis accordingly.
2) I feel that the point the authors make about actionvalue vs. policy representations may actually be underselling the true extent of the confound, and so their proposed solution may not be sufficient. However, this all depends on how the authors want to define a 'policy neuron' vs. a 'value neuron', as I explain below. I think they should clarify this.
2.1) Their arguments seem to assume that neural policy representations are in the form of action probabilities, which can then be identified by the key signature that they relate to actionvalues in a 'relative' manner (e.g. an 'action 1' neuron that is correlated positively with Q_{1} must be correlated negatively with Q_{2}), and hence will be best fit as encoding ΔQ (Q_{1} – Q_{2}). However, depending on how they define 'policy', this may not be the case.
Notably, even for reinforcement learning agents that do not explicitly represent actionvalues, few of them directly learn a policy in its most raw form of a set of action probabilities. Instead, they represent the policy in a more abstract parameter space. The simplest parameterization is a vector of action strengths, one for each possible action. Then during a choice, the probability of each action is determined by applying a choice function (e.g. softmax) to the action strengths of the set of actions that are currently available. The choice's outcome is then used to do learning on the action strengths. This method is used by some traditional actorcritic agents (which represent state values and action strengths, but not actionvalues). My impression is that the authors' covariancebased model is similar, in that the variables that it updates when it learns are the input weights W_{1} and W_{2} to each pool (with one input weight for each action, thus being analogous to action strengths).
Note that in these models, the action strengths are not explicitly represented in a 'relative' manner; only the resulting action probabilities are (since the probabilities must sum to 1). It's not clear to me whether a neuron encoding an action strength would be classified as a 'policy neuron' or an 'actionvalue neuron' by the authors' current framework, nor is it clear to me which outcome the authors would prefer. I believe the dynamics of actorcritic learning would cause the action strengths to be somewhat 'relative' (e.g. the best action is nudged toward positive strength while all others are nudged to negative strength), but I'm not sure big this effect would be, or whether this would also occur for the authors' covariancebased model, or whether this would occur if > 2 actions are available. It is possible that these types of learning tasks can't discriminate between action strengths (e.g. from an actorcritic) versus actionvalues (e.g. from a Qlearner). So, the authors should clarify whether they believe this is an important distinction for the present study.
The reviewer is making an interesting and important point. An initial requirement for a neuron to be considered an actionvalue neuron, a policy neuron or any decision variableneuron, is that it is significantly more correlated with these decision variables than with decision variables that are unrelated to the current task. The permutation analysis of Figure 3 can be used to find such neurons.
The question of which decision variable the neuron represents (assuming that it passed the permutation test) is a more difficult one. The reason is that the different decision variables are correlated. Moreover, because these variables are all some function of past actions and rewards, and relate to future choice, many existing and future decisionmaking models are expected to have modules whose activity correlates with these variables. One may argue that the question of whether neurons represent actionvalue, policy, state or some other correlated variable is not an interesting question. This is because all these correlated decision variables implicitly encode actionvalue. Even directpolicy models can be taken to implicitly encode actionvalue, because policy is correlated with the difference between the actionvalues. However, we believe that the difference between actionvalue representation and representation of other variables is an important one, because it centers on the question of the computational model that underlies decisionmaking in these tasks.
Often, reports of actionvalue representation are taken to support the hypothesis that actionvalues are explicitly computed in the brain, and that these actionvalues play a specific role in the decision making process. While other models may include no such calculation they can still include neuronal activity that correlates with actionvalue, as in the covariancebased plasticity model (at the level of the population). One proper way of ruling out competing hypotheses about the variables the neuronal activity correlates with is to test for significant correlations in directions that are correlated with actionvalue but are orthogonal to each of the competing hypotheses.
Clearly, one cannot attempt to rule out all possible hypotheses. However, even in the restricted framework of valuebased Qlearning, a necessary condition for a neuron to be considered as representing an actionvalue is that it is not representing other decision variables of that model such as policy. Regarding alternative models for learning, clearly the more restrictive the characterization of the response properties of a neuron in the task, the more informative it is about the underlying neural computation.
We added a section in the Discussion titled “Are actionvalue representations a necessary part of decision making? “that addresses these issues.
2.2) Suppose we agree that neurons only count as coding the policy if they encode action probability (and not strength). Their proposed solution still seems modeldependent because it assumes that the policy is such that the probability of choosing an action is a function of the difference in actionvalues (Q_{1} – Q_{2}) and hence policy neurons can be identified as encoding ΔQ and not ΣQ. However, there is data suggesting that humans and animals are also influenced by ratios of reward rates rather than just differences (e.g. "Ratio vs. difference comparators in choice", Gibbon and Fairhurst, J Exp Anal Behav 1994; "Ratio and Difference Comparisons of Expected Reward in Decision Making Tasks", Worthy, Maddox, and Markman, 2008). If so, then policy neuron activity could be related to a ratio (e.g. Q_{1} / Q_{2}), which is correlated with both ΔQ and ΣQ.
We agree, but any analysis can only consider and compare the hypotheses that are explicitly acknowledged. We added a paragraph in the Discussion addressing this point (section “Differentiating actionvalue from other decision variables”, fifth paragraph).
Here is my proposed solution. It seems to me that if 'policy neurons' are equated to action probabilities, then the proper method of distinguishing policy from value coding would be to design a task that explicitly dissociates between the probability of choosing an action (encoded by policy neurons) and the action's value (encoded by actionvalue neurons). For instance, suppose an animal is found to choose based on the ratio of the reward rates, such that it chooses A 80% of the time when V(A) = 4*V(B). Then we can set up the following three trial types:
V(A), V(B), p(choose A)
8, 2, 80%
4, 1, 80%
4, 4, 50%
A neuron encoding V(A) should be twice as active on the first trial type as the other two trial types (since V(A) is twice as high), while a neuron encoding the policy p(choose A) should be equally active on the first two trial types (since p(choose A) = 80%). Of course, more trial types might be desired to further dissociate this from encoding of ΣQ and ΔQ. Also, note that this approach is modeldependent, because it requires a model of behavior to estimate the true p(action) on each trial (or else careful psychophysics to find two pairs of actionvalues that make the subject have the same action probabilities).
In general, to use this approach in a regressionbased manner, one would (1) fit a model to behavior, (2) use the model to derive p(action,t) and V(action,t) for each action and each trial t, (3) fit neural activity as a function of those variables (and possibly others, such as the actually performed action, ΣQ, etc.), (4) test whether the neuron is significantly modulated by p(action), V(action), or both, controlling for temporal correlation using the authors' proposed method that uses task trajectories from other sessions as a control. Of course, if the model says that choice is indeed based on the value difference ΔQ as the authors currently assume, then this approach would simplify to the same one the authors currently propose.
This is an elegant experimental design and not unlike the one we consider in Figure 6. However, with respect to the proposed analysis, there are two important differences. One is the question of whether behavior is modulated by the ratio of reward rates, the difference of reward rates or a different function. In the paper we posited that it is the difference in the reward rates that modulates behavior when analyzing the data in the valuebased framework. We agree, that it is possible that the ratio is a better predictor of behavior. Our choice followed that of the previous publications and is based on the assumption of the Qlearning model that the probability of choice is a monotonic function of the difference between actionvalues.
Second, in point 4, the reviewers propose to test the type of representation by looking for significant modulation or the lack of it. However, a nonsignificant result for one variable, is not an indication that it was not the modulator. As described in Figure 5, this can lead to confounds. Furthermore, policy and actionvalue will have shared variance, and so some of the modulation of the neuronal activity cannot be conclusively attributed to any of them. Therefore, it is better to use model comparison (likelihood) when considering the results of this analysis. In our manuscript we focus on significance tests that can rule out specific possibilities under the null hypothesis.
Note, that the design suggested by the reviewers can also be used to reject the hypothesis that neurons are policy neurons. For neurons whose activity differs significantly between the first two cases (p(choose A)=80%) the null hypothesis that they represent policy can be rejected. In the experimental design we simulate in the paper (Figure 6) this is like comparing the activity of neurons at the end of two blocks where the policy is similar (this is an assumption which can be tested empirically). We can compare the neural activity in (0.1, 0.5) with (0.5, 0.9), and the activity in (0.5, 0.1) with (0.9, 0.5). To rule out the possibility of state representation we should compare the activity and the end of the following blocks: compare (0.1, 0.5) with (0.5, 0.1), and (0.5, 0.9) with (0.9, 0.5). As the reviewers note, this is in fact exactly what we do in the analysis in Figure 6. We regress neuronal activity on state – sum(0.1, 0.5)=0.6, sum(0.5, 0.9)=1.4, sum(0.5, 0.1)=0.6, sum(0.9, 0.5)=1.4. This effectively compares activity in cases with the same policies in a regression model.
Thirdly, the reviewers raised some questions about the corrections proposed and whether there in fact remained evidence for action value coding in the Basal Ganglia.
1) A critical assumption is that there exists temporal correlation strong enough to contaminate the analysis. It would be helpful to report the degree of this temporal correlation in the basal ganglia data set vs. the motor/auditory cortex data and the random walk model.
Author response image 2 shows a plot of the autocorrelation of the spike counts in each trial for the different data sets (averaged over the spike counts in each group; lightcolors denote SEM; computed using MATLAB’s ‘autocorr’ function).
We believe that it is better to refrain from including this figure in the paper for two reasons: (1) The autocorrelations relevant for the temporal correlations confound are those associated with the timescale relevant for learning, tens of trials. Computing such autocorrelations in experiments of a few hundreds of trials introduces substantial biases (Newbold and Agiakloglou, 1993; Kohn, 2006). This is also demonstrated in the negative autocorrelation of the randomwalk spike counts, computed using sessions of 151379 trials. Alternative measures for autocorrelation are also problematic when applied to small samples, see (Kohn, 2006). (2) We are not aware of theoretical mapping from the autocorrelation function to the temporal correlations confound. For example, considering the autocorrelations below, it is not clear how to compare the basal ganglia and the motor cortex datasets with respect to the temporal correlations confound when considering their autocorrelation functions. For these reasons, computing autocorrelation functions to quantify the temporal correlations confound may be misleading rather than useful.
We added a paragraph to the manuscript, describing the potential problems with the autocorrelation measure (section “Possible solutions to the temporal correlations confound”, second paragraph).
A figure, in the format of Figure 1D, showing the distribution of tvalues for the actual basal ganglia data set analyzed with trialmatched Q estimates should be presented. This information is critical for effective comparisons to other data sets.
We added Figure 3—figure supplement 1, which reports this information.
2) The authors proposed two possible solutions for this type of study. The first is to use a more stringent (and appropriate) criterion for significance, given the often wrongly assumed variance due to correlation. The permutation test is definitely in the right direction, particularly for reducing false positives. However, I am concerned by the really high rate of false negatives (~70% misses). "Considering the population of simulated actionvalue neurons of Figure 1, this analysis identified 29% of the actionvalue neurons of Figure 1 as such". Considering other unaccountable variables in typical experiments, particularly that basal ganglia neurons may have mixed selectivity both at the population and singleneuron level, such a high false negative rate seems to carry high risks of missing a true representation.
The rate of misses of actionvalue neurons in our analysis depends on the parameters that we used to model these neurons. We used parameters such that the "standard" methods miss approximately 60% of the action value neurons. With the permutation test we miss approximately 70%. Other parameters would yield different rates of misses. If selectivity is weak then indeed, it will be more difficult to identify such neurons. However, a necessary condition for a neuron to be classified as a taskrelated neuron is that it is more correlated with decision variables in its corresponding session than with these decision variables in other sessions. We do not see a way around it even if this requirement is associated with a substantial rate of false identifications.
One approach to increase the power of any analysis will be to use as many trials as possible, as can be seen from the increase in the correct detection rate in Figure 6, caused by the addition of trials (we could not add trials in this analysis because we analyzed the original neurons of Figure 1). Another alternative is to consider population coding rather than to focus on individual neurons. This analysis is, however, beyond the scope of this paper.
3) The authors suggested randomized blocks as the second solution. In addition to my earlier point, by their own account, such a design is not new and has been implemented in three separate studies >5 years ago. The authors pointed out some issues with those studies, which will need to be addressed in the future, but did not suggest any solutions.
We are not sure that we understand this comment. In our second solution we proposed randomized trials and not randomized blocks. If the reviewer relates to the similarity of our second solution to (Fitzgerald, Friston and Dolan, 2012) then crucially, we used reward probabilities in the analysis and not estimated actionvalues. This removes temporal correlations which are present when estimated actionvalues are used (see Figure 2—figure supplement 9). In addition, our analysis in Figure 6 rules out policy and state representations, which was not present in (Fitzgerald, Friston and Dolan, 2012). This last point is also relevant to (Cai et al., 2011 and Kim et al., 2012).
4) The authors stated that the detrending analysis does not resolve the confound. However, judging from Figure 2—figure supplement 7, the detrending analysis resulted in ~29% significant Q modulation in the basal ganglia, in contrast to ~14% for random walk, ~12% for motor cortex and 10% for auditory cortex. Compared to other figures, which showed similar percentage for all four datasets, it seems that the basal ganglia data set is most robust to this analysis. Doesn't this support the idea of an action value representation in the basal ganglia?
Originally, we were not clear enough on this issue. We’ve added clarifying sentences in the figure legends. The analysis of the basal ganglia data in Figure 2—figure supplement 7erroneously identified unrelated actionvalues from simulations. In fact, this analysis indicates that detrending is even less useful there than in other datasets.
5) The authors focus on statistical significance. Does examining the magnitude of the effects distinguish erroneous from "real" action value coding? It seems incomplete to only plot the tvalues, which are important for understanding parameter precision, without presenting the parameters effect sizes. Can real action value coding be distinguished by effect sizes that were meaningfully large (i.e., substantive versus statistical significance)?
To address this comment, we compared the explained variance of the actionvalue and randomwalk neurons used in our paper. Surprisingly, the explained variance of the randomwalk neurons is higher than that of the true actionvalue neurons.
One may argue that very high explained variance (say R^{2} > 0.25) can be used as conclusive evidence of actionvalue representation. However, we find that if the diffusion coefficient of the randomwalk neurons is sufficiently large then a substantial fraction of the neurons will exhibit high values of R^{2}. For example, with a diffusion coefficient of 0.5 31% of the randomwalk neurons exhibit R^{2} > 0.25.
6) Along related lines, it seems like examining the pattern of effects is also useful. When comparing Figure 1D and Figure 2B, one can see that the erroneous detections included positive and negative ΔQ and ΣQ neurons, whereas for real detections (Figure 1D), there are much fewer of these neurons (by definition). All the erroneous detections generate spherical tvalue plots, indicating that combinations of one or the other action value are independent. This seems not to be the case for real detections (in the authors simulations), nor in real data (Samejima et al., 2005). This suggests that any nonuniformity in detecting combinations of action value coding would be evidence that it is not erroneous (even if the type I error is not properly controlled).
We partially answer this question (question 3 in the first set of comments above). Some nonuniformities may indeed indicate that the result are not due to random modulations. However, even when dealing with random modulations we may see certain biases that are caused by the design of the analysis. Another example is Figure 2. There we find that in the randomwalk dataset, the fraction of state neurons is larger than that of policy neurons. We shortly address the fact that the results may be biased towards a specific classification in some experimental designs in Figure 2—figure supplements 4, 5, and Figure 3—figure supplement 1.
7) The simulations in Figure 2 are useful, but it would be useful to translate the diffusion parameter (σ) of the random walk into an (auto) correlation. This would make it easier for a reader to interpret how this relates to real data.
As discussed above, we fear that presenting autocorrelations may be misleading. Particularly, the autocorrelations of the randomwalk function for a finite (and small) number of trials, which is relevant for experiments is very different from the function obtained when the number of trials is large. This is depicted in Author response image 4, where we compare the autocorrelation of the randomwalk sessions of the paper, with the autocorrelation function of the same process, computed using 5,000 trials.
8) Is the M1 data a proper control? It is hard to tell from the task description here. I wouldn't be able to replicate the task that was used given the description here. If that M1 data is published, a citation would be helpful. My concerns are whether it might have had unusually large temporal correlations and thus exaggerated the degree to which such correlations might confound actionvalue studies, due to either (1) having blocks of trials (as opposed to randomly interleaved trial types), (2) being a BMI task in which animals were trained to induce the recorded ensemble to emit specific longduration activity patterns.
The motor cortex data was recorded in Eilon Vaadia’s lab and has not been published yet. We agree that the specific task the subject is performing may influence the overall firing rate or the temporal correlations in the neural activity and hence the false positive rates in the detection of actionvalue representation. However, we think it is unlikely that the recordings in this data set are an outlier in terms of autocorrelations. First, the monkey was extensively trained and all trials were identical, so there is nothing in the design of the task that suggests longterm correlations between trials. Second, the monkey was conditioned to enhance the power of beta band frequencies (2030Hz). This frequency band is two orders of magnitude different than the time scale separating different trials (on average 14.2 seconds). Finally, we considered spike count prior to the beginning of the trials, while the monkey was still waiting for a GO signal.
https://doi.org/10.7554/eLife.34248.029Article and author information
Author details
Funding
Israel Science Foundation (757/16)
 Yonatan Loewenstein
Deutsche Forschungsgemeinschaft (CRC1080)
 Yonatan Loewenstein
Gatsby Charitable Foundation
 Yonatan Loewenstein
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We are extremely grateful to Oren Peles, Eilon Vaadia and Uri WernerReiss for providing us with their motor cortex recordings, Bshara Awwad, Itai Hershenhoren, Israel Nelken for providing us with their auditory cortex recordings, Kenji Doya and Makoto Ito for providing us with their basal ganglia recordings, Mati Joshua, Gianluigi Mongillo, Jonathan Roiser and Roey Schurr for careful reading of the manuscript and helpful comments and Inbal Goshen, Hanan Shteingart and Wolfram Schultz for discussions.
Reviewing Editor
 Timothy E Behrens, University of Oxford, United Kingdom
Version history
 Received: December 11, 2017
 Accepted: May 13, 2018
 Accepted Manuscript published: May 31, 2018 (version 1)
 Version of Record published: June 19, 2018 (version 2)
Copyright
© 2018, ElberDorozko et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 4,297
 Page views

 642
 Downloads

 19
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Computational and Systems Biology
 Neuroscience
Cerebellar climbing fibers convey diverse signals, but how they are organized in the compartmental structure of the cerebellar cortex during learning remains largely unclear. We analyzed a large amount of coordinatelocalized twophoton imaging data from cerebellar Crus II in mice undergoing ‘Go/Nogo’ reinforcement learning. Tensor component analysis revealed that a majority of climbing fiber inputs to Purkinje cells were reduced to only four functional components, corresponding to accurate timing control of motor initiation related to a Go cue, cognitive errorbased learning, reward processing, and inhibition of erroneous behaviors after a Nogo cue. Changes in neural activities during learning of the first two components were correlated with corresponding changes in timing control and error learning across animals, indirectly suggesting causal relationships. Spatial distribution of these components coincided well with boundaries of AldolaseC/zebrin II expression in Purkinje cells, whereas several components are mixed in single neurons. Synchronization within individual components was bidirectionally regulated according to specific task contexts and learning stages. These findings suggest that, in close collaborations with other brain regions including the inferior olive nucleus, the cerebellum, based on anatomical compartments, reduces dimensions of the learning space by dynamically organizing multiple functional components, a feature that may inspire newgeneration AI designs.

 Neuroscience
Ultrasonic vocalizations (USVs) fulfill an important role in communication and navigation in many species. Because of their social and affective significance, rodent USVs are increasingly used as a behavioral measure in neurodevelopmental and neurolinguistic research. Reliably attributing USVs to their emitter during close interactions has emerged as a difficult, key challenge. If addressed, all subsequent analyses gain substantial confidence. We present a hybrid ultrasonic tracking system, Hybrid Vocalization Localizer (HyVL), that synergistically integrates a highresolution acoustic camera with highquality ultrasonic microphones. HyVL is the first to achieve millimeter precision (~3.4–4.8 mm, 91% assigned) in localizing USVs, ~3× better than other systems, approaching the physical limits (mouse snout ~10 mm). We analyze mouse courtship interactions and demonstrate that males and females vocalize in starkly different relative spatial positions, and that the fraction of female vocalizations has likely been overestimated previously due to imprecise localization. Further, we find that when two male mice interact with one female, one of the males takes a dominant role in the interaction both in terms of the vocalization rate and the location relative to the female. HyVL substantially improves the precision with which social communication between rodents can be studied. It is also affordable, opensource, easy to set up, can be integrated with existing setups, and reduces the required number of experiments and animals.