Robust and distributed neural representation of action values
Abstract
Studies in rats, monkeys, and humans have found actionvalue signals in multiple regions of the brain. These findings suggest that actionvalue signals encoded in these brain structures bias choices toward higher expected rewards. However, previous estimates of actionvalue signals might have been inflated by serial correlations in neural activity and also by activity related to other decision variables. Here, we applied several statistical tests based on permutation and surrogate data to analyze neural activity recorded from the striatum, frontal cortex, and hippocampus. The results show that previously identified actionvalue signals in these brain areas cannot be entirely accounted for by concurrent serial correlations in neural activity and action value. We also found that neural activity related to action value is intermixed with signals related to other decision variables. Our findings provide strong evidence for broadly distributed neural signals related to action value throughout the brain.
Introduction
The reinforcement learning theory provides a general theoretical framework for understanding the neural basis of valuebased decision making (Corrado and Doya, 2007; Dayan and Niv, 2008; Glimcher, 2011; Lee et al., 2012a; Mars et al., 2012; O'Doherty et al., 2007). In algorithms based on this theory, an agent selects an action based on a set of action values (i.e., values associated with potential actions) in a given state (Sutton and Barto, 1998). Human and animal choice behaviors are parsimoniously accounted for by such algorithms. Furthermore, a large body of studies in rats, monkeys, and humans have found neural or hemodynamic signals correlated with action value in multiple regions of the brain, especially in the frontal cortexbasal ganglia loop (Chase et al., 2015; Ito and Doya, 2011; Lee, 2006; Lee et al., 2012a; Rushworth et al., 2009). These findings led to the view that multiple brain structures contribute to biasing choices toward relatively valuable ones during decision making by representing a set of action values.
It is often difficult to rigorously demonstrate that neural activity is genuinely correlated with action value, because both neural activity and action value tend to fluctuate slowly over time and thus are serially correlated. Recently, for example, ElberDorozko and Loewenstein, 2018 made two lines of argument to suggest that actionvalue neurons had not been clearly demonstrated in the striatum. First, with a permutation test in which behavioral data from different sessions are used to determine significance of actionvaluerelated neural activity, they found that the number of neurons encoding action value in the ventral striatum (VS) and ventral pallidum (VP; Ito and Doya, 2009) was reduced to a chance level. A more recent simulation study also has shown that naïve applications of the conventional Ftest for multiple linear regression can suffer from a false positive and hence a ‘nonsense correlation’ between a behavioral variable and autocorrelated neural activity (Harris, 2020). Second, ElberDorozko and Loewenstein, 2018 argued that neural activity related to action value may reflect other decision variables correlated with action value, such as a choice probability or policy. Even though ElberDorozko and Loewenstein, 2018 focused on striatal actionvalue signals, these problems might be also relevant to actionvalue signals reported in other brain areas.
Given the significance of these statistical issues concerning valuerelated signals throughout the brain, we decided to reanalyze the data we have collected in our previous studies using the methods designed to strictly account for temporal correlations in the data. In addition to the permutation test used in ElberDorozko and Loewenstein, 2018, we also used surrogate behavioral and neural data to determine the statistical significance of value signals. We also tested whether actionvalue neurons identified in our previous studies merely encode policy or state value rather than action value. Overall, the results from these analyses demonstrate that neural activity in many areas of the brain, including the striatum, robustly encode action values.
Results
Neuronal and behavioral database
We analyzed neural activity related to action value as well as chosen value (value of the chosen action in a given trial). Included in this analysis are the neural data recorded from the dorsomedial striatum (DMS, 466 neurons), dorsolateral striatum (DLS, 206 neurons), VS (165 neurons), lateral orbitofrontal cortex (OFC, 1148 neurons), anterior cingulate cortex (ACC, 673 neurons), medial prefrontal cortex (mPFC, 854 neurons), secondary motor cortex (M2, 411 neurons), and dorsal CA1 region (508 neurons) in rats (n = 27; 383 sessions) performing a dynamic foraging task in a modified Tmaze (Figure 1, see Materials and methods; Kim et al., 2009; Kim et al., 2013; Sul et al., 2010; Sul et al., 2011; Lee et al., 2012b; Lee et al., 2017). We also analyzed neural data recorded from the dorsolateral prefrontal cortex (DLPFC, 164 neurons), caudate nucleus (CD, 93 neurons), and VS (90 neurons) in three monkeys performing an intertemporal choice task (see Materials and methods; Kim et al., 2008; Cai et al., 2011). In these monkey experiments, temporally discounted values (DVs) of alternative choices were randomized across trials, so that all decision variables were devoid of temporal correlation. We included in the analysis only those neurons with mean firing rates ≥1 Hz during a given analysis time window. To assess actionvaluerelated neural activity in rats, we analyzed neural spike data during the last 2 s of the delay period, immediately before the central bridge is lowered so that the animal is allowed to run forward and head toward the left or right goal location (Figure 1A; 196 DMS, 123 DLS, 68 VS, 782 OFC, 405 ACC, 431 mPFC, 301 M2, and 307 CA1 neurons). To assess actionvaluerelated neural activity in monkeys, we analyzed neural spike data during the 1 s time window before the onset of sensory cues signaling two choice options (75 CD, 66 VS, and 105 DLPFC neurons). To assess chosenvaluerelated neural activity in rats, we analyzed neural spike data during the 2 s time period centered around the outcome onset (±1 s since the choice outcome was revealed; 241 DMS, 139 DLS, 80 VS, 808 OFC, 401 ACC, 446 mPFC, 334 M2, and 326 CA1 neurons). In the following, we first describe the results from simulations to test false positive rates of several different statistical tests used in the present study in identifying actionvalue and chosenvalue neurons. We then show the results of these tests applied to the actual neural data collected from rats performing the blockdesigned dynamic foraging task. We then address the issue of potentially misidentifying other decisionvariable signals as actionvalue signals using the data from both rats and monkeys.
Validation of permutation and surrogate databased tests
We first assessed false positive rates of different analysis methods using actual behavioral data and simulated null neural data whose autocorrelation was chosen to match that of the actual neural data. The simulated neural data was generated as the following:
where x(t) is the firing rate of the simulated neuron at trial t, α is the autoregressive (AR) coefficient, and ε is a standard normal deviate. We then generated time series for spike counts assuming the simulated neuron is a Poisson process. We set α = 0.8 and 0.83 to test false positive rates of actionvalue and chosenvalue signals, respectively, which were chosen to match the distributions of the firstorder AR coefficient, AR(1), and mean firing rate to those of the actual neural data used to analyze actionvalue signals (neural activity during the last 2 s of the delay period; AR(1) = 0.19 ± 0.18 and mean firing rate = 6.14 ± 7.61 Hz, n = 2613 neurons) and chosenvalue signals (neural activity during the 2 s time period centered around the outcome onset; AR(1) = 0.21±0.20 and mean firing rate = 5.90 ± 6.72 Hz, n = 2775 neurons; mean ± SD).
We used these simulated neural data to test false positive rates of different analysis methods. Throughout the study, we identified actionvalue neurons as those whose activity is significantly related to either of the left and right action values (p<0.025 for Q_{L} or Q_{R}). A conventional ttest (linear regression analysis, model 1, Equation 5) yielded >10% actionvalue neurons, which is significantly greater than expected by chance (binomial test, p=3.7 × 10^{−22}). Adding potentially confounding variables to the regression (choice and chosen value; model 2, Equation 6) reduced the number of actionvalue neurons, but it was still significantly greater than expected by chance (binomial test, p=2.6 × 10^{−9}; Figure 2A). We used two different methods to handle false positive identification of actionvalue neurons in our previous studies. One method (withinblock permutation; see Materials and methods; Kim et al., 2009) reduced the false positive rate further, but it was still significantly above the chance level (binomial test, p=5.9 × 10^{−5}). The other method (adding AR terms to the regression; see Materials and methods; Kim et al., 2013; Sul et al., 2010; Sul et al., 2011; Lee et al., 2012b; Lee et al., 2017) reduced the number of actionvalue neurons to the chance level (binomial test, p=0.191; Figure 2A). The use of the same tests was less problematic for the analysis of chosenvalue signals. A conventional ttest (model 4, Equation 8) yielded ~9% chosenvalue neurons and it was significantly greater than expected by chance (binomial test, p=3.7 × 10^{−8}). However, the number of chosenvalue neurons was reduced to the chance level by adding confounding variables to the regression (model 5, Equation 9) and also by other methods used in our previous studies (applying withinblock permutation or adding AR terms to model 5; Figure 2B).
We then tested four different methods based on surrogate data to determine statistical significance of value terms in multiple regression models (models 2 and 5; Equation 6 and 9). The first two methods used surrogate behavioral data. Specifically, we tested session permutation (ElberDorozko and Loewenstein, 2018) and pseudosession (Harris, 2020) methods. In the session permutation test, surrogate behavioral data was drawn from other behavioral sessions. In the pseudosession test, surrogate behavioral data for a given session was generated based on a reinforcement learning model using the model parameters estimated for the same animal (see Materials and methods). The other two methods used surrogate neural data generated with Fourier phase randomization (FPR). For this, we tested the conventional FPR method and the amplitude adjusted Fourier transformation (AAFT) method (Theiler et al., 1992; see Materials and methods). Both methods generate surrogate neural data with the same amplitude, but randomized phase of the Fourier transform as the original data. The two methods differ in that the surrogate neural data has a normalized spike count distribution (FPR) or maintains the original spike count distribution (AAFT). In all of these methods, the pvalue for a regression coefficient was determined by the frequency in which the magnitude of tvalue obtained using surrogate data exceeds that of the original tvalue. When tested using the simulated neural data, all of these four methods yielded ~5% of false positive actionvalue and chosenvalue neurons, and none of them was significantly higher than expected by chance (binomial test, p>0.05; Figure 2A,B) Therefore, these tests are unlikely to suffer from an inflated false positive rate when applied to our actual neural data.
For the session permutation method, we found that trialbytrial action values are substantially correlated between the original and resampled behavioral sessions. We used four different combinations of reward probabilities (left:right = 0.72:0.12, 0.63:0.21, 0.21:0.63 and 0.12:0.72) in our previous studies and, even though their sequence was randomized, there was a constraint that the option with the higherreward probability always changes its location at the beginning of a new block. The number of trials per block was also similar across studies (40.1 ± 3.1; mean ± SD). Hence, the relative reward probability tended to be correlated or anticorrelated between two different sessions depending on whether the first blocks of the two sessions had the same or different locations for the higherrewardprobability target (Figure 2C). Likewise, in the pseudosession method, which generates simulated behavioral data according to the same block structure of a given behavioral session, trialbytrial action values tended to be positively correlated between actual and simulated behavioral sessions (Figure 2C). This raises the possibility that for the neural data collected during the experiments with a block design, the session permutation and pseudosession methods might be too stringent (high false negative rate) for the identification of actionvalue neurons. Unlike action values, trialbytrial chosen values were only weakly correlated between the original and resampled behavioral sessions (Figure 2C).
Activity related to action value and chosen value
We applied the above methods to the actual neural data obtained from rats. We analyzed the neural data during the last 2 s of the delay period to assess actionvaluerelated neural activity. As expected, the conventional ttests yielded high levels of actionvalue signals and they were reduced substantially by employing the withinblock permutation procedure or adding AR terms. All of these methods yielded significant (binomial test, p<0.05) fractions of actionvalue neurons in all tested brain structures except the DLS (Figure 3A, top). The pseudosession, FPR, and AAFT methods also yielded significant actionvalue signals in all of these brain structures except the DLS. The proportion of action valuecoding neurons tended to be somewhat lower when they were determined with the session permutation method, but this was still significantly above the chance level in several brain areas, including the striatum, OFC, and hippocampus (Figure 3A, bottom). When applied to neural data during the 2 s time period centered around the outcome onset, all of these methods yielded significant chosenvalue signals in all tested brain structures (Figure 3B). These results show significant encoding of actionvalue and chosenvalue signals in multiple areas of the rat brain that cannot be explained by slowly drifting and serially correlated neural activity.
Transformation of value signals
In reinforcement learning theory, action values are monotonically related to the probability of choosing the corresponding actions, referred to as policy, making it hard to distinguish the neural activity related to either of these quantities. In addition, the activity of individual neurons is likely to encode multiple variables simultaneously (Rigotti et al., 2013). Despite these difficulties, it has been argued that neural signals related to action value might actually represent policy exclusively (ElberDorozko and Loewenstein, 2018). To address this issue quantitatively, we used the difference in action values (ΔQ) and their sum (ΣQ) as proxies for policy and state value, respectively, and tested how signals for action value, policy, and state value are related in a population of neurons in different brain structures.
If the activity of a given neuron is strongly related to policy, then its activity would be related to the difference in action values, ΔQ, but not their sum, ΣQ. To test whether this is the case, we analyzed the same neural data used to assess actionvaluerelated neural activity in rats (neural spikes during the last 2 s of the delay period). As shown in Figure 4B, some of the action valueresponsive neurons showed activity correlated with ΣQ (25.8–62.5% across different brain areas), some with ΔQ (13.2–42.9%), and others with both ΣQ and ΔQ (6.3–25%). There were also neurons that were exclusively responsive to action value (0–22.7%). Conversely, some of ΣQ and ΔQresponsive neurons were also responsive to action value (ΣQ, 59.1–100%; ΔQ, 11.5–38.5%) and some were exclusively responsive to ΣQ (0–40.9%) or ΔQ (61.5–88.5%). These results indicate that a population of neurons in many brain areas tend to represent all of these variables rather than exclusively representing only one type of decision variable. For comparison, we also analyzed neural activity recorded during the 1 s time window before cue onset from the CD (a part of the DS), VS, and DLPFC of monkeys performing an intertemporal choice task (Cai et al., 2011; Kim et al., 2008). The results from this analysis were similar to those obtained from rats (Figure 4C), suggesting that DLPFC and striatal neurons in monkeys also represent all of these variables rather than exclusively representing only one type of value signals.
Even though all the brain regions tested in this study represented multiple types of value signals in parallel, their relative signal strengths varied across brain regions. If multiple types of value signals are represented equally often and strongly, then the points in Figure 4 would be rotationally invariant. By contrast, the pattern of anisotropy in these plots would change if the neurons in a given brain area tend to encode a specific type of value signals. For example, if neurons in a given brain area mostly encode ΣQ, the points representing individual neurons would be clustered along the identity line, since the regression coefficients for Q_{L} and Q_{R} would be similar for such neurons (Figure 4A). As shown in Figure 4, the distribution pattern of Q_{L}versusQ_{R} regression coefficients varied substantially across regions.
To quantify this further, we computed the mean resultant vectors after multiplying the angle of the vector defined by the regression coefficients for action values, θ (see Materials and methods), by a specific factor. First, we compared the vertical component of the mean resultant vector calculated after doubling these angles to test whether neurons in each area might be biased for coding ΣQ or ΔQ (Figure 4A). The results from this analysis showed that the vertical component of the mean resultant vector was significantly positive in all regions in the rat (Wilcoxon ranksum test, statistical test results summarized in Supplementary file 1; see also Figure 4B), indicating stronger encoding of ΣQ than ΔQ signals. In addition, the vertical component of this resultant varied in magnitude across regions. In the striatum, it was significantly larger in the VS than in the DMS and DLS. In the cortical areas, it was significantly larger in the OFC, mPFC, and ACC than in the M2 (oneway ANOVA followed by Bonferroni post hoc tests, statistical test results summarized in Supplementary file 1). In the monkey, the vertical component of the mean resultant vector was significantly negative in the DLPFC (Wilcoxon ranksum test; Figure 4C), indicating stronger encoding of ΔDV than ΣDV signals in the DLPFC. In addition, the vertical component of the resultant in the VS was significantly different from those in the DLPFC and CD, suggesting that the VS neurons tended to encode ΣDV signals more strongly than DLPFC and CD neurons (statistical test results summarized in Supplementary file 1). Collectively, these results showed that ΣQ signals are generally stronger than ΔQ signals in the rat brain areas examined in this study, and ΔDV signals are particularly strong in the monkey DLPFC.
Next, we examined the horizontal component of the mean resultant vector after multiplying the angles of the regression coefficient vector by four in order to test whether neurons in each brain area tend to favor coding action values of individual choices or whether they tend to combine action values to encode policy or state value. After this transformation, Q_{L} and Q_{R}coding neurons together would form vectors along the Xaxis with positive horizontal components, whereas ΣQ and ΔQcoding neurons together would form vectors along the Xaxis in the negative domain (Figure 4A). The results from this analysis showed that the horizontal component of the mean resultant vector was significantly negative in the VS, OFC, mPFC, ACC, and M2, but not in the other regions of the rat brain (Wilcoxon ranksum test; Figure 4B and C; statistical test results summarized in Supplementary file 1), indicating that signals related to ΣQ and ΔQ were more strongly represented than actionvalue signals of individual choices in the rat VS, OFC, mPFC, ACC, and M2.
Discussion
Neural signals related to action value have been found in widespread regions of the brain, especially in the frontal cortexbasal ganglia loop (Chase et al., 2015; Ito and Doya, 2011; Lee, 2006; Lee et al., 2012a; Rushworth et al., 2009), suggesting the involvement of multiple brain structures in valuebased decision making. However, the potential confounding of concurrent autocorrelations in value signals and neural activity and the possible superposition of different types of value signals have not been clearly resolved. The results in the present study confirm significant actionvalue signals in most of the brain regions tested previously. We also found that actionvaluerelated neural activity coexists with that related to policy and state values. These results support previous conclusions that action values are represented in many regions of the brain. Below, we discuss these two issues along with the significance of anatomical variation in value signals.
Concurrence of autocorrelation in behavioral and neural data
Neural spikes are often correlated across trials, as we demonstrated for all brain structures examined in the present study. Note that serial correlation in neural activity could be due to intrinsic nonstationarity and/or its relationship with action value. In the present study, we reanalyzed our previous neural data to rigorously test whether actionvalue neurons identified in previous blockdesign studies might result from serial correlation in neural spikes unrelated to action value. Recently, this issue was examined with simulated neural and behavioral data (ElberDorozko and Loewenstein, 2018; Harris, 2020), but the nature of serial correlations in simulated neural spikes and action value might deviate substantially from those of actual neural and behavioral data. In the present study, using actual behavioral data and simulated neural data whose level of autocorrelation was matched to that of the actual neural data, we first established that four different analysis methods, namely the session permutation, pseudosession, FPR and AAFT methods (ElberDorozko and Loewenstein, 2018; Harris, 2020; Theiler et al., 1992), do not inflate actionvalue signals. Applying these methods to actual neural data, we still found significant actionvalue signals in multiple areas of the rat brain. These findings indicate that actionvalue signals in our previous blockdesign studies cannot be entirely attributed to concurring autocorrelations in behavioral data and neural spikes unrelated to action value.
It should be noted that the optimal methods to assess valuerelated neural activity might vary depending on exact structures of neural and behavioral data. In our studies, because of similarity in block structure across sessions, trialbytrial action values were substantially correlated across sessions. This suggests that session permutation might be excessively stringent for testing actionvaluerelated neural activity. This is similarly problematic for the pseudosession test because simulated behavioral sessions have the same block structure as the original behavioral session. Indeed, both methods yielded somewhat lower fractions of actionvalue neurons compared to the FPR and AAFT tests. The pseudosession method yielded somewhat higher fractions of actionvalue neurons than the session permutation test in most tested regions, which suggests that some variability in the animal’s behavior shared across different sessions (e.g., a slow change in motivation) might not be captured by the models used to estimate action values. For blocked behavioral sessions, therefore, the FPR and AAFT methods are likely to estimate actionvalue signals more accurately than the session permutation and pseudosession methods. Our results also suggest that the problem arising from serial correlation in neural activity can be ameliorated by adding AR terms in the regression model. In our study, the results obtained with the FPR and AAFT methods were similar. Nevertheless, the simulated neural data obtained with the FPR lose their discrete properties and become normally distributed, whereas the AAFT maintains the original distribution of spike counts. Therefore, the AAFT method might be more reliable when the neural signals of interest are influenced by a nonGaussian or discrete nature of neural data. Neural activity is almost always serially correlated, and this makes it difficult to select appropriate statistical methods to identify how sensory, motor, or other cognitive variables are encoded in the brain when they are also serially correlated. For each candidate analysis method, therefore, it would be prudent to examine the rates of potential false positivity and negativity using a null data set that captures important features of the data set under investigation.
Multiple types of value signals
Neural activity seemingly representing action value might in fact represent other decision variables, such as policy, that are correlated with action value (ElberDorozko and Loewenstein, 2018). To test this, we compared neuronal responses to action value with those related to the sum of two action values and their difference as proxies for neuronal responses related to state value and policy, respectively. We found neurons carrying diverse combinations of valuerelated signals in the striatum, frontal cortical areas, and hippocampus. The majority of actionvalue coding neurons also coded state value and/or policy and, conversely, the majority of state value and/or policycoding neurons also coded action value as well. Also, a small number of neurons encoded action value without state value or policy, and some neurons encoded state value or policy without action value. Similarly, using a task in which values associated with specific colors and locations of sensory cues can be dissociated, we have shown previously that partially overlapping populations of neurons represent values associated with target colors and locations in the striatum and DLPFC in monkeys (Kim et al., 2012). Collectively, these results suggest that neurons in the striatum, frontal cortical areas, and hippocampus might not represent multiple types of value signals categorically, but instead show random mixed selectivity (Hirokawa et al., 2019; Raposo et al., 2014). Namely, the results from our analyses suggest that relatively weights given to different types of value signals vary continuously across individual neurons in most brain areas.
In addition to this heterogeneity in value coding within each brain region, how different types of value signals are encoded by individual neurons also varied across the brain structures examined in the present study. In the rat, all the tested regions, especially the OFC, mPFC, ACC, and VS, tended to overrepresent signals related to the sum of action values, but this tendency was weaker in the M2. These results suggest that the OFC, mPFC, ACC, and VS might mainly process signals related to the expected rewards that can be obtained in a given state (Bari et al., 2019), whereas the M2 might be concerned more with policy and action selection (Sul et al., 2011). The primate DLPFC conveyed relatively strong signals correlated with the difference between action values, suggesting its function might be more strongly related to policy and action selection than state value. These findings are at odds with functional homology between the rodent mPFC and monkey DLPFC (Uylings et al., 2003; Vertes, 2006). As in the rat striatum, we found a stronger representation of signals related to the sum of action values in the VS than in the CD in monkeys. This is consistent with the proposal that subdivisions of the striatum correspond to distinct corticobasal ganglia loops serving different functions (Devan et al., 2011; Ito and Doya, 2011; Redgrave et al., 2010; Yin and Knowlton, 2006). Further studies are needed to clarify relative strengths of different decision variables in different brain structures and how they are related to the functions served by individual brain structures.
Materials and methods
Behavioral and neural data
Request a detailed protocolWe analyzed singleneuron activity recorded from the dorsomedial (DMS, n = 466), dorsolateral (DLS, n = 206), and ventral (VS, n = 165) striatum of six rats performing a dynamic foraging task (a total of 81 sessions) in our previous studies (Kim et al., 2013; Kim et al., 2009), as well as activity recorded from the lateral OFC (n = 1148, three rats), ACC (n = 673, five rats), mPFC (n = 854, six rats), M2 (n = 411, three rats), and dorsal CA1 (n = 508, 11 rats) in our previous studies (total 302 sessions; Sul et al., 2010; Sul et al., 2011; Lee et al., 2012b; Lee et al., 2017). For the analysis of actionvalue signals, we focused on neural activity during the last 2 s interval of the delay period and included only the neurons with mean discharge rates ≥1 Hz during the analysis window. For the analysis of chosenvalue signals, we analyzed the activity during the 2 s time period centered around the outcome onset for the neurons with mean discharge rates ≥1 Hz during the analysis window. We also analyzed neural activity previously recorded in the CD, VS, and DLPFC of three monkeys performing an intertemporal choice task (Cai et al., 2011; Kim et al., 2008). This analysis was based on the activity during the 1 s cue period of the neurons with mean discharge rates ≥1 Hz.
Behavioral task
Request a detailed protocolDetails of behavioral tasks have been published previously (Cai et al., 2011; Kim et al., 2008; Kim et al., 2013; Kim et al., 2009; Lee et al., 2012a; Lee et al., 2017; Sul et al., 2011; Sul et al., 2010). Briefly, each rat performed one of two different dynamic foraging tasks. Each trial began as the rat returned to the central stem (detected by a photobeam sensor; green arrow in Figure 1A) of a modified Tmaze from either target location (orange circles in Figure 1A). After a delay of 2–3 s, the central bridge was lowered (delay offset) allowing the rat to navigate forward and choose freely between the two goal locations to obtain water reward. The rats performed four blocks of trials with each block associated with one of four different reward probability pairs (left:right = 0.72:0.12, 0.63:0.21, 0.21:0.63 or 0.12:0.72). The sequence of block was randomly determined with the constraint that the higherprobability target changes its location at the beginning of each block. In the twoarmed bandit (TAB) task (n = 215 sessions, n = 17 rats; Kim et al., 2009; Lee et al., 2012b; Sul et al., 2010), water was delivered probabilistically only at the chosen location in a given trial, whereas in the dual assignment with hold (DAWH) task (n = 168 sessions, n = 10 rats; Kim et al., 2013; Lee et al., 2017; Sul et al., 2011), water was delivered probabilistically at both locations according to a concurrent variableratio reinforcement schedule. Water delivered at the unvisited goal remained available until the rat’s next visit without additional water delivery. This implies that reward probability for a given target increases with the number of consecutive choices for the other target during the DAWH task. Mean (± SD) trial duration was 17.64 ± 13.35 s in the TAB task and 16.25 ± 14.82 s in the DAWH task.
Monkeys performed an intertemporal choice task (Cai et al., 2011; Kim et al., 2008). A trial began with the monkey’s fixation of gaze on a white square presented at the center of a computer screen. Following a 1 s foreperiod, two peripheral targets were presented. One target was green and delivered a small reward (0.26 ml of apple juice) when it was chosen, whereas the other target was red and delivered a large reward (0.4 ml of apple juice). The number of yellow disks (n = 0, 2, 4, 6, or 8) around each target indicated the delay (1 s/disk) between the animal’s choice and reward delivery (0 or 2 s for a small reward; 0, 2, 4, 6, or 8 s for a large reward). Each of the 10possible delay pairs for the two targets was displayed four times in alternating blocks of 40 trials in a pseudorandom manner with the position of the largereward target counterbalanced.
Reinforcement learning models
Request a detailed protocolWe used the Qlearning model (Sutton and Barto, 1998) to calculate the action values (Q_{L} and Q_{R} for lefttarget and righttarget choices, respectively) for the TAB task, and the stackedprobability model (Huh et al., 2009) for the DAWH task, respectively. In the Qlearning model, action values (${Q}_{a}\left(t\right)$) were computed in each trial as follows:
where α is the learning rate, $R\left(t\right)$ denotes the reward in the tth trial (1 if rewarded and 0 otherwise), and $a$ indicates the selected action (left or right goal choice). In the stackedprobability model, values were computed considering that reward probability of the unchosen target increases as a function of the number of consecutive alternative choices (see Huh et al., 2009 for details).
For the intertemporal choice task (Cai et al., 2011; Kim et al., 2008), the temporally DV was computed using a hyperbolic discount function as the following:
where ${A}_{x}$ and ${D}_{x}$ indicate the magnitude and the delay of the reward from target x, and the parameter k determines the steepness of the discount function. We indicate action value as DV_{x} instead of Q_{a} to denote temporally DV in the monkey studies. Actions were chosen according to the softmax action selection rule in all models as the following:
where ${P}_{L}\left(t\right)$ is the probability to choose the left goal, $\beta $ is the inverse temperature that defines the degree of randomness in action selection, b is a bias term for selecting the left target, and Q_{L} and Q_{R} (or ${DV}_{L}$ and ${DV}_{R}$) are values associated with two alternative actions of choosing left and right targets, respectively, in trial t. All the model parameters were estimated using a maximum likelihood method.
Regression analysis
Request a detailed protocolWe used multiple linear regression models to identify neurons related to action value or chosen value. For actionvaluerelated neural activity, we analyzed neural spikes during the delay period (before action selection) using several different regression models. The simplest contained only the left and right action values as explanatory variables as follows:
where S(t) is the spike count in a given analysis time window in trial t, ${Q}_{L}\left(t\right)$ and ${Q}_{R}\left(t\right)$ are the action values for the left and right target choices, respectively, and $\epsilon \left(t\right)$ is the error. The majority of the analysis was based on the following model that contained the animal’s choice (C, 1 if left and 0 if right) and chosen value (Q_{c}) as additional explanatory variables to control for effects of these variables on action values:
We subjected this model to various resamplingbased tests to identify actionvalue neurons. To compare the results from our previous analysis method (Sul et al., 2010; Sul et al., 2011; Lee et al., 2012b; Kim et al., 2013; Lee et al., 2017), we added AR terms, namely neural spikes during the same analysis time window in the previous three trials, to model 2.
To investigate how multiple types of value signals are encoded in the activity of neurons across different brain areas, we tested a regression model that includes the sum of action values, ΣQ(t)=Q_{L}(t)+Q_{R}(t), and their difference, ΔQ(t)=Q_{L}(t)−Q_{R}(t), which roughly correspond to state value and policy, respectively.
This regression model would fit the data equally well compared to the model containing action values (Q_{L} and Q_{R}) because ΔQ and ΣQ are linear combinations of action values. For chosenvaluerelated neural activity recorded in rats at the time choice outcome was revealed, the following two regression models were used:
where R(t) is reward (1 if reward and 0 if unrewarded) and X(t) is the interaction between choice and reward.
Actionvaluerelated neural activity in the monkey was analyzed using the following regression model:
where ${DV}_{L}\left(t\right)$ and ${DV}_{R}\left(t\right)$ are temporally DVs for the left and right target choices, respectively, and ${DV}_{chosen}\left(t\right)$ and ${DV}_{unchosen}\left(t\right)$ are temporally DVs for the chosen and unchosen target choices, respectively. Neural activity related to the sum of and difference between temporally DVs ($\sum DV$ and $\mathrm{\Delta}DV$, respectively) was assessed with the following regression model:
Permutation and surrogate databased tests
Request a detailed protocolFor the session permutation and pseudosession tests, valuerelated neural activity was assessed using spike data of the original session. In the session permutation test, the original neural data was paired with 382 remaining behavioral sessions. The results did not differ qualitatively when we paired the neural data only with the same type of behavioral sessions as the original one (214 TAB and 167 DAWH remaining sessions). In the pseudosession test, we generated 500 simulated behavioral sessions based on the Qlearning (for TABtask sessions) or stack probability (for DAWHtask sessions) model using model parameters estimated for a given animal. For the FPR and AAFT tests (Theiler et al., 1992), valuerelated neural activity was assessed using the original behavioral data and 1000 samples of surrogate neural data. In the FPR test, each surrogate neural data was generated with the same amplitude of the Fourier transform as the original data but with random phase. In the AAFT test, the same number of elements as the number of trials in the original neural data was drawn randomly from a Gaussian distribution, and these elements were then sorted according to the rank of the neural data (Gaussianization). All zero values (no spikes) of the neural data were replaced with small (<1) randomly chosen nonzero values in order to avoid artifacts in sorting consecutive zero values. The FPR method was then applied to the sorted Gaussian data. Finally, the original neural data was reordered according to the rank of the phaserandomized Gaussian data (deGaussianization), and this reordered neural data was used as surrogate neural data. For comparison, we also tested the withinblock permutation procedure we used in our previous study (Kim et al., 2009). For this, we randomly shuffled spike data 1000 times across different trials within each block while preserving the original block sequence.
Statistical analysis
Request a detailed protocolSignificance (pvalue) of a regression coefficient was determined with the ttest or by the frequency in which the absolute magnitude of tvalue for the regression coefficient obtained using a permutation test or surrogate data exceeds that of the original tvalue (resamplingbased tests). Statistical significance of the fraction of actionvalue or chosenvalue neurons in a given brain area was determined based on the binomial test.
To examine how different types of value signals are represented across different brain areas, we exploited the fact that the neurons encoding specific types of value signals, such as action values or policy, would be distributed along an oriented line through the origin in a complex plane defined by z = R⋅e^{iθ} = ${a}_{L}+{a}_{R}\cdot i$, where R = $\sqrt{{a}_{L}^{2}+{a}_{R}^{2}}$, θ = atan($a}_{R}/{a}_{L$), $i=\sqrt{1}$, and $a}_{L$ and $a}_{R$ are regression coefficients for left and right action values, respectively (namely, $a}_{1$ and $a}_{2$ in Equationd 6 and 10). In this plane, neurons encoding state value or policy would display twofold rotational symmetry, since they would be distributed mainly along the lines defined by y = x or y = x. When the angles are doubled, Q_{L} and Q_{R}coding neurons would form vectors along the xaxis (Q_{L}, positive; Q_{R}, negative) while ΣQ and ΔQcoding neurons would form vectors along the yaxis (ΣQ, positive; ΔQ, negative). Therefore, we examined the vertical component of the mean resultant vector after multiplying the angle of the vector z by a factor of 2 in order to test whether neurons in a given area tended to encode policy or state value more strongly. By contrast, the neurons encoding action values would show fourfold rotational symmetry since they would be clustered around x = 0 or y = 0. Therefore, we examined the horizontal component of the mean resultant vector after multiplying the angle of z by a factor of 4 in order to test whether neurons tended to encode action values of individual choices or combine them for policy or state value. We used Wilcoxon ranksum test to determine whether the horizontal or vertical component of the mean vector was significantly different from 0, and oneway ANOVA and Bonferroni post hoc tests to test whether they significantly varied across regions.
Throughout the paper, p=0.05 was used as the criterion for a significant statistical difference unless noted otherwise. Data are expressed as mean ± SEM unless noted otherwise. Raw data of this work is archived at Dryad (https://doi.org/10.5061/dryad.gtht76hj0).
Data availability
All data generated or analyzed during this study are included in the manuscript and supporting files. Raw data to reproduce this work is archived at Dryad https://doi.org/10.5061/dryad.gtht76hj0.

Dryad Digital RepositoryData from: Robust and distributed neural representation of action values.https://doi.org/10.5061/dryad.gtht76hj0
References

Reinforcement learning models and their neural correlates: an activation likelihood estimation metaanalysisCognitive, Affective, & Behavioral Neuroscience 15:435–459.https://doi.org/10.3758/s1341501503387

Understanding neural coding through the modelbased analysis of decision makingJournal of Neuroscience 27:8178–8180.https://doi.org/10.1523/JNEUROSCI.159007.2007

Reinforcement learning: the good, the bad and the uglyCurrent Opinion in Neurobiology 18:185–196.https://doi.org/10.1016/j.conb.2008.08.003

Parallel associative processing in the dorsal striatum: segregation of stimulusresponse and cognitive control subregionsNeurobiology of Learning and Memory 96:95–120.https://doi.org/10.1016/j.nlm.2011.06.002

Validation of decisionmaking models and analysis of decision variables in the rat basal gangliaJournal of Neuroscience 29:9861–9874.https://doi.org/10.1523/JNEUROSCI.615708.2009

Multiple representations and algorithms for reinforcement learning in the corticobasal ganglia circuitCurrent Opinion in Neurobiology 21:368–373.https://doi.org/10.1016/j.conb.2011.04.001

Role of striatum in updating values of chosen actionsJournal of Neuroscience 29:14701–14712.https://doi.org/10.1523/JNEUROSCI.272809.2009

Prefrontal and striatal activity related to values of objects and locationsFrontiers in Neuroscience 6:108.https://doi.org/10.3389/fnins.2012.00108

Neural basis of quasirational decision makingCurrent Opinion in Neurobiology 16:191–198.https://doi.org/10.1016/j.conb.2006.02.001

Neural basis of reinforcement learning and decision makingAnnual Review of Neuroscience 35:287–308.https://doi.org/10.1146/annurevneuro062111150512

Hippocampal neural correlates for values of experienced eventsJournal of Neuroscience 32:15053–15065.https://doi.org/10.1523/JNEUROSCI.280612.2012

Neural signals related to outcome evaluation are stronger in CA1 than CA3Frontiers in Neural Circuits 11:40.https://doi.org/10.3389/fncir.2017.00040

Modelbased analyses: promises, pitfalls, and example applications to the study of cognitive controlQuarterly Journal of Experimental Psychology 65:252–267.https://doi.org/10.1080/17470211003668272

Modelbased fMRI and its application to reward learning and decision makingAnnals of the New York Academy of Sciences 1104:35–53.https://doi.org/10.1196/annals.1390.022

A categoryfree neural population supports evolving demands during decisionmakingNature Neuroscience 17:1784–1792.https://doi.org/10.1038/nn.3865

Goaldirected and habitual control in the basal ganglia: implications for parkinson's diseaseNature Reviews Neuroscience 11:760–772.https://doi.org/10.1038/nrn2915

General mechanisms for making decisions?Current Opinion in Neurobiology 19:75–83.https://doi.org/10.1016/j.conb.2009.02.005

Role of rodent secondary motor cortex in valuebased action selectionNature Neuroscience 14:1202–1208.https://doi.org/10.1038/nn.2881

Testing for nonlinearity in time series: the method of surrogate dataPhysica D: Nonlinear Phenomena 58:77–94.https://doi.org/10.1016/01672789(92)90102S

Do rats have a prefrontal cortex?Behavioural Brain Research 146:3–17.https://doi.org/10.1016/j.bbr.2003.09.028

The role of the basal ganglia in habit formationNature Reviews Neuroscience 7:464–476.https://doi.org/10.1038/nrn1919
Decision letter

Timothy E BehrensSenior and Reviewing Editor; University of Oxford, United Kingdom
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Acceptance summary:
Action values are important components in reinforcement learning. Single neutrons in the brain have been reported to signal these values, but recent work has suggested that problems with these analyses bring these data into question. This paper performs rigorous analysis to show the action value signals are robust. In doing so it contributes to an important line of technical research that is finding safer ways to analyse neuronal data.
Decision letter after peer review:
Thank you for submitting your article "Further evidence for neural representation of action value" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by Timothy Behrens as the Senior Editor and Reviewing Editor. The reviewers have opted to remain anonymous.
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
Summary:
The manuscript is a response to recent work (ElberDorozko and Loewenstein 2018) showing that inferences about action value coding in neural population can be distorted by two mechanisms: (1) Serial correlation from trial to trial in both neural activity and action values, that causes statistical analyses that assume independence of each trial to overestimate significance (2) Correlation between action values and other behavioural/decision variables which can cause incorrect inference that neurons coding for other variables represent action values.
The present study uses simulations and reanalysis of neuronal recordings to address methodological and scientific questions raised by ElberDorozko and Loewenstein's work. Broadly, they do so convincingly (although not elegantly) with respect to serial correlations. The portion of the manuscript that deals with correlated variables is less convincing, but this is an issue of narrower significance.
The manuscript is of interest to eLife for the following reasons. The previous paper raised an important methodological concern that had been uniformally ignored in the analysis of neuronal activity across many fields. In doing so, however, it also cast doubts on previous results that had studied one particular computation performed by neuronal activity – the computation of action value. These were important studies revealing a fundamental computation underlying how the brian controls behaviour. The current manuscript acknowledges the methodological concern but allays the doubts over actionvalue computations. It is therefore of potential significant interest.
Essential revisions:
The methods used to investigate and account for serial autocorrelations are coarse, adhoc, and little is presented to verify that they do what they say they do. This raises 2 issues. The first relates to how robust the findings will be in the face of future related criticisms. The second relates to setting methodological standards for how the field should deal with serial correlations in the future.
We will deal with the second issue first, as dealing with it properly may make the first problem redundant.
Strategy for accounting for autocorrelations:
The issue is that the authors propose 3 techniques for dealing with serial autocorrelations in the noise (simulating neurons with serial correlations, permuting "within blocks" only, and including a few trials back as coregressors). None of them are elegant or general techniques, and in all cases the behaviour of these techniques in the null case is poorly characterised (see below for more on this point). This is particularly surprising because this is an old problem in the statistics literature, and there are offtheshelf techniques for addressing the problem elegantly and rigorously. We acknowledge the requirement to adhere to the spirit of the EDL paper, but we think it would be extremely advantageous to move the argument forward using general wellvalidated techniques.
Options include
1. Estimating a whitening kernel for the residuals and refitting the model using this kernel (technically nontrivial). For an example from fMRI, see Woolrich et al. Neuroimage 2001.
2. Fitting autoregressive noise models using standard offtheshelf software (eg gls in r – see review comment below ).
3. Permutation tests after transformations that render the data exchangeable.
Eg fourier transform the data and permute only the phases then reconstruct the data.
Eg Wavelet transform the data and permute wavelet coefficients then reconstruct the data.
Here are the reviewer comments that led to this discussion, which contain other related comments that maybe useful. A similar discussion was had at triage. We note that, whilst the reviewer suggests an autoregressive model, which would be fine, it would also be fine to use one of the other techniques above, which may be more appropriate if the residuals are not well described by a limited AR model such as AR(1). The permutation strategies described above are trivial to implement and effective.
"I think the paper could do a better job technically unpacking the issues with temporal correlations, which in my view weren't diagnosed as precisely as they could have been in the original article either. The senior author knows more econometrics than I do, but as I understand it, all of the estimation issues in OLS here are due to the assumption of uncorrelated errors. Autocorrelation in the dependent variables, and the explanatory variables, is perfectly OK (indeed, one would result from the other) so long as they cancel each other out when the model is fit, in which case the residuals would be uncorrelated. Changing the autocorrelation in either y or X affects this only indirectly, since both appear in the residual. I think the article's focus on the autocorrelation in the explanatory variables and spike rates – both in rhetoric and analysis and results – is a piece of the puzzle but tends to obscure this deeper point. It would be helpful to also focus on visualizing and decorrelating the residuals. For the same reason, the lack of regressor autocorrelation in the monkey experiment is less of a solution than it is made out to be, I think."
"I find the estimation strategy of Figure 3 and sporadically onward a bit frustrating and less convincing (or at least more roundabout) than it could be. I'm sympathetic to the overall conclusion, but the overall strategy comes off as piling up multiple fixes, even though they are shown not to work adequately on simulated data. To compensate for this, the simulations are used to define a new, inflated false positive rate that is, finally, in a followup test, compared to the obtained rate of nominal positives. Frankly: yuck. How about figuring out why the fixes don't work, and finding a test that does work? For the nonparametric fix (bootstrap) the issue is presumably within session correlations, as discussed later; but for the lagged AR terms, I assume the problem is there aren't enough of them to handle longertimescale correlations. But this is itself kind of a hack; a more orthodox parametric approach would be to use a nonlinear, generalized least squares (eg gls() in R) to estimate a full AR(1) model or whatever other error covariance form is supported by the actual data. (Note that even an AR(1) process predicts correlations at arbitrary lags so adding individual lag terms is not sufficient.)"
Characterisation of performance in the null case.
If the authors change their strategy as recommended above, this section of the review may be rendered redundant. However, given the current approach the review team did not think that the paper did a good job in presenting diagnostics that adequately evaluate the performance of their strategies.
Minimally, given how much work the random walk neuron model does, we think that the authors should try harder to evaluate the performance with a model that looks more like the data. The model was setup only to match the neuronal autocorrelations at lag 1 trial and likely has a very different autocorrelation structure from real neurons at lags greater than 1. The autocorrelation structure of the control 'random' neuron model should be matched to that of the neuronal data. This may need a generative model that is more expressive than AR(1). Without this, the authors are susceptible to future criticism that simply shows that the authors techniques do not do well in the face of realistic data.
We also think that instead of simply reporting the number of false positives at p<0.05 threshold, the authors should construct the pp plot (Wikipedia), which plots observed false positives in empirical data against the nominal threshold. This will make it useful to future researchers who would like to use the same techniques with different threshold.
All three reviewers made the same point. I include all 3 here to encourage the authors that it is an important point that will likely be shared by many readers.
"Generation of randomwalk neurons. How is it possible to create the same autocorrelation kernel as the one observed in the neural data (essentially flat – at least for the shown scale of 5 trials) through a randomwalk process – for which the correlation should intrinsically decrease over time? The authors mentioned that they have matched autocorrelation at lag 1 only, which may be good enough as an approximation for what the authors intend to do with randomwalk neurons, but it is not a tight match and the authors may want to mention this somewhere in the manuscript."
"The random walk neuron model does a lot of work as a control against which real neurons are compared. However, the model was setup only to match the neuronal autocorrelations at lag 1 trial and likely has a very different autocorrelation structure from real neurons at lags greater than 1. The autocorrelation structure of the control 'random' neuron model should be matched to that of the neuronal data. "
"Either way, everything comes down here to the simulated spike trains under the null model, and it would be good to have more argument that these are actually a good simulation for the data. Among other things, I wasn't clear if their timescale is individually fit per brain area or experiment or just roughly chosen; if multiple timescales of correlation are detectable in the actual data, rather than just rectified AR(1) as here; and again if the autocorrelative structure of the residuals is similar between data and simulation. "
Action values vs policy etc.
There was broad scepticism amongst the reviewers as to whether it was possible to dissociate policies from values, and whether it was really relevant to do so, particularly if policy is (confusingly) used to refer to a difference in Qvalues. This is reflected in the comments below. Whilst we acknowledge the authors' ambitions to address the critiques raised in EDL, we encourage great care in the interpretation of this whole section. Again, related points were made by all 3 reviewers, highlighting that this Is likely also to be an issue for many readers.
"I find the second half of the article, on alternative decision variables, a little bit of a red herring. One thing is that the relationship between a Q value and a policy (as the term is normally used in RL, and was used by ElberDoroko) is nonlinear. Calling the difference in Q values a "policy" is just not using the term accurately. On the other hand, my view is that this example shows that the whole critique is ill founded, and the only useful question is what is the (linear or nonlinear) relationship between decision variables and brain activity. Neural representations of values are likely to be nonlinear for reasons other than policy (eg, there is plenty of work by Glimcher and others on gain control or divisive normalization) and may also be differential (eg, activity which is related to the relative value, chosen minus unchosen, which is nevertheless in units of value and not normalized/softmaxed etc into a policy). Telling the difference between divisive and subtractive normalization is not really viable, especially in the linear setting; and even so, the same (softmax) algebraic form could describe either policy or (gain controlled) value. There's just not a meaningful categorical distinction to be made. I suppose there might be some way of recasting this section to focus on the distinction between summation vs difference as being representative (in a linear framework) of state values vs. relativized, or normalized, or postchoice policy values. But I think it's giving too much away to frame this as actually distinct variables confounding one another; and also unfair to call a difference a policy."
"Correlation with sum(Q) and diff(Q). I don't understand the exact graphical description on Figure 8 and Figure 8. The authors label gray neurons as 'only Q', but many of them are probably not coding anything (nonsignificant, corresponding to black neurons in Figure 6 of the article by EDL). Also, I expected that the neurons coding selectively for one action value (QL or QR) should be found on Figure 8A for x > threshold and y ~ 0 and vice versa. However, it is clearly not the case given the labelling of neurons provided by the authors for this graph. Could the authors clarify this and explain the apparent discrepancy with the analyses performed by EDL (Figure 6 from their article). I have a similar concern regarding Figure 8B: pure actionvalue neurons seem to be located only at the center of the graphs (for x ~ 0 and y ~ 0), which is where nonselective neurons should be found."
"Figures 7 – 9 attempt to dissociate action value coding from coding of policy (difference in action values) and state value (approximated as sum of action values). As these variables are linearly dependent, it is formally impossible to say whether a neuron represents one of them, or a linear combination of the others. Mixed linear selectivity is ubiquitous (e.g. Kobak et al. eLife 2016;5:e10989), so it not that interesting to ask which of this degenerate set of variables is most 'purely' represented by each neuron. This said, having chosen a given nondegenerate pair of these variables to work with, it is interesting to know how representation of one variable correlates with representation of the other across the population, and how this varies across regions. This is shown nicely in figure 7A and the top two panels of 9A, but I felt the remaining panels of figures 79 did not add additional value."
When the authors assess the extent of chosenvalue coding (Figure 5, 6B) , they include the individual action values in their regression model, which is important as these variables are correlated. However, when they assess action value coding (Figures 4, 6A) they do not include chosen value in the models. I think the rationale is that the analyses are different trial epochs, prechoice for 4, 6A, post outcome for 5, 6B. However, chosenvalue coding is certainly possible before the choice is executed and hence chosen value should be included in the model when assessing action value coding.
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
Thank you for submitting your article "Further evidence for neural representation of action value" for consideration by eLife. Your revised article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by Timothy Behrens as the Senior Editor and Reviewing Editor. The reviewers have opted to remain anonymous.
I have written a summary of our opinion after the discussion directly below here. I have also left the reviews below for emphasis and detail, but please don't feel that you need to address all the points in the reviews. If you can address the central issues in the summary directly below here, we will be happy.
This revision has raised some complications in the reviewers' minds that have led to a lot of discussion. In brief, we are not happy with the 2stage approach that does not lead to pvalues for individual neurons that can be trusted.
Whilst we agree that this approach goes some way to rebutting the EDL finding, it is a narrow rebuttal which does not provide a good way forward for scientists faced with similar problems in the future. The combination of GLS modelling that does not accurately deal with the autocorrelations, with nonstandard application of circular permutations to demonstrate control performance is, in our view, dangerous, and not useful to the community. We are not keen on publishing such an approach in eLife.
If the GLS approach does not lead to good corrections for autocorrelations, then we think it incumbent upon you guys to find a nonparametric approach that does.
We are slightly bemused by your assertion that you cannot do the Fourier permutation test because the timepoints are not equally sampled. There are two reasons we are bemused. The first is that the same argument applies to the circular permutation that you do use. The second is that there are wellestablished methods for computing Fourier transforms for nonuniformly sampled data. However, we think that you don't even need to use them. We are happy for you to line the trials up in a matrix and do the Fourier transform across trials. We think the danger of introducing errors by this approximation is small.
It is also possible that a correct application of the circular method you propose would work (i.e. ignore GLS and just do the full circular permutation test). The three likely issues with this are (1) the circular permutations might correlate with the original design matrix leading to low sensitivity, (2) edge effects will mean that the circularshifted test will have different autocorrelation properties than the original test. This can lead to false positives (and actually may be a problem in your control analysis in the paper). (3) The small number of possible permutations will prevent accurate inference in the tail (and possibly prevent eg corrections for multiple comparisons).
Note that although reviewer 2 below has an issue with the >10 trial threshold for the circular test, this was not thought a problem after discussion.
In order to assess these effects, one of the reviewers prepared a Jupyter Notebook comparing the different approaches. It looks as though I can only attach a single file to the letter and the notebook comes in both.html and.py forms. I have attached the.html. If you email me when you get this, I will forward the (anonymised).py.
You can see that, whilst both Fourier and Circular approaches suffer slightly from edge effects, the circular test is much more severely affected. There are readily available techniques in the literature for removing edge effects. For example you could window the data before the regression (eg using a Tukey window). We think it would likely be profitable for you to investigate these approaches whichever permutation method you choose. You can see, however, that even without dealing with these details, the Fourier method does a pretty good job.
You can see in the reviews below, that the reviewers also remain concerned about the distinction in the second half of the paper between action values, chosen values, policies etc, which interacts heavily with questions about the linearity of neuronal responses. We remain concerned that there is a danger this will confuse more than clarify the issue for the community. However, we realise that there is a similar section in the EDL paper, and that you guys need to address this. This may be TB's fault (along with the original reviewers) for not flagging this in the original paper. We would appreciate a clear statement in this section in the paper that states that this section is a narrow rebuttal of the EDL paper, and discusses the difficulties in differentiating policy from action values etc. (see reviews below).
Reviewer #1:
The authors have substantially modified their manuscript, including the general statistical approach for assessing the neural encoding of value signals. Doing so, they have addressed several of the concerns I have regarding the original manuscript. The GLS regression approach appears to support the main claims made in the original manuscript. Although I found at times the description of the results (including their illustration) to be less clear than before, I do not have important concerns that remain to be addressed.
Reviewer #2:
The authors have taken some welcome steps to address reviewer comments. They now use a regression model which explicitly models autocorrelation in the residuals (GLS model), they compare models with different order autoregressive structure using BIC, and show PP plots for real and circularly permuted data. These steps do improve the manuscript, but unfortunately the statistical approach still has real problems.
The main request of the reviews was: 1. Use a principled method which would fix the problem of P value inflation due to correlations. 2. Show that it works (i.e. gives correct P values) using PP plots on simulated or otherwise generated 'null' data. 3. Use P values derived from this method directly as the statistics reported.
Doing this properly solves the issue once and for all, and both answers the specific question (do neurons encode action values) and provides a method the field can use to avoid problems in future.
What the authors have actually done is; 1. Use a principled method (GLS model with autocorrelated residuals) to obtain P values. 2. Provide diagnostics based on null data generated via a poorly implemented circular permutation approach (see below), which none the less suggests that the GLS model has not fixed the problem (P values for null data are still inflated). 3. Test whether the fraction of significant neurons in the original data is significantly different from the average fraction of significant neurons across the permuted data.
This is really not good, as it does not provide a way of calculating accurate P values for individual neurons, nor does it take into account variability across permutations when asking if the fraction of significant neurons in the real data is significantly higher than that expected by chance.
Circular permutation of neuronal data relative to behavioral data by a random number of trials is a good way of generating data under the null hypothesis that there is no relationship between activity and behavior, while preserving the autocorrelations in both. However, the authors did their permutations "with the constraint that the minimum difference of trial number between the original and shifted data is > 10". This makes the permuted data meaningless as it no longer comes from any welldefined null distribution. Additionally, to accurately estimate the distribution of the measure of interest under the null hypothesis, a large number of permutations is needed (thousands), while the current work used only 10. As the authors only used the mean across permutations rather than the distribution when calculating their statistics, the number of permutations is perhaps less problematic, but this is a very nonstandard way to use permutations.
In my understanding, the correct permutation test to generate accurate P values in a regression analysis of neuronal activity is as follows: 1. Calculate the measure of interest on the real data. This could be a β weight for a particular neuron, or a summary measure across the population (e.g. average β squared or coefficient of partial determination across neurons). Summary measures across the population will have much more statistical power once you have more than a few neurons. 2. Generate an ensemble of e.g. 5000 permutated datasets. To make each permuted dataset, circularly permute the neuronal data relative to the behavioral data by a random number of trials between 1 and the number of trials in the session. If multiple sessions contribute to your measure of interest, draw the circular shift separately for each session for each permuted dataset. 3. Calculate the measure of interest for each permuted datasets. The distribution of the measure across permuted datasets is an estimate of it's distribution under the null hypothesis that there is no relationship between behavior and neural activity. Calculate a P value by comparing the value of the measure for the real data with its distribution across the permutations. For a two tailed test the P value is min(X, 1X) where X if the fraction of permutations for which the measure on true data is greater than that on permuted data. By construction, the P values generated by this method for circularly permuted data are uniformly distributed between 0 and 1.
Reviewer #3:
This article is improved but many of the core problems we identified in the original are unchanged, and overall I still feel it has promise but is not yet in publishable shape.
On the two sets of results separately:
Serial autocorrelation: The rhetoric is much more precise and improved, and I appreciate the move toward GLS and toward deemphasizing the problematic 'null' simulations (which, though not improved are probably sufficient for the specific use to which they are now put).
However, I just think the methods presented here still haven't convincingly solved the problem at hand. The bottom line (from what I can see) is they still don't have a test for significance that actually produces correct p values, as shown by all the p/p plots. In particular, GLS with the chosen AR structure also produces inflated false positives, when run on the shifted null control data. (That said, it is hard to rule out that the problem might be due at least in part to the shifted null being overconservative for the same reason sessionlevel permutations are, i.e. the block structure of conditions.)
The article thus continues to resort to the twostage procedure of testing significance, criticized previously: using a demonstrably flawed method, then testing whether the proportion of significant neurons exceeds a measure of the proportion expected due to inflation. This is arguably valid, for the narrow job of rejecting the claim that previous results, in the aggregate, are due to p inflation, but it simply isn't a viable procedure going forward for conducting inference neuron by neuron. Just for instance, the very next section of the paper (on policy and value coding) contains extensive discussion counting and comparing the number of nominally significant neurons of different types, but none of these numbers can be taken seriously given what immediately preceded.
My view is that for the work to be useful to the field, it needs to present a method that gives a demonstrably trustworthy p value at the single neuron level. If GLS doesn't work, and they must resort to augmenting it with a further nonparametric stage, then they may as well just instead go ahead and define a proper nonparametric test, e.g. a full permutation test based on the circular shift. In this case the GLS is moot and correlation would work fine as the test statistic. The main question to my eye here is the validity / independence / sensitivity of the circular shifts as the unit of permutation. Given very long autocorrelation, many such shifts will be nonindependent from each other, and furthermore I don't completely understand why same the issue argued to plague the acrosssession permutation control (i.e. nonindependence due to structure in the trial blocking) doesn't also apply here. Thus, I think to go this way would ideally require more work to validate the control, but I fear this means they are back with the problem of designing convincing null simulations.
Value vs policy coding: I continue to think this chunk of the paper is mostly built on confusing conceptual foundations, admittedly mostly inherited from the earlier paper. First, I still think the whole motivating framing that activity related to action value isn't bona fide action value activity if it is negatively modulated by the alternative value ("policy") is just plain wrong, for many reasons I rehearsed before: related for instance to normalization and efficient coding. Second, this linear approach neglects chosen value altogether (but is confounded by it), which other results in the paper suggest are a key factor.
Finally, apart from the fact that the p values themselves are dubious (see above) the exercise of counting neurons that are significant or nonsignificant on different sets of correlated tests does not clearly lend itself to any formal conclusion or even informal interpretation. What results would be expected under different hypotheses? What are the hypotheses? The idea of "mixed selectivity" is neither defined nor tested, and I don't see that this would be a viable way to do it: ultimately, interpreting the Venn diagram of positive and negative results on correlated tests flirts with a combination of the fallacies of frequentist reasoning including affirming nulls and double dipping. The analysis of angles is much less problematic in this respect (since it represents a single test with a welldefined null hypothesis rather than a family of nonindependent tests), though again it is a bit less than one might hope for since it is at the population, rather than neuron level. I'd still vote to axe this section entirely, or narrow it way down to the angle thing if necessary.
https://doi.org/10.7554/eLife.53045.sa1Author response
Essential revisions:
The methods used to investigate and account for serial autocorrelations are coarse, adhoc, and little is presented to verify that they do what they say they do. This raises 2 issues. The first relates to how robust the findings will be in the face of future related criticisms. The second relates to setting methodological standards for how the field should deal with serial correlations in the future.
Two major concerns raised by the reviewers were (1) the lack of systematic approach to evaluate statistical significance of valuesignals in neural activity in the presence of residual autocorrelation, and (2) conflation of signals related to the contrast of action values and policy. To address these concerns, we have performed an almost completely new series of analyses on the neural data and rewrote much of the entire manuscript. In particular, based on the suggestions from the reviewers and advice from an expert on applied statistics (a new coauthor in the manuscript), we have adopted the generalized least square (GLS) regression model to evaluate the nature of residual autocorrelation in neural activity, and dramatically simplified the permutation tests to evaluate the statistical significance of value signals in neural activity. We also added the results obtained from the monkey dorsolateral prefrontal cortex. These and other major changes in the revised manuscript are summarized below.
We will deal with the second issue first, as dealing with it properly may make the first problem redundant.
Strategy for accounting for autocorrelations:
The issue is that the authors propose 3 techniques for dealing with serial autocorrelations in the noise (simulating neurons with serial correlations, permuting "within blocks" only, and including a few trials back as coregressors). None of them are elegant or general techniques, and in all cases the behaviour of these techniques in the null case is poorly characterised (see below for more on this point). This is particularly surprising because this is an old problem in the statistics literature, and there are offtheshelf techniques for addressing the problem elegantly and rigorously. We acknowledge the requirement to adhere to the spirit of the EDL paper, but we think it would be extremely advantageous to move the argument forward using general wellvalidated techniques.
Options include
1. Estimating a whitening kernel for the residuals and refitting the model using this kernel (technically nontrivial). For an example from fMRI, see Woolrich et al. Neuroimage 2001.
2. Fitting autoregressive noise models using standard offtheshelf software (eg gls in r – see review comment below ).
3. Permutation tests after transformations that render the data exchangeable.
Eg fourier transform the data and permute only the phases then reconstruct the data.
Eg Wavelet transform the data and permute wavelet coefficients then reconstruct the data.
Here are the reviewer comments that led to this discussion, which contain other related comments that maybe useful. A similar discussion was had at triage. We note that, whilst the reviewer suggests an autoregressive model, which would be fine, it would also be fine to use one of the other techniques above, which may be more appropriate if the residuals are not well described by a limited AR model such as AR(1). The permutation strategies described above are trivial to implement and effective.
"I think the paper could do a better job technically unpacking the issues with temporal correlations, which in my view weren't diagnosed as precisely as they could have been in the original article either. The senior author knows more econometrics than I do, but as I understand it, all of the estimation issues in OLS here are due to the assumption of uncorrelated errors. Autocorrelation in the dependent variables, and the explanatory variables, is perfectly OK (indeed, one would result from the other) so long as they cancel each other out when the model is fit, in which case the residuals would be uncorrelated. Changing the autocorrelation in either y or X affects this only indirectly, since both appear in the residual. I think the article's focus on the autocorrelation in the explanatory variables and spike rates – both in rhetoric and analysis and results – is a piece of the puzzle but tends to obscure this deeper point. It would be helpful to also focus on visualizing and decorrelating the residuals. For the same reason, the lack of regressor autocorrelation in the monkey experiment is less of a solution than it is made out to be, I think."
"I find the estimation strategy of Figure 3 and sporadically onward a bit frustrating and less convincing (or at least more roundabout) than it could be. I'm sympathetic to the overall conclusion, but the overall strategy comes off as piling up multiple fixes, even though they are shown not to work adequately on simulated data. To compensate for this, the simulations are used to define a new, inflated false positive rate that is, finally, in a followup test, compared to the obtained rate of nominal positives. Frankly: yuck. How about figuring out why the fixes don't work, and finding a test that does work? For the nonparametric fix (bootstrap) the issue is presumably within session correlations, as discussed later; but for the lagged AR terms, I assume the problem is there aren't enough of them to handle longertimescale correlations. But this is itself kind of a hack; a more orthodox parametric approach would be to use a nonlinear, generalized least squares (eg gls() in R) to estimate a full AR(1) model or whatever other error covariance form is supported by the actual data. (Note that even an AR(1) process predicts correlations at arbitrary lags so adding individual lag terms is not sufficient.)"
We agree with the editor and reviewers that we should have dealt with the statistical issues resulting from serial correlation in neural data more systematically. We are also grateful to the reviewers for suggesting 3 specific approaches we could take, including the use of whitening kernel, generalized least square method, and appropriate transformation (e.g., wavelet). As the reviewers pointed out, some of these methods have been well established in the field of fMRI and are broadly applied. Unfortunately, it is difficult to adapt these methods to the analysis of spike data in our current study, because unlike the fMRI data (which is sampled periodically) spike counts in successive trials during our experiment are separated by highly variable intertrial interval. Also, the number of trials in our experiments is much smaller than the number of data points analyzed in typical fMRI experiments. Accordingly, we have decided to combine the generalized least square (GLS) regression analysis models with a circular permutation test to better assess the nature of residual autocorrelation and to develop the most robust method to circumvent the difficulty in evaluating the statistical significance of valuerelated signals in neural activity. We agree with the sentiment expressed by the reviewers that this would be most beneficial for future studies facing the same statistical issues. For comparison with the previous study (i.e., ElberDorozko and Loewenstein, 2018, EDL), however, we also kept the results obtained with the OLS method and the permutation test proposed by EDL but moved them to Figure 3 supplements (Figure 3supplement 1 and Figure 3supplement 2). We also show residual autocorrelation in Figure 2 of the revised manuscript as suggested by the reviewers.
Characterisation of performance in the null case.
If the authors change their strategy as recommended above, this section of the review may be rendered redundant. However, given the current approach the review team did not think that the paper did a good job in presenting diagnostics that adequately evaluate the performance of their strategies.
Minimally, given how much work the random walk neuron model does, we think that the authors should try harder to evaluate the performance with a model that looks more like the data. The model was setup only to match the neuronal autocorrelations at lag 1 trial and likely has a very different autocorrelation structure from real neurons at lags greater than 1. The autocorrelation structure of the control 'random' neuron model should be matched to that of the neuronal data. This may need a generative model that is more expressive than AR(1). Without this, the authors are susceptible to future criticism that simply shows that the authors techniques do not do well in the face of realistic data.
We also think that instead of simply reporting the number of false positives at p<0.05 threshold, the authors should construct the pp plot (Wikipedia), which plots observed false positives in empirical data against the nominal threshold. This will make it useful to future researchers who would like to use the same techniques with different threshold.
All three reviewers made the same point. I include all 3 here to encourage the authors that it is an important point that will likely be shared by many readers.
"Generation of randomwalk neurons. How is it possible to create the same autocorrelation kernel as the one observed in the neural data (essentially flat – at least for the shown scale of 5 trials) through a randomwalk process – for which the correlation should intrinsically decrease over time? The authors mentioned that they have matched autocorrelation at lag 1 only, which may be good enough as an approximation for what the authors intend to do with randomwalk neurons, but it is not a tight match and the authors may want to mention this somewhere in the manuscript."
"The random walk neuron model does a lot of work as a control against which real neurons are compared. However, the model was setup only to match the neuronal autocorrelations at lag 1 trial and likely has a very different autocorrelation structure from real neurons at lags greater than 1. The autocorrelation structure of the control 'random' neuron model should be matched to that of the neuronal data. "
"Either way, everything comes down here to the simulated spike trains under the null model, and it would be good to have more argument that these are actually a good simulation for the data. Among other things, I wasn't clear if their timescale is individually fit per brain area or experiment or just roughly chosen; if multiple timescales of correlation are detectable in the actual data, rather than just rectified AR(1) as here; and again if the autocorrelative structure of the residuals is similar between data and simulation. "
As we mentioned above, we now use the GLS regression to better address the problems resulting from autocorrelation in the residual from the regression model. In particular, we have used the BIC to determine the order of residual autocorrelation in the GLS method. In addition, we have performed extensive analysis to develop highorder randomwalk neuron models to reproduce the shape of the autocorrelation function. Unfortunately, despite extensive simulation and data analyses, we failed to develop robust models that could accurately reproduce the main features of autocorrelation in the neural data. There are at least two reasons for this difficulty. First, we came to hypothesize that autocorrelation in neural data occurs in many different time scales, which is a topic we are currently pursuing in another manuscript. Therefore, to reproduce the observed autocorrelation, we were frequently required to include a large number of autoregressive terms, which made our analysis of value signals unnecessarily complicated. Second, there is a potential gap between the “nonlinear” randomwalk neuron model and the framework of linear models, because the firstorder autocorrelation in the latent variable (rate parameter) can potentially be disguised as displaying higherorder autocorrelation when only the outputs (counts) of such models are considered. We think these are important problems for future investigations and we are planning to pursue them, but they are clearly beyond the scope of our current study. Therefore, in the revised manuscript, we now use circular permutation and trialshifted data as a way to generate the null data and bypass the potential issues mentioned above. We show the results obtained with AR(1) randomwalk neurons only in Figure 3supplement 2 for comparison with the results shown in EDL paper. We also show PP plots in Figures 35 as suggested by the reviewers.
Action values vs policy etc.
There was broad scepticism amongst the reviewers as to whether it was possible to dissociate policies from values, and whether it was really relevant to do so, particularly if policy is (confusingly) used to refer to a difference in Qvalues. This is reflected in the comments below. Whilst we acknowledge the authors' ambitions to address the critiques raised in EDL, we encourage great care in the interpretation of this whole section. Again, related points were made by all 3 reviewers, highlighting that this Is likely also to be an issue for many readers.
"I find the second half of the article, on alternative decision variables, a little bit of a red herring. One thing is that the relationship between a Q value and a policy (as the term is normally used in RL, and was used by ElberDoroko) is nonlinear. Calling the difference in Q values a "policy" is just not using the term accurately. On the other hand, my view is that this example shows that the whole critique is ill founded, and the only useful question is what is the (linear or nonlinear) relationship between decision variables and brain activity. Neural representations of values are likely to be nonlinear for reasons other than policy (eg, there is plenty of work by Glimcher and others on gain control or divisive normalization) and may also be differential (eg, activity which is related to the relative value, chosen minus unchosen, which is nevertheless in units of value and not normalized/softmaxed etc into a policy). Telling the difference between divisive and subtractive normalization is not really viable, especially in the linear setting; and even so, the same (softmax) algebraic form could describe either policy or (gain controlled) value. There's just not a meaningful categorical distinction to be made. I suppose there might be some way of recasting this section to focus on the distinction between summation vs difference as being representative (in a linear framework) of state values vs. relativized, or normalized, or postchoice policy values. But I think it's giving too much away to frame this as actually distinct variables confounding one another; and also unfair to call a difference a policy."
As the reviewers pointed out, the relationship between differential action values and policy is non linear. Nevertheless, in the literature (inclu ding EDL), neurons with significant effects of value difference ( ΔQ) have been frequently referred to as policy coding neurons. We have clarified the text throughout the revised manuscript to avoid unnecessary confusion on this issue.
"Correlation with sum(Q) and diff(Q). I don't understand the exact graphical description on Figure 8 and Figure 8. The authors label gray neurons as 'only Q', but many of them are probably not coding anything (nonsignificant, corresponding to black neurons in Figure 6 of the article by EDL). Also, I expected that the neurons coding selectively for one action value (QL or QR) should be found on Figure 8A for x > threshold and y ~ 0 and vice versa. However, it is clearly not the case given the labelling of neurons provided by the authors for this graph. Could the authors clarify this and explain the apparent discrepancy with the analyses performed by EDL (Figure 6 from their article). I have a similar concern regarding Figure 8B: pure actionvalue neurons seem to be located only at the center of the graphs (for x ~ 0 and y ~ 0), which is where nonselective neurons should be found."
"Figures 7 – 9 attempt to dissociate action value coding from coding of policy (difference in action values) and state value (approximated as sum of action values). As these variables are linearly dependent, it is formally impossible to say whether a neuron represents one of them, or a linear combination of the others. Mixed linear selectivity is ubiquitous (e.g. Kobak et al. eLife 2016;5:e10989), so it not that interesting to ask which of this degenerate set of variables is most 'purely' represented by each neuron. This said, having chosen a given nondegenerate pair of these variables to work with, it is interesting to know how representation of one variable correlates with representation of the other across the population, and how this varies across regions. This is shown nicely in figure 7A and the top two panels of 9A, but I felt the remaining panels of figures 79 did not add additional value."
When the authors assess the extent of chosen value coding (Figure 5, 6B) , they include the individual action values in their regression model, which is important as these variables are correlated. However, when they assess action value coding (Figures 4, 6A) they do not include chosen value in the models. I think the rationale is that the analyses are different trial epochs, prechoice for 4, 6A, post outcome for 5, 6B. However, chosen value coding is certainly possible before the choice is executed and hence chosen value should be included in the model when assessing action value coding.
We apologize for the confusion caused by some of the figures in the original manuscript. Some of this was due to the fact that gray circles in the original manuscript represented those neurons coding only Q (Figure 7A) or DV (Figure 9) and empty squares represented those that do not encode any of these value terms, but these two symbols could not be clearly distinguished in the figures of the original manuscript. To avoid this confusion, we replaced gray circles in Figure 6 (Figure 7A and 9 in the original manuscript) with red circles. We also agree that some of the results shown in these original figures are not essential for the main conclusion of our manuscript, so the results other than the scatter plots in original Figure 7A and Figure 9 were either removed or moved to a supplementary figure (Figure 6supplement 1). Instead, we added a new analysis result (examining distributions of actionvalue coefficients after doubling or quadrupling their angles to examine how relative strengths of different value signals vary across different brain regions; Figure 6 in the revised manuscript). With respect to the possibility of chosen value signals before action selection, we obtained essentially the same conclusions with and without including chosen value in the regression model assessing actionvalue signals, which is consistent with our previous finding that chosenvalue signals are generally weak before action selection. As suggested, we show the results obtained with the model including chosen value (model 3, Equation 5) in the revised manuscript.
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
I have written a summary of our opinion after the discussion directly below here. I have also left the reviews below for emphasis and detail, but please don't feel that you need to address all the points in the reviews. If you can address the central issues in the summary directly below here, we will be happy.
This revision has raised some complications in the reviewers' minds that have led to a lot of discussion. In brief, we are not happy with the 2stage approach that does not lead to pvalues for individual neurons that can be trusted.
Whilst we agree that this approach goes some way to rebutting the EDL finding, it is a narrow rebuttal which does not provide a good way forward for scientists faced with similar problems in the future. The combination of GLS modelling that does not accurately deal with the autocorrelations, with nonstandard application of circular permutations to demonstrate control performance is, in our view, dangerous, and not useful to the community. We are not keen on publishing such an approach in eLife.
If the GLS approach does not lead to good corrections for autocorrelations, then we think it incumbent upon you guys to find a nonparametric approach that does.
We fully agree that a better method to evaluate the statistical significance of valuesignals at the level of individual neurons was needed in order to firmly establish the extent to which neurons in different anatomical areas encode value signals throughout the brain. Thus, we very much appreciate the reviewer’s effort to provide us with a set of sample codes and illustrate how more appropriate analyses should be carried out. We have followed these suggestions closely in the revised manuscript. Specifically, we evaluated the extent of false positivity for several different methods by using actual behavioral and simulated neural data while maintaining the same distribution of firing rates and AR(1) coefficients as in the actual neural data. In the revised manuscript, we accordingly report the results from the four procedures that yield chancelevel rate of false positivity (~5%) for actionvalue neurons (Figure 2 of the revised manuscript) when tested using the simulated null data. Two of these methods are based on resampling of behavioral data, and include ‘session permutation’ proposed by the previous paper published in eLife by ElberDorozko and Loewenstein (2018) and the ‘pseudosession’ method proposed in a recent bioRxiv paper (Harris, 2020). The other two are based on Fourier phase randomization of neural data suggested by the reviewer, namely the conventional Fourier phase randomization (FPR) method and a modified version of the amplitude adjusted Fourier transformation method (Theiler 1992 Physica D). Using these 4 new methods, we still found significant actionvalue and chosenvalue signals in multiple areas of the rat brain (Figure 3 of the revised manuscript). Therefore, we believe that our findings robustly demonstrate the presence of actionvalue signals in the brain.
You can see in the reviews below, that the reviewers also remain concerned about the distinction in the second half of the paper between action values, chosen values, policies etc, which interacts heavily with questions about the linearity of neuronal responses. We remain concerned that there is a danger this will confuse more than clarify the issue for the community. However, we realise that there is a similar section in the EDL paper, and that you guys need to address this. This may be TB's fault (along with the original reviewers) for not flagging this in the original paper. We would appreciate a clear statement in this section in the paper that states that this section is a narrow rebuttal of the EDL paper, and discusses the difficulties in differentiating policy from action values etc. (see reviews below).
We also agree with the reviewer that approaches based on a simple linear regression model cannot adequately be used to classify individual neurons coding different types of value signals, such as action values, policy, and chosen values. Therefore, as suggested by the reviewers and editor, we shortened this section significantly, but kept it as a rebuttal to the previous eLife paper published by ElberDorozko and Loewenstein. In particular, we added the following text in the revised manuscript: “In reinforcement learning theory, action values are monotonically related to the probability of choosing the corresponding actions, referred to as policy, making it hard to distinguish the neural activity related to either of these quantities. In addition, the activity of individual neurons is likely to encode multiple variables simultaneously (Rigotti et al., 2013). Despite these difficulties, it has been argued that neural signals related to action value might actually represent policy exclusively (ElberDorozko and Loewenstein, 2018). To address this issue quantitatively, using the difference in action values (∆Q) and their sum (ΣQ) as proxies for policy and state value, respectively, we tested how signals for action values, policy and state value are related in a population of neurons in different brain structures.”
https://doi.org/10.7554/eLife.53045.sa2Article and author information
Author details
Funding
Institute for Basic Science (IBSR002A1)
 Min Whan Jung
National Institute of Mental Health (DA 029330)
 Daeyeol Lee
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
This work was supported by the Research Center Program of Institute for Basic Science (IBSR002A1; MWJ) and the National Institute of Health grants (DA 029330; DL).
Senior and Reviewing Editor
 Timothy E Behrens, University of Oxford, United Kingdom
Version history
 Received: October 28, 2019
 Accepted: April 19, 2021
 Accepted Manuscript published: April 20, 2021 (version 1)
 Version of Record published: May 7, 2021 (version 2)
Copyright
© 2021, Shin et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 2,067
 Page views

 329
 Downloads

 10
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Computational and Systems Biology
 Neuroscience
An animal entering a new environment typically faces three challenges: explore the space for resources, memorize their locations, and navigate towards those targets as needed. Here we propose a neural algorithm that can solve all these problems and operates reliably in diverse and complex environments. At its core, the mechanism makes use of a behavioral module common to all motile animals, namely the ability to follow an odor to its source. We show how the brain can learn to generate internal “virtual odors” that guide the animal to any location of interest. This endotaxis algorithm can be implemented with a simple 3layer neural circuit using only biologically realistic structures and learning rules. Several neural components of this scheme are found in brains from insects to humans. Nature may have evolved a general mechanism for search and navigation on the ancient backbone of chemotaxis.

 Neuroscience
Automatic leveraging of information in a hippocampal neuron database to generate mathematical models should help foster interactions between experimental and computational neuroscientists.