Subpopulations of neurons in lOFC encode previous and current rewards at time of choice

Abstract
Editor's evaluation
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Studies of neural dynamics in lateral orbitofrontal cortex (lOFC) have shown that subsets of neurons that encode distinct aspects of behavior, such as value, may project to common downstream targets. However, it is unclear whether reward history, which may subserve lOFC’s well-documented role in learning, is represented by functional subpopulations in lOFC. Previously, we analyzed neural recordings from rats performing a value-based decision-making task, and we documented trial-by-trial learning that required lOFC (Constantinople et al., 2019). Here, we characterize functional subpopulations of lOFC neurons during behavior, including their encoding of task variables. We found five distinct clusters of lOFC neurons, either based on clustering of their trial-averaged peristimulus time histograms (PSTHs), or a feature space defined by their average conditional firing rates aligned to different task variables. We observed weak encoding of reward attributes, but stronger encoding of reward history, the animal’s left or right choice, and reward receipt across all clusters. Only one cluster, however, encoded the animal’s reward history at the time shortly preceding the choice, suggesting a possible role in integrating previous and current trial outcomes at the time of choice. This cluster also exhibits qualitatively similar responses to identified corticostriatal projection neurons in a recent study (Hirokawa et al., 2019), and suggests a possible role for subpopulations of lOFC neurons in mediating trial-by-trial learning.

Editor's evaluation

This study used advanced statistical methods to classify the responses of neurons in the lateral orbito-frontal cortex (lOFC) of rats performing economic choices. The authors report that lOFC neurons can be classified into a small number of distinct types based on their characteristic temporal dynamics and preferences for task variables (e.g., choice, success, reward history). The results will be of interest to neuroscientists studying prefrontal cortex, as they demonstrate order amidst what would appear as disorganized diversity, and suggest that some neuronal types play specific functional roles; for instance, in integrating previous trial outcomes into a current choice.

https://doi.org/10.7554/eLife.70129.sa0

Introduction

Previous experience can profoundly influence subsequent behavior and choices. In trial-based tasks, the effects of previous choices and outcomes on subsequent ones are referred to as ‘sequential effects,’ and while they are advantageous in dynamic environments, they produce suboptimal biases when outcomes on each trial are independent. The orbitofrontal cortex (OFC) has been implicated in updating behavior based on previous experience, particularly when task contingencies are partially observable (Rushworth et al., 2011; Stalnaker et al., 2014; Wallis, 2007; Wilson et al., 2014). However, it is unclear whether behavioral flexibility in OFC is mediated by dedicated subpopulations of neurons exhibiting distinct encoding and/or connectivity. We previously trained rats on a value-based decision-making task, in which they chose between explicitly cued, guaranteed and probabilistic rewards on each trial (Constantinople et al., 2019b; Constantinople et al., 2019a). Despite the fact that outcomes were independent on each trial, we observed several distinct sequential effects that contributed to behavioral variability. Optogenetic perturbations of the lateral orbitofrontal cortex (lOFC) eliminated one particular sequential effect, an increased willingness to take risks following risky wins, but spared other types of trial-by-trial learning, such as spatial ‘win-stay/lose-switch’ biases. We interpreted this data as evidence that (1) different sequential effects may be mediated by distinct neural circuits and (2) lOFC promotes learning of abstract biases that reflect the task structure (here, biases for the risky option), but not spatial ones (Constantinople et al., 2019a).

Electrophysiological recordings from lOFC during this task revealed encoding of reward history and reward outcomes on each trial at the population level, which could in principle support sequential effects (Constantinople et al., 2019a). Recent studies of rodent OFC have suggested that despite the apparent heterogeneity of neural responses in prefrontal cortex, neurons can be grouped into distinct clusters that exhibit similar task-related responses and, in some cases, project to a common downstream target (Hirokawa et al., 2019; Namboodiri et al., 2019). In light of these results, we hypothesized that reward history might be encoded by a distinct cluster of neurons in lOFC. This would suggest that the lOFC-dependent sequential effect we observed (an increased willingness to take risks following risky wins) may derive from a subpopulation of neurons encoding reward history, and may potentially project to a common downstream target.

Here, we analyzed an electrophysiological dataset from lOFC during a task in which independent and variable sensory cues conveyed dissociable reward attributes on each trial (reward probability and amount; Constantinople et al., 2019b; Constantinople et al., 2019a). Our previous analysis investigated single unit responses for sensitivity to task parameters, and here we found that clustering of lOFC neurons based on either their trial-averaged peristimulus time histograms (PSTHs), or a feature space defined by their average conditional firing rates aligned to different task variables (Hirokawa et al., 2019), both revealed five clusters of neurons with different response profiles. We exploited the temporal variability of task events to fit a generalized linear model (GLM) to lOFC firing rates, generating a rich description of the encoding of various task variables over time in individual neurons. All the clusters exhibited weak encoding of reward attributes, and stronger encoding of reward history, the animal’s left or right choice, and reward receipt. Encoding of these variables was broadly distributed across clusters. Only one cluster, however, encoded the animal’s reward history at the time shortly preceding the choice. This distinct encoding was observable by three independent metrics (coefficient of partial determination, mutual information, and discriminability or $d^{'}$ ) and two separate clustering methods. Intriguingly, this cluster exhibited qualitatively similar responses to identified corticostriatal cells in a previous study (Hirokawa et al., 2019), and also exhibited the strongest encoding of reward outcomes, suggesting that these neurons are well-situated to integrate previous and current reward experience. We hypothesize that this subpopulation of neurons, which were identifiable based on their temporal response profiles alone, may mediate sequential learning effects by integrating previous and current trial outcomes at the time of choice.

Results

Rats’ behavior on this task has been previously described (Constantinople et al., 2019b; Constantinople et al., 2019a). Briefly, rats initiated a trial by poking their nose in a central nose port (Figure 1A–C). They were then presented with a series of pseudo-randomly timed light flashes, the number of which conveyed information about reward probability on each trial. Simultaneously, they were presented with randomly timed auditory clicks, the rate of which conveyed the volume of water reward (6–48 µl) baited on each side. After a cue period of $\sim 2.5 - 3 s$ , rats reported their choice by poking in one of the two side ports. Reward volume and probability of reward ( $p$ ) were randomly chosen on each trial. On each trial, the left or right port was randomly designated as risky ( $p §lt; 1$ ) or safe ( $p = 1$ ). Well-trained rats tended to choose the side offering the greater subjective value, and trial history effects contributed to behavioral variability (Constantinople et al., 2019b; Constantinople et al., 2019a). We analyzed 659 well-isolated single-units in the lOFC with minimum firing rates of 1 Hz, obtained from tetrode recordings (Constantinople et al., 2019a).

Figure 1 with 1 supplement see all

Download asset Open asset

Single units in lOFC belong to clusters with distinct temporal response profiles

To characterize the temporal dynamics of the neural responses during this task, we performed k-means clustering on trial-averaged PSTHs (‘PSTH clustering’), aligned to trial initiation. We used the gap statistic (Tibshirani et al., 2001), which quantifies the improvement in cluster quality with additional clusters relative to a null distribution (Materials and methods), to choose a principled number of clusters, and identified five distinct clusters of responses (Figure 1D). Each cluster has a stereotyped period of elevated activity during the task: at trial start, during the cue period, and at or near reward delivery, with the largest cluster (cluster 3) having activity just before the rat entered the reward port (Figure 1—figure supplement 1). Cluster two is the only cluster that exhibited persistent activity during the cue period. These data indicate that despite the well-documented heterogeneity of neural responses in prefrontal cortex, responses in lOFC belonged to one of a relatively small, distinct set of temporal response profiles aligned to different task events.

A recent study in rat lOFC similarly found discrete clusters of neural responses by clustering the conditional firing rates of neurons for different task-related variables (‘conditional clustering’) (Hirokawa et al., 2019). We sought to compare results from clustering on PSTHs and conditional firing rates. Therefore, we generated conditional response profiles for each neuron (Figure 2B; Materials and methods), and performed clustering on these conditional responses via the same procedure. This also revealed five distinct clusters that exhibited qualitatively similar temporal response profiles as the clusters that were based on the average PSTHs (Figure 2C–E, Figure 2—figure supplement 3). Moreover, clusters 2 and 3 were comprised of highly similar groups of neurons across both procedures, where similarity was quantified by the consistency of being labeled in a given cluster across clustering approaches (Figure 2E, 66% overlap of cluster 2 neurons and 70% of cluster three neurons). This indicates that neurons in these two clusters are robustly identified by either their temporal response profiles or their encoding of task variables (Figure 2E).

Figure 2 with 6 supplements see all

Download asset Open asset

Conditional feature space clustering.

(A) Feature-conditioned PSTHs of an example neuron. The average z-scored firing rate in each gray time window comprises an element in the feature space. (B) Feature space for conditional clustering for the sample neuron from A. The average firing rate for each condition is concatenated to yield a 19-dimensional feature space. Here, features are grouped into three qualitative types (for presentation purposes only): responses for choice and trial outcome (left); responses to expected value of left and right offers (middle); and responses to received reward volume (right). (C) Cluster-averaged PSTHs, aligned to trial initiation, using this conditional feature space. Error bars denote s.e.m. (D) Z-scored PSTHs for all neurons, sorted by time to peak within each cluster. (E) Consistency of cluster labeling, calculated as the conditional probability $P (C_{conditional} | C_{PSTH})$ of belonging in any ‘conditional’ cluster ( $C_{conditional}$ ), given that a unit belongs to a certain PSTH-defined cluster ( $C_{PSTH}$ ).

We evaluated the quality of clustering using two additional methods. First, we visualized clustering using t-SNE projections of data into two dimensions, and further quantified cluster separation using the Mahalanobis distance among clusters, which in this context can be interpreted as an average z-score for data in a cluster from a given cluster mean (Figure 2—figure supplement 2). We found that lOFC responses obtained by both procedures were qualitatively clustered in feature space, and demonstrated Mahalahobis distances with smaller intracluster distances than across-cluster distances. Second, we used the PAIRS statistic to more directly test whether lOFC responses belong to distinct clusters (Raposo et al., 2014). PAIRS examines the angle between neural responses in feature space, and obtains an average angle for the data points closest to it. Intuitively, a small average angle among neighboring points implies that the data is tightly packed together into clusters, whereas larger angles suggest broadly distributed, unclustered data. We found that our neural responses were more tightly packed together than would be observed for an unclustered, Gaussian reference distribution (Figure 2—figure supplement 4).

Finally, we also compared the estimated the number of clusters in the data to those obtained using other clustering metrics, in particular the silhouette score and adjusted Rand Index (Figure 2—figure supplement 5). The silhouette score quantifies cluster quality with two components: it provides high scores when points within a cluster are close in feature space, and penalizes small inter-cluster distances that arise when clusters are nearby in feature space. The adjusted Rand Index (ARI) does not assess cluster quality through distances among datapoints in feature space, but instead is a measure of reproducibility of a clustering result. Specifically, it quantifies the probability that two labels from separate clustering results would arise by chance. We generated labels for ARI by subsampling our data (90% sampling without replacement) and calculated ARI on the subsampled dataset and the full dataset. ARI results demonstrated a broad range of possible numbers of clusters with reproducible cluster labels (Figure 2—figure supplement 5A). In contrast, the silhouette score gave inconclusive results on our lOFC data (Figure 2—figure supplement 5B). The dimensionality and covariance of our lOFC responses lies in a regime in which clusters are tightly packed in feature space, a regime in which silhouette score is known to underestimate the number of clusters in other contexts (Garcia-Dias et al., 2020; Garcia-Dias et al., 2018). To assess this more directly, we compared the results from different clustering metrics on artificial data with known statistics (deriving from five clusters). We found that in the data-like regime, the silhouette score substantially underestimated the optimal number of clusters (two clusters, Figure 2—figure supplement 6, see Discussion and Materials and methods). The gap statistic underestimated cluster number on synthetic data, but to a lesser degree. Taken together, these analyses suggest that the gap statistic is a conservative but principled metric for determining the number of clusters in the statistical regime of the lOFC data.

Neural responses in lOFC are well-captured by a generalized linear model that includes attributes of offered rewards, choice, and reward history

We next sought to characterize the encoding properties of neurons in each cluster. To that end, we modeled each neuron’s trial-by-trial spiking activity with a generalized linear model with Poisson noise (Figure 3). This approach has been previously used in sensory areas to model the effects of external stimuli upon neural firing (Pillow et al., 2005; Pillow et al., 2008), and more recently has been extended to higher order areas of cortex during decision-making tasks (Park et al., 2014). In this model, the time-dependent firing rate ( $λ_{t}$ ) of a neuron is the exponentiated sum of a set of linear terms.

λ_{t} = e x p [\sum_{s = 1}^{S} (X_{t - τ : t}^{(s)} * k_{s} (τ))],

Figure 3 with 2 supplements see all

Download asset Open asset

GLM analysis.

(A) Model schematic: the timing of external stimuli conveying reward attributes, as well as timing of choices and outcomes on each trial are model inputs. Nonlinear response kernels convolved with each input generate time dependent responses that are summed and exponentiated to give a mean firing rate, $λ_{t}$ , in each time bin. Spikes are generated from a Poisson process with mean firing rate $λ_{t}$ . (B) Task schematic illustrating the key choice and outcome inputs to the model. (C) Kernel fits for a sample neuron. Kernels are grouped by the aspects of the task that they model. Error bars denote estimated kernel standard deviation (Materials and methods). (D) Timing of each kernel’s contribution in an example trial. The kernels in bold from panel C are the kernels that are active in this trial. The resulting model-predicted firing rate is shown in the bottom row. (E) Representative PSTHs to held-out testing data from five different neurons (black) and model prediction (red). (F). Variance explained for each neuron, with sample neurons from E denoted by correspondingly colored crosses. The distribution of $R^{2}$ values is presented along edge of the panel.

where the variables $X_{t - τ : t}^{(s)}$ are the stimuli and behavioral events against which neural activity is regressed; $k_{s} (τ)$ denote the response kernels that are convolved with the task variables to model time-dependent responses to each variable (Figure 3C). The probability that a given number of spikes would occur in a discrete time bin of $Δ t = 50 m s$ is given by a homogeneous Poisson process with mean $λ_{t} Δ t$ .

The task variables and example kernels for our model are shown for a sample neuron in Figure 3C. Variables were binarized such that 1 (0) denoted the occurrence (absence) of an event. The task variables included reward (win or loss on the current or previous trial), choice (left or right), and the timing of cues indicating reward attributes for the left and right offers (reward volume conveyed by auditory clicks and reward probability conveyed by visual flashes). We also included the average reward rate prior to starting each trial and the location of the trial within the session (not shown). Given the large number of model parameters, we used a smooth, log-raised cosine temporal basis and L2 regularization to prevent overfitting (Materials and methods). Based on model comparison, we found that including a spike history term as in other GLM approaches did not improve our model, presumably due to the fact that we are modeling longer timescale responses.

The chosen model parameters were selected by model comparison against several alternative models using cross-validation. Model comparison favored a simpler binary win/loss representation of rewarded outcomes over richer representations of reward volume on the current or previous trial (Figure 3—figure supplement 1). The model reproduced the trial-averaged PSTHs of individual neurons (Figure 3E). To quantify model performance, we calculated the proportion of variance explained ( $R^{2}$ ) for held-out testing data (Figure 3F). The model captured a high percentage of variance for most of the neurons in our dataset. A small fraction of neurons exhibited negative $R^{2}$ values (69 units, Figure 3—figure supplement 2), indicating that the model produced a worse fit of the test data than the data average. Our liberal inclusion criteria did not require neurons to exhibit task-modulation of their firing rates, so these neurons were likely not task-modulated, and were excluded from subsequent analyses.

Clustering reveals distributed encoding of most task variables across subpopulations of lOFC neurons

We next sought to characterize the extent to which neurons that belonged to different clusters might be ‘functionally distinct,’ and encode different task-related variables (Hirokawa et al., 2019; Namboodiri et al., 2019). To address this question, we computed two complementary but independent metrics, both based on the GLM fit: the coefficient of partial determination and mutual information. The coefficient of partial determination (CPD) quantifies the contribution of a single covariate (e.g. left/right choice) to a neuron’s response by comparing the goodness-of-fit of the encoding model with and without that covariate. In other words, the CPD quantifies the amount of variance that is explained by kernels representing different aspects of the model. We computed the CPD in a sliding time window throughout the trial, and compared CPD values to a shuffle test, in which trial indices were randomly shuffled relative to the design matrix before generating model predictions. CPD values that were within the 95% confidence intervals of the shuffled distribution were set to 0 before averaging CPD values over neurons in a cluster. The average CPD values for different task variables is shown for clusters based on each clustering procedure in Figure 4A and C. Note that CPD plots are aligned to different events, depending on the covariate.

Figure 4 with 1 supplement see all

Download asset Open asset

Coefficient of partial determination (CPD) and mutual information (MI) reveal broad encoding of all task parameters in each cluster.

(A) CPD for choice and reward outcome encoding: Reward history panels quantify encoding of outcome on the previous trial (previous win/loss/opt out), choice panels convey encoding of choices on the current trial (left/right), and reward panels quantify encoding of outcomes on the current trial (win/loss). (B) Mutual information for same task parameters. (C) CPD of flashes (left panels) and clicks (right panels) that encode reward probability and reward volume, respectively. CPD and MI values are averaged over neurons in each cluster; error bars show s.e.m. Results for PSTH clustering are in the left columns, and results for conditional clustering are in the right columns.

We also computed the mutual information (MI) between neural spike trains and different model covariates (Materials and methods). Our approach relates the statistics of trial-level events to firing rates, which allows us to assess the information content for each stimulus throughout the entire time course of a trial. Such an approach does not easily generalize to the statistics of within-trial events such as the information represented in stimulus clicks and flashes, so we restricted the MI analysis to the other covariates: reward history, choice, and reward outcome. The average MI for these variables is shown for clusters based on the trial-averaged PSTHs in Figure 4B.

In general, regardless of the metric (CPD or MI) or clustering procedure (PSTH clustering or conditional clustering), task variables appear to be broadly encoded across neural subpopulations, with similar temporal dynamics and average CPD/MI values across clusters. All clusters encoded cues conveying reward volume and probability during the cue period (Figure 4C), although it is worth noting that the magnitude of CPD for clicks and flashes was an order of magnitude lower than for the other task variables. Therefore, encoding of cues representing reward attributes was substantially weaker than encoding of reward history, choice, and outcome. It may be surprising that neurons with strikingly different PSTHs appear to encode task variables with similar time courses. However, PSTHs marginalize out all conditional information, so a PSTH carries no information about encoding of task or behavioral variables, per se. Additionally, to assess whether different clusters may have a different composition across neuron types, we classified single units based on their waveforms into regular spiking units with broad action potential (AP) and after-hyperpolarization (AFP) widths, or fast-spiking units with narrow AP and AHP widths. We found the distribution of these two unit types to be similar across clusters (Figure 4—figure supplement 1).

Reward history was most strongly encoded at the time of trial initiation and decayed over the course of the trial, consistent with previous analyses (Constantinople et al., 2019a). Neurons belonging to cluster 2, which strongly overlap in both clustering procedures (Figure 2E), seem to exhibit slightly more pronounced encoding of reward history during the trial, compared to the other clusters, although these neurons were also persistently active during this time period. Encoding of choice and reward outcomes were phasic, peaking as the animal made his choice and received outcome feedback, respectively, and were broadly distributed across clusters (Figure 4A–B).

Reward history re-emerges before choice in a distinct subset of lOFC neurons

The CPD and MI analysis revealed one aspect of encoding that is unique to cluster 3. Both clustering methods identified largely overlapping populations of neurons in this cluster (Figure 2E), indicating that these neurons exhibited similar temporal response profiles as well as encoding. Like neurons in the other clusters, neurons in cluster 3 encoded reward history at trial initiation, and that encoding decreased through the cue period. However, for neurons in this cluster, encoding of reward history re-emerged at the time preceding the animal’s choice on the current trial (Figure 5A–B, green lines). This ‘bump’ of reward history encoding late in the trial was unique to cluster 3 regardless of the clustering method, and was observable in both CPD and MI (Figure 4A–B). This result was further corroborated by computing the average discriminability index ( $d^{'}$ ) for reward history, which is a model-agnostic metric that quantifies the difference in mean firing rate in each condition, accounting for response variance over trials. Cluster 3 was unique from the other clusters for having a subset of neurons with high $d^{'}$ values for reward history at the time preceding the animal’s choice (Figure 5C). We additionally found that encoding of previous wins and losses during this time period only extended to the previous trial, and did not encode the outcome of additional past trials (Figure 5—figure supplement 1). Clustering of PSTHs aligned to later events in the trial showed more detailed structure of this late-in-trial encoding of reward history, with some neurons encoding the previous trial just before the choice, and others at the time of choice (Figure 5—figure supplement 2C). We note that clustering on PSTHs aligned to later events produced similar results, in terms of the number of clusters and the within-cluster responses (Figure 5—figure supplement 2B).

Figure 5 with 2 supplements see all

Download asset Open asset

Encoding of reward history emerges late in the trial for cluster 3.

(A) CPD for reward history, aligned to leaving the center poke after the cue period. Note the peak in encoding that is isolated to cluster 3 (black arrow). (B) CPD similarly aligned to choice, demonstrates that this encoding occurs before reward delivery on the current trial. (C) Sensitivity, $d^{'}$ , across time and neurons for reward history has activity isolated to cluster 3 (magenta circle). $d^{'}$ is sorted within the PSTH-based clusters by time-to-peak.

Notably, cluster 3 also exhibited the most prominent encoding of reward outcome compared to the other clusters (Figure 4A–B, green, bottom row). This suggests that this subset of neurons may be specialized for representing or even integrating information about reward outcomes on previous and current trials. We wondered if this might reflect adaptive value coding, in which value representations in OFC have been observed to dynamically reflect the statistics of offered rewards (Kobayashi et al., 2010; Webb et al., 2014; Yamada et al., 2018; Padoa-Schioppa, 2009; Rustichini et al., 2017; Conen and Padoa-Schioppa, 2019). Adaptive value coding, which is thought to reflect a divisive normalization or range adaptation mechanism, allows the circuit to efficiently represent values in diverse contexts in which their magnitudes may differ substantially. As such, it provides key computational advantages, such as efficient coding, or the maximization of mutual information between neural signaling and the statistics of the environment (Webb et al., 2014; Yamada et al., 2018; Padoa-Schioppa, 2009; Rustichini et al., 2017; Simoncelli and Olshausen, 2001; Carandini and Heeger, 2011; Tymula and Glimcher, 2020; Louie and Glimcher, 2012).

According to divisive normalization models of subjective value, the value of an option or outcome is divided by a recency-weighted average of previous rewards (Tymula and Glimcher, 2020; Louie and Glimcher, 2012; Webb et al., 2014). Therefore, if neurons in OFC exhibit adaptive value coding, we would predict that they would exhibit stronger reward responses following unrewarded trials, and weaker responses following rewarded trials (Figure 6C). Put another way, regressing the firing rate against current and previous rewards should reveal coefficients with opposite signs (Kennerley et al., 2011). Neurons with regression coefficients for current and previous rewards with the same sign would be modulated by current and previous rewards, but not in a way that is consistent with the adaptive value coding hypothesis (Figure 6D).

Figure 6 with 2 supplements see all

Download asset Open asset

Adaptive value coding analysis using linear regression of firing rates for the 1s period following reward delivery against current and previous reward outcomes.

(A) Regression coefficients for the current reward outcome, $c_{win/loss}$ (parameterized as win = 1, and loss=-1), and previous trial outcome, $c_{rewhist}$ . 180 neurons had significant coefficients for both regressors ( $p §lt; 0.05$ , t-test), and 81 neurons had coefficients with opposite signs, consistent with adaptive value coding (blue dots). The remaining neurons have differential responses due to reward history, but inconsistent with adaptive value coding (red crosses). (B) Probability that a model with significant regressors for both current and past reward outcome would come from a given cluster. Shaded regions denote all models from panel A, and blue bars show the probability for adaptive neurons only. Error bars are the 95% confidence interval of the mean for a binomial distribution with observed counts from each cluster. (C) Example cell demonstrating adaptive value coding. Shaded gray region denotes time window used to compute mean firing rate for the regression. (D) Sample cell demonstrating significant modulation due to reward history, but with a relationship inconsistent with adaptive value coding.

To test this hypothesis, we regressed the firing rates of neurons in a 1s window after reward receipt against reward win/loss outcomes on the current trial, as well as the win/loss outcome from the previous trial (Figure 6, Materials and methods). We found that 180 neurons (27% of population) exhibited significant coefficients for both current reward and reward history regressors, and 45% of these units demonstrated adaptive value coding indicated by the opposite signs of regression coefficients for current and previous reward outcomes (81 units, Figure 6A, blue). To determine if these adaptive neurons preferentially resided in a particular cluster, we calculated the cluster-specific probability of a neuron demonstrating either significant coefficients for rewarded volume and reward history, or having coefficients with opposite signs, consistent with adaptive value coding (Figure 6B). We found that no cluster preferentially contained adaptive units with higher probability (comparison of 95% binomial confidence intervals). We also investigated whether adaptive value coding was present for rewarded volume representations, and similarly found that neurons did not preferentially reside in any one cluster (30 adaptive units out of 79 significant units, Figure 6—figure supplement 2). Notably, activity during the reward epoch did not appear to encode a reward prediction error (Figure 6—figure supplement 1), defined as the difference between the expected value and the outcome of the chosen option, as has been previously reported in rat and primate OFC (McDannald et al., 2011; Takahashi et al., 2009; Takahashi et al., 2013; Kennerley et al., 2011; Miller et al., 2018).

Neurons in Cluster 3 have similar responses to previously reported striatum projection neurons

Studies of OFC indicate that long-range projections to different downstream circuits exhibit distinct encoding and/or make distinct contributions to behavior (Groman et al., 2019; Hirokawa et al., 2019; Namboodiri et al., 2019). For instance, previous work from Hirokawa et al., 2019 used optogenetic tagging to record from OFC neurons that project to the ventral striatum, and found that these cells exhibited responses that seemed to correspond to a distinct cluster with stereotyped responses. This cluster encoded the trial outcome just after reward delivery, with larger responses following unrewarded trials, and also encoded the negative integrated value of the chosen option during the inter-trial interval, until the start of the next trial (Figure 7A). We examined the average PSTHs of each cluster aligned to trial outcome and the start of the next trial, to see if any cluster resembled the corticostriatal responses described in Hirokawa et al., 2019. Specifically, we plotted the cluster-averaged firing rates for trials classified as a loss, a low-reward trial (6 or 12 µl), or a high-reward trial (24 or 48 µl) (Figure 7B). We found that several clusters prominently encoded the reward outcome after reward (clusters 1, 3, 5); however, only cluster 3 encoded the negative (i.e., inverted) value of the reward during the inter-trial interval until the start of the next trial. Therefore, one of the clusters that we identified in our data exhibits qualitatively similar responses to the corticostriatal projection neurons identified by clustering in Hirokawa et al., 2019.

Figure 7

Download asset Open asset

Rewarded volume encoding by cluster.

(A) Schematic of the encoding properties of a cluster of putative corticostriatal projection neurons from Hirokawa et al., 2019. Neurons from this cluster encode the outcome of a trial just after choice, as well as negative (i.e. inverted) integrated value during the intertrial interval, until the start of the next trial. (B) Cluster-averaged response to rewarded volume magnitude. Each sub-panel shows the average response to either a trial loss (blue), a low reward (6 or 12 µl, green), or a high reward (24 or 48 µl, orange).

Discussion

We analyzed neural responses in lOFC from rats performing a value-based decision-making task, in which they chose between left and right offers with explicitly cued reward probabilities and volumes. Despite the apparent response heterogeneity in our dataset, two independent clustering methods revealed that neurons belonged to one of a small number of distinct clusters. We clustered based on the marginalized PSTHs over all trials, and found that subpopulations of neurons exhibited characteristic temporal response profiles. To our knowledge, using temporal responses that do not explicitly contain conditional information is a novel approach for identifying distinct neural response profiles in prefrontal cortex. We also clustered based on a conditional feature space for each neuron, consistent with previous work (Namboodiri et al., 2019; Hirokawa et al., 2019). The feature space for each neuron corresponded to its average firing rate on different trial types, in select time windows that most closely corresponded to the differentiating covariate on each trial (e.g. wins vs. losses at the time of reward feedback). Notably, these independent clustering methods identified robustly identifiable groups of neurons for two of the clusters, indicating that neurons in these clusters can be identified by either their temporal response profiles or their task variable encoding, as defined by the feature space.

Previous studies that clustered using a conditional feature space identified a larger number of clusters than we did here (Namboodiri et al., 2019; Hirokawa et al., 2019). This could reflect the different metrics used to select the number of clusters (the gap statistic in this study; compared to PCA and silhouette scores [Namboodiri et al., 2019], or adjusted Rand index [Hirokawa et al., 2019]). Another difference is the time window that was used to generate the conditional feature space. In Hirokawa et al., 2019, for instance, the authors used the same time window for each feature, which was after the rat made a choice but before it received feedback about the outcome of that choice. There was no such epoch in our behavioral paradigm, precluding a more direct comparison. However, a principled choice of clusters using the gap statistic still provided a useful tool for investigating the encoding of task variables in lOFC.

Rat lOFC weakly encodes reward attributes

Our task design allowed us to isolate neural responses to sensory cues that conveyed information about distinct reward attributes, because these cues were presented independently and variably in time. However, our GLM revealed weak encoding of these reward attributes – reward probability and volume – across all clusters, regardless of the clustering method. This was observable by examination of the relative magnitude of the flash and click kernels in individual neurons, and also by the CPD metric. The average flash and click CPD values were an order of magnitude smaller than for the other covariates, indicating that flashes and clicks did not contribute substantially to neural firing. Therefore, while behavioral analyses have shown that rats used the flashes and clicks to guide their choices (Constantinople et al., 2019b), these cues were not strongly represented in lOFC. This weak encoding of reward attributes, whose combination would specify the subjective value of each offer (Constantinople et al., 2019b), is potentially consistent with a recent study of rat lOFC during a multi-step decision-making task that enabled dissociation of choice and outcome values (Miller et al., 2017; Miller et al., 2018). Recordings in that study revealed weak encoding of choice values, but strong encoding of outcome values, and optogenetic perturbations suggested a critical role for lOFC in guiding learning but not choice (Miller et al., 2018). Other studies in rat lOFC have reported strong encoding of reward and outcome values specifically following action selection, but not preceding choice (Sul et al., 2010; Miller et al., 2018; Steiner and Redish, 2014; Steiner and Redish, 2012; Mainen and Kepecs, 2009; Hirokawa et al., 2019). However, recordings from rat lOFC in sensory preconditioning paradigms have revealed cue-evoked responses that may reflect inferred value (Sadacca et al., 2018; Takahashi et al., 2013). Studies in non-human primates have also reported strong encoding of offer values in OFC after presentation of a stimulus, and before choice (Padoa-Schioppa and Assad, 2006; Kennerley et al., 2011; Lin et al., 2020; Rich and Wallis, 2016). A recent study in mouse OFC used olfactory cues to convey reward attributes (juice identity and volume) and reported encoding of offer values before choice (Kuwabara et al., 2020). It is unclear what factors may account for the pronounced encoding of offer value before choice in Kuwabara et al., 2020, but minimal encoding of offer value before choice in our data. One important difference is that the cues that conveyed reward attributes in the present study needed to be integrated over time. The integration process itself likely does not occur in OFC (Lin et al., 2020). It is possible that OFC would only represent the offer value once the animal had accumulated enough sensory evidence to infer it. If that inference occurred at a different time on each trial, depending on the strength of the sensory evidence and stochasticity of the accumulation process, then representations of offer value would not be observable in the trial-averaged neural response.

Adaptive coding in rat lOFC

Previous studies in primate medial (Yamada et al., 2018) and central-lateral (Kennerley et al., 2011; Tremblay and Schultz, 1999; Padoa-Schioppa, 2009; Rustichini et al., 2017) OFC have reported subsets of neurons that adjust the gain of their firing rates to reflect the range of offered rewards, a phenomenon referred to as adaptive value coding. This type of coding is efficient because it would allow OFC to accurately encode values in diverse contexts that may vary substantially in reward statistics (Webb et al., 2014), analogous to divisive normalization in sensory systems (Carandini and Heeger, 2011; Simoncelli and Olshausen, 2001).

According to divisive normalization models of subjective value, the value of an option is divided by a recency-weighted average of previous rewards (Louie and Glimcher, 2012; Tymula and Glimcher, 2020; Webb et al., 2014). Therefore, we would predict that neurons implementing adaptive value coding would exhibit stronger responses following unrewarded trials, and weaker responses following rewarded trials. Consistent with this hypothesis, we identified a subset of neurons that had significant coefficients for rewards on current and previous trials, with opposite signs, and while this fraction was somewhat modest (∼15%), it was comparable to the proportion of adaptive value coding neurons observed in central-lateral primate OFC (Kennerley et al., 2011). Notably, we did not find any evidence of encoding of a reward prediction error during this epoch, or the difference between the expected value of the chosen lottery and the outcome. Therefore, the differential encoding of reward outcomes depending on reward history reflected a discrepancy between reward outcomes and recent experience, not a discrepancy between reward outcomes and expectations (as in a reward prediction error). These were dissociable in our task, due to the sensory cues that explicitly indicated expected value on each trial. We emphasize that adaptive value coding is likely not a specialization of lOFC, but probably occurs broadly in brain areas that represent subjective value.

Mixed selectivity in OFC

The complexity and diversity of responses found in the prefrontal cortex, including the OFC, has led to questions about whether or not neurons with mixed selectivity or specialized responses represent behavioral and cognitive variables (Rigotti et al., 2013; Hirokawa et al., 2019; Raposo et al., 2014; Mante et al., 2013; Namboodiri et al., 2019). We tested if OFC neural responses belonged to clusters in our task using two separate methods: the gap statistic, and the PAIRS statistic. Both methods confirm that there is statistically significant clustering in our data. We emphasize, however, that mixed selectivity is fundamentally a property of individual neurons. We did not directly investigate whether individual lOFC neurons in our dataset exhibited mixed selectivity, but instead focused on encoding at the level of clusters. The broad encoding of task variables in each cluster suggests that lOFC neurons probably exhibit mixed selectivity, but a formal assessment was beyond the scope of our study.

Representations of reward history in OFC

In studies that require animals to learn and update the value of actions and outcomes from experience (i.e. in accordance with reinforcement learning), values are often manipulated or changed over the course of the experiment to assess behavioral flexibility, value-updating, and goal-directed behavior. In rodents and primates, lesion and perturbation studies have shown that the OFC is critical for inferring value in these contexts, suggesting an important role in learning and dynamically updating value estimates for model-based reinforcement learning (Gallagher et al., 1999; Izquierdo et al., 2004; Pickens et al., 2005; Noonan et al., 2010; West et al., 2011; Gremel and Costa, 2013; Miller et al., 2017; Miller et al., 2018). Neural recordings in these dynamic paradigms have revealed activity in OFC that reflects reward history and outcome values, which could subserve evaluative processing and learning (Nogueira et al., 2017; Sul et al., 2010; Miller et al., 2018).

Other studies, including this one, have used sensory cues (e.g. static images, odors) to convey information about reward attributes. Once learned, the mapping between these cues and reward attributes is fixed, and the subject must choose between options with explicitly cued values. Neurons in the central-lateral primate OFC and rat lOFC have been shown to represent the values associated with these sensory cues in their firing rates, as well as reward outcomes and outcome values (Thorpe et al., 1983; Schoenbaum et al., 1998; Tremblay and Schultz, 1999; Wallis and Miller, 2003; Padoa-Schioppa and Assad, 2006; Sul et al., 2010; Padoa-Schioppa, 2011; Levy and Glimcher, 2012; Rudebeck and Murray, 2014; Sadacca et al., 2018; Hirokawa et al., 2019; Lin et al., 2020). However, the extent to which OFC is causally required for these tasks is a point of contention, and may differ across species (Ballesta et al., 2020; Gardner et al., 2020; Kuwabara et al., 2020). Notably, even when reward contingencies are fixed over trials, animals often show sequential learning effects (Lak et al., 2020; Constantinople et al., 2019b), and reward history representations in OFC have been reported in tasks with stable reward contingencies (Constantinople et al., 2019a; Padoa-Schioppa, 2013; Kennerley et al., 2011).

In this study, we have described dynamic trial-by-trial changes in firing rates that reflected reward history just preceding the choice. These reward history representations might influence ongoing neural dynamics supporting the upcoming choice (Mochol et al., 2021), and/or mediate learning. In contrast to broadly distributed adaptive value coding, this activity was restricted to a particular subset of neurons that were identifiable by two independent clustering methods, and that exhibited the strongest encoding of reward outcomes. Intriguingly, the responses of neurons in this cluster (cluster 3) are qualitatively similar to the responses of identified corticostriatal cells in Hirokawa et al., 2019, raising the possibility that cluster 3 neurons also project to the striatum. lOFC neurons that project to the striatum likely mediate learning: ablating OFC projections to the ventral striatum during a reversal learning task demonstrated that this projection specifically supports learning from negative outcomes (Groman et al., 2019).

In reinforcement learning accounts of basal ganglia function, it is thought that cortical inputs to the striatum encode information about the state the animal is in, and corticostriatal synapses store the learned values of performing actions in particular states in their synaptic weights (i.e. Q-values), which can be updated and modulated by dopamine-dependent plasticity (Averbeck and Costa, 2017; Doya, 2000). Coincident activation of cortical inputs and striatal spiking can tag corticostriatal synapses for plasticity in the presence of dopamine (Yagishita et al., 2014). One reason that lOFC neurons might encode reward history at the time of choice is so that contextual inputs reflecting the animal’s state would include its recent reward history. We have previously shown that optogenetic perturbations of lOFC in this task, triggered when rats exited the center port, disrupted sequential trial-by-trial learning effects (Constantinople et al., 2019a). These sequential learning effects are not optimal for decision making in this task, as trials were independent and contained explicitly cued reward contingencies. Suboptimal sequential biases are observed across species and domains (Blanchard et al., 2014; Croson and Sundali, 2005; Neiman and Loewenstein, 2011), and reflect an innate difficulty in judging true randomness in an environment (Nickerson, 2002). The neural circuits supporting these sequential biases could potentially derive from the subpopulation of lOFC neurons in cluster 3, which may mediate the effect of previous trials on the animal’s behavior either by influencing lOFC dynamics supporting the upcoming choice (Mochol et al., 2021), or via projections to the striatum that mediate learning.

Share this article

Cite this article

Task description and PSTH clustering.

Conditional feature space clustering.

GLM analysis.

Coefficient of partial determination (CPD) and mutual information (MI) reveal broad encoding of all task parameters in each cluster.

Encoding of reward history emerges late in the trial for cluster 3.

Adaptive value coding analysis using linear regression of firing rates for the 1s period following reward delivery against current and previous reward outcomes.

Rewarded volume encoding by cluster.

Author details

David L Hocker

Contribution

For correspondence

Competing interests

Carlos D Brody

Contribution

Competing interests

Cristina Savin

Contribution

Competing interests

Christine M Constantinople

Contribution

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism

Further reading