Abstract
In probabilistic reversal learning, the choice option yielding reward at higher probability switches at a random trial. To perform optimally in this task, one has to accumulate evidence across trials to infer the probability that a reversal has occurred. In this study, we investigated how this reversal probability is represented in cortical neurons by analyzing the neural activity in prefrontal cortex of monkeys and recurrent neural networks trained on the task. We found that neural trajectories encoding reversal probability had substantial dynamics associated with intervening behaviors necessary to perform the task. Furthermore, the neural trajectories were translated systematically in response to whether outcomes were rewarded, and their position in the neural subspace captured information about reward outcomes. These findings suggested that separable dynamic trajectories, instead of fixed points on a line attractor, provided a better description of neural representation of reversal probability. Near the behavioral reversal, in particular, the trajectories shifted monotonically across trials with stable ordering, representing varying estimates of reversal probability around the reversal point. Perturbing the neural trajectory of trained networks biased when the reversal trial occurred, showing the role of reversal probability activity in decision-making. In sum, our study shows that cortical neurons encode reversal probability in a family of dynamic neural trajectories that accommodate flexible behavior while maintaining separability to represent distinct probabilistic values.
Introduction
To survive in a dynamically changing world, animals must interact with the environment and learn from their experience to adjust their behavior. Reversal learning has been used for assessing ability to adapt one’s behavior in such environment [1–6]. For instance, in the two-armed bandit tasks with probabilistic reward, the subject learns from initial trials that one option has higher reward probability than the other. When the reward probability of two options is reversed at a random trial, the subject must learn to reverse its preferred choice to maximize reward outcome. In these tasks, there is uncertainty in when to reverse one’s choice, as reward is received stochastically even when the less favorable option is chosen. Therefore, it is essential that reward outcomes are integrated over multiple trials before the initial choice preference is reversed. Although neural mechanisms for accumulating evidence within a trial have been studied extensively [7–10], it remains unclear if a recurrent neural circuit uses a similar neural mechanism for accumulating evidence across multiple trials, while performing intervening behavior during each trial. In this study, we merged two classes of computational models, i.e., behavioral and neural, to investigate the neural basis of multi-trial evidence accumulation. The behavior models capture subject’s behavioral strategies for performing the reversal learning task. For instance, Model-free reinforcement learning (RL) [11–13] assumes that the subject learns only from choices and reward outcomes with-out specific knowledge about task structure. Model-based Bayesian inference [2, 14, 15], in contrast, assumes that the task structure is known to the subject, and one can infer reversal points statistically, resulting in abrupt switches in their choice preference. Model-based and model-free RL models are formal models that do not specify implementation in a network of neurons. On the other hand, neural models implemented with recurrent neural networks (RNNs) can be trained to use recurrent activity to perform the reversal learning task. In particular, attractor dynamics, in which the network state moves towards discrete [9, 16] or along continuous [8, 17] attractor states, have been studied extensively as a potential neural mechanism for decision-making and evidence accumulation [18, 19].
Here, we trained RNNs that learned from a Bayesian inference model to mimic the behavioral strategies of monkeys performing the reversal learning task [2, 4]. We found that, in the prefrontal cortex of monkeys and in trained RNNs, neural activity during a baseline hold period encoded reversal probability in a one-dimensional subspace, similar to a line attractor. However, intervening behavior during a trial, including making decisions and receiving feedback, produced substantial non-stationary neural dynamics. This observation made the attractor dynamics, which require the network state to stay close to attractor states [8, 9, 16, 17], ill-suited for explaining the neural activity associated with evidence accumulation in reversal learning.
Instead, we found that reversal probability was encoded in dynamic neural trajectories that shifted systematically across trials. Reward outcome pushed the entire trajectory in a positive (without reward) or negative (with reward) direction, separating trajectories of adjacent trials. Moreover, integrating reward outcomes across trials captured the position of a trajectory. These results suggested a neural mechanism where separable dynamic trajectories encode accumulated evidence. Around the behavioral reversal trial, reversal probabilities were represented by a family of rank-ordered trajectories that shifted monotonically. Perturbation experiment in trained RNNs demonstrated a causal link between reversal probability activity and choice outcomes.
In sum, our results show that, in a probabilistic reversal learning task that requires evidence integration across trials and execution of intervening behavior in-between trials, reversal probability is encoded in separable dynamic trajectories that allow for temporally flexible representation of accumulated evidence.
Results
1 Trained RNN’s choices are consistent with monkey behavior
In the reversal learning task, in each trial, two options were available. The subject (either the monkeys or the network) chose one of the options. Rewards were delivered stochastically. The initial high-value option was rewarded 70% of the time when chosen, and the initial low value option was rewarded 30% of the time when chosen. The task was executed in blocks of trials. On a randomly chosen trial, the reward probability of the two options was switched. Because reward delivery was stochastic, the agent had to infer the reversal by accumulating evidence that a reversal had occurred. In this study, we will focus on this reversal inference process.
We began by training an RNN on the reversal learning task and comparing the performance of the network to the monkeys. This allowed us to study the solutions adopted by the network, to generate hypotheses that we could test in neural data. Therefore, we trained an RNN to choose from two options in each trial when triggered by a cue. Following the choice, feedback was provided to the network, signaling the choice it made and reward outcome (Fig. 1A). The reward schedule was probabilistic and identical to the task monkeys performed. This reward schedule was reversed at a random trial, and the RNN learned to reverse its decision by mimicking the outputs of a Bayesian inference model that captures the monkey’s reversal behavior (See Methods Section 2.3 for the RNN training scheme). In a typical block consisting of 36 trials, a trained RNN selected the initial high reward options, despite receiving occasional no rewards, but abruptly switched its choice when consecutive no-reward trials persisted (Fig. 1B).
The reversal behavior of trained RNNs was similar to the monkey’s behavior on the same task. RNNs selected the high reward option with high probability before the behavioral reversal, at which time they abruptly switched their choice (Fig. 1C). The behavioral reversal was preceded by a gradually increasing number of no-reward trials (Fig. 1D). The distribution of behavioral reversal trials (i.e., trial at which preferred choice was reversed) relative to the scheduled reversal trial (i.e., trial at which reward schedule was reversed) was similar to the distribution of monkey’s reversal trials (Fig. 1E).
2 Task-relevant neural activity evolves dynamically
Next we examined the temporal dynamics of task-relevant neural activity, in particular activity encoding the choice and reversal probability. This analysis focused on trials around the reversal point in each block. To capture task-relevant neural activity, we first identified population vectors that encoded the task variables using a method called targeted dimensionality reduction [20]. It regresses the activity of individual neurons onto task variables and finds the maximal population vector of each task variable. Then, neural activity representing the task variable is obtained by projecting the population activity onto the identified task vectors (see Methods Section 3 for details).
When averaged over blocks, the neural activity associated with choices and inferred reversal probability, denoted as xchoice and xrev, respectively, produced non-stationary dynamics in each trial (Fig. 2A). Their activity level reached a maximum around the time of cue onset (black squares in Fig. 2A), when the monkey and RNN were about to make a choice. The rotational neural dynamics were found both in the prefrontal cortex (PFC) of monkeys and trained RNNs.
The orientation of rotational trajectories shifted as trials progressed, indicating systematic changes in the choice and reversal probability activity across trials. When the task-relevant activity at cue onset was analyzed, we found that reversal probability activity, xrev, peaked at the reversal trial in the PFC and RNN (Fig. 2B). On the other hand, choice activity, xchoice, decreased gradually over trials reflecting the changes in choice preference (Fig. 2C). The inverted-V shape of xrev and the monotonic decrease in xchoice over trials explained the counter-clockwise shift in the rotational trajectories observed in the two-dimensional phase space (Fig. 2A).
3 Integration of reward outcomes drives reversal probability activity
We asked if the changes in reversal probability activity xrev across trials, as shown in Fig. 2B, can be explained by integrating reward outcomes. In particular, we wondered if the reward outcomes from each trial would drive the shifts in reversal probability activity. To investigate this question, we set up a reward integration equation that predicts the next trial’s reversal probability activity based on current trial’s reversal probability and reward outcome, therefore predicting across-trial reversal probability by integrating reward outcomes. Here, is the reversal probability activity at the time of cue onset tonset at trial k, and is an estimate of the shift in reversal probability activity driven by trial k’s reward outcome (± if rewarded and − if not rewarded. See Methods Section 4 for details).
The predicted reversal probability activity was in good agreement with the actual activity of PFC and RNN (example blocks shown in Figs.3A,C; prediction accuracy of all blocks shown in Fig. 3E). Moreover, we found that , the neural activity encoding reversal probability, responded to reward outcomes consistently with how reversal probability itself would be updated. In other words, receiving no-reward at trial k increased the reversal probability activity in the next trial k + 1 (Figs.3B, D; no reward), while receiving a reward at trial k decreased (Figs.3B, D; reward). At the behavioral reversal trial (k = 0), however, the reversal probability activity in the following trial (k = 1) decreased regardless of the reward outcome at the reversal trial. When the reward integration equation was fitted to the reversal probability activity at other time points (i.e., was estimated at each t), the prediction accuracy remained stable in time (Fig. 3F).
These findings show that neural activity encoding reversal probability exhibits structured responses to reward outcomes, consistent with how reversal probability itself would respond: increase with no reward and decrease with reward. In addition, the reversal probability activity can be predicted by integrating reward outcomes, supporting that it encodes accumulation of decision-related evidence.
4 Dynamic neural trajectories encoding reversal probability are separable
Previous works have shown that accumulation of decision-related evidence can be represented as a line attractor in a stable subspace of network activity [20, 21]. One might hypothesize that reversal probability in the reversal learning task could be similarly characterized by such line attractor dynamics. A direct application of the line attractor model would imply that, across a trial when no decision-related evidence is presented, the reversal probability activity should remain constant (Fig. 4A, line attractor).
However, we found that there was substantial activity in this neural subspace (Fig. 2A). In particular, the non-stationary neural activity was associated with intervening behaviors during a trial. The time derivative of reversal probability activity increased rapidly at the time of cue onset, when a decision is made, followed by sharp decrease until the time of reward (dxrev/dt in Fig. 4B). So, instead of a static view of evidence accumulation, we explored the hypothesis that different levels of reversal probability could be encoded in dynamic neural trajectories (Fig. 4A, dynamic trajectory).
To encode distinct values of reversal probability in dynamic trajectories, the trajectories representing the values must remain separated as they evolve in time. We compared trajectories at adjacent trials to examine if the reward outcome drives the next trial’s trajectory away from the current trial’s trajectory, thus separating them, and, if so, to what extent the trajectories are separated.
Analysis of PFC activity showed that, not receiving a reward increased the next trial’s trajectory , compared to the current trial’s trajectory Within a trial, this positive shift was observed over the entire trial duration until the next trial’s reward was revealed, as shown in the difference of adjacent trials’ trajectories (Fig. 4C, R−). Moreover, across trials, the same trend was observed in all the trials except at the behavioral reversal trial, at which the reversal probability activity reached its maximum value and decreased in the following trial (Fig. 4D, R−). On the other hand, when a reward was received, the next trial’s trajectory was decreased compared to the current trial’s trajectory. This negative shift persisted until the next trial’s reward, similarly to the case when reward was received (Fig. 4C, R+). Across trials, the same trend was observed in all the trials except at the trial preceding the behavioral reversal trial, at which the trajectory increased to the maximum value at the reversal trial (Fig. 4D, R+). Additional analysis on R− and R+ beyond the next trial’s reward time can be found in Supp. Figure S1.
We examined what type of activity mode the dynamic trajectories exhibited when separating away from the previous trial’s trajectory. Ramping activity is often observed in cortical neurons of animals engaged in decision-making [22–26]. We found that, when no rewards were received, trajectories were separated from the previous trial’s trajectory by increasing their ramping rates towards the decision time (dR−/dt > 0 in Fig. 4E). On the other hand, when rewards were received, trajectories were separated by decreasing their ramping rate (dR+/dt < 0 in Fig. 4E). The increase (or decrease) in the ramping rates was observed in consecutive no reward (or reward) trials around the reversal trial (Fig. 4E, left).
Consistently with the PFC activity, the trained RNN exhibited similar activity responses to reward outcomes: neural trajectories encoding reversal probability increased when reward was not received and decreased when reward was received. The shift in trajectories persisted throughout the trial duration (Fig. 4G) and ramping rates changed in agreement with the PFC findings (Fig. 4H).
Since the dynamics of trained RNNs are fully known, we sought to examine the circuit dynamic motif that separates neural trajectories. We projected the differential equation governing the network dynamics onto a one-dimensional subspace and analyzed the contribution of recurrent and external inputs to reversal probability dynamics: (see Methods Section 1 for details). We found that the external input xext was positive, while the recurrent input xrec was negative and curtailed the external input (Fig. 4F, external and recurrent). When no reward was received, xext and xrec were both amplified by approximately the same factor and resulted in increased total input: with γnorew > 1 (Fig. 4F, amplification). On the other hand, when reward was received, they were both suppressed, resulting in decreased total input: with γreward < 1. This suggested a circuit dynamic motif, where external feedback balanced by recurrent inhibition drives the reversal probability dynamics. The total drive is amplified or suppressed, depending on reward outcomes, resulting in a trajectory that separates from the previous trial’s trajectory.
In sum, our findings show that dynamic neural trajectories encoding reversal probability are separated from the previous trial’s trajectory in response to reward outcomes, allowing them to represent distinct values of reversal probability across a trial.
5 Monotonic shift of reversal probability trajectories across trials
So far, we showed that neural trajectories of two adjacent trials, encoding reversal probability, were separable. In this section, we investigated if trajectories exhibited systematic changes across multiple trials. Specifically, we quantified the mean behavior of trajectories in each trial (referred to as mean trajectory of a trial) and looked for consistent trends in the changes of mean trajectories across trials.
Since a mean trajectory was obtained by averaging over all reward outcomes, we compared how reward and no-reward blocks contributed to modifying the next trial’s mean trajectory. This analysis amounted to comparing and shown in Fig. 4D. Since > 0 and < 0 throughout a trial, we flipped the sign of to and compared the magnitudes of two positive traces, and We found that, before the behavioral reversal trial (k < 0), the contribution of no-reward was larger than reward . This is shown as lying above during a trial and across pre-reversal trials (see relative trials k = −5 to −1 in Fig. 5A). Their temporal averages over trial duration also captured this finding: , was larger than before the behavioral reversal trial (Fig. 5B, bottom). This analysis showed that the sum , i.e., the difference of mean trajectories between two adjacent trials, stayed positive during the pre-reversal trials (Fig. 5B, top). We confirmed that the fraction of trials, for which is positive, was close to 0.8 in the pre-reversal phase (Fig. 5C, top).
The sum being positive meant that the next trial’s mean trajectory was increased relative to the current trial’s mean trajectory. Furthermore, the sum of pre-reversal trials being positive was equivalent to the mean trajectories increasing monotonically across trials towards the behavioral reversal trial (Fig. 5C, bottom). The monotonicity of trajectories implied that a topological structure was present in pre-reversal trajectories; namely, the rank order of the trajectories was preserved throughout the trial duration. Consistent with this observation, we found that the Spearman rank correlation of pre-reversal trajectories was stable in time (Fig. 5E, pre).
After the reversal trial (k ≥ 0), on the other hand, the contributions of no reward and reward were the opposite of pre-reversal trials. The traces of were positioned above (see relative trials k = 0 to 4 in Fig. 5A), and the temporal average was larger than (Fig. 5B, bottom). This showed that was mostly negative during post-reversal trials. The fraction of trials, for which is negative, was close to 0.8 in the post-reversal phase (Fig. 5D, top). The negativity of across post-reversal trials meant that the post-reversal trajectories were monotonically decreasing (Fig. 5D, bottom). Similarly to the pre-reversal trajectories but in reversed order, the rank order of post-reversal trajectories was stable over the trial duration (Fig. 5E, post).
Consistently with the PFC findings, in the trained RNNs, the effects of reward outcomes on mean trajectories were characterized by being positive and negative before and after the reverse trial, respectively (Fig. 5F). Consequently, trained RNNs exhibited monotonic increase and decrease in the pre- and post-reversal phases, respectively (Figs. 5G, H). Also, the rank order of trajectories was stable over the trial duration (Fig. 5I)).
Our analyses show that the mean behavior of dynamic neural trajectories, encoding reversal probability, is to shift monotonically across trials near the behavioral reversal. It suggests that a family of graded neural trajectories, with a temporally stable rank order, could represent varying estimates of the probability that a reversal has occurred.
6 Perturbing neural activity encoding reversal probability biases choice outcomes
Next we turned to the RNN to see if we could perturb activity within the reversal probability space, and consequently perturb the network’s choice preference. Previous experimental works demonstrated that perturbing neural activity of medial frontal cortex [27], specific cell types [28, 29] or neuromodulators [5, 6] affect the performance of reversal learning. In our study, the perturbation was tailored to be within the reversal probability space by applying an external stimulus aligned (v+) or opposite (v−) to the reversal probability vector. An external stimulus in a random direction was also applied as a control (vrnd). All the stimuli were applied before the time of choice at the reversal trial or at preceding trials (Fig. 6A).
We found that the deviation of perturbed reversal probability activity from the unperturbed activity peaked at the end of perturbation duration and decayed gradually (Fig. 6B, red solid). The perturbed choice activity, however, deviated more slowly and peaked during the choice duration (Fig. 6B, black solid). This showed that perturbation of the reversal probability activity had its maximal effect on the choice activity when the choice was made. The strong perturbative effects on the reversal probability and choice activity were not observed in the control (Fig. 6B, dotted).
The perturbation in the aligned (v+) and opposite (v−) directions shifted the reversal probability activity along the same directions as the perturbation vector, as expected (Fig. 6C, left). The choice activity, on the other hand, increased when the perturbation was in the opposite direction (v−) and decreased when the perturbation was in the aligned direction (v+) (Fig. 6C, right). This finding showed that the choice activity could be biased (1) towards pre-reversal choices if the perturbation decreases the reversal probability activity and (2) towards the post-reversal choices if the perturbation increases the reversal probability activity.
We further analyzed if perturbing within the reversal probability space could affect the choice outcomes, specifically the behavioral reversal trial. We found that the reversal trial was delayed when v− stimulus was applied to reduce the reversal probability activity (Fig. 6D, left). The effect of v− stimulus increased gradually with the stimulus strength and was significantly stronger than the v+ or vrnd stimuli in delaying the reversal trial. Perturbation had the strongest effect when applied to the reversal trial, while perturbations on trials preceding the reversal showed appreciable but reduced effects (Fig. 6D, right). When the v+ stimulus was applied to trials preceding the reversal trial, the reversal was accelerated (Fig. 6E, left). The effect of v+ stimulus also increased with stimulus strength and was significantly stronger than the v− or vrnd stimuli in accelerating the reversal trial (Fig. 6E, right).
We asked if perturbation of neural activity in PFC could exhibit similar responses. In other words, does increase (or decrease) in reversal probability activity lead to decrease (or increase) in choice activity in PFC? Although PFC activity was not perturbed by external inputs, we considered the residual activity of single trials, i.e., deviation of single trial neural activity around the trial-averaged activity, to be “natural” perturbation responses. We fitted a linear model to the residual activity of reversal probability and choice and found that they were strongly negatively correlated (i.e., negative slope in Fig. 6F)) at the trial preceding the behavioral reversal trial. This analysis demonstrated the correlation between perturbation responses of reversal probability and choice activity. However, it remains to be investigated, through perturbation experiments, whether reversal probability activity is causally linked to choice activity in PFC and, moreover, to animal’s choice outcomes.
Discussion
Reversal learning
Reversal learning has been a behavioral framework for investigating how the brain supports flexible behavior [1–6] and for elucidating neural mechanisms underlying mental health issues [30, 31]. It has been shown that multiple brain regions (cortical [3, 4, 14, 27–29, 32] and subcortical [3, 33]), neuromodulators [2, 5, 6] and different inhibitory neuron types [28, 29] are involved in reversal learning.
Our results
Despite these recent advances, the dynamics of neural activity in cortical areas during a reversal learning task have not been well characterized. In this study, we investigated how reversal probability is represented in cortical neurons by analyzing neural activity in the prefrontal cortex of monkeys and recurrent neural networks performing the reversal learning task. Reversal probability was encoded in dynamically evolving neural trajectories that shifted in response to reward outcomes. Neural trajectories were translated in the direction consistent with how reversal probability would be updated by reward outcomes, and their position could be estimated by integrating reward outcomes across trials. These suggested a neural mechanism where separable dynamic trajectories represent reversal probability by accumulating reward outcomes. Around the behavioral reversal, the average effects of reward outcomes became monotonic, resulting in graded neural representation of reversal probabilities. Perturbation experiments in trained networks demonstrated a potential causal link between reversal probability activity and choice outcomes.
Attractor dynamics
RNNs with attractor dynamics have been investigated in various contexts as a neural implementation of normative models of decision-making and evidence integration [34–38]. One perspective is to consider decision variables as discrete or continuous attractor states of an RNN. Then, the network activity converges to an attracting state as a decision is made. Biologically plausible network models [7, 16] and neural recordings in cortical areas have been shown to exhibit discrete [9, 10, 39] and continuous [40] attractor dynamics. Another perspective, more closely related to our study, is to consider evidence integration as a movement of network state along a one-dimensional continuous attractor, as demonstrated in [8, 21, 41] (see also continuous attractor dynamics in spatial mapping [17, 42–44]).
In most of the studies, decision-related evidence was presented without significant interruption until the decision point [9, 10, 20, 39]. However, this was not the case in a reversal learning task with probabilistic rewards, as reward outcomes were revealed intermittently over multiple trials while intervening behavior must be performed in-between trials. We showed that such multi-trial evidence integration promoted substantial non-stationary activity in the neural subspace encoding reversal probability. Therefore, the continuous attractor dynamics, in which the network state stays close to the attracting states, did not fully account for the observed neural dynamics. Instead, our findings suggest that separable dynamic trajectories could serve as a neural mechanism for representing accumulated evidence in a temporally flexible way.
Related work
Recent studies showed that intervening behaviors, such as introducing an intruder [21] or accumulating reward across trials [41], could produce neural trajectories that deviate from and retract to a line attractor. In our study, we focused on characterizing the neural representation of reversal probability but did not investigate it from dynamical systems perspective. It remains as future work to characterize if and how the separable dynamic trajectories observed in our study could be augmented to the continuous attractor model and compare it to previous works [21, 41].
In a related work [45], RNNs were trained to perform a change point detection task designed by the International Brain Laboratory [46]. Authors showed that trained RNNs exhibited behavior outputs consistent with an ideal Bayesian observer, as found in our study. However, their trained RNN exhibited line attractor dynamics in contrast to ours. One possible reason for this discrepancy is that their network model stepped through only a few time points in a trial, which limited the possible range of temporal dynamics RNNs can exhibit. This suggests that the setup of a task RNNs learn can shape the trained RNN dynamics. Moreover, it needs to be investigated whether such attractor dynamics are present in the neural recordings from mice performing the change point detection task.
Although RNNs in our study were trained via supervised learning, animals learn a reversal learning task from reward feedback, making it into a reinforcement learning (RL) problem. Neuromodulators play a key role in mediating RL in the brain. In a recent study, dopamine-based RL was used to train artificial RNNs to conduct reversal learning tasks. It was shown that neural activity in RNNs and mice performing the same tasks were in good agreement [47]. In addition, projections of serotonin from dorsal raphe nuclei [6, 48] and norepinephrine from the locus coeruleus [5] to the cortical areas were shown to be involved in reversal learning. Further studies with biologically plausible network models including neuromodulatory effects [49, 50] or formal RL theories incorporating neuromodulators [51] could provide further insights into the role of neuromodulators in reversal learning.
Conclusion
Our findings show that, when performing a reversal learning task that requires evidence integration across trials, a cortical circuit adopts a dynamic neural representation of accumulated evidence to accommodate non-stationary activity associated with intervening behaviors. Such neural mechanism demonstrates the temporal flexibility of cortical computation and opens the opportunity for extending existing neural model for evidence accumulation by augmenting temporal dynamics.
Methods
1 Recurrent neural network
Network model
We trained a recurrent neural network with purely inhibitory synaptic connections. A baseline excitatory external input was applied to neurons, without which the network activity became quiescent. Such inhibitory network operated in a balanced regime where the recurrent inhibitory inputs were balanced with the external excitatory inputs [52]. Neurons were connected sparsely with connection probability p. Throughout network training, the signs of synaptic weights were preserved, resulting in a trained network that had only inhibitory synaptic connections.
The network dynamics were governed by the following equation
and the network readout was
Here, u ∈ ℝ N is the neural activity of population of N neurons, W rec is an N × N recurrent connectivity matrix with inhibitory synaptic weights: is connection from neurons j to i. The activation function was sigmoidal, ϕ(x) = 1/(1 +exp[(ax + b)]), and was applied to u elementwise in ϕ(u). The baseline input Ibase was constant in time and same for all neurons, the cue Icue was turned on to signal the RNN to make a choice, and the feedback Ifeedback provided information about the previous trial’s choice and reward outcome (see Table 1).
The duration of a trial was T = 500ms. The feedback Ifeedback was applied on the time interval [0, Tfeedback], and the cue Icue was applied on the the time interval where Tfeedback = 300ms and , . The network choice was defined using the average of the readout z on the time interval where and .
Reduced model
One-dimensional reduction of the network dynamics in a subspace defined by a task vector, v, was derived as follows (see Fig. 4). The projection of network activity onto the task vector was
Then, the dynamics of the projected activity is governed by
where
Here xrec includes both the decay and recurrent terms, and xext accounts for all external inputs I = Ibase + Icue + Ifeedback.
2 Reversal learning task
Overview. Each block consisted of T = 24 trials during network training. The reversal trial r was sampled randomly and uniformly from 10 trials around the midtrial:.
The network made a choice in each trial: A or B. To model which choice was rewarded, we generated a “rewarded” choice for each trial. One of the choices was more likely to be rewarded than the other. The network’s choice was compared to the rewarded choice, and the network received a feedback that signaled its choice and reward outcome (e.g., chose A and received a reward). The option that yielded higher reward prior to the reversal trial was switched to the other option at the reversal trial.
To train the network to reverse its preferred choice, we used the output of an ideal Bayesian observer model as teaching signal. Specifically, we first inferred the scheduled reversal trial (i .e., the trial at which reward probability switched) using the Bayesian model. Then, the network was trained to flip its preferred choice a few trials after the inferred scheduled reversal trial, such that network’s behavioral reversal trial occurred a few trials after the scheduled reversal trial.
Note that, although we refer to “rewarded” choices, there were no actual rewards in our network model. The “rewarded” choices were set up to define feedback inputs that mimic the reward outcomes monkey received.
2.1 Experiment variables
The important variables for training the RNN were network choice, rewarded choice and feedback. Network choice. To define network choice, we symmetrized the readout z sym = (z, −z) and computed its log-softmax where s = ez + e−z. The network choice was
where
Rewarded choice. To model stochastic rewards, rewarded choices were generated probabilistically for each trial k:
The reversal of reward schedule was implemented by switching the target probability at the scheduled reversal trial of the block, denoted by rsch.
Feedback. We considered that reward is delivered when the network choice agreed with the rewarded choice, and no reward is delivered when they disagreed. This led to four types of feedback inputs show in Table 1.
2.2 Bayesian inference model
Here we formulate Bayesian models that infer the scheduled reversal trial or the behavior reversal trial.
Ideal observer model
The ideal observer model, developed previously [2, 4], inferred the scheduled reversal trial and assumed that (a) the target probability was known (Eq. 9) and (b) it switched at the reversal trial (Eq. 10).
The data available to the ideal observer were the choice yk ∈ {A, B} and the reward outcome zk ∈ {0, 1} at all the trials k ∈ [1, T]. We inferred the posterior distribution of scheduled reversal at trials k ∈ [1, T]. By Bayes’ rule
We evaluated the posterior distribution of r when data were available up to any trial t ≤ T. The likelihood function fIO(r) = p(y1:t, z1:t|r) of the ideal observer was defined by
For k < r,
For k ≥ r,
To obtain the posterior distribution of r (Eq. 11), the likelihood function fIO(r) was evaluated for all r ∈ [1, t], assuming flat prior p(r) and normalizing by the choice and reward data p(y1:t, z1:t).
Behavioral model
To infer the trial at which choice reversed, i.e., behavior reversal, we used a likelihood function that assumed the preferred choice probability switched at the behavior reversal. Here, the reward schedule was not known.
The data available to the behavioral model were the choice yk ∈ {A, B} at all the trials k ∈ [1, T]. We inferred the posterior distribution of behavior reversal at trials k ∈ [1, T]. By Bayes’ rule
The likelihood function for the behavioral model was
For k < r,
For k ≥ r,
To obtain the posterior distribution of r, we assumed flat prior p(r), as in the ideal observer, and normalized by the choice data p(y1:t).
2.3 Training scheme
Overview. The ideal observer successfully inferred a scheduled reversal trial, which occurred randomly around the mid-trial. To learn to switch its preferred choice, we trained the network to learn from scheduled reversal trials inferred from the ideal observer. In other words, in a block consisting of T trials, the network choices and reward outcomes were fed into the ideal observer model to infer the randomly chosen scheduled reversal trial. Then, the network was trained to switch its preferred choice a few trials after the inferred reversal trial. This delay in the behavior reversal from the scheduled reversal was observed in monkey’s reversal behavior [4] and a running estimate of the Maximum a Posterior of the reversal probability (see Step 3 below). As the inferred scheduled reversal trial varied across blocks, the network learned to reverse its choice in a block-dependent manner.
Below we described the specific steps taken to train the network.
Step 1. Simulate the network and store the network choices and reward outcomes.
Step 2. Apply the ideal observer model to network’s choice and reward data to infer the scheduled reversal.
Step 3. Identify the trial t* at which network choice should be reversed.
The main observation is that the running estimate of Maximum a Posterior (MAP) of the reversal probability converges a few trials past the MAP estimate. In other words, let
then,
where the convergence occurs around
Step 4. Construct the choice sequences the network will learn.
Step 5. Define the loss function of a block.
Step 6. Train the recurrent connectivity weights W rec and the readout weights wout with backpropagation using Adam optimizer with learning rate 10−2. The learning rate was decayed by a factor 0.9 every 3 epochs. The batch size (i.e., the number of networks trained) was 256. The training was continued until the fraction of rewarded trials was close to reward probability p of the preferred option.
3 Targeted dimensionality reduction
Targeted dimensionality reduction (TDR) identifies population vectors that encode task variables explicitly or implicitly utilized in the experiment the subject or RNN performs [20]. In this study, we were interested in identifying population vectors that encode choice preference and reversal probability. Once those task vectors were identified, we analyzed the neural activity projected to those vectors to investigate neural representation of task variables.
We describe how TDR was performed in our study (see [20] for the original reference). First we regressed the neural activity of each neuron at each time point onto task variables of interest. Then we used the matrix of regression coefficients (i.e., neuron by time) to identify the task vector. Let yit(k) be the spiking rate of neuron i at time t on trial k where we have N neurons and M time points. We regressed the spiking activity on task variables of interest zv(k) where the task variables were v ∈ {reversal probability, choice preference, direction, object, block type, reward outcome, trial number}. For each neuron-time pair, (i, t), we performed linear regression over all trials k ∈ [0, T] with a bias:
This regression analysis yielded an N ×M coefficient matrix for each task variable, v. We considered this coefficient matrix as a population vector evolving in time:. Then, a task vector was defined as the population vector wv ∈ ℝ N at which the L2-norm achieved its maximum:
We performed QR-decomposition on the matrix of task vectors W = [wrev, wchoice, …] to orthogonalize the task vectors. Then, the population activity was projected onto each (orthogonalized) task vector to obtain the neural activity encoding each task variable:
where yt(k) = (y1t(k), …, yNt(k)) is the population activity at time t on trial k.
4 Reward integration equation
To derive the reward integration equation shown in Figure 3, we considered the neural activity in a subspace encoding the reversal probability:
We analyzed the neural activity at the time of cue onset t = ton and obtained a sequence of reversal probability activity across trials: To set up the reward integration equation
we estimated the update driven by reward outcomes at each trial k. Specifically, the update term was defined as the block-average of the difference of reversal probability activity at adjacent trials:
Here, denotes all the blocks across sessions (or networks) in which reward was received at trial k. Similarly, denotes all the blocks in which reward was not received at trial k.
To predict , we set the initial value at trial 0 and sequentially predicted the following trials using Eq. (30) with the update term from Eq. (31). The same analysis was performed at different time points t. We derived integration equations for each time and assessed its prediction accuracy as shown in Figure 3F.
To evaluate the contribution of reward and no-reward outcomes on the average responses of over blocks, we computed
where
with denote the fractions of reward and no-reward blocks at trial k. In Figure 4D and Figure 5A, the weighted responses, i.e., and , were shown.
5 Decoding monkey’s behavioral reversal trial
The PFC activity encoding reversal probability was used to decode the behavioral reversal trial at which monkey reversed its preferred choice (see Supp. Fig. S2). Our analysis is similar to the Linear Discriminant Analysis (LDA) performed in a previous study [4] at a fixed time point. Here, we applied LDA to time points across a trial.
For training, 90% of the blocks were randomly selected to train the decoder and remaining 10% of the blocks were used for testing. This was repeated 20 times. Input data to LDA was of Δk trials around the reverse trial, i.e., k ∈ [krev − Δk, …, krev + Δk] with Δk = 10. At each trial k, we took the activity vector around time t0 with Δt = 160ms. The target output t0 t0−Δt t0+Δt of LDA was a one-hot vector ytarget, whose element was 1 at the reversal trial krev and 0 at other trials. The following input-output shows dataset of a block used for training:
Here where Tdec = 2Δt/Δh + 1 denotes the number of time points around t0 with time increment Δh = 20ms, and the one-hot vector where Kdec = 2Δk + 1 denotes the number of trials around the reversal trial. As mentioned above, this analysis was repeated for time point t0 across a trial.
Acknowledgements
This research was supported by the Intramural Research Program of the National Institutes of Health: the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) and the National Institute of Mental Health (NIMH).
Supplementary material
References
- [1]Perseveration in extinction and in discrimination reversal tasks following selective frontal ablations in macaca mulattaPhysiology & Behavior 4:163–171
- [2]Reversal learning and dopamine: a bayesian perspectiveJournal of Neuroscience 35:2407–2416
- [3]Orbitofrontal circuits control multiple reinforcement-learning processesNeuron 103:734–746
- [4]Prefrontal cortex predicts state switches during reversal learningNeuron 106:1044–1054
- [5]Two types of locus coeruleus norepinephrine neurons drive reinforcement learningbioRxiv :2022–12
- [6]Serotonin in the orbitofrontal cortex enhances cognitive flexibilitybioRxiv
- [7]Probabilistic decision making by slow reverberation in cortical circuitsNeuron 36:955–968
- [8]Context-dependent computation by recurrent dynamics in prefrontal cortexnature 503:78–84
- [9]Discrete attractor dynamics underlies persistent activity in the frontal cortexNature 566:212–217
- [10]Transitions in dynamical regime and neural mode underlie perceptual decision-makingbioRxiv :2023–10
- [11]Learning to predict by the methods of temporal differencesMachine learning 3:9–44
- [12]A theory of pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcementClasssical conditioning II: Current research and theory :64–99
- [13]2017 IEEE Symposium Series on Computational Intelligence (SSCI)IEEE :1–5
- [14]The role of frontal cortical and medial-temporal lobe brain areas in learning a bayesian prior belief on reversalsJournal of Neuroscience 35:11751–11760
- [15]Bayesian online learning of the hazard rate in change-point problemsNeural computation 22:2452–2476
- [16]A recurrent network mechanism of time integration in perceptual decisionsJournal of Neuroscience 26:1314–1328
- [17]How the brain keeps the eyes stillProceedings of the National Academy of Sciences 93:13339–13344
- [18]Neural dynamics of choice: single-trial analysis of decision-related activity in parietal cortexJournal of Neuroscience 32:12684–12701
- [19]Neural underpinnings of the evidence accumulatorCurrent opinion in neurobiology 37:149–157
- [20]Context-dependent computation by recurrent dynamics in prefrontal cortexnature 503:78–84
- [21]An approximate line attractor in the hypothalamus encodes an aggressive stateCell 186:178–193
- [22]A motor cortex circuit for motor planning and movementNature 519:51–56
- [23]Robust neuronal dynamics in premotor cortex during motor planningNature 532:459–464
- [24]Attractor dynamics gate cortical information flow during decision-makingNature neuroscience 24:843–850
- [25]Single-trial spike trains in parietal cortex reveal discrete steps during decision-makingScience 349:184–187
- [26]Discrete stepping and nonlinear ramping dynamics underlie spiking responses of lip neurons during decision-makingNeuron 102:1249–1258
- [27]Change point estimation by the mouse medial frontal cortex during probabilistic reward learningbioRxiv :2022–5
- [28]Selective engagement of prefrontal vip neurons in reversal learningbioRxiv :2024–4
- [29]Distinct roles of parvalbumin-and somatostatin-expressing neurons in flexible representation of task variables in the prefrontal cortexProgress in Neurobiology 187
- [30]Reinforcement learning detuned in addiction: integrative and translational approachesTrends in neurosciences 45:96–105
- [31]Orbitofrontal cortex, decision-making and drug addictionTrends in neurosciences 29:116–124
- [32]Orbitofrontal cortex as a cognitive map of task spaceNeuron 81:267–279
- [33]Reinforcement-learning in fronto-striatal circuitsNeuropsychopharmacology 47:147–162
- [34]A theory of memory retrievalPsychological review 85
- [35]The effect of stimulus strength on the speed and accuracy of a perceptual decisionJournal of vision 5:1–1
- [36]Neural basis of a perceptual decision in the parietal cortex (area lip) of the rhesus monkeyJournal of neurophysiology 86:1916–1936
- [37]A role for neural integrators in perceptual decision makingCerebral cortex 13:1257–1269
- [38]A comparison of macaque behavior and superior colliculus neuronal activity to predictions from models of two-choice decisionsJournal of neurophysiology 90:1392–1407
- [39]The dynamics and geometry of choice in premotor cortexBioRxiv
- [40]Bump attractor dynamics in prefrontal cortex explains behavioral precision in spatial working memoryNature neuroscience 17:431–439
- [41]Cell-type-specific population dynamics of diverse reward computationsCell 185:3568–3587
- [42]Toroidal topology of population activity in grid cellsNature 602:123–128
- [43]A unified theory for the computational and mechanistic origins of grid cellsNeuron 111:121–137
- [44]Mechanisms underlying the neural computation of head directionAnnual review of neuroscience 43:31–54
- [45]Advances in Neural Information Processing SystemsCurran Associates, Inc 33:4584–4596
- [46]Brain-wide representations of prior information in mouse decision-makingBioRxiv :2023–7
- [47]Prefrontal cortex as a meta-reinforcement learning systemNature neuroscience 21:860–868
- [48]Activity patterns of serotonin neurons underlying cognitive flexibilityElife 6
- [49]Temporal derivative computation in the dorsal raphe network revealed by an experimentally driven augmented integrate-and-fire modeling frameworkElife 12
- [50]Dopamine and serotonin interplay for valence-based spatial learningCell Reports 39
- [51]Serotonin predictively encodes valuebioRxiv :2023–9
- [52]Chaos in neuronal networks with balanced excitatory and inhibitory activityScience 274
Article and author information
Author information
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
Copyright
This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
Metrics
- views
- 101
- downloads
- 4
- citations
- 0
Views, downloads and citations are aggregated across all versions of this paper published by eLife.