Neural dynamics of reversal learning in the prefrontal cortex and recurrent neural networks

Christopher M Kim; Carson C Chow; Bruno B Averbeck

doi:10.7554/eLife.103660.1

Introduction

To survive in a dynamically changing world, animals must interact with the environment and learn from their experience to adjust their behavior. Reversal learning has been used for assessing ability to adapt one’s behavior in such environment [1–6]. For instance, in the two-armed bandit tasks with probabilistic reward, the subject learns from initial trials that one option has higher reward probability than the other. When the reward probability of two options is reversed at a random trial, the subject must learn to reverse its preferred choice to maximize reward outcome. In these tasks, there is uncertainty in when to reverse one’s choice, as reward is received stochastically even when the less favorable option is chosen. Therefore, it is essential that reward outcomes are integrated over multiple trials before the initial choice preference is reversed. Although neural mechanisms for accumulating evidence within a trial have been studied extensively [7–10], it remains unclear if a recurrent neural circuit uses a similar neural mechanism for accumulating evidence across multiple trials, while performing intervening behavior during each trial. In this study, we merged two classes of computational models, i.e., behavioral and neural, to investigate the neural basis of multi-trial evidence accumulation. The behavior models capture subject’s behavioral strategies for performing the reversal learning task. For instance, Model-free reinforcement learning (RL) [11–13] assumes that the subject learns only from choices and reward outcomes with-out specific knowledge about task structure. Model-based Bayesian inference [2, 14, 15], in contrast, assumes that the task structure is known to the subject, and one can infer reversal points statistically, resulting in abrupt switches in their choice preference. Model-based and model-free RL models are formal models that do not specify implementation in a network of neurons. On the other hand, neural models implemented with recurrent neural networks (RNNs) can be trained to use recurrent activity to perform the reversal learning task. In particular, attractor dynamics, in which the network state moves towards discrete [9, 16] or along continuous [8, 17] attractor states, have been studied extensively as a potential neural mechanism for decision-making and evidence accumulation [18, 19].

Here, we trained RNNs that learned from a Bayesian inference model to mimic the behavioral strategies of monkeys performing the reversal learning task [2, 4]. We found that, in the prefrontal cortex of monkeys and in trained RNNs, neural activity during a baseline hold period encoded reversal probability in a one-dimensional subspace, similar to a line attractor. However, intervening behavior during a trial, including making decisions and receiving feedback, produced substantial non-stationary neural dynamics. This observation made the attractor dynamics, which require the network state to stay close to attractor states [8, 9, 16, 17], ill-suited for explaining the neural activity associated with evidence accumulation in reversal learning.

Instead, we found that reversal probability was encoded in dynamic neural trajectories that shifted systematically across trials. Reward outcome pushed the entire trajectory in a positive (without reward) or negative (with reward) direction, separating trajectories of adjacent trials. Moreover, integrating reward outcomes across trials captured the position of a trajectory. These results suggested a neural mechanism where separable dynamic trajectories encode accumulated evidence. Around the behavioral reversal trial, reversal probabilities were represented by a family of rank-ordered trajectories that shifted monotonically. Perturbation experiment in trained RNNs demonstrated a causal link between reversal probability activity and choice outcomes.

In sum, our results show that, in a probabilistic reversal learning task that requires evidence integration across trials and execution of intervening behavior in-between trials, reversal probability is encoded in separable dynamic trajectories that allow for temporally flexible representation of accumulated evidence.

Results

1 Trained RNN’s choices are consistent with monkey behavior

In the reversal learning task, in each trial, two options were available. The subject (either the monkeys or the network) chose one of the options. Rewards were delivered stochastically. The initial high-value option was rewarded 70% of the time when chosen, and the initial low value option was rewarded 30% of the time when chosen. The task was executed in blocks of trials. On a randomly chosen trial, the reward probability of the two options was switched. Because reward delivery was stochastic, the agent had to infer the reversal by accumulating evidence that a reversal had occurred. In this study, we will focus on this reversal inference process.

We began by training an RNN on the reversal learning task and comparing the performance of the network to the monkeys. This allowed us to study the solutions adopted by the network, to generate hypotheses that we could test in neural data. Therefore, we trained an RNN to choose from two options in each trial when triggered by a cue. Following the choice, feedback was provided to the network, signaling the choice it made and reward outcome (Fig. 1A). The reward schedule was probabilistic and identical to the task monkeys performed. This reward schedule was reversed at a random trial, and the RNN learned to reverse its decision by mimicking the outputs of a Bayesian inference model that captures the monkey’s reversal behavior (See Methods Section 2.3 for the RNN training scheme). In a typical block consisting of 36 trials, a trained RNN selected the initial high reward options, despite receiving occasional no rewards, but abruptly switched its choice when consecutive no-reward trials persisted (Fig. 1B).

Comparison of the behavior of trained RNNs and monkeys. **(A)** Schematic of RNN training setup. In a trial, network makes a choice in response to a cue. Then, a feedback input, determined by the choice and reward outcome, is injected to the network. This procedure is repeated across trials. **(B)** Example of a trained RNN’s choice outcomes. Vertical bars show RNN choices in each trial and the reward outcomes (magenta: choice A, blue: choice B, light: rewarded, dark: not rewarded). Horizontal bars on the top show reward schedules (magenta: choice A receiving reward is 70%, choice B receiving reward is 30%; blue: reward schedule is reversed). Black curve shows the RNN output. Green horizontal bars show the posterior of reversal probability at each trial inferred using Bayesian model. **(C)** Probability of choosing the initial best option. Relative trial indicates the trial number relative to the behavioral reversal trial inferred from the Bayesian model. Relative trial number 0 is the trial at which the choice was reversed. **(D)** Fraction of no-reward blocks as a function of relative trial. Dotted lines show 0.3 and 0.7. **(E)** Distribution of RNN’s and monkey’s reversal trial, relative to the experimentally scheduled reversal trial.

The reversal behavior of trained RNNs was similar to the monkey’s behavior on the same task. RNNs selected the high reward option with high probability before the behavioral reversal, at which time they abruptly switched their choice (Fig. 1C). The behavioral reversal was preceded by a gradually increasing number of no-reward trials (Fig. 1D). The distribution of behavioral reversal trials (i.e., trial at which preferred choice was reversed) relative to the scheduled reversal trial (i.e., trial at which reward schedule was reversed) was similar to the distribution of monkey’s reversal trials (Fig. 1E).

2 Task-relevant neural activity evolves dynamically

Next we examined the temporal dynamics of task-relevant neural activity, in particular activity encoding the choice and reversal probability. This analysis focused on trials around the reversal point in each block. To capture task-relevant neural activity, we first identified population vectors that encoded the task variables using a method called targeted dimensionality reduction [20]. It regresses the activity of individual neurons onto task variables and finds the maximal population vector of each task variable. Then, neural activity representing the task variable is obtained by projecting the population activity onto the identified task vectors (see Methods Section 3 for details).

When averaged over blocks, the neural activity associated with choices and inferred reversal probability, denoted as x_choice and x_rev, respectively, produced non-stationary dynamics in each trial (Fig. 2A). Their activity level reached a maximum around the time of cue onset (black squares in Fig. 2A), when the monkey and RNN were about to make a choice. The rotational neural dynamics were found both in the prefrontal cortex (PFC) of monkeys and trained RNNs.

Neural trajectories encoding choice and reversal probability variables. **(A)** Neural trajectories of PFC (top) and RNN (bottom) obtained by projecting population activity onto task vectors encoding choice and reversal probability. Trial numbers indicate their relative position to the behavioral reversal trial. Neural trajectories in each trial were averaged over 8 experiment sessions and 23 blocks for the PFC, and 40 networks and 20 blocks for the RNNs. Black square indicates the time of cue onset. **(B-C)** Neural activity encoding reversal probability and choice in PFC (top) and RNN (bottom) at the time of cue onset (black squares in panel A) around the behavioral reversal trial. Shaded blue shows the standard error of mean over sessions (or networks) and blocks.

The orientation of rotational trajectories shifted as trials progressed, indicating systematic changes in the choice and reversal probability activity across trials. When the task-relevant activity at cue onset was analyzed, we found that reversal probability activity, x_rev, peaked at the reversal trial in the PFC and RNN (Fig. 2B). On the other hand, choice activity, x_choice, decreased gradually over trials reflecting the changes in choice preference (Fig. 2C). The inverted-V shape of x_rev and the monotonic decrease in x_choice over trials explained the counter-clockwise shift in the rotational trajectories observed in the two-dimensional phase space (Fig. 2A).

3 Integration of reward outcomes drives reversal probability activity

We asked if the changes in reversal probability activity x_rev across trials, as shown in Fig. 2B, can be explained by integrating reward outcomes. In particular, we wondered if the reward outcomes from each trial would drive the shifts in reversal probability activity. To investigate this question, we set up a reward integration equation that predicts the next trial’s reversal probability activity based on current trial’s reversal probability and reward outcome, therefore predicting across-trial reversal probability by integrating reward outcomes. Here, is the reversal probability activity at the time of cue onset t_onset at trial k, and is an estimate of the shift in reversal probability activity driven by trial k’s reward outcome (± if rewarded and − if not rewarded. See Methods Section 4 for details).

The predicted reversal probability activity was in good agreement with the actual activity of PFC and RNN (example blocks shown in Figs.3A,C; prediction accuracy of all blocks shown in Fig. 3E). Moreover, we found that , the neural activity encoding reversal probability, responded to reward outcomes consistently with how reversal probability itself would be updated. In other words, receiving no-reward at trial k increased the reversal probability activity in the next trial k + 1 (Figs.3B, D; no reward), while receiving a reward at trial k decreased (Figs.3B, D; reward). At the behavioral reversal trial (k = 0), however, the reversal probability activity in the following trial (k = 1) decreased regardless of the reward outcome at the reversal trial. When the reward integration equation was fitted to the reversal probability activity at other time points (i.e., was estimated at each t), the prediction accuracy remained stable in time (Fig. 3F).

Integration of reward outcomes drives reversal probability activity. **(A)** The reversal probability activity of PFC (orange) and prediction by the reward integration equation (blue) at the time of cue onset across trials around the behavioral reversal trial. Three example blocks are shown. Pearson correlation between the actual andpredicted PFC activity is shown on each panel. **(B)** of PFC estimated from the reward integration equation at cue onset. and correspond to no-reward (red) and reward trials (blue), respectively. **(C-D)** Same as in panels (A) and (B) but for trained RNNs. **(E)** Prediction accuracy of the reward integration equation of all 8 PFC recording sessions and all 40 trained RNNs at cue onset. **(F)** Average prediction accuracy of the reward integration equation across time. The value at each time point shows the prediction accuracy averaged over all blocks in PFC recording sessions (top) or trained RNNs (bottom).

These findings show that neural activity encoding reversal probability exhibits structured responses to reward outcomes, consistent with how reversal probability itself would respond: increase with no reward and decrease with reward. In addition, the reversal probability activity can be predicted by integrating reward outcomes, supporting that it encodes accumulation of decision-related evidence.

4 Dynamic neural trajectories encoding reversal probability are separable

Previous works have shown that accumulation of decision-related evidence can be represented as a line attractor in a stable subspace of network activity [20, 21]. One might hypothesize that reversal probability in the reversal learning task could be similarly characterized by such line attractor dynamics. A direct application of the line attractor model would imply that, across a trial when no decision-related evidence is presented, the reversal probability activity should remain constant (Fig. 4A, line attractor).

Dynamic neural trajectories encoding reversal probability are separated in response to reward outcomes. **(A)** Two neural models for the reversal probability dynamics. Left: Line attractor model where x_rev(t) remains constant during a trial. Right: Dynamic trajectory model where x_rev(t) is non-stationary. In both models, the trajectories of adjacent trials are separable if the shift due to reward (or no reward ) is negative (or positive) throughout the trial. **(B)** Left: Block-averaged x_rev/dt of PFC across trial and time. Dotted red lines indicate the onset time of fixation (−0.5s), cue (0s) and reward (0.8s); same lines shown on the right. Right: Average of x_rev/dt over trials shown on the left panel. **(C)** Left: x_rev(t) of PFC at current trial (black) is compared to x_rev(t) in the next trial when reward is received (red) and not received (blue). Right: The difference of x_rev(t) between current and next trials shown on the left panels. **(D)** Difference of x_rev of two adjacent trials when reward is not received (R₋) or received (R₊). The approximate time of reward outcome is shown. **(E)** Left: x_rev(t) of PFC of consecutive no reward trials before the behavioral reversal trial (top) and consecutive reward trials after the behavioral reversal (bottom). The initial value was subtracted to compare the ramping rates of x_rev(t). Right: Difference in the ramping rates of trajectories of adjacent trials, when reward was received (blue) and not received (red). **(F)** External (left) and recurrent (middle) inputs to the reversal probability dynamics, when reward was not received (red) and received (blue). Amplification factor (right) shows the ratio of no reward (or reward) input to the reference input. The amplification factors for both the external and recurrent inputs are shown. **(G-H)** Same as the right panels in (C) and (E) but for trained RNNs.

However, we found that there was substantial activity in this neural subspace (Fig. 2A). In particular, the non-stationary neural activity was associated with intervening behaviors during a trial. The time derivative of reversal probability activity increased rapidly at the time of cue onset, when a decision is made, followed by sharp decrease until the time of reward (dx_rev/dt in Fig. 4B). So, instead of a static view of evidence accumulation, we explored the hypothesis that different levels of reversal probability could be encoded in dynamic neural trajectories (Fig. 4A, dynamic trajectory).

To encode distinct values of reversal probability in dynamic trajectories, the trajectories representing the values must remain separated as they evolve in time. We compared trajectories at adjacent trials to examine if the reward outcome drives the next trial’s trajectory away from the current trial’s trajectory, thus separating them, and, if so, to what extent the trajectories are separated.

Analysis of PFC activity showed that, not receiving a reward increased the next trial’s trajectory , compared to the current trial’s trajectory Within a trial, this positive shift was observed over the entire trial duration until the next trial’s reward was revealed, as shown in the difference of adjacent trials’ trajectories (Fig. 4C, R₋). Moreover, across trials, the same trend was observed in all the trials except at the behavioral reversal trial, at which the reversal probability activity reached its maximum value and decreased in the following trial (Fig. 4D, R₋). On the other hand, when a reward was received, the next trial’s trajectory was decreased compared to the current trial’s trajectory. This negative shift persisted until the next trial’s reward, similarly to the case when reward was received (Fig. 4C, R₊). Across trials, the same trend was observed in all the trials except at the trial preceding the behavioral reversal trial, at which the trajectory increased to the maximum value at the reversal trial (Fig. 4D, R₊). Additional analysis on R₋ and R₊ beyond the next trial’s reward time can be found in Supp. Figure S1.

We examined what type of activity mode the dynamic trajectories exhibited when separating away from the previous trial’s trajectory. Ramping activity is often observed in cortical neurons of animals engaged in decision-making [22–26]. We found that, when no rewards were received, trajectories were separated from the previous trial’s trajectory by increasing their ramping rates towards the decision time (dR₋/dt > 0 in Fig. 4E). On the other hand, when rewards were received, trajectories were separated by decreasing their ramping rate (dR₊/dt < 0 in Fig. 4E). The increase (or decrease) in the ramping rates was observed in consecutive no reward (or reward) trials around the reversal trial (Fig. 4E, left).

Consistently with the PFC activity, the trained RNN exhibited similar activity responses to reward outcomes: neural trajectories encoding reversal probability increased when reward was not received and decreased when reward was received. The shift in trajectories persisted throughout the trial duration (Fig. 4G) and ramping rates changed in agreement with the PFC findings (Fig. 4H).

Since the dynamics of trained RNNs are fully known, we sought to examine the circuit dynamic motif that separates neural trajectories. We projected the differential equation governing the network dynamics onto a one-dimensional subspace and analyzed the contribution of recurrent and external inputs to reversal probability dynamics: (see Methods Section 1 for details). We found that the external input x_ext was positive, while the recurrent input x_rec was negative and curtailed the external input (Fig. 4F, external and recurrent). When no reward was received, x_ext and x_rec were both amplified by approximately the same factor and resulted in increased total input: with γ^norew > 1 (Fig. 4F, amplification). On the other hand, when reward was received, they were both suppressed, resulting in decreased total input: with γ^reward < 1. This suggested a circuit dynamic motif, where external feedback balanced by recurrent inhibition drives the reversal probability dynamics. The total drive is amplified or suppressed, depending on reward outcomes, resulting in a trajectory that separates from the previous trial’s trajectory.

In sum, our findings show that dynamic neural trajectories encoding reversal probability are separated from the previous trial’s trajectory in response to reward outcomes, allowing them to represent distinct values of reversal probability across a trial.

5 Monotonic shift of reversal probability trajectories across trials

So far, we showed that neural trajectories of two adjacent trials, encoding reversal probability, were separable. In this section, we investigated if trajectories exhibited systematic changes across multiple trials. Specifically, we quantified the mean behavior of trajectories in each trial (referred to as mean trajectory of a trial) and looked for consistent trends in the changes of mean trajectories across trials.

Since a mean trajectory was obtained by averaging over all reward outcomes, we compared how reward and no-reward blocks contributed to modifying the next trial’s mean trajectory. This analysis amounted to comparing and shown in Fig. 4D. Since > 0 and < 0 throughout a trial, we flipped the sign of to and compared the magnitudes of two positive traces, and We found that, before the behavioral reversal trial (k < 0), the contribution of no-reward was larger than reward . This is shown as lying above during a trial and across pre-reversal trials (see relative trials k = −5 to −1 in Fig. 5A). Their temporal averages over trial duration also captured this finding: , was larger than before the behavioral reversal trial (Fig. 5B, bottom). This analysis showed that the sum , i.e., the difference of mean trajectories between two adjacent trials, stayed positive during the pre-reversal trials (Fig. 5B, top). We confirmed that the fraction of trials, for which is positive, was close to 0.8 in the pre-reversal phase (Fig. 5C, top).

Mean trajectories encoding reversal probability shift monotonically across trials. **(A)** Traces of and around the behavioral reversal trial. Note the sign flip in , which was introduced to compare the magnitudes of and . **(B)** Top: across trial and time. Bottom: Temporal averages of and over the trial duration. **(C)** Top: Traces of of pre-reversal trials (relative trial k = −5 to −1), and the fraction of trials at each time point that satisfy . Bottom: Mean PFC reversal probability trajectories of pre-reversal trials. **(D)** Same as in panel (C), but for post-reversal trials (relative trial k = 0 to 4). **(E)** Spearman rank correlation between trial numbers and the mean PFC reversal probability trajectories across pre-reversal (red) and post-reversal (blue) trials at each time point. For the post-reversal trials, Spearman rank correlation was calculated with the trial numbers in reversed order to capture the descending order. **(F)** of trained RNNs across trial and time. **(G-I)** Trained RNNs’ block-averaged x_rev before and after the reversal trial and their average Spearman correlation at each time point.

The sum being positive meant that the next trial’s mean trajectory was increased relative to the current trial’s mean trajectory. Furthermore, the sum of pre-reversal trials being positive was equivalent to the mean trajectories increasing monotonically across trials towards the behavioral reversal trial (Fig. 5C, bottom). The monotonicity of trajectories implied that a topological structure was present in pre-reversal trajectories; namely, the rank order of the trajectories was preserved throughout the trial duration. Consistent with this observation, we found that the Spearman rank correlation of pre-reversal trajectories was stable in time (Fig. 5E, pre).

After the reversal trial (k ≥ 0), on the other hand, the contributions of no reward and reward were the opposite of pre-reversal trials. The traces of were positioned above (see relative trials k = 0 to 4 in Fig. 5A), and the temporal average was larger than (Fig. 5B, bottom). This showed that was mostly negative during post-reversal trials. The fraction of trials, for which is negative, was close to 0.8 in the post-reversal phase (Fig. 5D, top). The negativity of across post-reversal trials meant that the post-reversal trajectories were monotonically decreasing (Fig. 5D, bottom). Similarly to the pre-reversal trajectories but in reversed order, the rank order of post-reversal trajectories was stable over the trial duration (Fig. 5E, post).

Consistently with the PFC findings, in the trained RNNs, the effects of reward outcomes on mean trajectories were characterized by being positive and negative before and after the reverse trial, respectively (Fig. 5F). Consequently, trained RNNs exhibited monotonic increase and decrease in the pre- and post-reversal phases, respectively (Figs. 5G, H). Also, the rank order of trajectories was stable over the trial duration (Fig. 5I)).

Our analyses show that the mean behavior of dynamic neural trajectories, encoding reversal probability, is to shift monotonically across trials near the behavioral reversal. It suggests that a family of graded neural trajectories, with a temporally stable rank order, could represent varying estimates of the probability that a reversal has occurred.

6 Perturbing neural activity encoding reversal probability biases choice outcomes

Next we turned to the RNN to see if we could perturb activity within the reversal probability space, and consequently perturb the network’s choice preference. Previous experimental works demonstrated that perturbing neural activity of medial frontal cortex [27], specific cell types [28, 29] or neuromodulators [5, 6] affect the performance of reversal learning. In our study, the perturbation was tailored to be within the reversal probability space by applying an external stimulus aligned (v₊) or opposite (v₋) to the reversal probability vector. An external stimulus in a random direction was also applied as a control (v_rnd). All the stimuli were applied before the time of choice at the reversal trial or at preceding trials (Fig. 6A).

Perturbing RNN’s neural activity encoding reversal probability biases choice outcomes. **(A)** RNN perturbation scheme. Three perturbation stimuli were used; v₊, population vector encoding the reversal probability; v₋, negative of v₊; v_rnd, control stimulus in random direction. Perturbation stimuli were applied at the reversal (0) and two preceding (−2, -1) trials. **(B)** Deviation of reversal probability activity Δx_rev and choice activity Δx_choice from the unperturbed activity. Perturbation was applied at the reversal trial during a time interval the cue was presented (shaded red). Choice was made after a short delay (shaded gray). Perturbation response along the reversal probability vector v₊ (solid) and random vector v_rnd (dotted) are shown. **(C)** Perturbation of reversal probability activity (left) and choice activity (right) in response to three types of stimulus. Δx_rev shows the activity averaged over the duration of perturbation, and Δx_choice shows the averaged activity over the duration of choice. **(D-E)** Fraction of blocks in all 40 trained RNNs that exhibited delayed or accelerated reversal trials in response to perturbations of the reversal probability activity. Perturbations at trial number -1 by three stimulus types are shown on the left panels, and perturbations at all three trials by the stimulus of interest (v₋ in D and v₊ in E) are shown on the right panels. **(F)** Left: The slope of linear regression model fitted to the residual activity of reversal probability and choice. The residual activity at each trial over the time interval [0, 500]ms was used to fit the linear model. Red dot indicates the slope at trial number -1. Right: Each dot is the residual activity of a block at trial number -1. Red line shows the fitted linear model.

We found that the deviation of perturbed reversal probability activity from the unperturbed activity peaked at the end of perturbation duration and decayed gradually (Fig. 6B, red solid). The perturbed choice activity, however, deviated more slowly and peaked during the choice duration (Fig. 6B, black solid). This showed that perturbation of the reversal probability activity had its maximal effect on the choice activity when the choice was made. The strong perturbative effects on the reversal probability and choice activity were not observed in the control (Fig. 6B, dotted).

The perturbation in the aligned (v₊) and opposite (v₋) directions shifted the reversal probability activity along the same directions as the perturbation vector, as expected (Fig. 6C, left). The choice activity, on the other hand, increased when the perturbation was in the opposite direction (v₋) and decreased when the perturbation was in the aligned direction (v₊) (Fig. 6C, right). This finding showed that the choice activity could be biased (1) towards pre-reversal choices if the perturbation decreases the reversal probability activity and (2) towards the post-reversal choices if the perturbation increases the reversal probability activity.

We further analyzed if perturbing within the reversal probability space could affect the choice outcomes, specifically the behavioral reversal trial. We found that the reversal trial was delayed when v₋ stimulus was applied to reduce the reversal probability activity (Fig. 6D, left). The effect of v₋ stimulus increased gradually with the stimulus strength and was significantly stronger than the v₊ or v_rnd stimuli in delaying the reversal trial. Perturbation had the strongest effect when applied to the reversal trial, while perturbations on trials preceding the reversal showed appreciable but reduced effects (Fig. 6D, right). When the v₊ stimulus was applied to trials preceding the reversal trial, the reversal was accelerated (Fig. 6E, left). The effect of v₊ stimulus also increased with stimulus strength and was significantly stronger than the v₋ or v_rnd stimuli in accelerating the reversal trial (Fig. 6E, right).

We asked if perturbation of neural activity in PFC could exhibit similar responses. In other words, does increase (or decrease) in reversal probability activity lead to decrease (or increase) in choice activity in PFC? Although PFC activity was not perturbed by external inputs, we considered the residual activity of single trials, i.e., deviation of single trial neural activity around the trial-averaged activity, to be “natural” perturbation responses. We fitted a linear model to the residual activity of reversal probability and choice and found that they were strongly negatively correlated (i.e., negative slope in Fig. 6F)) at the trial preceding the behavioral reversal trial. This analysis demonstrated the correlation between perturbation responses of reversal probability and choice activity. However, it remains to be investigated, through perturbation experiments, whether reversal probability activity is causally linked to choice activity in PFC and, moreover, to animal’s choice outcomes.

Discussion

Reversal learning

Reversal learning has been a behavioral framework for investigating how the brain supports flexible behavior [1–6] and for elucidating neural mechanisms underlying mental health issues [30, 31]. It has been shown that multiple brain regions (cortical [3, 4, 14, 27–29, 32] and subcortical [3, 33]), neuromodulators [2, 5, 6] and different inhibitory neuron types [28, 29] are involved in reversal learning.

Our results

Despite these recent advances, the dynamics of neural activity in cortical areas during a reversal learning task have not been well characterized. In this study, we investigated how reversal probability is represented in cortical neurons by analyzing neural activity in the prefrontal cortex of monkeys and recurrent neural networks performing the reversal learning task. Reversal probability was encoded in dynamically evolving neural trajectories that shifted in response to reward outcomes. Neural trajectories were translated in the direction consistent with how reversal probability would be updated by reward outcomes, and their position could be estimated by integrating reward outcomes across trials. These suggested a neural mechanism where separable dynamic trajectories represent reversal probability by accumulating reward outcomes. Around the behavioral reversal, the average effects of reward outcomes became monotonic, resulting in graded neural representation of reversal probabilities. Perturbation experiments in trained networks demonstrated a potential causal link between reversal probability activity and choice outcomes.

Attractor dynamics

RNNs with attractor dynamics have been investigated in various contexts as a neural implementation of normative models of decision-making and evidence integration [34–38]. One perspective is to consider decision variables as discrete or continuous attractor states of an RNN. Then, the network activity converges to an attracting state as a decision is made. Biologically plausible network models [7, 16] and neural recordings in cortical areas have been shown to exhibit discrete [9, 10, 39] and continuous [40] attractor dynamics. Another perspective, more closely related to our study, is to consider evidence integration as a movement of network state along a one-dimensional continuous attractor, as demonstrated in [8, 21, 41] (see also continuous attractor dynamics in spatial mapping [17, 42–44]).

In most of the studies, decision-related evidence was presented without significant interruption until the decision point [9, 10, 20, 39]. However, this was not the case in a reversal learning task with probabilistic rewards, as reward outcomes were revealed intermittently over multiple trials while intervening behavior must be performed in-between trials. We showed that such multi-trial evidence integration promoted substantial non-stationary activity in the neural subspace encoding reversal probability. Therefore, the continuous attractor dynamics, in which the network state stays close to the attracting states, did not fully account for the observed neural dynamics. Instead, our findings suggest that separable dynamic trajectories could serve as a neural mechanism for representing accumulated evidence in a temporally flexible way.

Related work

Recent studies showed that intervening behaviors, such as introducing an intruder [21] or accumulating reward across trials [41], could produce neural trajectories that deviate from and retract to a line attractor. In our study, we focused on characterizing the neural representation of reversal probability but did not investigate it from dynamical systems perspective. It remains as future work to characterize if and how the separable dynamic trajectories observed in our study could be augmented to the continuous attractor model and compare it to previous works [21, 41].

In a related work [45], RNNs were trained to perform a change point detection task designed by the International Brain Laboratory [46]. Authors showed that trained RNNs exhibited behavior outputs consistent with an ideal Bayesian observer, as found in our study. However, their trained RNN exhibited line attractor dynamics in contrast to ours. One possible reason for this discrepancy is that their network model stepped through only a few time points in a trial, which limited the possible range of temporal dynamics RNNs can exhibit. This suggests that the setup of a task RNNs learn can shape the trained RNN dynamics. Moreover, it needs to be investigated whether such attractor dynamics are present in the neural recordings from mice performing the change point detection task.

Although RNNs in our study were trained via supervised learning, animals learn a reversal learning task from reward feedback, making it into a reinforcement learning (RL) problem. Neuromodulators play a key role in mediating RL in the brain. In a recent study, dopamine-based RL was used to train artificial RNNs to conduct reversal learning tasks. It was shown that neural activity in RNNs and mice performing the same tasks were in good agreement [47]. In addition, projections of serotonin from dorsal raphe nuclei [6, 48] and norepinephrine from the locus coeruleus [5] to the cortical areas were shown to be involved in reversal learning. Further studies with biologically plausible network models including neuromodulatory effects [49, 50] or formal RL theories incorporating neuromodulators [51] could provide further insights into the role of neuromodulators in reversal learning.

Conclusion

Our findings show that, when performing a reversal learning task that requires evidence integration across trials, a cortical circuit adopts a dynamic neural representation of accumulated evidence to accommodate non-stationary activity associated with intervening behaviors. Such neural mechanism demonstrates the temporal flexibility of cortical computation and opens the opportunity for extending existing neural model for evidence accumulation by augmenting temporal dynamics.

Methods

1 Recurrent neural network

Network model

We trained a recurrent neural network with purely inhibitory synaptic connections. A baseline excitatory external input was applied to neurons, without which the network activity became quiescent. Such inhibitory network operated in a balanced regime where the recurrent inhibitory inputs were balanced with the external excitatory inputs [52]. Neurons were connected sparsely with connection probability p. Throughout network training, the signs of synaptic weights were preserved, resulting in a trained network that had only inhibitory synaptic connections.

The network dynamics were governed by the following equation

and the network readout was

Here, u ∈ ℝ ^N is the neural activity of population of N neurons, W ^rec is an N × N recurrent connectivity matrix with inhibitory synaptic weights: is connection from neurons j to i. The activation function was sigmoidal, ϕ(x) = 1/(1 +exp[(ax + b)]), and was applied to u elementwise in ϕ(u). The baseline input I_base was constant in time and same for all neurons, the cue I_cue was turned on to signal the RNN to make a choice, and the feedback I_feedback provided information about the previous trial’s choice and reward outcome (see Table 1).

The duration of a trial was T = 500ms. The feedback I_feedback was applied on the time interval [0, T_feedback], and the cue I_cue was applied on the the time interval where T_feedback = 300ms and , . The network choice was defined using the average of the readout z on the time interval where and .

Reduced model

One-dimensional reduction of the network dynamics in a subspace defined by a task vector, v, was derived as follows (see Fig. 4). The projection of network activity onto the task vector was

Then, the dynamics of the projected activity is governed by

where

Here x_rec includes both the decay and recurrent terms, and x_ext accounts for all external inputs I = I_base + I_cue + I_feedback.

2 Reversal learning task

Overview. Each block consisted of T = 24 trials during network training. The reversal trial r was sampled randomly and uniformly from 10 trials around the midtrial:.

The network made a choice in each trial: A or B. To model which choice was rewarded, we generated a “rewarded” choice for each trial. One of the choices was more likely to be rewarded than the other. The network’s choice was compared to the rewarded choice, and the network received a feedback that signaled its choice and reward outcome (e.g., chose A and received a reward). The option that yielded higher reward prior to the reversal trial was switched to the other option at the reversal trial.

To train the network to reverse its preferred choice, we used the output of an ideal Bayesian observer model as teaching signal. Specifically, we first inferred the scheduled reversal trial (i .e., the trial at which reward probability switched) using the Bayesian model. Then, the network was trained to flip its preferred choice a few trials after the inferred scheduled reversal trial, such that network’s behavioral reversal trial occurred a few trials after the scheduled reversal trial.

Note that, although we refer to “rewarded” choices, there were no actual rewards in our network model. The “rewarded” choices were set up to define feedback inputs that mimic the reward outcomes monkey received.

2.1 Experiment variables

The important variables for training the RNN were network choice, rewarded choice and feedback. Network choice. To define network choice, we symmetrized the readout z ^sym = (z, −z) and computed its log-softmax where s = e^z + e^−z. The network choice was

where

Rewarded choice. To model stochastic rewards, rewarded choices were generated probabilistically for each trial k:

The reversal of reward schedule was implemented by switching the target probability at the scheduled reversal trial of the block, denoted by r_sch.

Feedback. We considered that reward is delivered when the network choice agreed with the rewarded choice, and no reward is delivered when they disagreed. This led to four types of feedback inputs show in Table 1.

2.2 Bayesian inference model

Here we formulate Bayesian models that infer the scheduled reversal trial or the behavior reversal trial.

Ideal observer model

The ideal observer model, developed previously [2, 4], inferred the scheduled reversal trial and assumed that (a) the target probability was known (Eq. 9) and (b) it switched at the reversal trial (Eq. 10).

The data available to the ideal observer were the choice y_k ∈ {A, B} and the reward outcome z_k ∈ {0, 1} at all the trials k ∈ [1, T]. We inferred the posterior distribution of scheduled reversal at trials k ∈ [1, T]. By Bayes’ rule

We evaluated the posterior distribution of r when data were available up to any trial t ≤ T. The likelihood function f_IO(r) = p(y_1:t, z_1:t|r) of the ideal observer was defined by

For k < r,

For k ≥ r,

To obtain the posterior distribution of r (Eq. 11), the likelihood function f_IO(r) was evaluated for all r ∈ [1, t], assuming flat prior p(r) and normalizing by the choice and reward data p(y_1:t, z_1:t).

Behavioral model

To infer the trial at which choice reversed, i.e., behavior reversal, we used a likelihood function that assumed the preferred choice probability switched at the behavior reversal. Here, the reward schedule was not known.

The data available to the behavioral model were the choice y_k ∈ {A, B} at all the trials k ∈ [1, T]. We inferred the posterior distribution of behavior reversal at trials k ∈ [1, T]. By Bayes’ rule

The likelihood function for the behavioral model was

For k < r,

For k ≥ r,

To obtain the posterior distribution of r, we assumed flat prior p(r), as in the ideal observer, and normalized by the choice data p(y_1:t).

2.3 Training scheme

Overview. The ideal observer successfully inferred a scheduled reversal trial, which occurred randomly around the mid-trial. To learn to switch its preferred choice, we trained the network to learn from scheduled reversal trials inferred from the ideal observer. In other words, in a block consisting of T trials, the network choices and reward outcomes were fed into the ideal observer model to infer the randomly chosen scheduled reversal trial. Then, the network was trained to switch its preferred choice a few trials after the inferred reversal trial. This delay in the behavior reversal from the scheduled reversal was observed in monkey’s reversal behavior [4] and a running estimate of the Maximum a Posterior of the reversal probability (see Step 3 below). As the inferred scheduled reversal trial varied across blocks, the network learned to reverse its choice in a block-dependent manner.

Below we described the specific steps taken to train the network.

Step 1. Simulate the network and store the network choices and reward outcomes.

Step 2. Apply the ideal observer model to network’s choice and reward data to infer the scheduled reversal.

Step 3. Identify the trial t^* at which network choice should be reversed.

The main observation is that the running estimate of Maximum a Posterior (MAP) of the reversal probability converges a few trials past the MAP estimate. In other words, let

then,

where the convergence occurs around

Step 4. Construct the choice sequences the network will learn.

Step 5. Define the loss function of a block.

Step 6. Train the recurrent connectivity weights W ^rec and the readout weights w^out with backpropagation using Adam optimizer with learning rate 10⁻². The learning rate was decayed by a factor 0.9 every 3 epochs. The batch size (i.e., the number of networks trained) was 256. The training was continued until the fraction of rewarded trials was close to reward probability p of the preferred option.

3 Targeted dimensionality reduction

Targeted dimensionality reduction (TDR) identifies population vectors that encode task variables explicitly or implicitly utilized in the experiment the subject or RNN performs [20]. In this study, we were interested in identifying population vectors that encode choice preference and reversal probability. Once those task vectors were identified, we analyzed the neural activity projected to those vectors to investigate neural representation of task variables.

We describe how TDR was performed in our study (see [20] for the original reference). First we regressed the neural activity of each neuron at each time point onto task variables of interest. Then we used the matrix of regression coefficients (i.e., neuron by time) to identify the task vector. Let y_it(k) be the spiking rate of neuron i at time t on trial k where we have N neurons and M time points. We regressed the spiking activity on task variables of interest z^v(k) where the task variables were v ∈ {reversal probability, choice preference, direction, object, block type, reward outcome, trial number}. For each neuron-time pair, (i, t), we performed linear regression over all trials k ∈ [0, T] with a bias:

This regression analysis yielded an N ×M coefficient matrix for each task variable, v. We considered this coefficient matrix as a population vector evolving in time:. Then, a task vector was defined as the population vector w^v ∈ ℝ ^N at which the L₂-norm achieved its maximum:

We performed QR-decomposition on the matrix of task vectors W = [w_rev, w_choice, …] to orthogonalize the task vectors. Then, the population activity was projected onto each (orthogonalized) task vector to obtain the neural activity encoding each task variable:

where y_t(k) = (y_1t(k), …, y_Nt(k)) is the population activity at time t on trial k.

4 Reward integration equation

To derive the reward integration equation shown in Figure 3, we considered the neural activity in a subspace encoding the reversal probability:

We analyzed the neural activity at the time of cue onset t = t_on and obtained a sequence of reversal probability activity across trials: To set up the reward integration equation

we estimated the update driven by reward outcomes at each trial k. Specifically, the update term was defined as the block-average of the difference of reversal probability activity at adjacent trials:

Here, denotes all the blocks across sessions (or networks) in which reward was received at trial k. Similarly, denotes all the blocks in which reward was not received at trial k.

To predict , we set the initial value at trial 0 and sequentially predicted the following trials using Eq. (30) with the update term from Eq. (31). The same analysis was performed at different time points t. We derived integration equations for each time and assessed its prediction accuracy as shown in Figure 3F.

To evaluate the contribution of reward and no-reward outcomes on the average responses of over blocks, we computed

where

with denote the fractions of reward and no-reward blocks at trial k. In Figure 4D and Figure 5A, the weighted responses, i.e., and , were shown.

5 Decoding monkey’s behavioral reversal trial

The PFC activity encoding reversal probability was used to decode the behavioral reversal trial at which monkey reversed its preferred choice (see Supp. Fig. S2). Our analysis is similar to the Linear Discriminant Analysis (LDA) performed in a previous study [4] at a fixed time point. Here, we applied LDA to time points across a trial.

For training, 90% of the blocks were randomly selected to train the decoder and remaining 10% of the blocks were used for testing. This was repeated 20 times. Input data to LDA was of Δk trials around the reverse trial, i.e., k ∈ [k_rev − Δk, …, k_rev + Δk] with Δk = 10. At each trial k, we took the activity vector around time t₀ with Δt = 160ms. The target output t₀ t₀−Δt t₀+Δ_t of LDA was a one-hot vector y^target, whose element was 1 at the reversal trial k_rev and 0 at other trials. The following input-output shows dataset of a block used for training:

Here where T_dec = 2Δt/Δh + 1 denotes the number of time points around t₀ with time increment Δh = 20ms, and the one-hot vector where K_dec = 2Δk + 1 denotes the number of trials around the reversal trial. As mentioned above, this analysis was repeated for time point t₀ across a trial.

Acknowledgements

This research was supported by the Intramural Research Program of the National Institutes of Health: the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) and the National Institute of Mental Health (NIMH).

Significance of findings

Strength of evidence

Abstract

Introduction

Results

1 Trained RNN’s choices are consistent with monkey behavior

2 Task-relevant neural activity evolves dynamically

3 Integration of reward outcomes drives reversal probability activity

4 Dynamic neural trajectories encoding reversal probability are separable

5 Monotonic shift of reversal probability trajectories across trials

6 Perturbing neural activity encoding reversal probability biases choice outcomes

Discussion

Reversal learning

Our results

Attractor dynamics

Related work

Conclusion

Methods

1 Recurrent neural network

Network model

Four types of feedback inputs

Reduced model

2 Reversal learning task

2.1 Experiment variables

2.2 Bayesian inference model

Ideal observer model

Behavioral model

2.3 Training scheme

3 Targeted dimensionality reduction

4 Reward integration equation

5 Decoding monkey’s behavioral reversal trial

Acknowledgements

Supplementary material

References

Article and author information

Author information

Christopher M Kim*

Carson C Chow

Bruno B Averbeck

Version history

Cite all versions

Copyright

Metrics

Christopher M Kim