Neural dynamics of reversal learning in the prefrontal cortex and recurrent neural networks

Christopher M Kim author has email address
Carson C Chow
Bruno B Averbeck

Laboratory of Biological Modeling, NIDDK/NIH, Bethesda, United States
Laboratory of Neuropsychology, NIMH/NIH, Bethesda, United States

https://doi.org/10.7554/eLife.103660.2

Open access
Copyright information

Figures and data

Comparison of the behavior of trained RNNs and monkeys.
(A) Schematic of RNN training setup. In a trial, the network makes a choice in response to a cue. Then, a feedback input, determined by the choice and reward outcome, is injected to the network. This procedure is repeated across trials. The panel on the right shows this sequence of events unfolding in time in a trial. (B) Left: Example of a trained RNN’s choice outcomes. Vertical bars show RNN choices in each trial and the reward outcomes (magenta: choice A, blue: choice B, light: rewarded, dark: not rewarded). Horizontal bars on the top show reward schedules (magenta: choice A receiving reward is 70%, choice B receiving reward is 30%; blue: reward schedule is reversed). Black curve shows the RNN output. Green horizontal bars show the posterior of reversal probability at each trial inferred using Bayesian model. Right: Schematic of RNN training scheme. The scheduled reversal indicates the trial at which the reward probabilities of two options switch (color codes for magenta and cyan are the same as the left panel). The inferred reversal is the scheduled reversal trial inferred from the Bayesian model. The behavioral reversal is determined by adding a few delay trials to the inferred reversal trial. The target output, on which the RNNs outputs are trained, switches at the behavioral reversal trial. (C) Probability of choosing the initial best (i.e., high-value) option. Relative trial indicates the trial number relative to the behavioral reversal trial inferred from the Bayesian model. Relative trial number 0 is the trial at which the choice was reversed. Shaded region shows the S.E.M (standard error of mean) over blocks in all the sessions (monkeys) or networks (RNNs). (D) Fraction of no-reward blocks as a function of relative trial. Dotted lines show 0.3 and 0.7. Shaded region shows the S.E.M (standard error of mean) over blocks in all the sessions (monkeys) or networks (RNNs). (E) Distribution of RNN’s and monkey’s reversal trial, relative to the experimentally scheduled reversal trial.

Neural trajectories encoding choice and reversal probability variables.
(A) Neural trajectories of PFC (top) and RNN (bottom) obtained by projecting population activity onto task vectors encoding choice and reversal probability. Trial numbers indicate their relative position to the behavioral reversal trial. Neural trajectories in each trial were averaged over 8 experiment sessions and 23 blocks for the PFC, and 40 networks and 20 blocks for the RNNs. Black square indicates the time of cue onset. (B-C) Neural activity encoding reversal probability and choice in PFC (top) and RNN (bottom) at the time of cue onset (black squares in panel A) around the behavioral reversal trial. Shaded region shows the S.E.M over sessions (or networks) and blocks.

Integration of reward outcomes drives reversal probability activity.
(A) The reversal probability activity of PFC (orange) and prediction by the reward integration equation (blue) at the time of cue onset across trials around the behavioral reversal trial. Three example blocks are shown. Pearson correlation between the actual and predicted PFC activity is shown on each panel. Relative trial number indicate the trial position relative to the behavioral reversal trial. (B) of PFC estimated from the reward integration equation at cue onset. and correspond to no-reward (red) and reward trials (blue), respectively. The shaded region shows the S.E.M over blocks and sessions. (C-D) Same as in panels (A) and (B) but for trained RNNs. (E) Prediction accuracy, quantified with Pearson correlation, of the reward integration equation of all 8 PFC recording sessions and all 40 trained RNNs at cue onset. (F) Average prediction accuracy, quantified with Pearson correlation, of the reward integration equation across time. The value at each time point shows the prediction accuracy averaged over all blocks in PFC recording sessions (top) or trained RNNs (bottom).

Integration of reward outcomes drives reversal probability activity.
(A) The reversal probability activity of PFC (orange) and prediction by the reward integration equation (blue) at the time of cue onset across trials around the behavioral reversal trial. Three example blocks are shown. Pearson correlation between the actual and predicted PFC activity is shown on each panel. Relative trial number indicate the trial position relative to the behavioral reversal trial. (B) of PFC estimated from the reward integration equation at cue onset. and correspond to no-reward (red) and reward trials (blue), respectively. The shaded region shows the S.E.M over blocks and sessions. (C-D) Same as in panels (A) and (B) but for trained RNNs. (E) Prediction accuracy, quantified with Pearson correlation, of the reward integration equation of all 8 PFC recording sessions and all 40 trained RNNs at cue onset. (F) Average prediction accuracy, quantified with Pearson correlation, of the reward integration equation across time. The value at each time point shows the prediction accuracy averaged over all blocks in PFC recording sessions (top) or trained RNNs (bottom).

Augmented model for reversal probability activity.
(A) Schematic of two activity modes of the reversal probability activity. Left: Stationary mode (line attractor) where x_rev(t) remains constant during a trial, and non-stationary mode where x_rev(t) is dynamic. Right: Augmentation of stationary and non-stationary activity modes where the stationary mode leads the non-stationary mode in time. The time derivative dx_rev/dt is shown to demonstrate (non-)stationarity of the activity. (B) Left: Block-averaged x_rev/dt of PFC across trial and time. Dotted red lines indicate the onset time of fixation (−0.5s), cue (0s) and reward (0.8s); same lines shown on the right. Right: x_rev/dt averaged over all trials (white), together with the trajectories of 5 trials around the reversal trial (colored). (C) Left: Contraction factor of x_rev of PFC at different time points. Dotted line at 1 indicates the threshold of contraction and expansion. Right: Contraction factor of PFC x_rev of individual trials between the time interval -2.5s and -1s. (D) Block-averaged dx_rev/dt of RNNs at the pre-reversal (left) and post-reversal (right) trials. Note that the sign of the post-reversal trial trajectories was flipped to match the shape of the pre-reversal trajectories. Dotted red lines indicate the time of fixation, cue off and reward. (E) Contraction factor of x_rev of RNN. Similar results for RNN as in panel (C). (F) Generating PFC non-stationary reversal probability trajectories from the stationary activity using support vector regression (SVR) models. Top: Trajectories generated from SVR compared to the PFC reversal probability trajectories in trials around the reversal trial in an example block. The initial state (green) is the input to the SVR model, which then predicts the rest of the trajectory. The normalized mean-squared-error (MSE) between the SVR trajectory (prediction, red) and the PFC trajectory (data, black) is shown in each trial. Bottom: Trajectories generated from the null SVR compared to the PFC reversal probability trajectories. The initial states of trials in a block were shuffled randomly prior to training the null SVR model. The trajectories predicted from the null SVR model (blue) are compared to the PFC reversal probability trajectories (black). (G) The normalized MSE of all trials in the test dataset. (H) Difference between the normalized MSE of the SVR and the null models. The difference of normalized MSE between two models was calculated for each trial.

Dynamic neural trajectories encoding reversal probability are separated in response to reward outcomes.
(A) Left: x_rev(t) of PFC at current trial (black) is compared to x_rev(t) in the next trial when reward is received (top, red) and not received (bottom, blue). Right: The difference of x_rev(t) between current and next trials shown on the left panels. Shaded region shows the S.E.M. across all trials, blocks and sessions. (B) Difference of x_rev of two adjacent trials when reward is not received (top, R₋) or received (bottom, R₊). The approximate time of reward outcome is shown. Relative trial number indicate the trial position relative to the behavioral reversal trial. (C) Left: x_rev(t) of PFC of consecutive no reward trials before the behavioral reversal trial (top) and consecutive reward trials after the behavioral reversal (bottom). The initial value was subtracted to compare the ramping rates of x_rev(t). Right: Difference in the ramping rates of trajectories of adjacent trials, when reward was received (blue) and not received (red). (D-E) Same as the right panels in (A) and (C) but for trained RNNs. (F) Left, Middle: External (left) and recurrent (middle) inputs to the RNN reversal probability dynamics, when reward was not received (red, magenta) or was received (blue, cyan). Right: Amplification factor shows the ratio of the total input when no reward (or reward) was received to the total input of reference input. The amplification factors for both the external (red, blue) and recurrent (magenta, cyan) inputs are shown. Red and magenta curves and blue and cyan curves overlap.

Mean trajectories encoding reversal probability shift monotonically across trials.
(A) Traces of and around the behavioral reversal trial. Note the sign flip in , which was introduced to compare the magnitudes of. and . Relative trial number indicate the trial position relative to the behavioral reversal trial. (B) Top: across trial and time. Bottom: Temporal averages of and. over the trial duration. (C) Top: Traces of of pre-reversal trials (relative trial k = − 5 to − 1), and the fraction of trials at each time point that satisfy. Bottom: Mean PFC reversal probability trajectories of pre-reversal trials. (D) Same as in panel (C), but for post-reversal trials (relative trial k = 0 to 4). (E) Spearman rank correlation between trial numbers and the mean PFC reversal probability trajectories across pre-reversal (red) and post-reversal (blue) trials at each time point. For the post-reversal trials, Spearman rank correlation was calculated with the trial numbers in reversed order to capture the descending order. (F) of trained RNNs across trial and time. (G-I) Trained RNNs’ block-averaged xrev before and after the reversal trial and their average Spearman correlation at each time point.

Mean trajectories encoding reversal probability shift monotonically across trials.
(A) Traces of and around the behavioral reversal trial. Note the sign flip in , which was introduced to compare the magnitudes of. and . Relative trial number indicate the trial position relative to the behavioral reversal trial. (B) Top: across trial and time. Bottom: Temporal averages of and. over the trial duration. (C) Top: Traces of of pre-reversal trials (relative trial k = − 5 to − 1), and the fraction of trials at each time point that satisfy. Bottom: Mean PFC reversal probability trajectories of pre-reversal trials. (D) Same as in panel (C), but for post-reversal trials (relative trial k = 0 to 4). (E) Spearman rank correlation between trial numbers and the mean PFC reversal probability trajectories across pre-reversal (red) and post-reversal (blue) trials at each time point. For the post-reversal trials, Spearman rank correlation was calculated with the trial numbers in reversed order to capture the descending order. (F) of trained RNNs across trial and time. (G-I) Trained RNNs’ block-averaged xrev before and after the reversal trial and their average Spearman correlation at each time point.

Perturbing RNN’s neural activity encoding reversal probability biases choice outcomes.
(A) RNN perturbation scheme. Three perturbation stimuli were used; v₊, population vector encoding the reversal probability; v₋, negative of v₊; v_rnd, control stimulus in random direction. Perturbation stimuli were applied at the reversal (0) and two preceding (−2, -1) trials. (B) Deviation of reversal probability activity Δx_rev and choice activity Δx_choice from the unperturbed activity. Perturbation was applied at the reversal trial during a time interval the cue was presented (shaded red). Choice was made after a short delay (shaded gray). Perturbation response along the reversal probability vector v₊ (solid) and random vector v_rnd (dotted) are shown. (C) Perturbation of reversal probability activity (left) and choice activity (right) in response to three types of perturbation stimuli. Each dot shows the response of a perturbed network. Two perturbation strengths (multiplicative factor of 3 and 4 shown in panels D and E) were applied to 40 RNNs. Δx_rev shows the activity averaged over the duration of perturbation, and Δx_choice shows the averaged activity over the duration of choice. Δx_choice of v₊ is significantly smaller than Δx_choice of v₋ (KS-test, p-value = 0.007). (D-E) Fraction of blocks in all 40 trained RNNs that exhibited delayed or accelerated reversal trials in response to perturbations of the reversal probability activity. Perturbations at trial number -1 by three stimulus types are shown on the left panels, and perturbations at all three trials by the stimulus of interest (v₋ in D and v₊ in E) are shown on the right panels. A multiplicative factor on the perturbation stimuli is shown as stimulus strength. (F) Left: The slope of linear regression model fitted to the residual activity of reversal probability and choice. The residual activity at each trial over the time interval [0, 500]ms was used to fit the linear model. Red dot indicates the slope at trial number -1. Relative trial number indicate the trial position relative to the behavioral reversal trial. Right: Each dot is the residual activity of a block at trial number -1. Red line shows the fitted linear model, and its slope (−0.34) is shown.

Four types of feedback inputs

Break down of R⁺, R⁻ by the reward outcomes of two consecutive trials.
(A) R⁺ was decomposed into two components R⁺ = R⁺⁺ + R⁺⁻, where R⁺⁺ indicates two consecutive reward trials and R⁺⁻ indicates a reward followed by no reward. Left: R⁺⁺ across trial and time (top). Traces of R⁺⁺ at individual trials and the fraction of trials whose traces are negative (bottom). Middle: Same as the left panel but for R⁺⁻. Right: Same as the other panels but for R⁺. (B) R⁻ was decomposed into two components R⁻ = R⁻⁺ + R^{− −}, where R⁻⁺ indicates no reward followed by a reward and R^{− −} indicates two consecutive no rewards. Same analysis as in panel (A) was performed.

Decoding reward outcome and the behavioral reversal trial using neural trajectories encoding reversal probability.
(A) Left: Decoding the reward outcome (i.e., reward or no reward) of every trial at each time point, given the difference of neural trajectories of two adjacent trials. At each time point, 300ms segment of the trajectories were used for decoding. Right: Decoding accuracy is averaged over all trials shown on the left panel. Red dotted line shows the approximate time of next trial’s reward. Gray dotted line shows the chance level performance. (B) Left: Decoding the behavioral reversal trial using neural trajectories of 20 trials around the reversal trial. Decoding error shows the position of predicted reversal trial relative to the actual reverse trial. At each time point, 300ms segment of each trajectory was used for decoding. Black shows the decoding error when single trial trajectories were used, and green shows the result when randomly chosen 5 blocks of trajectories were averaged before decoding. Gray dotted line shows the chance level performance. Right: Distance between trajectories was measured by taking the average of normalized mean-squared-error of adjacent trajectories at all trials. Each dot corresponds to a time point shown on the left panel.

Comparison of RNNs trained with and without fixation.
(A) RNNs trained without fixation. Right: The choice output of the RNNs oscillates. Left, Middle: The derivative of reversal probability activity dx_rev/dt does not converge to 0 during the early part of a trial (start to cue-off). As the cue is turned on, dx_rev/dt fluctuates with the cue. The white line shows dx_rev/dt averaged over all pre-reversal (left) and post-reversal (middle) trials. (B) RNNs trained with the choice output fixed at 0 before making a choice. Specifically, during the time interval between fixation and cue-off lines shown in the left and middle panels, the choice output was trained to be fixed at 0. Right: The choice output of the RNNs is flat when they are not making choices. Left, Middle: The derivative of reversal probability activity dx_rev/dt converges to 0 during the early part of a trial (fixation to cue-on). As the cue is turned on, dx_rev/dt shows fluctuation milder than RNNs trained without fixation. The white line shows dx_rev/dt averaged over all pre-reversal (left) and post-reversal (middle) trials.

Sign up for email alerts