Comparison of the behavior of trained RNNs and monkeys. (A) Schematic of RNN training setup. In a trial, network makes a choice in response to a cue. Then, a feedback input, determined by the choice and reward outcome, is injected to the network. This procedure is repeated across trials. (B) Example of a trained RNN’s choice outcomes. Vertical bars show RNN choices in each trial and the reward outcomes (magenta: choice A, blue: choice B, light: rewarded, dark: not rewarded). Horizontal bars on the top show reward schedules (magenta: choice A receiving reward is 70%, choice B receiving reward is 30%; blue: reward schedule is reversed). Black curve shows the RNN output. Green horizontal bars show the posterior of reversal probability at each trial inferred using Bayesian model. (C) Probability of choosing the initial best option. Relative trial indicates the trial number relative to the behavioral reversal trial inferred from the Bayesian model. Relative trial number 0 is the trial at which the choice was reversed. (D) Fraction of no-reward blocks as a function of relative trial. Dotted lines show 0.3 and 0.7. (E) Distribution of RNN’s and monkey’s reversal trial, relative to the experimentally scheduled reversal trial.

Neural trajectories encoding choice and reversal probability variables. (A) Neural trajectories of PFC (top) and RNN (bottom) obtained by projecting population activity onto task vectors encoding choice and reversal probability. Trial numbers indicate their relative position to the behavioral reversal trial. Neural trajectories in each trial were averaged over 8 experiment sessions and 23 blocks for the PFC, and 40 networks and 20 blocks for the RNNs. Black square indicates the time of cue onset. (B-C) Neural activity encoding reversal probability and choice in PFC (top) and RNN (bottom) at the time of cue onset (black squares in panel A) around the behavioral reversal trial. Shaded blue shows the standard error of mean over sessions (or networks) and blocks.

Integration of reward outcomes drives reversal probability activity. (A) The reversal probability activity of PFC (orange) and prediction by the reward integration equation (blue) at the time of cue onset across trials around the behavioral reversal trial. Three example blocks are shown. Pearson correlation between the actual andpredicted PFC activity is shown on each panel. (B) of PFC estimated from the reward integration equation at cue onset. and correspond to no-reward (red) and reward trials (blue), respectively. (C-D) Same as in panels (A) and (B) but for trained RNNs. (E) Prediction accuracy of the reward integration equation of all 8 PFC recording sessions and all 40 trained RNNs at cue onset. (F) Average prediction accuracy of the reward integration equation across time. The value at each time point shows the prediction accuracy averaged over all blocks in PFC recording sessions (top) or trained RNNs (bottom).

Dynamic neural trajectories encoding reversal probability are separated in response to reward outcomes. (A) Two neural models for the reversal probability dynamics. Left: Line attractor model where xrev(t) remains constant during a trial. Right: Dynamic trajectory model where xrev(t) is non-stationary. In both models, the trajectories of adjacent trials are separable if the shift due to reward (or no reward ) is negative (or positive) throughout the trial. (B) Left: Block-averaged xrev/dt of PFC across trial and time. Dotted red lines indicate the onset time of fixation (−0.5s), cue (0s) and reward (0.8s); same lines shown on the right. Right: Average of xrev/dt over trials shown on the left panel. (C) Left: xrev(t) of PFC at current trial (black) is compared to xrev(t) in the next trial when reward is received (red) and not received (blue). Right: The difference of xrev(t) between current and next trials shown on the left panels. (D) Difference of xrev of two adjacent trials when reward is not received (R) or received (R+). The approximate time of reward outcome is shown. (E) Left: xrev(t) of PFC of consecutive no reward trials before the behavioral reversal trial (top) and consecutive reward trials after the behavioral reversal (bottom). The initial value was subtracted to compare the ramping rates of xrev(t). Right: Difference in the ramping rates of trajectories of adjacent trials, when reward was received (blue) and not received (red). (F) External (left) and recurrent (middle) inputs to the reversal probability dynamics, when reward was not received (red) and received (blue). Amplification factor (right) shows the ratio of no reward (or reward) input to the reference input. The amplification factors for both the external and recurrent inputs are shown. (G-H) Same as the right panels in (C) and (E) but for trained RNNs.

Mean trajectories encoding reversal probability shift monotonically across trials. (A) Traces of and around the behavioral reversal trial. Note the sign flip in , which was introduced to compare the magnitudes of and . (B) Top: across trial and time. Bottom: Temporal averages of and over the trial duration. (C) Top: Traces of of pre-reversal trials (relative trial k = −5 to −1), and the fraction of trials at each time point that satisfy . Bottom: Mean PFC reversal probability trajectories of pre-reversal trials. (D) Same as in panel (C), but for post-reversal trials (relative trial k = 0 to 4). (E) Spearman rank correlation between trial numbers and the mean PFC reversal probability trajectories across pre-reversal (red) and post-reversal (blue) trials at each time point. For the post-reversal trials, Spearman rank correlation was calculated with the trial numbers in reversed order to capture the descending order. (F) of trained RNNs across trial and time. (G-I) Trained RNNs’ block-averaged xrev before and after the reversal trial and their average Spearman correlation at each time point.

Perturbing RNN’s neural activity encoding reversal probability biases choice outcomes. (A) RNN perturbation scheme. Three perturbation stimuli were used; v+, population vector encoding the reversal probability; v, negative of v+; vrnd, control stimulus in random direction. Perturbation stimuli were applied at the reversal (0) and two preceding (−2, -1) trials. (B) Deviation of reversal probability activity Δxrev and choice activity Δxchoice from the unperturbed activity. Perturbation was applied at the reversal trial during a time interval the cue was presented (shaded red). Choice was made after a short delay (shaded gray). Perturbation response along the reversal probability vector v+ (solid) and random vector vrnd (dotted) are shown. (C) Perturbation of reversal probability activity (left) and choice activity (right) in response to three types of stimulus. Δxrev shows the activity averaged over the duration of perturbation, and Δxchoice shows the averaged activity over the duration of choice. (D-E) Fraction of blocks in all 40 trained RNNs that exhibited delayed or accelerated reversal trials in response to perturbations of the reversal probability activity. Perturbations at trial number -1 by three stimulus types are shown on the left panels, and perturbations at all three trials by the stimulus of interest (v in D and v+ in E) are shown on the right panels. (F) Left: The slope of linear regression model fitted to the residual activity of reversal probability and choice. The residual activity at each trial over the time interval [0, 500]ms was used to fit the linear model. Red dot indicates the slope at trial number -1. Right: Each dot is the residual activity of a block at trial number -1. Red line shows the fitted linear model.

Four types of feedback inputs

Break down of R+, R by the reward outcomes of two consecutive trials. (A) R+ was decomposed into two components R+ = R++ + R+−, where R++ indicates two consecutive reward trials and R+− indicates a reward followed by no reward. Left: R++ across trial and time (top). Traces of R++ at individual trials and the fraction of trials whose traces are negative (bottom). Middle: Same as the left panel but for R+−. Right: Same as the other panels but for R+. (B) R was decomposed into two components R = R−+ + R−−, where R−+ indicates no reward followed by a reward and R−− indicates two consecutive no rewards. Same analysis as in panel (A) was performed.

Decoding reward outcome and the behavioral reversal trial using neural trajectories encoding reversal probability. (A) Left: Decoding the reward outcome (i.e., reward or no reward) of every trial at each time point, given the difference of neural trajectories of two adjacent trials. At each time point, 300ms segment of the trajectories were used for decoding. Right: Decoding accuracy is averaged over all trials shown on the left panel. Red dotted line shows the approximate time of next trial’s reward. Gray dotted line shows the chance level performance. (B) Left: Decoding the behavioral reversal trial using neural trajectories of 20 trials around the reversal trial. Decoding error shows the position of predicted reversal trial relative to the actual reverse trial. At each time point, 300ms segment of each trajectory was used for decoding. Black shows the decoding error when single trial trajectories were used, and green shows the result when randomly chosen 5 blocks of trajectories were averaged before decoding. Gray dotted line shows the chance level performance. Right: Distance between trajectories was measured by taking the average of normalized mean-squared-error of adjacent trajectories at all trials. Each dot corresponds to a time point shown on the left panel.