Figures and data

Comparison of the behavior of trained RNNs and monkeys.
(A) Schematic of RNN training setup. In a trial, the network makes a choice in response to a cue. Then, a feedback input, determined by the choice and reward outcome, is injected to the network. This procedure is repeated across trials. The panel on the right shows this sequence of events unfolding in time in a trial. (B) Left: Example of a trained RNN’s choice outcomes. Vertical bars show RNN choices in each trial and the reward outcomes (magenta: choice A, blue: choice B, light: rewarded, dark: not rewarded). Horizontal bars on the top show reward schedules (magenta: choice A receiving reward is 70%, choice B receiving reward is 30%; blue: reward schedule is reversed). Black curve shows the RNN output. Green horizontal bars show the posterior of reversal probability at each trial inferred using Bayesian model. Right: Schematic of RNN training scheme. The scheduled reversal indicates the trial at which the reward probabilities of two options switch (color codes for magenta and cyan are the same as the left panel). The inferred reversal is the scheduled reversal trial inferred from the Bayesian model. The behavioral reversal is determined by adding a few delay trials to the inferred reversal trial. The target output, on which the RNNs outputs are trained, switches at the behavioral reversal trial. (C) Probability of choosing the initial best (i.e., high-value) option. Relative trial indicates the trial number relative to the behavioral reversal trial inferred from the Bayesian model. Relative trial number 0 is the trial at which the choice was reversed. Shaded region shows the S.E.M (standard error of mean) over blocks in all the sessions (monkeys) or networks (RNNs). (D) Fraction of no-reward blocks as a function of relative trial. Dotted lines show 0.3 and 0.7. Shaded region shows the S.E.M (standard error of mean) over blocks in all the sessions (monkeys) or networks (RNNs). (E) Distribution of RNN’s and monkey’s reversal trial, relative to the experimentally scheduled reversal trial.

Neural trajectories encoding choice and reversal probability variables.
(A) Neural trajectories of PFC (top) and RNN (bottom) obtained by projecting population activity onto task vectors encoding choice and reversal probability. Trial numbers indicate their relative position to the behavioral reversal trial. Neural trajectories in each trial were averaged over 8 experiment sessions and 23 blocks for the PFC, and 40 networks and 20 blocks for the RNNs. Black square indicates the time of cue onset. (B-C) Neural activity encoding reversal probability and choice in PFC (top) and RNN (bottom) at the time of cue onset (black squares in panel A) around the behavioral reversal trial. Shaded region shows the S.E.M over sessions (or networks) and blocks.

Integration of reward outcomes drives reversal probability activity.
(A) The reversal probability activity of PFC (orange) and prediction by the reward integration equation (blue) at the time of cue onset across trials around the behavioral reversal trial. Three example blocks are shown. Pearson correlation between the actual and predicted PFC activity is shown on each panel. Relative trial number indicate the trial position relative to the behavioral reversal trial. (B)

Augmented model for reversal probability activity.
(A) Schematic of two activity modes of the reversal probability activity. Left: Stationary mode (line attractor) where xrev(t) remains constant during a trial, and non-stationary mode where xrev(t) is dynamic. Right: Augmentation of stationary and non-stationary activity modes where the stationary mode leads the non-stationary mode in time. The time derivative dxrev/dt is shown to demonstrate (non-)stationarity of the activity. (B) Left: Block-averaged xrev/dt of PFC across trial and time. Dotted red lines indicate the onset time of fixation (−0.5s), cue (0s) and reward (0.8s); same lines shown on the right. Right: xrev/dt averaged over all trials (white), together with the trajectories of 5 trials around the reversal trial (colored). (C) Left: Contraction factor of xrev of PFC at different time points. Dotted line at 1 indicates the threshold of contraction and expansion. Right: Contraction factor of PFC xrev of individual trials between the time interval -2.5s and -1s. (D) Block-averaged dxrev/dt of RNNs at the pre-reversal (left) and post-reversal (right) trials. Note that the sign of the post-reversal trial trajectories was flipped to match the shape of the pre-reversal trajectories. Dotted red lines indicate the time of fixation, cue off and reward. (E) Contraction factor of xrev of RNN. Similar results for RNN as in panel (C). (F) Generating PFC non-stationary reversal probability trajectories from the stationary activity using support vector regression (SVR) models. Top: Trajectories generated from SVR compared to the PFC reversal probability trajectories in trials around the reversal trial in an example block. The initial state (green) is the input to the SVR model, which then predicts the rest of the trajectory. The normalized mean-squared-error (MSE) between the SVR trajectory (prediction, red) and the PFC trajectory (data, black) is shown in each trial. Bottom: Trajectories generated from the null SVR compared to the PFC reversal probability trajectories. The initial states of trials in a block were shuffled randomly prior to training the null SVR model. The trajectories predicted from the null SVR model (blue) are compared to the PFC reversal probability trajectories (black). (G) The normalized MSE of all trials in the test dataset. (H) Difference between the normalized MSE of the SVR and the null models. The difference of normalized MSE between two models was calculated for each trial.

Dynamic neural trajectories encoding reversal probability are separated in response to reward outcomes.
(A) Left: xrev(t) of PFC at current trial (black) is compared to xrev(t) in the next trial when reward is received (top, red) and not received (bottom, blue). Right: The difference of xrev(t) between current and next trials shown on the left panels. Shaded region shows the S.E.M. across all trials, blocks and sessions. (B) Difference of xrev of two adjacent trials when reward is not received (top, R−) or received (bottom, R+). The approximate time of reward outcome is shown. Relative trial number indicate the trial position relative to the behavioral reversal trial. (C) Left: xrev(t) of PFC of consecutive no reward trials before the behavioral reversal trial (top) and consecutive reward trials after the behavioral reversal (bottom). The initial value was subtracted to compare the ramping rates of xrev(t). Right: Difference in the ramping rates of trajectories of adjacent trials, when reward was received (blue) and not received (red). (D-E) Same as the right panels in (A) and (C) but for trained RNNs. (F) Left, Middle: External (left) and recurrent (middle) inputs to the RNN reversal probability dynamics, when reward was not received (red, magenta) or was received (blue, cyan). Right: Amplification factor shows the ratio of the total input when no reward (or reward) was received to the total input of reference input. The amplification factors for both the external (red, blue) and recurrent (magenta, cyan) inputs are shown. Red and magenta curves and blue and cyan curves overlap.

Mean trajectories encoding reversal probability shift monotonically across trials.
(A) Traces of

Perturbing RNN’s neural activity encoding reversal probability biases choice outcomes.
(A) RNN perturbation scheme. Three perturbation stimuli were used; v+, population vector encoding the reversal probability; v−, negative of v+; vrnd, control stimulus in random direction. Perturbation stimuli were applied at the reversal (0) and two preceding (−2, -1) trials. (B) Deviation of reversal probability activity Δxrev and choice activity Δxchoice from the unperturbed activity. Perturbation was applied at the reversal trial during a time interval the cue was presented (shaded red). Choice was made after a short delay (shaded gray). Perturbation response along the reversal probability vector v+ (solid) and random vector vrnd (dotted) are shown. (C) Perturbation of reversal probability activity (left) and choice activity (right) in response to three types of perturbation stimuli. Each dot shows the response of a perturbed network. Two perturbation strengths (multiplicative factor of 3 and 4 shown in panels D and E) were applied to 40 RNNs. Δxrev shows the activity averaged over the duration of perturbation, and Δxchoice shows the averaged activity over the duration of choice. Δxchoice of v+ is significantly smaller than Δxchoice of v− (KS-test, p-value = 0.007). (D-E) Fraction of blocks in all 40 trained RNNs that exhibited delayed or accelerated reversal trials in response to perturbations of the reversal probability activity. Perturbations at trial number -1 by three stimulus types are shown on the left panels, and perturbations at all three trials by the stimulus of interest (v− in D and v+ in E) are shown on the right panels. A multiplicative factor on the perturbation stimuli is shown as stimulus strength. (F) Left: The slope of linear regression model fitted to the residual activity of reversal probability and choice. The residual activity at each trial over the time interval [0, 500]ms was used to fit the linear model. Red dot indicates the slope at trial number -1. Relative trial number indicate the trial position relative to the behavioral reversal trial. Right: Each dot is the residual activity of a block at trial number -1. Red line shows the fitted linear model, and its slope (−0.34) is shown.

Four types of feedback inputs

Break down of R+, R− by the reward outcomes of two consecutive trials.
(A) R+ was decomposed into two components R+ = R++ + R+−, where R++ indicates two consecutive reward trials and R+− indicates a reward followed by no reward. Left: R++ across trial and time (top). Traces of R++ at individual trials and the fraction of trials whose traces are negative (bottom). Middle: Same as the left panel but for R+−. Right: Same as the other panels but for R+. (B) R− was decomposed into two components R− = R−+ + R− −, where R−+ indicates no reward followed by a reward and R− − indicates two consecutive no rewards. Same analysis as in panel (A) was performed.

Decoding reward outcome and the behavioral reversal trial using neural trajectories encoding reversal probability.
(A) Left: Decoding the reward outcome (i.e., reward or no reward) of every trial at each time point, given the difference of neural trajectories of two adjacent trials. At each time point, 300ms segment of the trajectories were used for decoding. Right: Decoding accuracy is averaged over all trials shown on the left panel. Red dotted line shows the approximate time of next trial’s reward. Gray dotted line shows the chance level performance. (B) Left: Decoding the behavioral reversal trial using neural trajectories of 20 trials around the reversal trial. Decoding error shows the position of predicted reversal trial relative to the actual reverse trial. At each time point, 300ms segment of each trajectory was used for decoding. Black shows the decoding error when single trial trajectories were used, and green shows the result when randomly chosen 5 blocks of trajectories were averaged before decoding. Gray dotted line shows the chance level performance. Right: Distance between trajectories was measured by taking the average of normalized mean-squared-error of adjacent trajectories at all trials. Each dot corresponds to a time point shown on the left panel.

Comparison of RNNs trained with and without fixation.
(A) RNNs trained without fixation. Right: The choice output of the RNNs oscillates. Left, Middle: The derivative of reversal probability activity dxrev/dt does not converge to 0 during the early part of a trial (start to cue-off). As the cue is turned on, dxrev/dt fluctuates with the cue. The white line shows dxrev/dt averaged over all pre-reversal (left) and post-reversal (middle) trials. (B) RNNs trained with the choice output fixed at 0 before making a choice. Specifically, during the time interval between fixation and cue-off lines shown in the left and middle panels, the choice output was trained to be fixed at 0. Right: The choice output of the RNNs is flat when they are not making choices. Left, Middle: The derivative of reversal probability activity dxrev/dt converges to 0 during the early part of a trial (fixation to cue-on). As the cue is turned on, dxrev/dt shows fluctuation milder than RNNs trained without fixation. The white line shows dxrev/dt averaged over all pre-reversal (left) and post-reversal (middle) trials.