Figures and data

Two-stage decision task performance.
A, Timeline of events. Eye fixation was required while a red fixation cue was shown, otherwise subjects could saccade freely and indicate their decision (arrow as an example) by moving a manual joystick in the direction of the chosen stimulus. Once the second-stage choice had been made, the nature of the outcome was revealed by a secondary reinforcer cue (here, the pause symbol represents high outcome). Once the latter cue was off the screen, there was a fixed 500 ms delay and the possibility of a further delay (for both medium and low outcomes) before juice was provided (for both high and medium outcomes). B, The state-transition structure (kept fixed throughout the experiment). Each second stage stimuli had an independent reward structure: the outcome level (defined by the magnitude of the reward and the delay to its delivery) remained the same for a minimum number of trials (a uniformly distributed pseudorandom integer between 5 and 9) and then, either stayed in the same level (with one-third probability) or changed randomly to one of the other two possible outcome levels. C, Likelihood of first-stage choice repetition, averaged across sessions, as a function of reward and transition on the previous trial. Error bars depict SEM. B-C, Logistic regression results on first-stage choice with the contributions of the reward main effect (B) and reward × transition (C) from the five previous trials. Dots represent fixed-effects coefficients for each session (red when p < 0.05, grey otherwise).

The contents of GLM1. Far right column indicates which figure panels each regressor was used for.

The contents of GLM2. Far right column indicates which figure panels each regressor was used for.

The contents of GLM3. Far right column indicates which figure panels each regressor was used for.

ACC is the strongest reward coding region
A, Average coefficient of partial determination (CPD) across neurons in each region for the encoding of the reward received on the previous trial. Feedback is shown both for the previous trial (left) and the current trial (right). Solid horizontal lines represent periods where that area’s CPD differed significantly from all other areas CPDs (p<0.05, cluster-based permutation test). Dashed horizontal line indicates the confidence interval for each region derived from the null distribution. All epochs are 0-500ms with the exception of 0-1000ms for the first feedback epoch. B, Percent of neurons within each region that significantly encoded reward in the feedback epoch and epochs of the subsequent trial (p<0.05 cluster-based permutation test). Solid horizontal line represent period where that area’s CPD differed significantly from all other areas CPDs (p<0.05, chi-squared test). C, The time (relative to feedback onset) at which the peak encoding of the secondary reinforcer occurred, across significant encoding neurons. ***,0.001, independent t-test. D, Average coefficients for encoding the value of the secondary reinforcer. Solid horizontal lines represent periods where the coefficients in a particular region differed from 0 (p<0.05, 1-sample t-test). E, Same as in D but for the percentage of significant neurons. Dashed lines indicate the net percentage of significant neurons (positive – negative percentages). F, Average coefficients across Caudate neurons for the reward received on the current trial and the previous two trials. Inset, the average activity for each of the three conditions at the time point of peak reward encoding. **, ***, p<0.01,0.001 1-sample t-test against 0. G, Same as F but for Putamen neurons. H, Same as in A, but for encoding of the prediction error at feedback.

ACC encodes transition information until feedback
A, Average coefficient of partial determination (CPD) across neurons in each region for the encoding of the transition that occurred. Solid horizontal lines represent periods where that area’s CPD differed significantly from all other areas CPDs (p<0.05 cluster-based permutation test). Dashed horizontal line indicates the confidence interval for each region derived from the null distribution. All epochs are 0-1000ms. B, Percent of neurons within each region that significantly encoded transition (p<0.05 cluster-based permutation test). C, D, Same as A, B, but for encoding of the interaction of previous reward and transition. E, ACC coefficients for transition and feedback were correlated (coefficients taken from 300 ms post onset of each epoch). In A, B, and D, solid horizontal line indicates periods where ACC was significantly greater than all three other regions (A: permutation test, B and D: chi-squared test, p<0.05/3).

Value estimate encoding was predominantly model-based
A, Percentage of cells in each region that encoded the model-based (MB) derived estimates of the value of each of the choice 1 options. Solid horizontal lines indicate periods where the percentage of neurons was significant (p<0.05, binomial test). B, Same as in A, but for model-free (MF) derived estimates of each option’s value. C, D, Same as in A, B but for the average CPD across neurons in a region. Dashed lines represent the 95% confidence interval determined by permutation testing. Solid lines indicate periods where the strength of encoding was significant (p<0.05, cluster-based permutation test). E, Distribution of the peak CPD values (i.e., the highest CPD value observed over the epoch shown in A-D for either PicA or PicB cues) for each neuron during the epoch shown in C (MB, blue) and D (MF, orange) values. Horizontal lines indicate median and extrema. *, ***, p<0.05,0.001 paired t-test. F, The percentage of neurons in each region that significantly encoded a MB-estimate of the chosen option’s value (left), a MF estimate (middle), or both (right) assessed using cluster-length permutation testing. Coloured asterisks indicate that population is significantly greater than 10% (blue and orange) or 1% (green), binomial test. Asterisks between bars indicate a difference in size between the two populations (*, ***, p<0.05, 0.001, chi-square test).

Direction of the encoding of chosen and unchosen choice 1 options, depending on the transition that occurred.
A, Average coefficients across neurons in the ACC with respect to the value of chosen (blue) and unchosen (orange) options of choice 1 in common trials. Value estimates were calculated using a hybrid of MB and MF estimates derived from each monkey’s behaviour. Red horizontal line indicates portions where the coefficients in the two conditions differed significantly from one another (p<0.05, paired t-test). B, Same as in A, but for trials where a rare transition occurred. C-H, Same as in A-B but for DLPFC, Caudate, and Putamen, respectively.

Choice 1 was encoded by neurons sensitive to reward and transition
A, Average coefficient of partial determination (CPD) across neurons in each region for the encoding of choice 1. Dashed horizontal line indicates the confidence interval for each region derived from the null distribution. B, Same as A but for encoding of the interaction of reward, transition, and choice 1, all from the previous trial. C, A support vector machine was used to decode from each neural population which cue the monkeys would choose at choice 1 on each trial. Dashed horizontal line indicates the 95th confidence interval (permutation test) and solid horizontal lines indicate periods of significant decoding (p<0.05, cluster-based permutation test). Dashed vertical line indicates the time at which the subjects made their choice (0 ms). D, E, Same as in C but neurons were median split into two groups depending on the strength with which they encoded reward at feedback and transition at transition. F, Difference in decoder strength between A and B. Solid horizontal line indicates periods of significant difference assessed using permutation test.

Locations of each neuron recorded from subject C

Locations of each neuron recorded from subject J

Temporal autocorrelations of neuronal activity did not confound analyses.
A, To check for the effect of autocorrelations over time we repeated the CPD analysis in Figures 3, 4, and 7 but using trial data from a different session to the neural data. Average coefficient of partial determination (CPD) across neurons in each region for the encoding of the reward received on the previous trial (of a different session). Feedback is shown both for the previous trial (left) and the current trial (right). Solid horizontal lines represent p<0.05 assessed using permutation testing against the null distribution (indicated by dashed horizontal line). B-E, Same as in A, but for the encoding of transition, the interaction of reward and transition, choice 1, and the interaction of previous reward, previous transition and previous choice 1. F, The percentage of neurons that were found to significantly encode each parameter in its respective epoch assessed using cluster-based permutation testing (p<0.05). n.s., not significant, binomial test against 0.05 Bonferroni corrected for the 4 regions tested.

A model-based strategy was best to solve the task.
The amount of reward attained by an agent that either always chose the option with the highest MB-derived estimate of value (‘MB policy’) or the highest MF-derived estimate of value (‘MF policy’). This was compared to an agent that chose randomly (‘Random policy’). The average reward per trial over all recording sessions was used (n=57). ***, p<0.001, paired t-test.

Reward coding and transition coding is correlated
A-D, Correlation of coefficients for transition at the transition epoch (x-axis) with reward at the feedback epoch (y-axis) for ACC, DLPFC, Caudate, and Putamen neurons, respectively. Non-significant points have been removed for clarity (white). Top right numbers indicate the extreme values of each plot (Pearson r).

Coding of the interaction of choice 1, transition and reward from the previous trial.
A, Average coefficient of partial determination (CPD) across neurons in each region for the encoding of the interaction of reward and transition from the previous trial at the time the subjects made their first choice on the next trial. Dashed horizontal lines indicate the 95th percentile of the null distribution and solid horizontal lines indicate periods of significant coding (p<0.05, cluster-based permutation test). B-D, Same as in A but for the interaction between reward and choice 1, transition and choice 1, and the triple interaction between all three variables.

ACC encoded the upcoming choice during the preceding fixation period
A, A support vector machine was used to decode from each neural population which cue the monkeys would choose at choice 1 on each trial. Decoding was restricted to trials where they had received a high reward on the previous two trials. Dashed lines represents the 95% confidence interval (permutation test). B, Same as in A but only for trials following two consecutive low/medium reward trials.