Neural signatures of model-based and model-free reinforcement learning across prefrontal cortex and striatum
Figures
Two-stage decision task performance.
(A) Timeline of events. Eye fixation was required while a red fixation cue was shown, otherwise subjects could saccade freely and indicate their decision (arrow as an example) by moving a manual joystick in the direction of the chosen stimulus. Once the second-stage choice had been made, the nature of the outcome was revealed by a secondary reinforcer cue (here, the pause symbol represents high outcome). Once the latter cue was off the screen, there was a fixed 500 ms delay and the possibility of a further delay (for both medium and low outcomes) before juice was provided (for both high and medium outcomes). (B) The state-transition structure (kept fixed throughout the experiment). Each second-stage stimuli had an independent reward structure: the outcome level (defined by the magnitude of the reward and the delay to its delivery) remained the same for a minimum number of trials (a uniformly distributed pseudorandom integer between 5 and 9) and then, either stayed in the same level (with one-third probability) or changed randomly to one of the other two possible outcome levels. (C) Likelihood of first-stage choice repetition, averaged across sessions, as a function of reward and transition on the previous trial. (D–E) Logistic regression results on first-stage choice with the contributions of the reward main effect (D) and reward×transition (E) from the five previous trials. **, *, p<0.05, 0.01, respectively. In C–E, error bars depict standard error of the mean (n=30 and 27 for subject C and J, respectively).
A model-based (MB) strategy was best to solve the task.
The amount of reward attained by an agent that either always chose the option with the highest MB-derived estimate of value (‘MB policy’) or the highest MF-derived estimate of value (‘MF policy’). This was compared to an agent that chose randomly (‘random policy’). The average reward per trial over all recording sessions was used (n=57). ***, p<0.001, paired t-test.
Anterior cingulate cortex (ACC) is the strongest reward coding region.
(A) Average coefficient of partial determination (CPD) across neurons in each region for the encoding of the reward received on the previous trial. Feedback is shown both for the previous trial (left) and the current trial (right). Solid horizontal lines represent periods where that area’s CPD differed significantly from all other areas’ CPDs (p<0.05, cluster-based permutation test). Dashed horizontal line indicates the confidence interval for each region derived from the null distribution. All epochs are 0–500 ms with the exception of 0–1000 ms for the first feedback epoch. (B) The percentage of neurons within each region that significantly encoded reward during the feedback epoch and epochs of the subsequent trial (p<0.05 cluster-based permutation test). Solid horizontal line represents period where that area’s CPD differed significantly from all other areas’ CPDs (p<0.05, chi-squared test). (C) The time (relative to feedback onset) at which the peak encoding of the secondary reinforcer occurred, across significant encoding neurons. ***, 0.001, independent t-test. (D) Average coefficients for encoding the value of the secondary reinforcer. Solid horizontal lines represent periods where the coefficients in a particular region differed from 0 (p<0.05, one-sample t-test). (E) Same as in D but for the percentage of significant neurons. Dashed lines indicate the net percentage of significant neurons (positive – negative percentages). (F) Average coefficients across ACC neurons for the reward received on the current trial and the previous two trials. Inset, the average activity for each of the three conditions at the time point of peak reward encoding. **, ***, p<0.01, 0.001, one-sample t-test against 0. (G–I) Same as F but for dorsolateral prefrontal cortex (DLPFC), caudate, and putamen neurons, respectively. In all cases, error bars depict standard error of the mean across neurons. (n=240, 187, 115, and 119 for ACC, DLPFC, caudate, and putamen, respectively).
Temporal autocorrelations of neuronal activity did not confound analyses.
(A) To check for the effect of autocorrelations over time, we repeated the coefficient of partial determination (CPD) analysis in Figures 3, 4, and 7 but using trial data from a different session to the neural data. Average CPD across neurons in each region for the encoding of the reward received on the previous trial (of a different session). Feedback is shown both for the previous trial (left) and the current trial (right). Solid horizontal lines represent p<0.05 assessed using permutation testing against the null distribution (indicated by dashed horizontal line). Error bars depict standard error of the mean. (B–E) Same as in A but for the encoding of transition, the interaction of reward and transition, choice 1, and the interaction of previous reward, previous transition, and previous choice 1. (F) The percentage of neurons that were found to significantly encode each parameter in its respective epoch assessed using cluster-based permutation testing (p<0.05). n.s., not significant, binomial test against 0.05 Bonferroni corrected for the four regions tested.
Anterior cingulate cortex (ACC) encodes transition information until feedback.
(A) Average coefficient of partial determination (CPD) across neurons in each region for the encoding of the transition that occurred. Solid horizontal lines represent periods where that area’s CPD differed significantly from all other areas’ CPDs (p<0.05, cluster-based permutation test). Dashed horizontal line indicates the confidence interval for each region derived from the null distribution. All epochs are 0–1000 ms. Error bars depict standard error of the mean. (B) The percentage of neurons within each region that significantly encoded transition (p<0.05, cluster-based permutation test). (C, D) Same as A, B but for encoding of the interaction of previous reward and transition. (E) ACC coefficients for transition and feedback were correlated (coefficients taken from 300 ms post onset of each epoch). In A, B, and D, solid horizontal line indicates periods where ACC was significantly greater than all three other regions (A: permutation test, B and D: chi-squared test, p<0.05/3).
Reward coding and transition coding are correlated.
(A–D) Correlation of coefficients for transition at the transition epoch (x-axis) with reward at the feedback epoch (y-axis) for anterior cingulate cortex (ACC), dorsolateral prefrontal cortex (DLPFC), caudate, and putamen neurons, respectively. Nonsignificant points have been removed for clarity (white). Top right numbers indicate the extreme values of each plot (Pearson’s r).
Value estimate encoding was predominantly model-based (MB).
(A) Percentage of cells in each region that encoded the MB-derived estimates of the value of each of the choice 1 options. Solid horizontal lines indicate periods where the percentage of neurons was significant (p<0.05, binomial test). (B) Same as in A but for model-free (MF)-derived estimates of each option’s value. (C, D) Same as in A, B but for the average coefficient of partial determination (CPD) across neurons in a region. Dashed lines represent the 95% confidence interval determined by permutation testing. Solid lines indicate periods where the strength of encoding was significant (p<0.05, cluster-based permutation test). Error bars depict standard error of the mean. (E) Distribution of the peak CPD values (i.e. the highest CPD value observed over the epoch shown in A–D for either PicA or PicB cues) for each neuron during the epoch shown in C (MB, blue) and D (MF, orange) values. Horizontal lines indicate median and extrema. *, ***, p<0.05, 0.001 paired t-test. (F) The percentage of neurons in each region that significantly encoded an MB estimate of the chosen option’s value (left), an MF estimate (middle), or both (right) assessed using cluster-length permutation testing. Coloured asterisks indicate that population is significantly greater than 10% (blue and orange) or 1% (green), binomial test. Asterisks between bars indicate a difference in size between the two populations (*, ***, p<0.05, 0.001, chi-square test).
Dynamic coding of model-based (MB) Q-values in anterior cingulate cortex (ACC).
Coefficient of partial determination (CPD, z-axis) across significant neurons in ACC for encoding the MB-derived estimates of the value of each of the choice 1 options (p<0.05, permutation test). Neurons have been sorted by the onset of their coding, revealing a dynamic pattern tiling the entire epoch where different neurons are active at different parts.
Direction of the encoding of chosen and unchosen choice 1 options, depending on the transition that occurred.
(A) Average coefficients across neurons in the anterior cingulate cortex (ACC) with respect to the value of chosen (blue) and unchosen (orange) options of choice 1 in common trials. Value estimates were calculated using a hybrid of model-based (MB) and model-free (MF) estimates derived from each monkey’s behaviour. Red horizontal line indicates portions where the coefficients in the two conditions differed significantly from one another (p<0.05, paired t-test). Error bars depict standard error of the mean (n=240). (B) Same as in A but for trials where a rare transition occurred. (C–H) Same as in A–B but for dorsolateral prefrontal cortex (DLPFC, n=187), caudate (n=115), and putamen (n=119), respectively.
Choice 1 was encoded by neurons sensitive to reward and transition.
(A) Average coefficient of partial determination (CPD) across neurons in each region for the encoding of choice 1. Dashed horizontal line indicates the confidence interval for each region derived from the null distribution. (B) Same as A but for encoding of the interaction of reward, transition, and choice 1, all from the previous trial. (C) A support vector machine was used to decode from each neural population which cue the monkeys would choose at choice 1 on each trial. Dashed horizontal line indicates the 95th confidence interval (permutation test) and solid horizontal lines indicate periods of significant decoding (p<0.05, cluster-based permutation test). Dashed vertical line indicates the time at which the subjects made their choice (0 ms). (D, E) Same as in C but neurons were median split into two groups depending on the strength with which they encoded reward at feedback and transition at transition. (F) Difference in decoder strength between A and B. Solid horizontal line indicates periods of significant difference assessed using permutation test. In all cases, error bars depict standard error of the mean (n=240, 187, 115, and 119 for ACC, DLPFC, caudate, and putamen, respectively).
Coding of the interaction of choice 1, transition, and reward from the previous trial.
(A) Average coefficient of partial determination (CPD) across neurons in each region for the encoding of the interaction of reward and transition from the previous trial at the time the subjects made their first choice on the next trial. Dashed horizontal lines indicate the 95th percentile of the null distribution and solid horizontal lines indicate periods of significant coding (p<0.05, cluster-based permutation test). Error bars depict standard error of the mean (n=240, 187, 115, and 119 for ACC, DLPFC, caudate, and putamen, respectively). (B–D) Same as in A but for the interaction between reward and choice 1, transition and choice 1, and the triple interaction between all three variables.
Encoding of choice 1 was modulated by reward stability (explore/exploit strategy).
(A) Trials were split into those following two consecutive high rewards (‘exploit’) and those following two consecutive low/medium rewards (‘explore’). A support vector machine was used to decode from each region which cue the monkeys would choose at choice 1 on explore trials. Shaded error bars represent standard error of the mean over the different permutations of the data (n=20). Horizontal dashed lines indicate the 95% confidence interval (permutation test) and solid horizontal lines indicate periods of significant coding (p<0.05, permutation test). (B) Same as in A but only for exploit trials. (C) Difference in decoder strength between A and B. Solid horizontal line indicates periods of significant difference assessed using permutation test. (D–F) Same as in A–C but for decoding the direction of the choice 1, which was randomised across trials and therefore orthogonal to the cue that was chosen. Note that a choice was indicated via manual joystick movement.
Anterior cingulate cortex (ACC) encoded the upcoming choice during the preceding fixation period.
(A) A support vector machine was used to decode from each neural population which cue the monkeys would choose at choice 1 on each trial. Decoding was restricted to trials where they had received a high reward on the previous two trials. Dashed lines represent the 95% confidence interval (permutation test). Shaded error bars represent standard error of the mean over the different permutations of the data (n=20). (B) Same as in A but only for trials following two consecutive low/medium reward trials.
Tables
The contents of GLM1.
Far-right column indicates which figure panels each regressor was used for.
Regressor | Description | Figure |
|---|---|---|
| Constant | To account for the mean effect | |
| Linear effects | A ramping regressor to account for any monotonic autocorrelations across the session | |
| (Linear effects) ^ 2 | To account for any polynomial autocorrelations across the session | |
| (Linear effects) ^ 3 | ||
| (Linear effects) ^ 4 | ||
| Sin 2 | Sin curves of varying frequency (approximately 2, 4, 6, 9 cycles per session) to account for any non-monotonic autocorrelations across the session | |
| Sin 4 | ||
| Sin 6 | ||
| Sin 9 | ||
| Previous reward (t–3) | Value of the secondary reinforcer shown three trials ago | |
| Previous reward (t–2) | Value of the secondary reinforcer shown two trials ago | Figure 2F and G |
| Previous reward (t–1) | Value of the secondary reinforcer shown on previous trial | Figure 2A–C Figure 2F and G |
| Previous transition | Transition that occurred on the previous trial | |
| Previous choice 1 | The first choice from the previous trial | |
| Previous reward * previous transition | Pairwise interactions between the events from the previous trial | |
| Previous reward * previous choice 1 | ||
| Previous transition * previous choice 1 | ||
| Previous reward * previous transition * previous choice 1 | Three-way interaction of the events from the previous trial (equivalent to previous second-stage state * previous reward, as choice 1 * transition = second-stage state) | Figure 6B |
| Reward | Value of the secondary reinforcer (current trial) | Figure 2D and E Figure 3E Figure 5E and F |
| Transition | Transition that occurred (current trial) | Figure 3A–E |
| Choice 1 | The cue the subjects chose (current trial) | Figure 6A |
| Previous reward * transition | Pairwise interactions between the current trial’s events with reward from the previous trial | Figure 3C and D |
| Previous reward * choice 1 | ||
| Transition * choice 1 | ||
| Previous reward * Transition * choice 1 | Three-way interaction of the current trial’s events with the previous trial’s reward |
The contents of GLM2.
Far-right column indicates which figure panels each regressor was used for.
Regressor | Description | Figure |
|---|---|---|
| Constant | To account for the mean effect | |
| Linear effects | A ramping regressor to account for any monotonic autocorrelations across the session | |
| (Linear effects) ^ 2 | To account for any polynomial autocorrelations across the session | |
| (Linear effects) ^ 3 | ||
| (Linear effects) ^ 4 | ||
| Sin 2 | Sin curves of varying frequency (approximately 2, 4, 6, 9 cycles per session) to account for any non-monotonic autocorrelations across the session | |
| Sin 4 | ||
| Sin 6 | ||
| Sin 9 | ||
| Model-based PicA | Model-based derived estimates of the value of each option at choice 1 | Figure 4 |
| Model-based PicB | ||
| Model-free PicA | Model-free-derived estimates of the value of each option at choice 1 | |
| Model-free PicB | ||
| Choice 1 picture chosen | Which cue the subject chose on each trial |
The contents of GLM3.
Far-right column indicates which figure panels each regressor was used for.
| Regressor | Description | Figure |
|---|---|---|
| Constant | To account for the mean effect | |
| Linear effects | A ramping regressor to account for any monotonic autocorrelations across the session | |
| (Linear effects) ^ 2 | To account for any polynomial autocorrelations across the session | |
| (Linear effects) ^ 3 | ||
| (Linear effects) ^ 4 | ||
| Sin 2 | Sin curves of varying frequency (approximately 2, 4, 6, 9 cycles per session) to account for any non-monotonic autocorrelations across the session | |
| Sin 4 | ||
| Sin 6 | ||
| Sin 9 | ||
| Hybrid-based chosen option | Hybrid-based derived estimates of the value of each option at choice 1 | Figure 5 |
| Hybrid-based unchosen option |