Mice were sensitive to block length and leverage this information during the two-armed bandit task

(A) The mouse makes a left or right choice via tongue lick after the go cue. Depending on the reward probabilities, the choice might lead to water. (B) Trials were organized into blocks, each with distinct reward probabilities: “70:10” (70% chance to receive water for left choice; 10% for right) or “10:70” (10% for left; 70% for right). The block switches after the animal choose the high-reward-probability side ten times (LCriterion) plus an additional random number of trials (LRandom, drawn from exponential distribution, up to 30 trials). (C) Performance of a mouse in one example session. The top row shows reward probabilities for left and right options. The bottom row shows the animal’s choices and the outcomes. (D) Choice behavior around block switches. Thin line, mean values for individual animal. Thick line, mean values and SEM for all animals. (E) Histogram of LRandom. For all blocks with LCriterion ≤ 20. Colors indicate the 4 ranges of LRandom for subsequent analyses. (F) Choice behavior around block switches, plotted separately for the 4 ranges of LRandom. Mean values and SEM for all animals. (G) The probability of choosing the better option on the trial immediately preceding the switch, as a function of LRandom for the block preceding the switch. Mean values and SEM for all animals. (H) The number of trials to reach midpoint (when animal is equally likely to choose either option) as a function of LRandom for the block preceding the switch. Mean values and SEM for all animals. n = 31 mice, 617 sessions.

Unilateral lesion of ACAd/MOs altered block-length-dependent choice behavior and impaired overall performance

(A) Schematic representation of the unilateral excitotoxic lesion via injection of ibotenic acid. (B) Lesion blocks refers to blocks in which the lesioned side is the better option. Contra blocks refer to blocks in which the lesioned side is contralateral to the better option. (C, D) Post hoc histology with cresyl violet staining to confirm the loss of neurons in ACAd/MOs. (E) Choice behavior around block switches, plotted separately for the 4 ranges of LRandom. Black, pre-lesion. Green, post-lesion. Left, switches from lesion block to contra block. Right, switches from contra block to lesion block. Mean values and SEM for all animals. (F) The probability of choosing the better option on the trial immediately preceding the switch, as a function of LRandom for the block preceding the switch. Black, pre-lesion. Green, post-lesion. Mean values and SEM for all animals. (G) Similar to (F) for number of trials to reach midpoint (when animal is equally likely to choose either option). (H) Similar to (F) for hit rate (probability for animal to choose the better option). For (F) – (H), significant main effects and interactions from three-way ANOVA were indicated (P < 0.05). n = 9 mice, 200 pre-lesion sessions and 142 post-lesion sessions.

A hybrid model of beliefs and choice kernels to explain the behavior

(A) The schematic representation of the belief with choice kernel model (belief-CK). The model has four parameters: H (hazard rate), β (inverse temperature for belief), αK (learning rate for choice kernel) and βK (inverse temperature for choice kernel). (B – E) An example session along with the fits from the belief-CK model, including reward probabilities for left and right options (B) the running-average of probability of choosing right for the animal (black) and model (purple) (C), the belief that the left option is associated with reward probability of 10% (pL10, blue) or 70% (pL70, red) (D), and the choice kernels for left (blue) and right options (red) (E). (F) Model comparison between the belief-CK model and 7 other models. Lower log BIC values indicate a better fit. (G) The tally of the best-fitting model for each animal. (H) The probability of choosing the better option on the trial immediately preceding the switch, as a function of LRandom for the block preceding the switch. Black, mice. Purple, simulated performance using the belief-CK model with best-fitting parameters. Mean values and SEM for all animals. (I) Similar to (H) for number of trials to reach midpoint (when animal is equally likely to choose either option). (J) Similar to (H) for the tendency to win-stay on the 5 trials preceding the switch. (K) Similar to (H) for the tendency to lose-switch on the 5 trials preceding the switch. n = 31 mice, 617 sessions.

Effects of unilateral lesion of ACAd/MOs is consistent with a side-specific increase in hazard rate

(A) The hazard rates, before and after lesion, extracted by fitting the belief-with-choice-kernel model on a per-animal basis. Square, hazard rate for side ipsilateral to lesion. Cross, hazard rate for side contralateral to lesion. Inset, violin plot of the same data. (B) The hazard rates, before and after lesion, on a per-session basis. Mean and SEM. (C – D) Similar to (A – B) for learning rate for choice kernel. (E) The inverse temperature sum, before and after lesion, on a per-animal basis. (F) The inverse temperature sum, before and after lesion, on a per-session basis. (G - H) Similar to (E – F) for inverse temperature ratio. *, P < 0.05. n.s., not significant. n = 9 mice, 190 pre-lesion sessions and 140 post-lesion sessions.

Effects of bilateral and sham lesions of ACAd/Mos

(A) The number of trials performed in each session, before and after bilateral lesion, on a per-session basis. Mean and SEM. (B) Similar to (A) for the number of block switches in each session. (C) The probability of choosing the better option on the trial immediately preceding the switch, as a function of LRandom for the block preceding the switch, before and after bilateral lesion, on a per-session basis. Mean and SEM. Significant main effects and interactions from three-way ANOVA were indicated (P < 0.05). (D) Similar to (C) for number of trials to reach midpoint (when animal is equally likely to choose either option). (E) The hazard rates, before and after bilateral lesion, extracted by fitting the belief-CK model on a per-session basis. Mean and SEM. (F) Similar to (E) for learning rate for choice kernel. (G) Similar to (E) for inverse temperature sum. (H) Similar to (E) for inverse temperature ratio. (I – P) Similar to (A – H) for sham controls with unilateral saline injection. n.s., not significant. For bilateral lesion, n = 4 mice, 105 pre-lesion sessions and 61 post-lesion sessions. For saline control, n = 4 mice, 117 pre-lesion sessions and 53 post-lesion sessions.

Optogenetic inactivation in pre-choice, but not post-choice, period reproduced the deficit in change-point estimation

(A) The schematic representation of experimental setup. (B) CCD image of a mouse with a cleared skull cap. The tw blue crosses indicate the locations of the photostimulation, i.e. left and right ACAd/MOs. (C - D) The trial and block structures, and the timing of the photostimulation. (E) The probability of choosing the better option on the trial immediately preceding the switch, as a function of LRandom for the block preceding the switch for pre-choice inactivation. Black, control blocks. Light blue, Stimulated blocks. Blue, contralateral to stimulated blocks. Mean values and SEM for all animals. (F) The hazard rates extracted by fitting a modified belief-CK model, for pre-choice inactivation, on a per-animal basis. (G) Similar to (F) for learning rate for choice kernel. (H) Similar to (F) for inverse temperature sum. (I) Similar to (F) for inverse temperature ratio (J-N) Similar to E-I for post-choice inactivation. n = 6 animals.

Animal’s choice behavior around block switches with different performance criteria

(A) The equation for the switching condition, or block length (BL), which is the sum of Lcriterion and LRandom. (B) Choice behavior around block switches, plotted separately for 4 ranges of BL. Mean values and SEM for all animals. All data were included. (C) Similar to (B), including data in which Lcriterion ≤ 20 trials. (D) Choice behavior around block switches, plotted separately for 4 ranges of LRandom. Mean values and SEM for all animals. All data were included. (E) Similar to (C), including data in which Lcriterion ≤ 50 trials.

Animal’s choice behavior in a task variant without Lcriterion

(A) Choice behavior around block switches in a task variant without Lcriterion. All the trial timing and reward probabilities are identical, except the switching condition consists of only LRandom. Thin line, mean values for individual animal. Thick line, mean values and SEM for all animals.

(B) Histogram of LRandom. for all blocks. Colors indicate the 4 ranges of LRandom for subsequent analyses. (C) Choice behavior around block switches, plotted separately for the 4 ranges of LRandom. Mean values and SEM for all animals. (D) The probability of choosing the better option on the trial immediately preceding the switch, as a function of LRandom for the block preceding the switch (in 2 datapoints bin; main effect of LRandom: F (14, 204) = 1.8068, P = 0.0394 one-way ANOVA) Mean values and SEM for all animals. (E) The number of trials to reach midpoint (when animal is equally likely to choose either option) as a function of LRandom for the block preceding the switch (in 2 datapoints bin; main effect of LRandom: F (14, 120) = 1.0734, P = 0.3883 one-way ANOVA). Mean values and SEM for all animals. n = 10 mice, 48 sessions, 312 blocks.

No decrease in overall performance after unilateral lesion of ACAd/Mos

The total number of trials and block switches per session before (pre) and after (post) the unilateral lesion.

No motor deficits after unilateral lesion of ACAd/Mos

Mean left and right lick density for each possible combination for choice (left or right) and outco e (reward or no reward). No significant difference was detected between pre- and post-unilateral lesion.

Belief-CK model: effect of varying the hazard rate

The belief-CK model was used to simulate an agent’s choice behavior in the two-armed bandit task with probabilistic reward reversal. Parameters were selected based on the best fitting values from an animal. Each column shows the results using a different hazard rate (= 0.01, 0.25, 0.5, 0.75, 1) while all other parameters were kept constant (n = 300,000 trials, = 1.387, =0.468, = 2.543). Top row shows the mean fraction of trials choosing the better and worse options for 4 different LRandom ranges for 10 trials before and after the block switch. Middle row shows the P (better option) pre-switch as a function of LRandom. Mean and SEM. Bottom row shows the mean number of trials to reach midpoint as a function of LRandom.

Belief-CK model: effect of varying choice kernel learning rate.

Similar to Supplementary Figure 3.1, with different choice kernel learning rates (= 0.01, 0.25, 0.5, 0.75, 1) while all other parameters were kept constant (n = 300,000 trials, = 0.320, = 1.387, = 2.543).

Belief-CK model: effect of varying beta sum.

Similar to Supplementary Figure 3.1, with different beta sum (= 0, 1, 3, 5, 10) while all other parameters were kept constant (n = 300,000 trials, = 0.320, = 0.468). was set to equal to.

Belief-CK model: effect of varying beta ratio.

Similar to Supplementary Figure 3.1, with different beta ratios (0.01, 0.25, 0.5, 0.75, 1) while all other parameters were kept constant (n = 300,000 trials, = 0.320, =0.468, = 2.543). was fixed and was calculated based on the beta ratio values.

DF-Q-RPE algorithm cannot reproduce the LRandom-dependent trends in the experimental data.

(A) The probability of choosing the better option on the trial immediately preceding the switch, as a function of LRandom for the block preceding the switch. Black, mice. Purple, simulated performance using the DF-Q-RPE model with best-fitting parameters. Mean values and SEM for all animals. (B) Similar to (A) for number of trials to reach midpoint (when animal is equally likely to choose either option). (C) Similar to (A) for the tendency to win-stay on the 5 trials preceding the switch. (D) Similar to (A) for the tendency to lose-switch on the 5 trials preceding the switch. Mean and SEM. n = 31 mice, 617 sessions.

Fewer trials but similar performance after bilateral lesion of ACAd/Mos

The total number of left- and right-responding trials, reward rates, and hit rates before (pre) and after (post) the lesion.

No motor deficits after bilateral lesion of ACAd/Mos

Mean left and right lick density for each possible combination for choice (left or right) and outco e (reward or no reward). No significant difference was detected between pre- and post-bilateral lesion.

Validation of the laser steering system for optogenetic manipulation: characterization and c-Fos staining

(A) Optical transmission of the clear skull cap preparation was measured by illuminating with a laser and recording intensity using a power meter. Mean and SEM. n = 5. (B) Linearity of the galvanometers in the x and y directions. (C) Beam profile was measured at the sample plane by inserting and moving a razor blade across the plane using a micromanipulator. (D - F) In CaMKIIaCre;Ai32 animals, cortical excitatory neurons express ChR2. After unilateral photostimulation of the left ACAd/MOs region (40 Hz, 1.5 mW, 1 min on then 1 min off repeatedly for 20 min), immunohistostaining with a c-Fos antibody showed elevated signals.

Inactivating left and right ALM during two-armed bandit task

(A) In PvalbCre;Ai32 animals, parvalbumin-expressing neurons including fast-spiking interneurons in the neocortex express ChR2. Photostimulation of a brain region drives spiking in the interneurons, which in turn suppresses excitatory activity. Lick raster recorded in an example session, in which trials were sorted based on the photostimulation (None: no stimulation; ALM-L: left anterior lateral motor cortex, AP=2.5 mm, ML=-1.5 mm; ALM-R: right anterior lateral motor cortex, AP=2.5 mm, ML=1.5 mm; V1-L: left primary visual cortex, AP=-2.7 mm, ML=-2.5 mm; V1-R: right primary visual cortex, AP=-2.7 mm, ML=2.5 mm). (B) The number of trials of each type per session. (C) Percent of trials resulted in a miss, as a function of trial type. (D) Percent of trials resulted in a left response, as a function of trial type. (E) Percent of trials resulted in a right response, as a function of trial type. These results show that transient inactivation of ALM increased ipsilateral responses at the expense of contralateral responses. 9 sessions from 3 animals.

The results of three-way between-subjects ANOVA with factors of lesion (pre- and post-lesion), side (lesion blocks and Contra blocks), and LRandom (4 LRandom ranges) for P(better option)pre-switch, trials to reach midpoint and hit rates. p < 0.05 in bold. All dependent variables calculated for each block across sessions. (Error = 3285; 2399; 3285; 3285; 2816;3187 for P (better option) pre-switch, trials to reach midpoint and hit rates respectively

The results of two-way between-subjects ANOVA for bilaterally injected animals with factors of lesion (pre- and post-lesion and LRandom (4 LRandom ranges) for P(better option) pre-switch, trials to reach midpoint and hit rates. p < 0.05 in bold. All dependent variables calculated for each block across sessions. (Error = 1356;911;1356; for P (better option) pre-switch, trials to reach midpoint and hit rates respectively

The results of two-way between-subjects ANOVA for saline injected animals with factors of lesion (pre- and post-lesion and LRandom (4 LRandom ranges) for P (better option) pre-switch, trials to reach midpoint, hit rates, P(lose | switch), P(win | stay). p < 0.05 in bold. All dependent variables calculated for each block across sessions. (Error = 1871;1217; 1871 for P (better option) pre-switch, trials to reach midpoint and hit rates respectively.