System setup for HABITS.

(A), Front (left) and side (right) view of HABITS, showing components for stimulus presenting (LEDs & buzzers), rewarding (water tanks and pumps), behavioral reporting (lickports) and health monitoring (weight platform). These components are coordinated by controller unit and integrated into mouse home-cage with a tray for bedding change. (B), HABITS installed on standard mouse cage rack. (C), Mouse, living in home-cage with food, bedding, nesting material (cotton) and enrichment (tube), is performing task on the weight platform. (D), System architecture for high-throughput behavioral training, showing different tasks are running in parallel groups of HABITS, which further wirelessly connect to one single PC through Wi-Fi to stream real-time data to the graphic user interface (GUI).

HABITS performance in d2AFC task.

(A), Task structure for d2AFC based on sound frequency. (B), Example licks for correct (blue), error (red) and earlylick (gray) trials. Choice is the first lick after response onset. (C), Correct rate (black line) and earlylick rate (grey line) of an example mouse during training in HABITS for the first 13 days. Shaded blocks indicate trials occurred in dark cycle. Trials with earlylick inhabitation only occur after blue vertical line. Red vertical dash lines represent delay duration advancement from 0.2 s to 1.2 s. (D), Averaged correct rate (left) and earlylick rate (right) for all mice trained in d2AFC. Criterion level (75%) and chance level (50%) are labeled as gray and red dash lines, respectively. (E), Same as (D) but for manual training (1∼3 hour/day in home-cage). (F), Averaged correct rate (left), earlylick rate (middle) and no response rate (right) of expert mice trained with the two protocols. (G), Averaged number of trials (left) and days (right) to reach the criterion performance for the two training protocols. Circles, individual mice. Errorbar, mean and 95% CI across mice. (H), Left, number of trials performed per day throughout the training schedule for three different protocols. Error bar indicates the mean and 95% confidence interval (CI) across mice. Middle, volume of water harvested per day. Right, Relative body weights of mice in day 0, 8, 16, 26. Bold line and shades indicate mean and 95% CI across mice. (I), Behavioral performance of all mice training in d2AFC task based on sound orientation (left), light orientation (middle), and light color (right). (J), Box plot of average number of trials (left) and days (right) to reach the criterion performance for d2AFC tasks with different sensory modalities. (K), Left, percentage of trials performed as a function of time in a day for the four modalities trained autonomously (thick black shows the average). Shaded area indicates the dark cycle. Top right, averaged correct rate of grouped mice in dark cycle versus light cycle. Error bars show 95% CI across mice. n.s., not significant, **, p<0.01, two-sided Wilcoxon signed-rank tests. Bottom right, box plot of the averaged proportion of trials performed in dark cycle for the four modalities. Data collected from expert mice. (L), top, percentage of trials lying in groups of different trial block sizes for automated training in home-cage. Left, accumulated proportion of inter-block-interval. Left inside, averaged duration of trial blocks. Right, correct rate and earlylick rate as functions of trial block size. Grey dash line, the criterion performance; Red dash line, chance performance level. Data collected from trials of expert mice. For significance levels not mentioned in all Figures, n.s., not significant, p> 0.05; *, p<0.05; **, p<0.01 (two-sided Wilcoxon rank-sum tests).

All tasks training in HABITS.

Representative cognitive task performed in HABITS.

(A), Contingency reversal task. (A1), Task structure. (A2), Correct rate of example mice with different learning rates. Grey vertical lines indicate contingency reversal. (A3), Relative number of trials to reach the criterion as a function of reverse times. Grey lines, individual mice. Black lines, linear fit. (A4), Number of trials in the first reversal learning versus the average number of trials of the rest of contingency reversal learning for each mouse (each dot). Black line, linear regression. Red dash line, diagonal line. (B), Working memory task with sound frequency modality. (B1), Task structure. (B2), Stimulus generation matrix (SGM) for left (orange) and right (green) trials. (B3), left, averaged correct rate for each stimulus combination tested. Right, averaged correct rate for each (S1+S2) stimulus combination across mice. Black line and shade, linear regression and 95% CI. (B4), Averaged psychometric curves, i.e., percentage of right choice as a function of frequency difference between sample1 and sample2. (B5), Averaged correct rate as a function of delay duration. (C), Evidence accumulation with spatial cue task. (C1), Task structure. (C2), Averaged psychometric curves, i.e., performance as a function of the difference between right and left clicks rates. (C3), Averaged correct rate across all mice as a function of sample duration for different Poisson rates (different colors). Error bar represents 95% CI. (D), Multimodal integration task. (D1), Task structure. (D2), Averaged correct rate across all mice as a function of sample duration for different stimulus modalities (different colors). (D3), Averaged event rates during sample period for left (red) and right (blue) choice trials. (D4), Averaged weights (black line) of logistic regression fitting to the choice of trials across expert mice tested in > 1000 trials (N = 11 mice) from the first bin (40ms) to the last bin (1000ms) of the sample period. Gray dash line represents null hypothesis. Grey dots indicate significance, p<0.05, two-sided t-tests. (D5), Psychometric curve for trials with multimodal stimulus. (E), Confidence probing task. (E1), Task structure. (E2), Psychometric curve, i.e., right choice rate as a function of relative contrast (log scaled relative frequency). (E3), Histogram of time invested (TI) for both correct and error trials. (E4), Averaged correct rate across all mice as a function of TI. (E5), Averaged TI as a function of absolute relative contrast for both correct and error trials. Cycles, individual mice; *, p<0.05; **, p<0.01, two-sided Wilcoxon rank-sum tests.

Challenging mouse tasks innovated in HABITS.

(A), Continuous learning task. (A1), Task structure showing mice learning five subtasks one by one. (A2), Left, averaged correct rate of all mice performing the five tasks (different colors) continually. All task schedules are normalized to their maximum number of trials and divided to 10 stages equally. Right, box plot of number of trials to criteria for each task. (A3), Left, averaged reaction time of all mice performing the five tasks continually. Right, averaged median reaction time across the five tasks during early (perf. < 0.55), middle (perf. < 0.75) and trained (perf. > 0.75) stage. Error bar indicates 95% CI. (A4), Same as (A3) but for absolute performance bias. n.s., p>0.05; **, p<0.01, two-sided Wilcoxon signed-rank tests. (B), Double delayed match sample task (dDMS) with sound frequency modality. (B1), Task structure. (B2), Averaged correct rate across all mice during training (left) and averaged number of days to reach the criterion (right). (B3), Averaged earlylick rate across all mice. (B4), Averaged correct rate (black) and earlylick rate (gray) for all combination of sample and test stimulus. (B5), Heatmap of error rate (left) and earlylick rate (right) varies with different combination of delay1 and delay2 durations. (C), Delayed 3 alternative forced choice (d3AFC). (C1), Task structure. (C2), Averaged correct rate across all mice during training (left, colors indicate trial types) and averaged number of days to reach the criterion performance (right). (C3), Averaged correct rate (colors indicate trial types) and earlylick rate (gray) for different trial types. (C4), Averaged error rate of choices conditioning trial types. In each subplot, the position of bars corresponds to different choices. ****, p<0.0001, n.s., p>0.05, two-sided t-tests. (C5), Averaged choice rates for the three lickports (colors) as a function of sample frequency. Data collected from trained mice. (D), Context-dependent attention task. (D1), Task structure. (D2), Averaged correct rate across all mice during training (left, data only from trials with multimodal w/ conflict) and averaged number of days to reach the criterion (right). (D3), Correct rate (left) and reaction time (right) conditioning modalities. (D4), Averaged psychometric curve and partitioned linear regression for the multimodal with and without conflict conditions, respectively. (D5), Performance bias to sound orientation modal as a function of pre-cue contrast, for the two multimodal conditions. (D6), Averaged correct rate as a function of delay duration.

MT enabled faster learning with higher quality.

(A), The framework of machine teaching (MT) algorithm (see text for details). (B), Working memory task as in Fig. 4A, but with full stimulus generation matrix. (C), Averaged number of trials needed to reach the criterion for MT-based and random trial type selection strategies. **, p<0.01, two-sided Wilcoxon rank-sum test. (D), The absolute difference between contrast (contr.) of sample1 (S1) and sample2 (S2) during training process for the two strategies. (E), same as (D) but for correct rate. (F), MT-based d2AFC task training. Box plot of correct rate of expert mice (left) and number of trials needed to reach the criterion (right) for different training strategies (MT, anti-bias, and random). n.s., p>0.05, Kruskal–Wallis tests. (G), Left, averaged absolute performance bias for the three strategies during different training stages. Right, averaged across training stages. (H), same as (G) but for absolute trial type bias. (I), Percentage of trials showing significance for different regressors during task learning. (J-K), box plot of correct rate (J) and prediction performance difference between the full model and partial model excluding current stimulus (S0) (K) for different trained stage, including early (perf. > 75%), middle (perf. > 80%), and well (perf. > 85%) trained. *, p<0.05, **, p<0.01, ***, p<0.001, n.s., p>0.05, two-sided Wilcoxon rank-sum tests with Bonferroni correction.

MT manifested distinct learning path with faster forgetting and higher learning rate.

(A), task structure. (B), chart of training path in latent decision space following three goals one by one. (C), top, averaged correct rate across grouped mice during training (color, machine teaching; black, random). Bottom, same as top but performance for non-relative cue. (D), top, the slopes of linear regression between trial number and correct rate. Bottom, same as top but between trial number and performance for non-relative cue. **, p<0.01; n.s., p>0.05; two-sided Wilcoxon rank-sum tests. (E), the learning path of mice (lines) in latent decision space for machine teaching and random training strategies. Light dots represent model weights fitted by individual mice’s behavioral data. Shaded dots, averaged across mice. (Square dots, testing protocol; Cross dots, the first or the last half of trials in learning protocol; Cycle dots, all trials in learning protocol) (F), left, averaged absolute trial type bias between stay and switch conditions across grouped mice for the MT and random strategies from L1 to L3. Right, same as middle but for the bias between left and right trials. (G), same as (H) but for absolute performance bias in T1 and T2 protocols. L1, the first 500 trials of frequency learning protocol; L2, intermediate trials of frequency learning protocol; L3, the last 200 trials of frequency learning protocol; T1, testing orientation protocol; T2, testing frequency protocol. *, p<0.05; n.s., p>0.05; two-sided t-tests.

HABITS system.

(A) Block diagram of control system of HABITS, showing peripherals connected with microcontroller through digital input/output (DIO) or serial port. (B) Graphic user interface (GUI) of a specific cage (left, magnified) and data plot window (right) when click ‘plot’ button in the GUI, showing daily performance in all previous days, trial performance (green for correct and red for error trials) in last 24 hours, and body weight data in last 24 hours. (C) Example protocol programs for HABITS. (D) Around 100 HABITS are packed on standard racks for large-scale mouse behavioral testing. (E) Workflow pipeline for HABITS, showing fully autonomous mouse behavioral training after initialization of HABITS, before data harvest from SD card for analysis.

Autonomous versus manual training in home-cage.

(A), Flow chart of the task training protocol in home-cage (Materials and Methods). (B), Logistic regression model. (C), top, behavioral performance of example mouse in the autonomous training. Bottom, the significance of individual regressors; Circle size corresponds to p values; The significance of a regressor is evaluated by comparing the prediction of the full model to a partial model with the regressor of interest excluded. p Values are based on cross-validation t-test (Materials and methods). (D), percentage of trials relying on different regressors significantly during task learning. Cycles and light lines, individual mice; Bars and bold lines, average across mice; Shades and error bars, 0.95 CI. *, p<0.05, n.s., p>0.05, two-sided Wilcoxon rank-sum tests. (E), averaged water harvested per day (left) and number of trials per day (right) changing from manual to autonomous training in home-cage. Cycles, individual mice; Bar plot and error bar, mean and 0.95 CI across mice. (F), averaged relative body weights as a function of training days for free water (blue) and all d2AFC training mice (black). Shaded area shows 95% CI. (G), performance of all 6 female mice performing d2AFC task in home-cage automatically.

Reaction time based 2AFC task training in home-cage automatically.

(A), task structure of RT-based 2AFC task. (B), Flow chart of training protocol in home-cage. (C), conditioned behavioral data of example trials for correct (blue block) and error (red block) choice. (D), performance of example mouse performing task in home-cage. The color of background corresponding to (B). Grey blocks indicate dark cycle. Grey dash line, the criterion performance. Red horizontal dash line, chance performance level. (E), correct rate of all mice. (F), reaction time of all mice. Black line fitting to all mice from the onset to the end of training. (G), histogram of reaction time. Data collected from all mice. The bold vertical line represents the median of RT. (H), conditioned histogram of inter-trial-interval (ITI) for correct (blue) and error (red) trials.

Value-based dynamic foraging task.

(A), task structure. (B), example performance of a mouse in the early (top, first 6000 trials) and late (bottom, last 6000 trials) training stages with block size 500. Blue lines represent moving averaged behavioral probability of left choice within 40 trials. Purple lines show the assignment probability for left reward. (C), averaged probability of the choosing the lickport with the higher assignment probability (P(high)) across mice gradually increases following the number of trials. Black line indicates the assignment probability for left and right lickports is 60% (grey line, 52.5%) and 10% (grey line, 17.5%), respectively. Dots and errorbar, mean and 95% CI. (D), left, averaged P(high) across mice follows training sub-protocols with different block size. Right, the number of days to complete all training protocols from block size 500 to 100. Square dots indicate individual mice. (E), same as (B) but data collected from the sub-protocol with block size 100.

Other complex cognitive behavioral tasks training in home-cage automatically.

(A), Left, stimulus generation matrix of working memory task. Middle, number of days to train. Right, correct rate for SGM. Values lying in the diagonal line corresponding to the correct rate of probe trials. (B), Top, d3AFC task according to sound orientation and number of days to reach the criterion performance. Dots indicate individual mice. Performance (Middle) and earlylick rate (bottom) of all mice performing the d3AFC task. Red dash line, chance performance level; Grey dash line, the criterion performance. (C), Left, contingency reversal of d3AFC task according sound frequency (top) and performance of an example mouse (bottom). Right, averaged correct rate across all mice for different reverse times (top). Number of trials needed to learn as a function of reverse training times (bottom). Dots, individual mice. Line and shades, linear regression.

Simulation of machine teaching algorithm in decision-making scenario.

(A). the weight of regressors in an ideal learner vary during learning a 2AFC task. Note that the initial weights of bias and S1 regressors are not zero. (B), the presented trial types generated by random (black) and MT (red) during entire training process. (C), same as (A) but weights of all regressors begin at zero.

Details of behavioral analysis for multi-dimensional tasks.

(A), left, linear regression between trial number and correct rate in task requiring mice attend to sound frequency. Right, the R-square of every individual linear regression. (B), same as (A) but for performance following non-relative cue. (C), the number of trials to reach criterion performance for MT and random group. (D), performance of both grouped mice in T1 and T2 protocol. n.s., no significant. two-sided Wilcoxon rank-sum tests. (E), the presented individual trials with Stay/Switch (top) and Left/Right (bottom) trial type generated by MT (L3) and Random (T2). (F, G), After mice were trained by MT as in fig. 6A, they were intermediately set the training protocol to the beginning and retrained with randomly generated trial sequence. We compared correct rate of trials with sound frequency stimulus in the first and the second training, presented in (F). (G) shows the learning rate (left) and training efficiency (right) of the first and the second training processes. **, p<0.01; two-sided Wilcoxon signed-rank tests. (H), correct rate of both grouped mice for stay and switch trials in T2 protocol. n.s., no significant. two-sided Wilcoxon rank-sum tests.

Building materials of HABITS.