Figures and data

Monkeys rapidly learn various abstract object concept classification tasks.
(A) Object drag task. In each trial, after the monkey held a fixation dot, an object image and two gray target boxes appeared. The monkey had to touch and drag the image to the correct box to receive a reward. Two boxes were associated with different object concepts. (B) We used grayscale photographs of objects with no background in this first set of experiments. We used ∼100 training images to train monkeys and measured their generalization performance with test images. (C) Example reach trajectories showed that monkeys tended to choose targets randomly on Day 1, but by Day 3, they chose correct targets in many trials. (D) All monkeys’ correct rates rapidly improved during the first three days of the animate vs. inanimate task, reaching 85-90% and then plateaued (chance level: 50%). Before the performance plateaued, monkeys saw each training image 10-20 times (inset). (E) After learning of the animate vs. inanimate task (post-Day 3), monkeys made more than 90% correct choices for many images, but performed poorly on certain images. The plot is the average of the three monkeys, with images ordered by correct rates. The shading indicates S.E.M. across trials. (F) Reaction times (RTs), measured as the time from stimulus onset to reaching a target, also varied across images. The plot is the average of the three monkeys, with images ordered by RTs. (G) We tested a variety of classification rules in succession. Each task used 60-96 grayscale training images. (H) Monkeys learned all tasks with similar learning rates. The plot showed the average of the three monkeys. (I) After training on each task, the monkeys could generalize the rule to new stimuli they saw for the first time. The error bars represent S.E.M. across images. All images were used either under the CC0 license or the Pixabay content license.

Behavior cannot be explained by exemplar-based strategies or some accidental features.
(A) We assessed monkey performance in three tasks using large-scale image sets from the THINGS database (Hebart et al. 2019). We used 100 images to train monkeys (top row) and then presented 1,500-3,600 generalization images, containing “old” object categories present in training images (second row) and “new” categories that were never shown (bottom two rows). Each task used different training images. (B) All monkeys performed well on generalization images (75-93%) for both the “old” and “new” object categories. (C) Performance was generally high across a range of object categories. (D) Monkeys also performed well on control images that lacked certain visual features, such as cartoons without naturalistic textures, silhouettes without internal structures such as faces, and grayscale images without color. (B-D) Error bars indicate S.E.M. across object categories. All images were used either under CC0 license or Pixabay content license.

Absence of a common conceptual rule results in poorer performance.
(A-B) We assessed how well monkeys could learn stimulus-response associations in the absence of shared conceptual rules in two control tasks. In one task (A), two targets were associated with randomly selected images from our grayscale cropped image set (48 images per target). Thus, the monkeys had to remember the associations for individual images. In the other task (B), two targets were randomly associated with concrete object concepts (e.g., apple, horse, bin; referred to here as the “concrete category”; 16 concrete categories for each target). During the generalization test, different images of the same concrete categories were shown. (C) Although the monkeys could gradually learn both control tasks, performance was consistently lower than on the main concept classification tasks (the average across the six tasks shown in Fig. 1; p < 10−10 for all days, aggregated across monkeys, binomial test). (D) Their generalization performance for the concrete category randomized task (B) was also lower than that of the six main tasks (p < 0.016 for all three monkeys, binomial test). Error bars are S.E.M. across images. Asterisks indicate statistical significance: ∗p < 0.05, ∗∗p < 0.01, ∗∗∗p < 0.001.

Monkey performance is better explained by DNNs than by low- or mid-level vision models.
(A) Category classification accuracy of various vision models, including deep neural networks (DNNs). We constructed a linear classifier using the output of each model based on the training images and plotted their performance on the generalization images we used for the monkeys. The orange dashed lines represent the average classification performance of the three monkeys. (B) Accuracy of predicting monkey choices. A linear classifier was trained to classify monkey choices instead. P (match) indicates the probability that the monkey and model choices matched. DNNs outperformed lower-level vision models. (C) Accuracy of category and monkey-choice classification, based on neural responses in ventral pathway areas, was higher at later stages along the pathway. We employed neural responses collected by Papale et al. (2025), who used the same THINGS database.

Human behavioral responses were correlated with monkey performance.
(A) Humans quickly learned the tasks in Fig. 1 within 20-30 trials. The plots are the average of seven participants per task. (B) Different average reaction times (RTs) across images, averaged across participants. (C) We fitted the drift-diffusion model (DDM) to the choices and RTs. The model makes a decision based on a stochastic state biased by the category sensitivity of each image. (D) DDM accurately fits choices, RTs, and RT distributions. The plots are from the animate vs. inanimate task. Image IDs were sorted by category sensitivity. RT distributions were generated using the trials aggregated across the top (easy) and bottom (difficult) 10 images. See Supplementary Fig. 6 for all tasks. (E) Example comparison of category sensitivity between humans and monkeys. Each dot represents an image, whose absolute category sensitivity was averaged across participants. The line is a linear regression, and shading indicates the standard error of the pre- diction line. (F) Positive correlations between humans and monkeys. Error bars indicate S.E.M. across images. (G) The category sensitivity for control images —cartoon, silhouette, and outline— was in the same order between monkeys and humans. (H) Positive correlations were also evident within each control image set.

Triangular comparison across monkeys, humans, and artificial networks.
(A) To determine the limit of monkeys’ ability to classify concepts, we tested more concept tasks. As with the other tasks in Fig. 1, we trained the three monkeys with 90-120 images and assessed their generalization performance with 90-120 new images. (B) The monkeys (n = 3) performed well on tasks such as big vs. small objects and indoor vs. outdoor scenes, but they failed on some abstract tasks, such as fire- vs. water-related objects and Western vs. Eastern objects. Visual deep neural networks (DNNs) such as ResNet-50 also showed poorer performance on these tasks, whereas humans (n = 9) performed almost perfectly across all tasks. The monkey performance for the three THINGS tasks in Fig. 2 was replotted using only the first 100 test stimuli to facilitate comparison with the other tasks. See Supplementary Fig. 5A for the performance of other DNNs. Error bars indicate S.E.M. across images. (C) Finally, we concatenated the generalization performances of images across all tasks to generate the matrix of dissimilarity in behavioral performance across humans, monkeys, and models. (D) A closer look at the dissimilarity revealed that the monkey behavior was most similar to visual DNNs without language input, whereas the human behavior was most similar to language-informed DNNs. (E) t-SNE visualization of the dissimilarity matrix showed a spectrum from low-level visual models to language-informed DNNs, with the monkey behavior placed in the middle.

Pretraining of monkeys on the task structure using shape stimuli.
(A) Behavioral testing system attached to the monkeys’ home cage. Monkeys reached for a touchscreen through an arm hole in the front panel and received a juice reward from a cone-shaped reward socket (Kawaguchi et al. 2019). While performing tasks, they placed their mouth on the reward socket and looked at the screen through a viewing hole. (B) Prior to the main concept classification tasks, we trained the monkeys to perform the classification of amorphous shape stimuli. The task sequence followed the main design (Fig. 1A), but the monkeys had to categorize shape stimuli according to their shape, color, or texture. We created nine groups of stimuli as shown in this panel and chose two of them to classify for each task. In each group, there were 50 images with random bump positions. (C) The number of sessions (days) required for the monkeys to learn each shape classification task plotted in the trained order. Prior to the first task, monkeys were trained to touch a shape stimulus and move it to a target box. Then, in the first “Basic” vs. “Red” task, they were presented with two target boxes and had to learn to select one of them based on a stimulus feature. This initial stimulus-response association training took 9-11 days (see Methods). The monkeys were subsequently trained to classify different pairs of shapes in 2-5 days each. In the “Others” vs. “Red” task, the monkeys had to choose one target for red stimuli and the other target for stimuli randomly selected from the remaining eight stimulus categories.

Monkey reach trajectories reflected their decision-making process.
(A) Monkeys moved a stimulus to a target with a straight path in most trials, but they occasionally made detours as if they changed their mind (red lines; Kaufman et al. 2015). These trials were detected by looking for trajectories whose average position was the opposite to the position of the chosen target with respect to the midline. The right target box corresponded to the animate category in this example. (B) Monkeys exhibited these detour trajectories much more often for difficult images. For each task, we selected the 10 easiest and most difficult images based on correct rates and plotted the probability of trials in which the monkeys exhibited detour trajectories. The plots aggregate the data from the three monkeys.

Monkeys failed to generalize a rule to texform images.
It has been shown that animacy information is partially preserved even in images that retain only mid-level textural visual features (texform; Long et al. 2018). We found that the monkeys failed to generalize the learned animate vs. inanimate rule to texform images. Thus, it is unlikely that the monkeys relied primarily on mid-level textural features in our main task. Error bars are S.E.M. across images. (A) We first confirmed that monkeys could generalize the rule to the original, source images (object photographs) used to generate texform images by Long et al. (2018). (B) However, when we showed texform images, the monkey performance decreased to the chance level (p = 0.26 − 0.82, BF01 = 4.7 − 6.3, binomial test). We used 60 texform images (set 1). (C) We then attempted to train the monkeys to discriminate animate vs. inanimate using the same set of texform images (Set 1). Although the monkeys gradually improved their performance was consistently lower than that of the main concept tasks for all monkeys (p =< 9.5 × 10−4 for the first 4 days). In particular, monkey El presented a much slower learning curve. (D) Monkeys Ju and Ol were able to generalize the learned classification to other texform images (p < 2.1 × 10−5), but the performance of monkey El was statistically indistinguishable from chance level (p = 0.078, BF01 = 1.9, binomial test).

Model performance did not depend on classification or dimension reduction methods
(A) We confirmed that the classification and behavioral fitting performances of the models (Fig. 4) did not strongly depend on the number of dimensions of the model parameters reduced by principal component analysis (PCA). In each panel, the darker line represents the model performance in correctly classifying object concepts, and the lighter line represents the model performance in fitting monkey choices. In both cases, the fitting was performed using the training image sets, and the performance was evaluated using the generalization image sets used in the monkey experiments shown in Fig. 2. (B, C) Model performance was not strongly dependent on the methods used for dimension reduction and classification. We tested multi-dimensional scaling for dimension reduction (B) and used a support vector machine instead of logistic regression for classification (C) and found that the results did not change qualitatively.

Model classification and behavioral fitting performance for all tasks.
(A) Summary of model performance for all tasks shown in Figs. 1, 2, and 6. For the tasks using large-scale THINGS images (Fig. 2), we used only the first 100 generalization stimuli to compute the performance for proper comparison; thus, the performances do not exactly match those shown in Fig. 4. The classification performances of ResNet-50 and CLIP are the same as those shown in Fig. 6A. (B) DNNs could also classify modified stimuli outside the range of natural images (cartoon, silhouette, and outline images), which we tested with monkeys (Fig. 5G). The performance tended to be lower for the outline images, which is consistent with the monkey and human behavioral results.

Drift diffusion model (DDM) could accurately fit choices and reaction times.
(A, B) Fitting of the DDM (Fig. 5C) to the behavioral data of individual tasks for humans (A) and monkeys (B). The conventions are the same as those in Fig. 5D. Model fit was performed for the tasks using grayscale image sets (Fig. 1), in which the monkeys repeated many trials for the same stimuli. See Methods for the details of the model fit. The image IDs were sorted by the sensitivity parameter (dri in Eq. 2) of the model fit averaged across participants for each task. The fit lines are not smooth because the plots are the average across participants who had different rank orders of sensitivity across image IDs.