Gaze patterns and brain activations in humans and marmosets in the Frith-Happé theory-of-mind animation task

  1. Audrey Dureux  Is a corresponding author
  2. Alessandro Zanini
  3. Janahan Selvanayagam
  4. Ravi S Menon
  5. Stefan Everling
  1. Centre for Functional and Metabolic Mapping, Robarts Research Institute, University of Western Ontario, Canada
  2. Department of Physiology and Pharmacology, University of Western Ontario, Canada

Abstract

Theory of Mind (ToM) refers to the cognitive ability to attribute mental states to other individuals. This ability extends even to the attribution of mental states to animations featuring simple geometric shapes, such as the Frith-Happé animations in which two triangles move either purposelessly (Random condition), exhibit purely physical movement (Goal-directed condition), or move as if one triangle is reacting to the other triangle’s mental states (ToM condition). While this capacity in humans has been thoroughly established, research on nonhuman primates has yielded inconsistent results. This study explored how marmosets (Callithrix jacchus), a highly social primate species, process Frith-Happé animations by examining gaze patterns and brain activations of marmosets and humans as they observed these animations. We revealed that both marmosets and humans exhibited longer fixations on one of the triangles in ToM animations, compared to other conditions. However, we did not observe the same pattern of longer overall fixation duration on the ToM animations in marmosets as identified in humans. Furthermore, our findings reveal that both species activated extensive and comparable brain networks when viewing ToM versus Random animations, suggesting that marmosets differentiate between these scenarios similarly to humans. While marmosets did not mimic human overall fixation patterns, their gaze behavior and neural activations indicate a distinction between ToM and non-ToM scenarios. This study expands our understanding of nonhuman primate cognitive abilities, shedding light on potential similarities and differences in ToM processing between marmosets and humans.

Editor's evaluation

Dureux and colleagues provide important evidence regarding the capacity for mental state attribution in a highly social non-human primate species, the marmoset. Their findings suggest that marmosets and humans visually track abstract stimuli more closely during ToM animations and display differential activation of large-scale networks implicated in social processing. These findings will be of wide interest to scientists interested in social cognition.

https://doi.org/10.7554/eLife.86327.sa0

eLife digest

In our daily life, we often guess what other people are thinking or intending to do, based on their actions. This ability to ascribe thoughts, intentions or feelings to others is known as Theory of Mind.

While we often use our Theory of Mind to understand other humans and interpret social interactions, we can also apply our Theory of Mind to assign feelings and thoughts to animals and even inanimate objects. For example, people watching a movie where the characters are represented by simple shapes, such as triangles, can still see a story unfold, because they infer the triangles’ intentions based on what they see on the screen.

While it is clear that humans have a Theory of Mind, how the brain manages this capacity and whether other species have similar abilities remain open questions. Dureux et al. used animations showing abstract shapes engaging in social interactions and advanced brain imaging techniques to compare how humans and marmosets – a type of monkey that is very social and engages in shared childcare – interpret social cues. By comparing the eye movements and brain activity of marmosets to human responses, Dureux et al. wanted to uncover common strategies used by both species to understand social signals, and gain insight into how these strategies have evolved.

Dureux et al. found that, like humans, marmosets seem to perceive a difference between shapes interacting socially and moving randomly. Not only did their gaze linger longer on certain shapes in the social scenario, but their brain activity also mirrored that of humans viewing the same scenes. This suggests that, like humans, marmosets possess an inherent ability to interpret social scenarios, even when they are presented in an abstract form, providing a fresh perspective on primates’ abilities to interpret social cues.

The findings of Dureux et al. have broad implications for our understanding of human social behavior and could lead to the development of better communication strategies, especially for individuals social cognitive conditions, such as Autism Spectrum Disorder. However, further research will be needed to understand the neural processes underpinning the interpretation of social interactions. Dureux et al.’s research indicates that the marmoset monkey may be the ideal organism to perform this research on.

Introduction

Theory of Mind (ToM) refers to the capacity to ascribe mental states to other subjects (Carruthers and Smith, 1996; Premack and Woodruff, 1978). Various experimental approaches have been devised to investigate the cognitive processes involved in ToM, including tasks involving text (Happé, 1994), non-verbal pictures (Sarfati et al., 1997), false belief (Wimmer and Perner, 1983), and silent animations featuring geometric shapes. The latter approach is based on Heider and Simmel’s observation that participants attribute intentional actions, human character traits, and even mental states to moving abstract shapes (Heider and Simmel, 1944). Subsequent studies used these animations to test the ability to ascribe mental states in autistic children (Bowler and Thommen, 2000; Klin, 2000).

In the Frith-Happé animations, a large red triangle and a small blue triangle move around the screen (Abell et al., 2000; Castelli et al., 2002; Castelli et al., 2000). In the Random condition, the two triangles do not interact and move purposelessly, in the Goal-Directed (GD) condition the triangles interact but in a purely physical manner (i.e. chase, dancing, fighting and leading) and in the ToM condition the two animated triangles move as if one triangle is reacting to the other’s mental state (i.e. coaxing, surprising, seducing and mocking). Functional imaging studies have demonstrated that the observation of ToM compared to Random animations activates brain regions typically associated with social cognition, including dorso-medial frontal, temporoparietal, inferior and superior temporal cortical regions (Barch et al., 2013; Bliksted et al., 2019; Castelli et al., 2000; Chen et al., 2023; Gobbini et al., 2007; Vandewouw et al., 2021; Weiss et al., 2021; Wheatley et al., 2007).

Although the spontaneous attribution of mental states to moving shapes has been well established in humans, it remains uncertain whether other primate species share this capacity. There is some evidence suggesting that monkeys can attribute goals to agents with varying levels of similarity and familiarity to conspecifics, including human agents, monkey robots, moving geometric boxes, animated shapes, and simple moving dots (Atsumi et al., 2017; Atsumi and Nagasaka, 2015; Krupenye and Hare, 2018; Kupferberg et al., 2013; Uller, 2004). However, the findings in this area are somewhat mixed, with some studies investigating the attribution of goals to inanimate moving objects yielding inconclusive results (Atsumi and Nagasaka, 2015; Kupferberg et al., 2013). Nonhuman primates' spontaneous attribution of mental states to Frith-Happé animations is even less certain. While human subjects exhibit longer eye fixations when viewing the ToM condition compared to the Random condition of the Frith-Happé animations (Klein et al., 2009), a recent eye tracking study in macaque monkeys did not observe similar differences (Schafroth et al., 2021). Similarly, a recent fMRI study conducted on macaques found no discernible differences in activations between ToM and random Frith-Happé animations (Roumazeilles et al., 2021).

In this study, we investigated the behaviour and brain activations of New World common marmoset monkeys (Callithrix jacchus) while they viewed Frith-Happé animations. Living in closely-knit family groups, marmosets exhibit significant social parallels with humans, including prosocial behavior, imitation, and cooperative breeding. These characteristics establish them as a promising nonhuman primate model for investigating social cognition (Burkart et al., 2009; Burkart and Finkenwirth, 2015; Miller et al., 2016). To directly compare humans and marmosets in their response to these animations, we employed high-speed video eye-tracking to record eye movements in eleven healthy humans and eleven marmoset monkeys. Additionally, we conducted ultra-high field fMRI scans on ten healthy humans at 7T and six common marmoset monkeys at 9.4T. These combined methods allowed us to examine the visual behavior and brain activations of both species while they observed the Frith-Happé animations.

Results

To investigate whether marmoset monkeys, like humans, exhibit distinct processing patterns in response to the conditions in Frith-Happé animations (i.e. ToM, GD, and Random conditions), we compared gaze patterns and fMRI activations in both marmosets and human subjects as they watched shortened versions of the Frith-Happé animations (Figure 1).

Task Design.

Two different conditions of video clips resulting in eight animations were used during the scanning (ToM and Random animations), and an additional condition with four animations was used for the eye-tracking (ToM, GD and Random animations). In the ToM animations, one triangle reacted to the other triangle’s mental state, whereas in the Random animations the same two triangles did not interact with each other. In the GD animations, the two triangles interact with simple intentions. Each animation video lasted 19.5 s and was separated by baseline blocks of 15 s where a central dot was displayed in the center of the screen. In the fMRI task, several runs were used with a Randomized order of the two conditions whereas in the eye-tracking task one run containing all the twelve animations once was used.

Gaze patterns for Frith-Happé’s ToM, GD and Random animations in humans and marmosets

We first investigated in both humans and marmosets whether fixation durations differed between the three conditions (Figure 2A). By conducting mixed analyses of variance (ANOVA), with factors of species (Human vs Marmoset) and condition (ToM vs GD vs Random animation videos), we found a significant interaction between species and condition (F(2,40)=13.9, p=<0.001, ηp2p2.410). Here we observed longer fixation durations for ToM animation videos (M=432.6ms) as compared to GD videos (M=279.9ms, p=0.008) and Random videos (M=308.2ms, p=0.01) for humans (p=0.029) but not for marmosets (233.7ms for ToM videos, 219.6ms for GD videos and 235.6ms for Random videos, ToM vs GD: p=0.90 and ToM vs Random: p=1).This finding confirms that humans fixate longer in the ToM condition (Klein et al., 2009), whereas marmosets, like macaques (Schafroth et al., 2021), do not show this effect.

Fixation duration (A) and proportion of time looking triangles (B) in Frith-Happé’s ToM, GD and Random animations in humans (left) and marmosets (right).

(A). Bar plot depicting the fixation duration in the screen as a function of each condition. (B). Bar plot representing the proportion of time the radial distance between the current gaze position and each triangle was within 4 visual degrees, as a function of each condition. Green represents results obtained for ToM animation videos, orange represents results for GD animation videos and blue represents results for Random animation videos. In each graph, the left panel shows the results for 11 humans and the right panel for 11 marmosets. Each colored bar represents the group mean and the vertical bars represent the standard error from the mean. The differences between conditions were tested using ANOVA: p<0.05*, p<0.01** and p<0.001***.

To further analyze the gaze patterns of both humans and marmosets, we next measured the proportion of time subjects looked at each of the triangles in the videos (Figure 2B). We conducted mixed ANOVAs on the proportion of time the radial distance between the current gaze position and each triangle was within 4 visual degrees for each triangle separately.

Importantly, we observed a significant interaction between species and condition for the proportion of time spent looking at the large red triangle (F(2,40)=9.83, p<0.001, ηp2p2.330). Specifically, both humans (Figure 2B left) and marmosets (Figure 2B right) spent a greater proportion of time looking at the red triangle in ToM compared to the GD and Random videos (for humans, ToM vs GD: Δ=.23, p<0.001 and ToM vs Random: Δ=0.31, p<0.001; for marmosets, ToM vs GD: Δ=0.13, p<0.01 and ToM vs Random: Δ=0.13, p<0.01). However, while humans also allocated a greater proportion of time to the red triangle in GD compared to Random animations (Δ=0.08, p=0.05), marmosets did not show any difference between these two conditions (Δ=0.0003, p=1).

For the small blue triangle, we also observed a significant interaction of species and condition (F(2,40)=3.54, p=0.04, ηp2p2.151) but no significant pairwise differences were observed following Bonferroni correction. Therefore, humans and marmosets spent the same proportion of time looking at the blue triangle in the three different types of videos (for humans, ToM vs GD: Δ=-0.02, p=1, ToM vs Random: Δ=0.04, p=1 and GD vs Random: Δ=0.07, p=0.23; for marmosets, ToM vs GD: Δ=-0.05, p=0.89, ToM vs Random: Δ=0.07, p=0.66 and GD vs Random: Δ=-0.02, p=1; Figure 2B).

These results highlight the variation in gaze patterns observed in both humans and marmosets when their focus is directed towards the large red triangle during the viewing of ToM, GD, and Random videos. Notably, humans show a gradient of proportion of time spent looking at the red triangle across the three conditions, with the smallest proportion in Random videos and the greatest proportion in ToM videos. In contrast, marmosets exhibit a different pattern, spending more time looking at the red triangle in ToM videos, but allocating the same proportion of time to look at the red triangle in both Random and GD videos. This finding suggests that while humans demonstrate distinct attentional preferences for the red triangle across the three conditions, marmosets exhibit a similar attentional focus on the red triangle in the Random and GD conditions, but their pattern differs in the ToM condition. This suggests that marmosets process the Random and GD conditions in a similar manner, but their processing of the ToM condition is distinct, indicating a differential response to stimuli representing social interactions.

Functional brain activations while watching ToM and Random Frith-Happé’s animations in humans

Given that humans exhibited only minor differences, and marmosets showed no differences in eye movements between the Random and GD animations, coupled with task design constraints, we only used the Random and ToM animations for the fMRI studies in both humans and marmosets (see Materials and methods).

We first investigated ToM and Random animations processing in humans. Figure 3 shows group activation maps for ToM (A) and Random (B) conditions as well as the comparison between ToM and Random conditions (C) obtained for human participants.

Figure 3 with 2 supplements see all
Brain networks involved in processing of Frith-Happé’s ToM and Random animations in humans.

Group functional maps displayed on right fiducial (lateral and medial views) and left and right fiducial (dorsal and ventral views) of human cortical surfaces showing significant greater activations for ToM condition (A), Random condition (B) and the comparison between ToM and Random conditions (C). The white line delineates the regions based on the recent multi-modal cortical parcellation atlas (Glasser et al., 2016). The maps depicted are obtained from 10 human subjects with an activation threshold corresponding to z-scores >2.57 for regions with yellow/red scale or z-scores <–2.57 for regions with purple/green scale (AFNI’s 3dttest++, cluster-forming threshold of p<0.01 uncorrected and then FWE-corrected α=0.05 at cluster-level from 10000 Monte-Carlo simulations).

Both ToM (Figure 3A) and Random (Figure 3B) videos activated a large bilateral network. While the same larger areas were activated in both conditions, the specific voxels showing this activation within those areas were typically distinct. In some cases, both conditions activated the same voxels, but the degree of activation differed. This suggests a degree of both spatial and intensity variation in the activations for the two conditions within the same areas. The activated areas included visual areas (V1, V2, V3, V3CD, V3B, V4, V4T, V6A, V7, MT, MST), lateral occipital areas 1, 2, and 3 (LO1, LO2, LO3), temporal areas (FST, PH, PHT, TE2, posterior inferotemporal complex PIT and fusiform face complex FFC), temporo-parietal junction areas (TPOJ2 and TPOJ3), lateral posterior parietal areas also comprising the parietal operculum (supramarginal areas PF, PFt, angular areas PGp and PGi, superior temporal visual area STV, perisylvian language area PSL, medial intraparietal area MIP, ventral and dorsal lateral intraparietal areas LIPv and LIPd, anterior intraparietal area AIP, IPS1, IPS0, 7PC and 5 L), medial superior parietal areas (7am, PCV, 5 mv), secondary somatosensory cortex (S2), premotor areas (6, 55b, premotor eye field PEF, frontal eye field FEF), and frontal areas (8Av, 8 C, IFJp, IFIa).

The ToM condition (Figure 3A) also showed bilateral activations in posterior superior temporal sulcus (STSdp), in temporo-parietal junction area TPOJ1, in ventral visual complex (VVC), in parahippocampal area 3 (PHA3), in lateral posterior parietal areas Pfop and PFcm, in lateral prefrontal areas 8 C, 8Av, 44 and 45, in inferior frontal areas IFSp, IFSa, and in frontal opercular area 5 (FOP5).

To identify brain areas that are more active during the observation of ToM compared to Random videos, we directly compared the two conditions (i.e., ToM animations >Random animations contrast, Figure 3C and Figure 5A). This analysis reveals increased activations for the ToM condition compared to the Random condition in occipital, temporal, parietal and frontal areas. This includes notable differences in the bilateral visual areas V1, V2, V3, V3CD, V4, V4t, MT, MST, as well as the bilateral LO1, LO2 and LO3 regions. The increase extends into the lateral temporal lobe, as observed in the bilateral PH and FST areas and into the more inferior part of the temporal lobe in the bilateral TE2, FFC, and PIT areas. We also found greater activations in the bilateral temporo-parietal junction areas (TPOJ1, TPOJ2, TPOJ3), and along the right STS in STSdp and STSda areas. This extended to left and right parietal areas, especially in the inferior parietal lobule, including the right supramarginal and opercular supramarginal areas (PF, PFm, PFt, Pfop, and PFcm), left PFt, bilateral opercular areas PSL and STV, bilateral angular areas PGp and Pgi, right IPS1, and bilateral IP0. The activation also extended into the superior parietal lobule (right AIP). Moving anteriorly, we observed greater activations during ToM animations in the secondary somatosensory cortex, premotor areas (6 r and PEF), lateral prefrontal areas (8 C, 44, and 45), and the inferior (IFSa, IFSp, IFJa, and IFJp) as well as the opercular (FOP5) frontal areas in the right hemisphere. In contrast, the Random condition exhibited greater activations, than the ToM condition predominantly within the left and right visual areas (V1, V2, V3, V3A, V4) and in dorsolateral (10d and 10r bilateral), lateral (9m left) and medial frontal areas (d32 and a24 bilateral, p24 and s32 right).

At the subcortical level (see Figure 5—figure supplement 1, left panel), we observed enhanced bilateral activations in the cerebellum and in certain areas of the thalamus (namely, the right ventroposterior thalamus or THA-VP, and the left and right dorsoanterior thalamus or THA-DA) under both ToM (Figure 5—figure supplement 1A, left panel) and Random conditions (Figure 5—figure supplement 1B, left panel) when compared to the baseline. Additionally, a small section of the right amygdala was engaged in the ToM condition. We noted more pronounced activations in the posterior lobe of the cerebellum, the right amygdala and thalamus (right THA-VP, right ventroanterior thalamus or THA-VA, and left and right dorsoposterior thalamus or THA-DP) for the ToM condition compared to the Random condition (Figure 5—figure supplement 1C, left panel). No regions showed greater activations for Random condition compared to the ToM condition.

As we used shorter modified versions of the Frith-Happé animations (i.e. videos of 19.5 s instead of 40 s), we also validated our stimuli and our fMRI protocol by comparing the brain responses elicited by ToM animation videos – compared to Random animation videos – obtained in our group of 10 human subjects and those reported by the large group of humans (496) used in the social cognition task of the Human Connectome Project (HCP; Barch et al., 2013), which also used shortened versions of the Frith-Happé animations.

This comparison is shown in Figure 3—figure supplement 1. Overall, we observed similar distinct patterns of brain activations (Figure 3—figure supplement 1A and B), including a set of areas in occipital, temporal, parietal and frontal cortices, as described previously (Figure 3C). The main differences were stronger activations in the left hemisphere in the HCP dataset. Therefore, these results show that our stimuli and our protocol are appropriate to investigate mental state attribution to animated moving shapes.

Functional brain activations while watching ToM and Random Frith-Happé’s animations in marmosets

Having identified the brain regions activated during the processing of ToM or Random videos in human subjects and validated our protocol, we proceeded to use the same stimuli in marmosets. Figure 4 illustrates the brain network obtained for the ToM condition (A), Random condition (B), and the contrast between ToM and Random conditions (C) in six marmosets.

Brain networks involved in processing of Frith-Happé’s ToM and Random animations in marmosets.

Group functional maps showing significant greater activations for ToM condition (A), Random condition (B) and the comparison between ToM and Random conditions (C). Group map obtained from six marmosets displayed on lateral and medial views of the right fiducial marmoset cortical surfaces as well as dorsal and ventral views of left and right fiducial marmoset cortical surfaces. The white line delineates the regions based on the Paxinos parcellation of the NIH marmoset brain atlas (Liu et al., 2018). The brain areas reported have activation threshold corresponding to z-scores >2.57 (yellow/red scale) or z-scores <–2.57 (purple/green scale) (AFNI’s 3dttest++, cluster-forming threshold of p<0.01 uncorrected and then FWE-corrected α=0.05 at cluster-level from 10,000 Monte-Carlo simulations).

Both the ToM (Figure 4A) and Random (Figure 4B) animations activated an extensive network involving a variety of areas in the occipito-temporal, parietal and frontal regions. As in human subjects, it should be noted that while both conditions elicited strong activation in some of the same larger areas, these activations might have either occurred in distinct voxels within those areas, or the same voxels were activated to varying degrees for both conditions. This suggests distinct yet overlapping patterns of neural processing for the ToM and Random conditions.

In the occipital and temporal cortex, the activations were located in the visual areas V1, V2, V3, V3A, V4, V4t, V5, V6, MST, the medial (19 M), and dorsointermediate parts (19DI) of area 19, ventral temporal area TH, enthorinal cortex, and lateral and inferior temporal areas TE3 and TEO. Activations were also observed in the posterior parietal cortex, specifically in bilateral regions surrounding the intraparietal sulcus (IPS), in areas LIP, MIP, PE, PG, PFG, PF, V6A, PEC, in the occipito-parietal transitional area (OPt) and in medial part of the parietal cortex (area PGM). More anteriorly, bilateral activations were present in areas 1/2, 3a, 3b of the somatosensory cortex, in primary motor area 4 parts a, b and c (area 4ab and 4c), in area 6 ventral part (6Va) of the premotor cortex and in frontal areas 45 and 8Av.

The ToM condition (Figure 4A) also recruited bilateral activations in areas V5, TE2, FST, Pga-IPa, temporoparietal transitional area (TPt), around the IPS in AIP and VIP, in the internal part (S2I), parietal rostral part (S2PR) and ventral part (S2PV) of the secondary somatosensory cortex, in agranular insular cortex (AI), granular and dysgranular insular areas (GI and DI), retroinsular area (ReI) and orbital periallocortex (OPAI), as well as in premotor cortex in area 8 caudal part (8C), in area 6 dorsocaudal and dorsorostral parts (6DC, 6DR). Additionally, we also observed activations in posterior cingulate areas 23a, 23b, 29d, 30, 24d, and 24b.

Next, we examined the difference between ToM and Random animations (i.e. ToM condition >Random condition contrast, Figure 4C and Figure 5B). We found enhanced bilateral activations for the ToM condition across a range of regions. These encompassed occipital areas V1, V2, V3, V3A, V4, V4t, V5, V6, 19DI, 19M, temporal areas TH, TE2, TE3, FST, MST, TPt, and parietal areas LIP, MIP, VIP, AIP, PE, PG, PFG, OPt, V6A, PEC. Moreover, these activations extended to the somatosensory cortex (areas 1/2, 3 a, 3b, S2I, S2PV), the primary motor cortex (areas 4ab and 4c), lateral frontal areas 6DC, 8C, 6Va, 8Av, 8Ad (left hemisphere), and insular areas (ReI, S2I, S2PV, DI, AI). Additional activations were observed in the OPAI area, medial frontal area 32 and posterior cingulate areas (23 a, 23b, 29d, 30). Contrarily, we did not find any regions exhibiting stronger activations for the Random condition compared to the ToM condition. This further emphasizes the distinctive neural recruitment and processing associated with ToM animations within the marmoset brain.

Figure 5 with 1 supplement see all
Brain network involved during processing of ToM compared to Random Frith-Happé’s animations in both humans (A) and marmosets (B).

Group functional maps showing significant greater activations for ToM animations compared to Random animations. (A) Group map obtained from 10 human subjects displayed on the left and right human cortical flat maps. The white line delineates the regions based on the recent multi-modal cortical parcellation atlas (Glasser et al., 2016). (B) Group map obtained from 6 marmosets displayed on the left and right marmoset cortical flat maps. The white line delineates the regions based on the Paxinos parcellation of the NIH marmoset brain atlas (Liu et al., 2018). The brain areas reported in A and B have activation threshold corresponding to z-scores >2.57 (yellow/red scale) or z-scores <–2.57 (purple/green scale) (AFNI’s 3dttest++, cluster-forming threshold of p<0.01 uncorrected and then FWE-corrected α=0.05 at cluster-level from 10,000 Monte-Carlo simulations).

At the subcortical level (see Figure 5—figure supplement 1A, right panel), the ToM condition showed involvement of several areas including the bilateral hippocampus, bilateral pulvinar (lateral, medial and inferior parts), bilateral amygdala, and left caudate. On the other hand, the Random condition recruited only the pulvinar (Figure 5—figure supplement 1B, right panel). Upon comparison of the ToM and Random conditions, the ToM animations showed stronger activations in the right superior colliculus (SC), right lateral geniculate nucleus (LGN), left caudate, left amygdala, left hippocampus and certain portions of the right and left pulvinar (lateral and inferior pulvinar; Figure 5—figure supplement 1C, right panel).

Comparison of functional brain activations in humans and marmosets

As described earlier, both humans (Figure 5A) and marmosets (Figure 5B) displayed an extended network of activations across the occipital, temporal, parietal, and frontal cortices in response to ToM animations compared to Random animations. Overall, there were substantial similarities between the two species, with both exhibiting enhanced activations for ToM animations compared to Random animations within visual areas, inferior and superior temporal areas, the inferior parietal lobe, and AIP area encircling the IPS in the superior parietal lobe. We also found parallel activations in the somatosensory cortex, although the activation was more widespread in marmosets compared to humans, where it was confined to the secondary somatosensory cortex. Additional similarities were identified in the premotor cortex and certain regions of the lateral prefrontal cortex. Overall, left and right hemisphere activations demonstrated greater congruity in marmosets compared to humans. However, this might be attributed to our human head coil, which had a lower signal-to-noise ratio (SNR) in the left hemisphere (see Figure 3—figure supplement 2). Indeed, similar bilateral activations in humans have been observed in the human HCP dataset (Barch et al., 2013; see Figure 3—figure supplement 1B).

Nevertheless, there were also discernible differences between the two species for ToM compared to Random animations, including stronger activations in medial frontal cortex, primary motor area and posterior cingulate cortex for marmosets, which were absent in our human sample. Moreover, different parts of the insular cortex were recruited in marmosets, whereas in humans, activations were limited to the parietal operculum and did not extend into the insula. At the subcortical level, although both humans and marmosets demonstrated activations in the amygdala, humans recruited the dorsal thalamus and the cerebellum, whereas marmosets displayed activations in the hippocampus, the SC and the LGN.

These results indicate that, while there were many shared brain activation patterns in both humans and marmosets during the processing of ToM animations compared to Random animations, several notable species-specific differences were also evident.

Discussion

In the present study, we investigated whether New-World common marmoset monkeys, like humans, process videos of animated abstract shapes differently when these shapes appear to be reacting to each other (ToM condition) compared to when they interact in a purely physical manner (GD condition) or when they move purposelessly (Random condition). To facilitate a direct comparative analysis between the two primate species, we measured their gaze patterns and brain activations as they viewed the widely-used Frith-Happé’s animations (Abell et al., 2000; Castelli et al., 2000). In these animations, the ToM condition is characterized by one triangle reacting to the other’s mental state, exemplifying behaviors like coaxing, surprising, seducing, and mocking. In the GD condition, the two triangles appear to engage purely physically without any implied mental attribution, depicting behaviors such as chasing, dancing, fighting, and leading. In the Random condition, the two triangles move independently, depicting motions akin to a game of billiards, drifting movements, a star pattern, or a tennis game. In all these animations, the physical interaction of the triangles does appear to follow the laws of physics in a reasonably predictable manner. This is probably most evident in the random ‘billiard’ condition in which the two triangles bounce off the walls. However, the ToM animations also follow Newton’s third law, for example when the small triangle is trying to get inside the box and bounces against it in the ‘seducing’ condition, or when the large triangle pushes the small triangle in the ‘coaxing’ condition.

In our first experiment, we examined the gaze patterns of marmosets and humans during the viewing of these video animations. Klein et al., 2009 reported differing fixation durations for these animations, where the longest fixations were observed for ToM animations, followed by GD animations and the shortest fixations for Random animations. They further reported that the intentionality score - derived from verbal descriptions of the animations - followed a similar pattern: highest for ToM, lowest for Random, and intermediate for GD animations. This validated the degree of mental state attribution according to the categories and established that animations provoking mentalizing (ToM condition) were associated with long fixations. This, in turn, supports the use of fixation durations as a nonverbal metric for mentalizing capacity (Klein et al., 2009; Meijering et al., 2012). Our results with human subjects, which demonstrated longer fixation durations for the ToM animations compared to the GD and Random animations, paralleled those of Klein et al., 2009. However, unlike Klein et al.’s findings, we did not observe intermediate durations for GD animations in our study.

Interestingly, our marmoset data did not align with the human findings but instead resonated more with Schafroth et al., 2021’s observations in macaque monkeys, which did not show significant differences in fixation durations across the three animation types.

Our study went a step further than previous research in humans (Klein et al., 2009) and macaques (Schafroth et al., 2021) by investigating the proportion of time that subjects devoted to looking at the two central figures in the animations: the large red triangle and the small blue triangle. Our results indicate that both humans and marmosets spent significantly more time looking at the large red triangle during ToM, compared to GD and Random animations. Humans also exhibited a preference for the red triangle in the GD over the Random condition, a differentiation not evident in marmosets. This result suggests that marmosets process these conditions similarly, indicating that, unlike humans, they do not seem to discern a marked difference between purposeless and goal-directed motions. However, they did show a distinctive gaze pattern in the ToM condition, pointing to their capacity to potentially perceive or react to animated sequences with complex mental interactions. Our findings revealed no significant differences in gaze patterns towards the smaller blue triangle across the three conditions in both humans and marmosets, potentially due to the perception of the large red triangle as a more salient or socially relevant figure in the interactions.

Together, the observed gaze patterns do not support the idea that marmosets increase their cognitive processing during ToM animations in the same way as humans. However, the findings point to a certain level of sophistication in the marmosets' perception of abstract ToM animations.

Thus, in our second experiment, we investigated the brain networks involved when viewing the ToM and Random Frith-Happé’s animations in humans and marmosets. Previous fMRI studies in humans identified a specific network associated with ToM processing in tasks such as stories, humorous cartoons, false-belief tasks, and social gambling. This network typically includes areas such as the medial frontal gyrus, posterior cingulate cortex, inferior parietal cortex, and temporoparietal junction (Fletcher et al., 1995; Gallagher et al., 2000). Nevertheless, these studies used complex stimuli and yielded heterogeneous results, with varied activations in regions such as the medial and lateral prefrontal cortex, inferior parietal lobule, occipital cortex, and insula across different studies (Carrington and Bailey, 2009). This variability is likely due to the diverse experimental paradigms employed to study ToM. Studies employing Frith-Happé’s animations, which are less complex and more controlled, reported distinct patterns of brain activation when viewing ToM compared to Random animations, involving areas such as the dorso-medial prefrontal cortex, inferior parietal cortex, temporoparietal, inferior and superior temporal regions, and lateral superior occipital regions (Barch et al., 2013; Bliksted et al., 2019; Castelli et al., 2000; Chen et al., 2023; Gobbini et al., 2007; Vandewouw et al., 2021; Weiss et al., 2021; Wheatley et al., 2007).

Our slightly adapted versions of the Frith-Happé animations led to a similar distinct pattern of brain activations, with an exception for the lack of activations in the dorsal part of the medial prefrontal cortex. This discrepancy could be attributable to various factors, including differences in task design, methodological aspects such as statistical power, or variations in participants characteristics. Otherwise, our results and those of the HCP data from Barch et al., 2013, revealed stronger activations for ToM versus Random animations in areas in premotor, prefrontal, parietal, visual, inferior and superior temporal cortices including the STS and temporoparietal junction. Overall, the ToM network we identified, as well as that reported by Barch et al., 2013, appear to be more extensive than those described in studies employing more complex experimental paradigms to study ToM. This aligns with the recent meta-analysis conducted by Schurz et al., 2021, which demonstrated that the network activated by simpler, non-verbal stimuli like social animations differs from the traditional network, with involvement of both cognitive and affective networks (Schurz et al., 2021).

As in humans, the comparison of responses to ToM and Random animations in marmosets revealed activations in occipito-temporal, parietal, and frontal regions. Specifically, activations in the TE areas in marmosets could be equivalent to those observed along the STS in humans (Yovel and Freiwald, 2013). We also observed in both our human subjects and marmosets, activations in the inferior parietal cortex, previously reported in human literature (Carrington and Bailey, 2009; Fletcher et al., 1995; Gallagher et al., 2000). We also found the similar activations in the superior parietal cortex in both our human and marmoset subjects, specifically in the area surrounding the IPS, but this have not been predominantly described in previous work (Carrington and Bailey, 2009; Chen et al., 2023; Fletcher et al., 1995; Gallagher et al., 2000; Gobbini et al., 2007). However, there are also noteworthy differences between our results and those of our human data. Firstly, while we did not observe activations in the medial prefrontal cortex in humans, they were present in marmosets, aligning with previous human fMRI studies (Bliksted et al., 2019; Castelli et al., 2000; Weiss et al., 2021; Wheatley et al., 2007). The marmoset network also included the posterior cingulate cortex and the insula, areas known to be involved in mentalizing and affective processing respectively in human ToM studies that employed more complex stimuli (Fletcher et al., 1995; Gallagher et al., 2000; Wheatley et al., 2007). Finally, a prominent difference between humans and marmosets is the strong activation in the marmoset motor cortex for ToM animations, which was absent in humans, in addition to the differences observed at the subcortical level. Interestingly, we have also recently reported activations in marmoset primary motor cortex during the observation of social interactions (Cléry et al., 2021), suggesting a potential role for the marmoset motor cortex in interaction observation. Regarding the distinct subcortical activations observed in humans and marmosets, it’s important to consider the specific social cognitive demands that might be unique to each species. The involvement of the dorsal thalamus, cerebellum, and a small portion of the amygdala in humans may reflect the complexities of information processing, social cognition, and emotional involvement required to interpret the ToM animations (e.g. Halassa and Sherman, 2019; Janak and Tye, 2015; Van Overwalle et al., 2014). Conversely, the activation of the amygdala and hippocampus in marmosets could suggest a more emotion- and memory-based processing of the social stimuli (e.g. Eichenbaum, 2017; Van Overwalle et al., 2014). However, it’s critical to consider that these interpretations are speculative and would require further study for confirmation.

Together, these findings demonstrate that marmosets, while observing interacting animated shapes as opposed to randomly moving shapes, exhibit enhanced activation in several brain regions previously associated with ToM processing in humans.

Interestingly, our results differed from those obtained by Roumazeilles et al., 2021 in their fMRI study conducted in macaques using the same animations. Roumazeilles and colleagues reported no differences in activation between ToM and Random animations, suggesting that rhesus macaques may not respond to the social cues presented by the ToM Frith-Happé animations. This disparity between our marmoset findings and those of macaques raises intriguing questions about potential differences in the evolutionary development of ToM processing within non-human primates. Marmosets, as New World monkeys, are part of an evolutionary lineage that diverged earlier than the lineage of Old-World monkeys such as macaques. This difference in lineage might lead to distinct evolutionary trajectories in cognitive processing, which could include varying sensitivity to abstract social cues in animations.

In summary, our study reveals novel insights into how New World marmosets, akin to humans, differentially process abstract animations that depict complex social interactions and animations that display purely physical or random movements. Our findings, supported by both specific gaze behaviors (i.e. the proportion of time spent on the red triangle, despite the inconclusiveness of overall fixation) and distinct neural activation patterns, shed light on the marmosets' capacity to interpret social cues embedded in these animations.

The differences observed between humans, marmosets, and macaques underscore the diverse cognitive strategies that primate species have evolved to decipher social information. This diversity may be influenced by unique evolutionary pressures that arise from varying social structures and lifestyles. Like macaque monkeys, humans often live in large, hierarchically organized social groups where status influences access to resources. However, both humans and marmosets share a common trait: a high degree of cooperative care for offspring within the group, with individuals other than the biological parents participating in child-rearing. These distinctive social dynamics of marmosets and humans may have driven the development of unique social cognitive abilities. This could explain their enhanced sensitivity to abstract social cues in the Frith-Happé animations.

Nonetheless, it is crucial to emphasize that even though marmosets respond to the social cues in the Frith-Happé animations, this does not automatically imply that they possess mental-state attributions comparable to humans. As such, future research including a range of tasks, from sensory-affective components to more abstract and decoupled representations of others' mental states (Schurz et al., 2021), will be fundamental in further unravelling the complexities of the evolution and functioning of the Theory of Mind across the primate lineage.

Materials and methods

Common marmosets

Request a detailed protocol

All experimental procedures were in accordance with the Canadian Council of Animal Care policy and a protocol approved by the Animal Care Committee of the University of Western Ontario Council on Animal Care #2021–111.

Eleven adult marmosets (4 females, 32–57 months, mean age: 36.6 months) were subjects in this study. All animals were implanted for head-fixed experiments with either a fixation chamber (Johnston et al., 2018) or a head post (Gilbert et al., 2023) under anesthesia and aseptic conditions. Briefly, the animals were placed in a stereotactic frame (Narishige, model SR-6C-HT) while being maintained under gas anaesthesia with a mixture of O2 and air (isoflurane 0.5–3%). After a midline incision of the skin along the skull, the skull surface was prepared by applying two coats of an adhesive resin (All-Bond Universal; Bisco, Schaumburg, IL) using a microbrush, air-dried, and cured with an ultraviolet dental curing light (King Dental). Then, the head post or fixation chamber was positioned on the skull and maintained in place using a resin composite (Core-Flo DC Lite; Bisco). Heart rate, oxygen saturation, and body temperature were continuously monitored during this procedure.

Six of these animals (four females - weight 315–442 g, age 30–34 months - and two males - weight 374–425 g, age 30 and 55 months) were implanted with an MRI-compatible machined PEEK head post (Gilbert et al., 2023). Two weeks after the surgery, these marmosets were acclimatized to the head-fixation system in a mock MRI environment.

Human participants

Request a detailed protocol

Eleven healthy humans (4 females, 25–42 years, mean age: 30.7 years) participated in the eye tracking experiment. Among these, five individuals, along with five additional subjects (4 females, 26–45 years), took part in the fMRI experiment. All subjects self-reported as right-handed, had normal or corrected-to-normal vision and had no history of neurological or psychiatric disorders. Importantly, all subjects confirmed they had not previously been exposed to the Frith-Happé animation videos used in our study. Subjects were informed about the experimental procedures and provided informed written consent. These studies were approved by the Ethics Committee of the University of Western Ontario.

Stimuli

Request a detailed protocol

Eight animations featuring simple geometric shapes with distinct movement patterns were used (Figure 1). These animations, originally developed by Abell and colleagues (Abell et al., 2000), presented two animated triangles - a large red triangle and a small blue triangle - moving within a framed white background. The original social animation task included three conditions: ToM, Goal-Directed (GD), and Random. In the ToM animations, one triangle displayed behaviors indicative of mental interactions by reacting to the mental state of the other triangle. The GD animations depicted simple interactions between the two triangles, while the Random animations showed the triangles moving and bouncing independently.

The ToM animations portrayed various scenarios, such as one triangle attempting to seduce (Video 1) or persuade the other mocking it behind its back (Video 2), surprising it by hiding behind a door (Video 3), or coaxing it out of an enclosure (Video 4). In the GD animations, the triangles could dance together (Video 5), fight together (Video 6), or one triangle could chase (Video 7) or lead the other (Video 8). The Random animations featured independent movements of the triangles, following patterns such as billiard (Video 9), drifting (Video 10), star (Video 11), or tennis (Video 12). Similar to the approach used in the HCP study (Barch et al., 2013), we modified the original video clips and shortened each animation to 19.5 s using custom video-editing software (iMovie, Apple Incorporated, CA).

Video 1
Theory of Mind (ToM)' Category, Frith-Happe Animations – Seducing Simulation.
Video 2
Theory of Mind (ToM)' Category, Frith-Happe Animations – Mocking Simulation.
Video 3
Theory of Mind (ToM)' Category, Frith-Happe Animations – Surprise Simulation.
Video 4
Theory of Mind (ToM)' Category, Frith-Happe Animations – Coaxing Simulation.
Video 5
Goal-Directed (GD)' Category, Frith-Happe Animations – Dancing Simulation.
Video 6
Goal-Directed (GD)' Category, Frith-Happe Animations – Fighting Simulation.
Video 7
Goal-Directed (GD)' Category, Frith-Happe Animations – Chase Simulation.
Video 8
Goal-Directed (GD)' Category, Frith-Happe Animations – Leading Simulation.
Video 9
Random' Category, Frith-Happé Animations – Billiard Simulation.
Video 10
Random' Category, Frith-Happé Animations – Drifting Simulation.
Video 11
Random' Category, Frith-Happé Animations – Star Simulation.
Video 12
Random' Category, Frith-Happé Animations – Tennis Simulation.

Eye tracking task and data acquisition

Request a detailed protocol

To investigate potential behavioral differences during the viewing of Frith-Happé animations, we presented all ToM, GD and Random video clips once each in a pseudorandomized manner to both marmoset and human subjects. The presentation of stimuli was controlled using Monkeylogic software (Hwang et al., 2019). All stimuli were presented on a CRT monitor (ViewSonic Optiquest Q115, 76 Hz non-interlaced, 1600 x 1280 resolution). Eye position was digitally recorded at 1 kHz via video tracking of the left pupil (EyeLink 1000, SR Research, Ottawa, ON, Canada).

At the beginning of each session, horizontal and vertical eye positions of the left eye were calibrated by presenting a 1 degree dot at the display centre and at 6 degrees in each of the cardinal directions for 300–600ms. Monkeys were rewarded at the beginning and end of each session. Crucially, no rewards were provided during the calibration or while the videos were played.

fMRI task

Request a detailed protocol

For the fMRI experiment, it was crucial for us to ensure that the subjects remained alert and focused throughout the entire scanning session, which becomes increasingly difficult with longer runs. There, we used only the ToM and Random conditions in our functional runs, as the GD condition is situated between these two extremes, depicting physical interaction among the triangles without suggesting any mental state attribution. The limitation to ToM and random conditions is consistent with the design of previous fMRI studies in humans and macaques that employed Frith-Happé animations (Gobbini et al., 2007; Barch et al., 2013; Bliksted et al., 2019; Vandewouw et al., 2021; Weiss et al., 2021; Chen et al., 2023; Roumazeilles et al., 2021).

Humans and marmosets were presented with ToM and Random video clips in a block design. Each run consisted of eight blocks of stimuli (19.5 s each) interleaved by baseline blocks (15 s each). ToM or Random animations were presented pseudorandomly, and each condition was repeated four times (Figure 1). For each run, the order of these conditions was randomized leading to 14 different stimulus sets, counterbalanced within and between subjects. In baseline blocks, a 0.36° circular black cue was displayed at the center of the screen against a gray background. We found previously that such a stimulus reduced the vestibulo-ocular reflex evoked by the strong magnetic field.

fMRI experimental setup

Request a detailed protocol

During the scanning sessions, the marmosets sat in a sphinx position in a custom-designed plastic chair positioned within a horizontal magnet (see below). Their head was restrained using a head fixation system allowing to secure the surgically implanted head post to a clamping bar (Gilbert et al., 2023). After the head was immobilized, the two halves of the coil housing were positioned on either side of the head. Inside the scanner, monkeys faced a translucent screen placed 119 cm from their eyes where visual stimuli were projected with an LCSD-projector (Model VLP-FE40, Sony Corporation, Tokyo, Japan) via a back-reflection on a first surface mirror. Visual stimuli were presented with the Keynote software (version 12.0, Apple Incorporated, CA) and were synchronized with MRI TTL pulses triggered by a Raspberry Pi (model 3B+, Raspberry Pi Foundation, Cambridge, UK) running via a custom-written Python program. No reward was provided to the monkeys during the scanning sessions. Animals were monitored using an MRI-compatible camera (Model 12M-I, MRC Systems GmbH). Horizontal and vertical eye movements were monitored at 60 Hz using a video eye tracker (ISCAN, Boston, Massachusetts). While we were able to obtain relatively stable eye movement recordings from a few runs per animal (min 1, max 5 runs per animal), the quality of the recordings was not sufficient for a thorough analysis. The large marmoset pupil represents a challenge for video eye tracking when the eyes are not fully open. Data from functional runs with more stable eye signals (n=15) show good compliance in the marmosets. The percentage of time spent in each run looking at the screen in the two experimental conditions (ToM, Random) and during the Baseline periods (fixation point in the center of the screen) was higher than 85% (88.2%, 88.6% and 93.4% respectively for ToM, Random and Baseline conditions). There was no significant differences between the ToM and Random condition (paired t-test, t(14)=-0.374, p=0.71), ruling out the possibility that any differences in fMRI activation between the ToM and Random condition were simply due to a different exposure to the videos.

Human subjects lay in a supine position and watched the stimuli presented via a rear projection system (Avotech SV-6011, Avotec Incorporated) through a surface mirror affixed to head coil. As for marmosets, visual stimuli were presented with the Keynote software (version 12.0, Apple Incorporated, CA) and were synchronized with MRI TTL pulses triggered by a Raspberry Pi (model 3B+, Raspberry Pi Foundation, Cambridge, UK) running via a custom-written python program.

MRI data acquisition

Request a detailed protocol

Marmoset and human imaging were performed at the Center for Functional and Metabolic Mapping at the University of Western Ontario.

For marmoset subjects, fMRI data were acquired on a 9.4T 31 cm horizontal bore magnet (Varian) with a Bruker BioSpec Avance III HD console running software package Paravision-360 (Bruker BioSpin Corp), a custom-built high-performance 15 cm diameter gradient coil (maximum gradient strength: 1.5 mT/m/A), and an eight-channel receive coil. Preamplifiers were located behind the animals, and the receive coil was placed inside an in-house built quadrature birdcage coil (12 cm inner diameter) used for transmission. Functional images were acquired during 6 functional runs for each animal using gradient-echo based single-shot echo-planar images (EPI) sequence with the following parameters: TR = 1.5 s, TE = 15ms, flip angle = 40°, field of view = 64 × 48 mm, matrix size = 96 × 128, resolution of 0.5 mm3 isotropic, number of slices = 42 [axial], bandwidth = 400 kHz, GRAPPA acceleration factor: 2 (left-right). Another set of EPIs with an opposite phase-encoding direction (right-left) was collected for the EPI-distortion correction. A T2-weighted structural was also acquired for each animal during one of the sessions with the following parameters: TR = 7 s, TE = 52ms, field of view = 51.2 × 51.2 mm, resolution of 0.133x0.133 × 0.5 mm, number of slices = 45 [axial], bandwidth = 50 kHz, GRAPPA acceleration factor: 2.

For human subjects, fMRI data were acquired on a 7T 68 cm MRI scanner (Siemens Magnetom 7T MRI Plus) with an AC-84 Mark II gradient coil, an in-house 8-channel parallel transmit, and a 32-channel receive coil (Gilbert et al., 2021). Functional images were acquired during 3 functional runs for each participant using Multi-Band EPI BOLD sequences with the following parameters: TR = 1.5 s, TE = 20ms, flip angle = 30°, field of view = 208 × 208 mm, matrix size = 104 × 104, resolution of 2 mm3 isotropic, number of slices = 62, GRAPPA acceleration factor: 3 (anterior-posterior), multi-band acceleration factor: 2. Field map images were also computed from the magnitude image and the two phase images. An MP2RAGE structural image was also acquired for each subject during the sessions with the following parameters: TR = 6 s, TE = 2.13ms, TI1 /TI2=800 / 2700ms, field of view = 240 × 240 mm, matrix size = 320 × 320, resolution of 0.75 mm3 isotropic, number of slices = 45, GRAPPA acceleration factor (anterior posterior): 3.

MRI data preprocessing

Request a detailed protocol

Marmoset fMRI data were preprocessed using AFNI (Cox, 1996) and FSL (Smith et al., 2004) software packages. Raw MRI images were first converted to NIfTI format using dcm2nixx AFNI’s function and then reoriented to the sphinx position using fslswapdim and fslorient FSL’s functions. Functional images were despiked using 3Ddespike AFNI’s function and time shifted using 3dTshift AFNI’s function. Then, the images obtained were registered to the base volume (i.e., corresponding to the middle volume of each time series) with 3dvolreg AFNI’s function. The output motion parameters obtained from volume registration were later used as nuisance regressors. All fMRI images were spatially smoothed with a 1.5 mm half-maximum Gaussian kernel (FWHM) with 3dmerge AFNI’s function, followed by temporal filtering (0.01–0.1 Hz) using 3dBandpass AFNI’s function. The mean functional image was calculated for each run and linearly registered to the respective anatomical image of each animal using FMRIB’s linear registration tool (FLIRT).

The transformation matrix obtained after the registration was then used to transform the 4D time series data. The brain was manually skull-stripped from individual anatomical images using FSL eyes tool and the mask of each animal was applied to the functional images. Finally, the individual anatomical images were linearly registered to the NIH marmoset brain template (Liu et al., 2018) using Advanced Normalization Tools (ANTs).

Human fMRI data were preprocessed using SPM12 (Wellcome Department of Cognitive Neurology). After converting raw images into NifTI format, functional images were realigned to correct for head movements and underwent slice timing correction. A field map correction was applied to the functional images from the magnitude and phase images with the specify toolbox implemented in SPM. Then, the anatomical and functional volumes corrected were coregistered with the MP2RAGE structural scan from each individual participant and normalized to the Montreal Neurological Institute (MNI) standard brain space. Anatomical images were segmented into white matter, gray matter, and CSF partitions and also normalized to the MNI space. The functional images were then spatially smoothed with a 6 mm FWHM isotropic Gaussian kernel. A high-pass filter (128 s) was also applied to the time series.

Statistical analysis

Behavioral eye tracking data

Request a detailed protocol

To evaluate gaze patterns during observation of ToM and Random videos, we used mixed analyses of variance (ANOVA), with factors of species (Human vs Marmoset) and condition (ToM vs Random videos) on the overall fixation duration and on the proportion of time when the radial distance between the subject’s gaze position and each triangle was less than 4 degrees. Partial eta squared (ηp2) was computed as a measure of effect size and post-hoc comparisons were Bonferroni corrected.

fMRI data

Request a detailed protocol

For each run, a general linear regression model was defined: the task timing was convolved to the hemodynamic response (AFNI’s ‘BLOCK’ convolution for marmosets’ data and SPM12 hemodynamic response function for humans’ data) and a regressor was generated for each condition (AFNI’s 3dDeconvolve function for marmosets and SPM12 function for humans). The two conditions were entered into the same model, corresponding to the 19.5 s presentation of the stimuli, along with polynomial detrending regressors and the marmosets’ motions parameters or human’s head movement parameters estimated during realignment.

The resultant regression coefficient maps of marmosets were then registered to template space using the transformation matrices obtained with the registration of anatomical images on the template (see MRI data processing part above).

Finally, we obtained for each run in marmosets and humans, two T-value maps registered to the NIH marmoset brain atlas (Liu et al., 2018) and to the MNI brain standard space, respectively.

These maps were then compared at the group level via paired t-tests using AFNI’s 3dttest ++function, resulting in Z-value maps. To protect against false positives and to control for multiple comparisons, we adopted a clustering method derived from 10000 Monte Carlo simulations to the resultant z-test maps using ClustSim option (α=0.05). This method corresponds to performing cluster-forming threshold of p<0.01 uncorrected and then applying a family-wise error (FWE) correction of p<0.05 at the cluster-level.

We used the Paxinos parcellation of the NIH marmoset brain atlas (Liu et al., 2018) and the most recent multi-modal cortical parcellation atlas (Glasser et al., 2016) to define anatomical locations of cortical and subcortical regions for both marmosets and humans respectively.

First, we identified brain regions involved in the processing of ToM and Random animations by contrasting each condition with a baseline (i.e. ToM condition >baseline and Random condition >baseline contrasts). This baseline brain activation recorded during the presentation of the circular black cue between video clips (i.e. baseline blocks of 15 s, see above), reflects 'resting state' activation. By comparing it to the brain activation during ToM and Random animations, we could specifically highlight the task-related activations and isolate brain regions engaged during each condition. Subsequently, we then determined the clusters that displayed significantly greater activation for the ToM animations compared to the Random animations (ToM condition >Random condition contrast), and vice versa. The resultant Z-value maps were displayed on fiducial maps obtained from the Connectome Workbench (v1.5.0 [Marcus et al., 2011]) using the NIH marmoset brain template (Liu et al., 2018) for marmosets and the MNI Glasser brain template (Glasser et al., 2016) for humans. Subcortical activations were displayed on coronal sections.

As we used shortened video clips (i.e. 19.5 s compared to the 40 s originally designed by Abell et al., 2000), we validated our fMRI protocol by confirming that our shorter videos elicited similar responses to those previously observed in the HCP (Barch et al., 2013), whichalso used modified versions of these animation videos. We compared our ToM vs Random Z-value map obtained in human subjects with those of the HCP (Barch et al., 2013). To this end, we downloaded the Z-value map of activations for ToM animations compared to Random animations from 496 subjects from the Neurovalt site (https://identifiers.org/neurovault.image:3179). We displayed the resultant Z-value maps on fiducial maps obtained from the Connectome Workbench (v1.5.0, [Marcus et al., 2011]) using the MNI Glasser brain template (Glasser et al., 2016).

Data availability

All fMRI and eye tracking data generated and analysed as well as the scripts used have been deposited in Github and the link has been provided in the manuscript. Here the link: https://github.com/audreydureux/Theory-of-mind_Human_Marmosets_Paper (copy archived at Dureux, 2023).

References

    1. Klin A
    (2000)
    Attributing social meaning to ambiguous visual stimuli in higher-functioning autism and Asperger syndrome: The Social Attribution Task
    Journal of Child Psychology and Psychiatry, and Allied Disciplines 41:831–846.

Decision letter

  1. Muireann Irish
    Reviewing Editor; University of Sydney, Australia
  2. Timothy E Behrens
    Senior Editor; University of Oxford, United Kingdom

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Gaze patterns and brain activations in humans and marmosets in the Frith-Happé theory-of-mind animation task" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Timothy Behrens as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

1. In the abstract, the claim that there is no evidence that nonhuman primates attribute mental states to moving shapes is false. You even cite some of this positive evidence (e.g., Uller, 2004; Atsumi et al., 2015; 2017). There is also evidence that they don't (Kupferberg et al., 2013; Burkart et al., 2012; Schafroth et al., 2021). The abstract would be stronger if written to represent the state of the field more accurately.

2. The overall conclusion as stated in the abstract, at the end of the introduction, and in the discussion is not warranted by the evidence. Indeed, the abstract completely fails to mention that the marmosets failed to show the human-like pattern of longer fixations on the ToM videos. Many readers will likely interpret this evidence as primarily against the idea that marmosets view the ToM videos in a human-like way, or as equivocal evidence at best. This report will be a stronger piece of science if it accurately describes the results.

3. The authors need to explicitly mention the rationale for omitting the original Goal-Directed condition from the Frith-Happé task. We cannot necessarily conclude that the marmosets are engaged in mental state attribution on the basis of these brain activation patterns – it could reflect the processing of distinct biological movements or the unfolding of an event narrative. If the authors proceed in pushing this data without the Goal-Directed videos, they must address their rationale for not testing these videos.

4. Were there any differences between the different types of videos used? For instance, was there any difference between videos in which the interaction between the two shapes is more obvious (e.g. coaxing vs seducing)?

5. Did the authors also present social videos to their animals? If so did they also observe additional recruitment of IPa and TPO areas like Clery and colleagues for social videos compared to Frith and Happe's videos (Cléry et al., 2021)?

6. Humans and marmosets recruit a distinct set of subcortical structures during the viewing of video clips; for instance dorsal thalamus and cerebellum in humans, the hippocampus and amygdala in marmosets. How do the authors interpret this difference?

7. Unlike marmosets, rhesus macaques are not sensitive to the type of the Frith and Happe social illusion (Roumazeilles et al., 2021; Schafroth et al., 2021). The authors might want to discuss the singularity of the marmosets from an evolutionary perspective.

8. The justification for looking in marmosets could be read to imply that macaque monkeys do not live in family groups or share important social similarities with humans. Both species share many social similarities (and many social differences) with humans. Marmosets are a good species to study; this section would benefit from a more accurate rationale.

9. Because it is one of the main metrics in the Klein and Schafroth papers, and thus readers will want to see it for sake of comparison, the authors should include a figure showing the overall fixation durations as a function of category and species.

10. The results about looking time to the large triangle need to follow up on the interaction between species and conditions so that readers know how to interpret it.

11. Are the bars in Figure 2 meant to add up to 1 for any given participant? If you analyzed the total time fixating on either shape, would marmosets be spending less time looking at the shapes overall than humans?

12. Readers will likely want clarification in cases where the same area showed stronger activation for ToM videos AND Random videos. I assume it was in different voxels in the same larger area, but this could be explicit.

13. The claim that these maps represent "dedicated brain networks" for ToM or Random videos (line 188) is too strong. These brain areas are used for many things.

14. For many of the sentences in the imaging results, the comparison needs to be made explicit. For example Line 193 – higher bilateral activation than what? Line 196 – greater activations than what? Line 202 – a larger network than what? Etc.

15. The description of Klein et al., (2009) on Lines 289-293 might be read to imply that they were attributing mentalizing without good reason. Klein also collected intentionality scores, which correlated with the viewing metric. This could be rephrased to be more accurate.

16. The inclusion of the authors as subjects is odd. Some readers will view it as a big red flag. The authors clearly know their own hypothesis and likely have a vested interest in a particular outcome. For the strongest report, the authors should remove their own data. At the very least, the authors need to demonstrate that the inclusion/exclusion of their unblinded data doesn't affect the interpretation of the human results.

17. The method should state whether the subjects had experienced these animations before (e.g., they're shown in some psychology and neuroscience classes).

18. The description of the monkey reward contingencies needs to be clearer about whether the monkeys were rewarded only during calibration or during videos as well, and whether any reward during videos was contingent on keeping their eyes on the screen.

19. Because this is a social task when the scans were normalized to MNI space, did the authors divide the human participants into those with and without a paracingulate sulcus?

20. The authors need to better specify what counts as a "baseline" for the fMRI comparisons. They should also briefly justify why this is an informative comparison.

Reviewer #1 (Recommendations for the authors):

I very much enjoyed reading this manuscript and believe it has the potential to make an important contribution to animal literature as well as social cognition more broadly.

As highlighted in the public review, my main query is in relation to the omission of the Goal-Directed condition of the Frith-Happé task. While the evidence presented here certainly suggests that marmosets process ToM animations in a different manner than Random animations, we are somewhat constrained in what we can interpret from these findings. As the authors note, we cannot necessarily conclude that the marmosets are engaged in mental state attribution on the basis of these brain activation patterns.

A more compelling argument would stem from the inclusion of the Goal-Directed condition in which the triangles arguably do interact but in a purely physical manner, i.e., there is no mental state attribution. I was surprised that this condition was not included as its omission somewhat limits the extent to which any conclusion regarding ToM can be drawn. Could the activations observed in the ToM condition reflect the processing of an event or narrative as it unfolds, rather than the cycling of random movements in the Random condition? I ask this question as previous studies using the Frith-Happé animations in dementia populations note that the mental state attribution judgements on ToM trials were conferred only at the end of the video (i.e., once the overall event narrative had been seen) whereas patients were adept at conferring a judgment of "no interaction" early during the viewing of Random animations (Ref: Synn et al. 2018 J. Alz Dis).

I wonder whether this interpretation might also reflect the curious finding of stronger medial PFC activation in Random trials versus ToM trials in humans, and no clear mPFC activation in the ToM trials. This seems very much at odds with the wider literature on the brain regions necessary for ToM, which often place the medial PFC at the heart of the social brain.

I very much appreciated the check using the independent HCP dataset. This was a very nice inclusion to ensure that the shortened version corresponded well with previous reports.

Reviewer #2 (Recommendations for the authors):

In their study, Dureux and colleagues are investigating the sensitivity of a highly social non-human primate species, the marmoset, to social illusion using the Frith and Happe task. Although this task is often considered a non-verbal TOM task, its relevance to investigate TOM has been disputed. For instance, the Frith and Happe task does not recruit in humans a similar network as other false-belief tasks and social gambling tasks (Schurz et al., 2020). While the authors might want to revise, or at least discuss their use of the TOM concept further, their results clearly show that marmosets distinguish the two types of videos shown to them.

Were there any differences between the different types of videos used? For instance, was there any difference between videos in which the interaction between the two shapes is more obvious (e.g. coaxing vs seducing).

Did the authors also present social videos to their animals? If so did they also observe additional recruitment of IPa and TPO areas like Clery and colleagues for social videos compared to Frith and Happe's videos (Cléry et al., 2021)?

Humans and marmosets recruit a distinct set of subcortical structures during the viewing of video clips; for instance dorsal thalamus and cerebellum in humans, and the hippocampus and amygdala in marmosets. How do the authors interpret this difference?

Unlike marmosets, rhesus macaques are not sensitive to the type of the Frith and Happe social illusion (Roumazeilles et al., 2021; Schafroth et al., 2021). The authors might want to discuss the singularity of the marmosets from an evolutionary perspective.

Refs:

Cléry JC, Hori Y, Schaeffer DJ, Menon RS, Everling S. 2021. Neural network of social interaction observation in marmosets. eLife 10:e65012. doi:10.7554/eLife.65012

Roumazeilles L, Schurz M, Lojkiewiez M, Verhagen L, Schüffelgen U, Marche K, Mahmoodi A, Emberton A, Simpson K, Joly O, Khamassi M, Rushworth MFS, Mars RB, Sallet J. 2021. Social prediction modulates activity of macaque superior temporal cortex (preprint). Neuroscience. doi:10.1101/2021.01.22.427803

Schafroth JL, Basile BM, Martin A, Murray EA. 2021. No evidence that monkeys attribute mental states to animated shapes in the Heider-Simmel videos. Sci Rep 11:3050. doi:10.1038/s41598-021-82702-6

Schurz M, Radua J, Tholen MG, Maliske L, Margulies DS, Mars RB, Sallet J, Kanske P. 2020. Toward a hierarchical model of social cognition: A neuroimaging meta-analysis and integrative review of empathy and theory of mind. Psychological Bulletin. doi:10.1037/bul0000303

Reviewer #3 (Recommendations for the authors):

This study is strong in many ways, and the goal is a good one. The below recommendations will help strengthen it further:

In the abstract, the claim that there is no evidence that nonhuman primates attribute mental states to moving shapes is false. You even cite some of this positive evidence (e.g., Uller, 2004; Atsumi et al., 2015; 2017). There is also evidence that they don't (Kupferberg et al., 2013; Burkart et al., 2012; Schafroth et al., 2021). The abstract would be stronger if written to represent the state of the field more accurately.

The overall conclusion as stated in the abstract, at the end of the introduction, and in the discussion is not warranted by the evidence. Indeed, the abstract completely fails to mention that the marmosets failed to show the human-like pattern of longer fixations on the ToM videos. Many readers will likely interpret this evidence as primarily against the idea that marmosets view the ToM videos in a human-like way, or as equivocal evidence at best. This report will be a stronger piece of science if it accurately describes the results.

The justification for looking in marmosets could be read to imply that macaque monkeys do not live in family groups or share important social similarities with humans. Both species share many social similarities (and many social differences) with humans. Marmosets are a good species to study; this section would benefit from a more accurate rationale.

Because it is one of the main metrics in the Klein and Schafroth papers, and thus readers will want to see it for sake of comparison, the authors should include a figure showing the overall fixation durations as a function of category and species.

The results about looking time to the large triangle need to follow up on the interaction between species and conditions so that readers know how to interpret it.

The sentence on lines 97-99 might be an incomplete sentence.

Are the bars in Figure 2 meant to add up to 1 for any given participant? If you analyzed the total time fixating on either shape, would marmosets be spending less time looking at the shapes overall than humans?

Overall, the figures are quite informative and aesthetically pleasing.

HCP should be explained the first time it is used.

Readers will likely want clarification in cases where the same area showed stronger activation for ToM videos AND Random videos. I assume it was in different voxels in the same larger area, but this could be explicit.

The claim that these maps represent "dedicated brain networks" for ToM or Random videos (line 188) is too strong. These brain areas are used for many things.

For many of the sentences in the imaging results, the comparison needs to be made explicit. For example Line 193 – higher bilateral activation than what? Line 196 – greater activations than what? Line 202 – a larger network than what? Etc.

The description of Klein et al., (2009) on Lines 289-293 might be read to imply that they were attributing mentalizing without good reason. Klein also collected intentionality scores, which correlated with the viewing metric. This could be rephrased to be more accurate.

In general, the discussion could be strengthened by avoiding repeating the results in as much detail.

The inclusion of the authors as subjects is odd. Some readers will view it as a big red flag. The authors clearly know their own hypothesis and likely have a vested interest in a particular outcome. For the strongest report, the authors should remove their own data. At the very least, the authors need to demonstrate that the inclusion/exclusion of their unblinded data doesn't affect the interpretation of the human results.

The method should state whether the subjects had experienced these animations before (e.g., they're shown in some psychology and neuroscience classes).

If the authors proceed in pushing this data without the Goal-Directed videos, they need to at least address their rationale for not testing these videos.

The description of the monkey reward contingencies needs to be clearer about whether the monkeys were rewarded only during calibration or during videos as well, and whether any reward during videos was contingent on keeping their eyes on the screen.

Because this is a social task when the scans were normalized to MNI space, did the authors divide the human participants into those with and without a paracingulate sulcus?

The authors need to better specify what counts as a "baseline" for the fMRI comparisons. They should also briefly justify why this is an informative comparison.

https://doi.org/10.7554/eLife.86327.sa1

Author response

Essential revisions:

1. In the abstract, the claim that there is no evidence that nonhuman primates attribute mental states to moving shapes is false. You even cite some of this positive evidence (e.g., Uller, 2004; Atsumi et al., 2015; 2017). There is also evidence that they don't (Kupferberg et al., 2013; Burkart et al., 2012; Schafroth et al., 2021). The abstract would be stronger if written to represent the state of the field more accurately.

We greatly appreciate the reviewer's insightful comments. In response, we have revised the abstract to better align with the existing literature concerning nonhuman primates’ abilities to attribute mental states to moving shapes. The specific modifications can be found on page 2 of the revised manuscript.

Page 2: “Theory of Mind (ToM) refers to the cognitive ability to attribute mental states to other individuals. This ability extends even to the attribution of mental states to animations featuring simple geometric shapes, such as the Frith-Happé animations in which two triangles move either purposelessly (Random condition), exhibit purely physical movement (Goal-directed condition), or move as if one triangle is reacting to the other triangle’s mental states (ToM condition). While this capacity in humans has been thoroughly established, research on nonhuman primates has yielded inconsistent results.

This study explored how marmosets (Callithrix jacchus), a highly social primate species, process Frith-Happé animations by examining gaze patterns and brain activations of marmosets and humans as they observed these animations. We revealed that both marmosets and humans exhibited longer fixations on one of the triangles in ToM animations, compared to other conditions. However, we did not observe the same pattern of longer overall fixation duration on the ToM animations in marmosets as identified in humans. Furthermore, our findings reveal that both species activated extensive and comparable brain networks when viewing ToM versus Random animations, suggesting that marmosets differentiate between these scenarios similarly to humans. While marmosets did not mimic human overall fixation patterns, their gaze behavior and neural activations indicate a distinction between ToM and non-ToM scenarios. This study expands our understanding of nonhuman primate cognitive abilities, shedding light on potential similarities and differences in ToM processing between marmosets and humans.”

2. The overall conclusion as stated in the abstract, at the end of the introduction, and in the discussion is not warranted by the evidence. Indeed, the abstract completely fails to mention that the marmosets failed to show the human-like pattern of longer fixations on the ToM videos. Many readers will likely interpret this evidence as primarily against the idea that marmosets view the ToM videos in a human-like way, or as equivocal evidence at best. This report will be a stronger piece of science if it accurately describes the results.

We agree that it is crucial to precisely represent the results of our study, including the nuanced details about marmosets' reactions to the ToM videos. To this end, we have revised the abstract, introduction, and Discussion sections to provide a more balanced and precise interpretation of our findings.

The major revisions appear on page 2 in the abstract, on page 4 in the introduction, and between pages 14 to 20 in the Discussion section. We believe that these changes will ensure the results and conclusions of the study are conveyed more accurately and transparently.

The revised content reads as follows:

Page 2: “Theory of Mind (ToM) refers to the cognitive ability to attribute mental states to other individuals. This ability extends even to the attribution of mental states to animations featuring simple geometric shapes, such as the Frith-Happé animations in which two triangles move either purposelessly (Random condition), exhibit purely physical movement (Goal-directed condition), or move as if one triangle is reacting to the other triangle’s mental states (ToM condition). While this capacity in humans has been thoroughly established, research on nonhuman primates has yielded inconsistent results.

This study explored how marmosets (Callithrix jacchus), a highly social primate species, process Frith-Happé animations by examining gaze patterns and brain activations of marmosets and humans as they observed these animations. We revealed that both marmosets and humans exhibited longer fixations on one of the triangles in ToM animations, compared to other conditions. However, we did not observe the same pattern of longer overall fixation duration on the ToM animations in marmosets as identified in humans. Furthermore, our findings reveal that both species activated extensive and comparable brain networks when viewing ToM versus Random animations, suggesting that marmosets differentiate between these scenarios similarly to humans. While marmosets did not mimic human overall fixation patterns, their gaze behavior and neural activations indicate a distinction between ToM and non-ToM scenarios. This study expands our understanding of nonhuman primate cognitive abilities, shedding light on potential similarities and differences in ToM processing between marmosets and humans.”

Pages 4-5: “Although the spontaneous attribution of mental states to moving shapes has been well established in humans, it remains uncertain whether other primate species share this capacity. There is some evidence suggesting that monkeys can attribute goals to agents with varying levels of similarity and familiarity to conspecifics, including human agents, monkey robots, moving geometric boxes, animated shapes, and simple moving dots (Atsumi et al., 2017; Atsumi and Nagasaka, 2015; Krupenye and Hare, 2018; Kupferberg et al., 2013; Uller, 2004). However, the findings in this area are somewhat mixed, with some studies investigating the attribution of goals to inanimate moving objects yielding inconclusive results (Atsumi and Nagasaka, 2015; Kupferberg et al., 2013). Nonhuman primates' spontaneous attribution of mental states to Frith-Happé animations is even less certain. While human subjects exhibit longer eye fixations when viewing the ToM condition compared to the Random condition of the Frith-Happé animations (Klein et al., 2009), a recent eye tracking study in macaque monkeys did not observe similar differences (Schafroth et al., 2021). Similarly, a recent fMRI study conducted on macaques found no discernible differences in activations between ToM and random Frith-Happé animations (Roumazeilles et al., 2021).

In this study, we investigated the behaviour and brain activations of New World common marmoset monkeys (Callithrix jacchus) while they viewed Frith-Happé animations. Living in closely-knit family groups, marmosets exhibit significant social parallels with humans, including prosocial behavior, imitation, and cooperative breeding. These characteristics establish them as a promising nonhuman primate model for investigating social cognition (Burkart et al., 2009; Burkart and Finkenwirth, 2015; Miller et al., 2016). To directly compare humans and marmosets in their response to these animations, we employed high-speed video eye-tracking to record eye movements in eleven healthy humans and eleven marmoset monkeys. Additionally, we conducted ultra-high field fMRI scans on ten healthy humans at 7T and six common marmoset monkeys at 9.4T. These combined methods allowed us to examine the visual behavior and brain activations of both species while they observed the Frith-Happé animations.”

Pages 14-15: “In our first experiment, we examined the gaze patterns of marmosets and humans during the viewing of these video animations. Klein et al. (2009) reported differing fixation durations for these animations, where the longest fixations were observed for ToM animations, followed by GD animations and the shortest fixations for Random animations. They further reported that the intentionality score – derived from verbal descriptions of the animations – followed a similar pattern: highest for ToM, lowest for Random, and intermediate for GD animations. This validated the degree of mental state attribution according to the categories and established that animations provoking mentalizing (ToM condition) were associated with long fixations. This, in turn, supports the use of fixation durations as a nonverbal metric for mentalizing capacity (Klein et al., 2009; Meijering et al., 2012). Our results with human subjects, which demonstrated longer fixation durations for the ToM animations compared to the GD and Random animations, paralleled those of Klein et al. (2009). However, unlike Klein et al.'s findings, we did not observe intermediate durations for GD animations in our study.

Interestingly, our marmoset data did not align with the human findings but instead resonated more with Schafroth et al. (2021)'s observations in macaque monkeys, which did not show significant differences in fixation durations across the three animation types.”

Pages 19-20: “In summary, our study reveals novel insights into how New World marmosets, akin to humans, differentially process abstract animations that depict complex social interactions and animations that display purely physical or random movements. Our findings, supported by both specific gaze behaviors (i.e., the proportion of time spent on the red triangle, despite the inconclusiveness of overall fixation) and distinct neural activation patterns, shed light on the marmosets' capacity to interpret social cues embedded in these animations.

The differences observed between humans, marmosets, and macaques underscore the diverse cognitive strategies that primate species have evolved to decipher social information. This diversity may be influenced by unique evolutionary pressures that arise from varying social structures and lifestyles. Like macaque monkeys, humans often live in large, hierarchically organized social groups where status influences access to resources. However, both humans and marmosets share a common trait: a high degree of cooperative care for offspring within the group, with individuals other than the biological parents participating in child-rearing. These distinctive social dynamics of marmosets and humans may have driven the development of unique social cognitive abilities. This could explain their enhanced sensitivity to abstract social cues in the Frith-Happé animations.

Nonetheless, it is crucial to emphasize that even though marmosets respond to the social cues in the Frith-Happé animations, this does not automatically imply that they possess mental-state attributions comparable to humans. As such, future research including a range of tasks, from sensory-affective components to more abstract and decoupled representations of others' mental states (Schurz et al., 2020), will be fundamental in further unravelling the complexities of the evolution and functioning of the theory of mind across the primate lineage.”

3. The authors need to explicitly mention the rationale for omitting the original Goal-Directed condition from the Frith-Happé task. We cannot necessarily conclude that the marmosets are engaged in mental state attribution on the basis of these brain activation patterns – it could reflect the processing of distinct biological movements or the unfolding of an event narrative. If the authors proceed in pushing this data without the Goal-Directed videos, they must address their rationale for not testing these videos.

We appreciate the reviewer's feedback about the omission of the Goal-Directed condition from the Frith-Happé task in our study. In our revised manuscript, we have elaborated on the factors influencing our decision to focus primarily on the ToM and Random conditions. These factors were two-fold:

1. Influence from prior fMRI studies: Many previous fMRI studies using the Frith-Happé animated triangles task with human and macaque subjects have only employed the ToM and Random conditions. These conditions represent the two extremes, with ToM depicting scenarios with mental interactions and Random showing scenarios absent of mental interactions. GD condition is situated between these two extremes, depicting physical interaction among the triangles without suggesting mental state attribution.

2. Practical considerations: The duration of each video clip in the Frith-Happé task (19.5 seconds) presented challenges for keeping marmoset subjects alert and focused during longer scanning sessions.

Given these constraints, we made the decision to limit the number of conditions presented in a single run.

However, understanding the value of including the Goal-Directed condition, we have performed an additional eye-tracking experiment incorporating all three conditions: ToM, Goal-Directed, and Random. The results from this experiment provided further insights into the gaze patterns during these conditions, adding depth to our understanding of marmoset behavior during the different conditions.

Moreover, while our data reveal distinct patterns of brain activation and gaze behavior during the ToM condition, we recognize and emphasize in our revised manuscript that these patterns do not conclusively prove that marmosets attribute mental states in the same way as humans.

These adjustments, along with the findings from the new eye-tracking experiment, have been integrated into the revised manuscript. The relevant sections in the methods, results, and discussion have been modified accordingly. These revisions can be found on pages 5 to 7 (Results section), 14 to 20 (Discussion section), and 23-24 (method section) of the revised manuscript.

4. Were there any differences between the different types of videos used? For instance, was there any difference between videos in which the interaction between the two shapes is more obvious (e.g. coaxing vs seducing)?

We appreciate the reviewer's interest in the distinct types of videos used in our study. In this investigation, we did not specifically analyze the responses to videos where the interaction between the two shapes was more or less pronounced, such as in coaxing versus seducing scenarios. Our primary focus was on contrasting the overall responses elicited by ToM animations and Random animations. Due to our experimental design, we did not have enough repetitions of each distinct type of video within each condition to provide the statistical power necessary for such an analysis.

We recognize that dissecting responses to different types of ToM animations might reveal further insights into the specificity of neural responses, and this is an intriguing area for future research.

5. Did the authors also present social videos to their animals? If so did they also observe additional recruitment of IPa and TPO areas like Clery and colleagues for social videos compared to Frith and Happe's videos (Cléry et al., 2021)?

Our study specifically focused on neural responses to the abstract social scenarios represented by the Frith-Happé animations. Consequently, realistic social videos were not included in our experimental design.

The study by Cléry et al. (2021) indeed demonstrates that the comparison social versus non-social realistic videos predominantly revealed a fronto-parietal network with additional temporal region engagement. The social condition in their study seems to recruit not only several areas that we observed to be activated in our ToM condition, but also additional temporal regions such as the IPa, TPO and TE1 areas. This suggests that these areas may play a significant role in the processing of more realistic social cues, thereby adding complexity to the social brain network.

Although our current study did not directly investigate this aspect, we agree that comparing the neural responses elicited by abstract versus realistic social scenarios in marmosets could provide valuable insights into the extent and adaptability of the social brain network. This approach would further our understanding of the neural substrates that underpin various facets of social cognition, depending on the complexity and realism of the presented stimuli. In future research, we plan to consider including such comparative investigations in our experimental design. We appreciate the reviewer's suggestion.

6. Humans and marmosets recruit a distinct set of subcortical structures during the viewing of video clips; for instance dorsal thalamus and cerebellum in humans, the hippocampus and amygdala in marmosets. How do the authors interpret this difference?

We are grateful to the reviewer for drawing attention to the distinct set of subcortical structures engaged by humans and marmosets during the viewing of video clips. The divergent patterns may reflect species-specific social cognitive strategies.

In humans, the involvement of the dorsal thalamus, which serves as a critical hub for information relay between various subcortical areas and the cortex, may indicate the necessity for complex information processing in interpreting the animations (e.g., Halassa and Sherman, 2019). The activation of the cerebellum, beyond its traditional role in motor functions, supports recent findings of its involvement in social cognition (e.g., Van Overwalle et al., 2014). The activations in a small portion of the amygdala may reflect emotional processing tied to understanding of the social scenarios in the animations (e.g., Janak and Tye, 2015).

On the other hand, the activation of the hippocampus and amygdala in marmosets might reflect a more emotion-driven interpretation of the animations. The recruitment of the hippocampus could suggest the role of memory in interpreting the animations by remembering past interactions to help interpret current social scenarios (e.g., Eichenbaum, 2017), while the more extended activation of the amygdala in marmosets might imply a higher degree of emotional processing compared to humans (e.g., Janak and Tye, 2015).

In response, we have included a detailed discussion in our revised manuscript, which takes into account the potential roles of these structures in complex information processing, social cognition, emotional processing, and memory recall in the context of interpreting the animations. We emphasize that these interpretations remain speculative and underscore the need for further research to confirm these observations.

We have now expanded our discussion on this topic in the revised manuscript, specifically on pages 18-19:

" Regarding the distinct subcortical activations observed in humans and marmosets, it's important to consider the specific social cognitive demands that might be unique to each species. The involvement of the dorsal thalamus, cerebellum, and a small portion of the amygdala in humans may reflect the complexities of information processing, social cognition, and emotional involvement required to interpret the ToM animations (e.g., Halassa and Sherman, 2019; Janak and Tye, 2015; Van Overwalle et al., 2014). Conversely, the activation of the amygdala and hippocampus in marmosets could suggest a more emotion- and memory-based processing of the social stimuli (e.g., Eichenbaum, 2017; Van Overwalle et al., 2014). However, it's critical to consider that these interpretations are speculative and would require further study for confirmation.”

7. Unlike marmosets, rhesus macaques are not sensitive to the type of the Frith and Happe social illusion (Roumazeilles et al., 2021; Schafroth et al., 2021). The authors might want to discuss the singularity of the marmosets from an evolutionary perspective.

As suggested, in our revised manuscript, we have now incorporated a comprehensive discussion regarding the uniqueness of marmosets from an evolutionary perspective, especially in light of the different results obtained in a similar study conducted on rhesus macaques. We highlight the divergent evolutionary trajectories of New World monkeys (such as marmosets) and Old-World monkeys (such as macaques), which may contribute to the differential sensitivity to abstract social cues embedded in animations.

We also underscore the diverse cognitive strategies that primate species employ in deciphering social information, influenced by unique evolutionary pressures arising from varying social structures and lifestyles.

The associated changes in our manuscript can be found in the Discussion section on pages 19-20, which reads:

“Interestingly, our results differed from those obtained by Roumazeilles et al. (2021) in their fMRI study conducted in macaques using the same animations. Roumazeilles and colleagues reported no differences in activation between ToM and Random animations, suggesting that rhesus macaques may not respond to the social cues presented by the ToM Frith-Happé animations. This disparity between our marmoset findings and those of macaques raises intriguing questions about potential differences in the evolutionary development of ToM processing within non-human primates. Marmosets, as New World monkeys, are part of an evolutionary lineage that diverged earlier than the lineage of Old-World monkeys such as macaques. This difference in lineage might lead to distinct evolutionary trajectories in cognitive processing, which could include varying sensitivity to abstract social cues in animations.

In summary, our study reveals novel insights into how New World marmosets, akin to humans, differentially process abstract animations that depict complex social interactions and animations that display purely physical or random movements. Our findings, supported by both specific gaze behaviors (i.e., the proportion of time spent on the red triangle, despite the inconclusiveness of overall fixation) and distinct neural activation patterns, shed light on the marmosets' capacity to interpret social cues embedded in these animations.

The differences observed between humans, marmosets, and macaques underscore the diverse cognitive strategies that primate species have evolved to decipher social information. This diversity may be influenced by unique evolutionary pressures that arise from varying social structures and lifestyles. Like macaque monkeys, humans often live in large, hierarchically organized social groups where status influences access to resources. However, both humans and marmosets share a common trait: a high degree of cooperative care for offspring within the group, with individuals other than the biological parents participating in child-rearing. These distinctive social dynamics of marmosets and humans may have driven the development of unique social cognitive abilities. This could explain their enhanced sensitivity to abstract social cues in the Frith-Happé animations.

Nonetheless, it is crucial to emphasize that even though marmosets respond to the social cues in the Frith-Happé animations, this does not automatically imply that they possess mental-state attributions comparable to humans. As such, future research including a range of tasks, from sensory-affective components to more abstract and decoupled representations of others' mental states (Schurz et al., 2020), will be fundamental in further unravelling the complexities of the evolution and functioning of the theory of mind across the primate lineage.”

8. The justification for looking in marmosets could be read to imply that macaque monkeys do not live in family groups or share important social similarities with humans. Both species share many social similarities (and many social differences) with humans. Marmosets are a good species to study; this section would benefit from a more accurate rationale.

In response to the reviewer's comment, we have clarified our previous section. Our intention was to highlight the unique social aspects of marmosets that make them a suitable species for studying social cognition, not to imply that macaques do not have their own set of social similarities with humans. As suggested, we have now revised the sentence on page 4 to enhance its clarity and accuracy, which now reads:

Page 4: “Living in closely-knit family groups, marmosets exhibit significant social parallels with humans, including prosocial behavior, imitation, and cooperative breeding. These characteristics establish them as a promising nonhuman primate model for investigating social cognition (Burkart et al., 2009; Burkart and Finkenwirth, 2015; Miller et al., 2016).”

9. Because it is one of the main metrics in the Klein and Schafroth papers, and thus readers will want to see it for sake of comparison, the authors should include a figure showing the overall fixation durations as a function of category and species.

In accordance with the reviewer's recommendation, we have included a new figure (Figure 2) into our revised manuscript, which can be found on page 39. This figure graphically represents the overall fixation durations as a function of both animation category and species. This facilitates a more comprehensive comparison of fixation durations between humans and marmosets across the different animation conditions (Random, Goal-directed, and ToM). This comparison is in alignment with the data representation found in the Klein and Schafroth paper.

10. The results about looking time to the large triangle need to follow up on the interaction between species and conditions so that readers know how to interpret it.

In response to the reviewer's comment, we have provided a more detailed analysis of the interaction between species and conditions for the proportion of time spent looking at the large red triangle. We found that both humans and marmosets spent a greater proportion of time looking at the red triangle in the ToM condition compared to the GD and Random conditions. However, while humans also allocated more time to the red triangle in GD compared to Random animations, marmosets did not show any difference between these two conditions. The updated text can be found in the Results section on pages 5 to 7, which now reads as follows:

Pages 6-7: “To further analyze the gaze patterns of both humans and marmosets, we next measured the proportion of time subjects looked at each of the triangles in the videos (Figure 2B). We conducted mixed ANOVAs on the proportion of time the radial distance between the current gaze position and each triangle was within 4 visual degrees for each triangle separately.

Importantly, we observed a significant interaction between species and condition for the proportion of time spent looking at the large red triangle (F(2,40)=9.83, p<.001, ηp2 = .330). Specifically, both humans (Figure 2B left) and marmosets (Figure 2B right) spent a greater proportion of time looking at the red triangle in ToM compared to the GD and Random videos (For humans, ToM vs GD: Δ=.23, p<.001 and ToM vs Random: Δ=.31, p<.001 ; For marmosets, ToM vs GD: Δ=.13, p<.01 and ToM vs Random: Δ=.13, p<.01). However, while humans also allocated a greater proportion of time to the red triangle in GD compared to Random animations (Δ=.08, p=.05), marmosets did not show any difference between these two conditions (Δ=.0003, p=1).

For the small blue triangle, we also observed a significant interaction of species and condition (F(2,40)=3.54, p=.04, ηp2=.151) but the comparisons were not resistant to the p value adjustment by Bonferroni correction. Therefore, humans and marmosets spent the same proportion of time looking at the blue triangle in the three different types of videos (For humans, ToM vs GD: Δ=-.02, p=1, ToM vs Random: Δ=.04, p=1 and GD vs Random: Δ=.07, p=.23 ; For marmosets, ToM vs GD: Δ=-.05, p=.89, ToM vs Random: Δ=.07, p=.66 and GD vs Random: Δ=-.02, p=1; Figure 2B).

These results highlight the variation in gaze patterns observed in both humans and marmosets when their focus is directed towards the large red triangle during the viewing of ToM, GD, and Random videos. Notably, humans show a gradient of proportion of time spent looking at the red triangle across the three conditions, with the smallest proportion in Random videos and the greatest proportion in ToM videos. In contrast, marmosets exhibit a different pattern, spending more time looking at the red triangle in ToM videos, but allocating the same proportion of time to look at the red triangle in both Random and GD videos. This finding suggests that while humans demonstrate distinct attentional preferences for the red triangle across the three conditions, marmosets exhibit a similar attentional focus on the red triangle in the Random and GD conditions, but their pattern differs in the ToM condition. This suggests that marmosets process the Random and GD conditions in a similar manner, but their processing of the ToM condition is distinct, indicating a differential response to stimuli representing social interactions.”

11. Are the bars in Figure 2 meant to add up to 1 for any given participant? If you analyzed the total time fixating on either shape, would marmosets be spending less time looking at the shapes overall than humans?

The values in the original Figure 2 do not total 1 for each participant, as there are instances where the triangles either overlap or are proximate enough that the eye position falls within the defined radius for both shapes simultaneously. Responding to the second query, after conducting an additional analysis, we found a significant species effect on the total time spent fixating on either shape (F(1,20)=14.38, p=.001, ηp2=.42). This indicates that humans tend to look at the triangles more frequently than marmosets (Δ=.16, p=.001).

12. Readers will likely want clarification in cases where the same area showed stronger activation for ToM videos AND Random videos. I assume it was in different voxels in the same larger area, but this could be explicit.

In response to the reviewer's request for clarification, we have made it explicit in our manuscript that while larger areas of the brain showed stronger activation for both ToM and Random videos, the specific voxels within these areas exhibiting this activation were typically distinct. Furthermore, we note that in some instances, both conditions activated the same voxels, but the degree of activation differed, suggesting spatial and intensity variation within the same areas. This elaboration can be found in the sections discussing functional brain activations in humans (pages 7-8) and marmosets (page 11). This can be read:

Pages 7-8 (Functional brain activations while watching ToM and Random Frith-Happé’s animations in humans): “Both ToM (Figure 3A) and Random (Figure 3B) videos activated a large bilateral network. While the same larger areas were activated in both conditions, the specific voxels showing this activation within those areas were typically distinct. In some cases, both conditions activated the same voxels, but the degree of activation differed. This suggests a degree of both spatial and intensity variation in the activations for the two conditions within the same areas.”

Page 11 (Functional brain activations while watching ToM and Random Frith-Happé’s animations in marmosets): “Both the ToM (Figure 4A) and Random (Figure 4B) animations activated an extensive network involving a variety of areas in the occipito-temporal, parietal and frontal regions. As in human subjects, it should be noted that while both conditions elicited strong activation in some of the same larger areas, these activations might have either occurred in distinct voxels within those areas, or the same voxels were activated to varying degrees for both conditions. This suggests distinct yet overlapping patterns of neural processing for the ToM and Random conditions.”

13. The claim that these maps represent "dedicated brain networks" for ToM or Random videos (line 188) is too strong. These brain areas are used for many things.

In response to the reviewer's comment, we agree that the term "dedicated brain networks" could potentially imply exclusivity, which is not our intention. We understand that these brain areas participate in a variety of cognitive functions. To address this, we have modified our phrasing on page 10 to "brain networks activated during the processing of ToM or Random videos" to more accurately represent our findings.

14. For many of the sentences in the imaging results, the comparison needs to be made explicit. For example Line 193 – higher bilateral activation than what? Line 196 – greater activations than what? Line 202 – a larger network than what? Etc.

We agree with the reviewer's observation about the need for explicit comparisons in our imaging results. To address this, we have now revised certain sentences in the Results section to provide clear and specific comparisons. The updated descriptions can be found in the Results section on pages 7 to 12 of the revised manuscript.

15. The description of Klein et al., (2009) on Lines 289-293 might be read to imply that they were attributing mentalizing without good reason. Klein also collected intentionality scores, which correlated with the viewing metric. This could be rephrased to be more accurate.

We concur with the reviewer's suggestion for a more accurate interpretation of Klein et al., 2009. Our intention was not to undermine the work by Klein et al. To address this, we have adjusted the phrasing within the Discussion section on page 15 of our manuscript, emphasizing Klein et al.'s valuable contribution through their correlation of intentionality scores with fixation durations. These revisions result in a more balanced and accurate representation of their work.

Page 15: “In our first experiment, we examined the gaze patterns of marmosets and humans during the viewing of these video animations. Klein et al. (2009) reported differing fixation durations for these animations, where the longest fixations were observed for ToM animations, followed by GD animations and the shortest fixations for Random animations. They further reported that the intentionality score – derived from verbal descriptions of the animations – followed a similar pattern: highest for ToM, lowest for Random, and intermediate for GD animations. This validated the degree of mental state attribution according to the categories and established that animations provoking mentalizing (ToM condition) were associated with long fixations. This, in turn, supports the use of fixation durations as a nonverbal metric for mentalizing capacity (Klein et al., 2009; Meijering et al., 2012).”

16. The inclusion of the authors as subjects is odd. Some readers will view it as a big red flag. The authors clearly know their own hypothesis and likely have a vested interest in a particular outcome. For the strongest report, the authors should remove their own data. At the very least, the authors need to demonstrate that the inclusion/exclusion of their unblinded data doesn't affect the interpretation of the human results.

We acknowledge the reviewer's concern regarding the inclusion of authors as subjects, and potential bias it could introduce. In response to this, we have excluded the data from the authors who initially participated in the study and replaced it with new data from subjects unrelated to the authorship of this work. For the eye tracking experiment, given the introduction of a new “Goal-Directed condition”, we carried out the experiment with eleven new participants, none of whom are authors of this study. Regarding the fMRI experiment, we substituted the data collected from the three author-participants with data from three additional participants who were not informed about the study's hypothesis.

The re-analyzed results, accounting for the updated participant pool, can now be found in the sections: “Gaze patterns for Frith-Happé’s ToM, GD and Random animations in humans and marmosets” (pages 5-7), “Functional brain activations while watching ToM and Random Frith-Happé’s animations in humans” (pages 7-10), and “Comparison of functional brain activations in humans and marmosets” (pages 13-14). We also updated the figures 2, 3 and 5 and the figures supplement 1 and 2 on pages 39, 40, 42, 43, and 44, respectively.

The revised participant information is detailed in the methods section on page 22:

“Eleven healthy humans (4 females, 25-42 years, mean age: 30.7 years) participated in the eye tracking experiment. Among these, five individuals, along with eight additional subjects (4 females, 26-45 years), took part in the fMRI experiment.”

17. The method should state whether the subjects had experienced these animations before (e.g., they're shown in some psychology and neuroscience classes).

We agree that detailing whether subjects had previous exposure to the animations is essential for the study's integrity. As a result, we have incorporated the following statement into the Methods section on page 22:

Page 22: “Importantly, all subjects confirmed they had not previously been exposed to the Frith-Happé animation videos used in our study.”

18. The description of the monkey reward contingencies needs to be clearer about whether the monkeys were rewarded only during calibration or during videos as well, and whether any reward during videos was contingent on keeping their eyes on the screen.

We appreciate the reviewer's suggestion to provide additional clarification on the reward contingencies for the monkey in our study. The monkeys received rewards only at the initial and final stages of each session, but not during the calibration or the viewing of the videos. Consequently, we have updated the text in the Methods section, now found on page 23, as follows:

“Monkeys were rewarded at the beginning and end of each session. Crucially, no rewards were provided during the calibration or while the videos were played.”

19. Because this is a social task when the scans were normalized to MNI space, did the authors divide the human participants into those with and without a paracingulate sulcus?

We appreciate the reviewer's insightful comment. While we normalized our MRI scans to MNI space in this study, we did not differentiate among participants based on the presence or absence of a paracingulate sulcus. The reviewer’s suggestion to consider this factor into account in our analyses is indeed valuable and will be considered in our future studies involving a larger pool of participants.

20. The authors need to better specify what counts as a "baseline" for the fMRI comparisons. They should also briefly justify why this is an informative comparison.

We agree that the definition and justification for our selected "baseline" in the fMRI comparisons should be more explicit. In our study, the "baseline" denotes brain activity when subjects are in a 'resting state' – a state of neutral alertness – specifically during the presentation of a circular black cue between video clips. The comparison to this baseline is valuable because it allows us to isolate and constrast brain activity associated with task-specific conditions, such as ToM or Random animations. We have added a more detailed explanation in the Methods section on page 30 of the revised manuscript. It now reads:

“First, we identified brain regions involved in the processing of ToM and Random animations by contrasting each condition with a baseline (i.e., ToM condition > baseline and Random condition > baseline contrasts). This baseline brain activation recorded during the presentation of the circular black cue between video clips (i.e., baseline blocks of 15 sec, see above), reflects 'resting state' activation. By comparing it to the brain activation during ToM and Random animations, we could specifically highlight the task-related activations and isolate brain regions engaged during each condition.”

Reviewer #1 (Recommendations for the authors):

I very much enjoyed reading this manuscript and believe it has the potential to make an important contribution to animal literature as well as social cognition more broadly.

As highlighted in the public review, my main query is in relation to the omission of the Goal-Directed condition of the Frith-Happé task. While the evidence presented here certainly suggests that marmosets process ToM animations in a different manner than Random animations, we are somewhat constrained in what we can interpret from these findings. As the authors note, we cannot necessarily conclude that the marmosets are engaged in mental state attribution on the basis of these brain activation patterns.

We are grateful for the reviewer's encouraging words, thoughtful evaluation, and constructive comments on our manuscript. We agree with the reviewer's remarks regarding the omission of the Goal-Directed condition in the Frith-Happé task and recognize the interpretative constraints this places on our findings. We have attempted to address all points in both our responses below and in the revised manuscript.

A more compelling argument would stem from the inclusion of the Goal-Directed condition in which the triangles arguably do interact but in a purely physical manner, i.e., there is no mental state attribution. I was surprised that this condition was not included as its omission somewhat limits the extent to which any conclusion regarding ToM can be drawn. Could the activations observed in the ToM condition reflect the processing of an event or narrative as it unfolds, rather than the cycling of random movements in the Random condition? I ask this question as previous studies using the Frith-Happé animations in dementia populations note that the mental state attribution judgements on ToM trials were conferred only at the end of the video (i.e., once the overall event narrative had been seen) whereas patients were adept at conferring a judgment of "no interaction" early during the viewing of Random animations (Ref: Synn et al. 2018 J. Alz Dis).

We appreciate the reviewer's insightful comment concerning the absence of the Goal-Directed (GD) condition in our study. We understand that integrating this condition could have offered a valuable contrast and enriched our understanding of the associated processing mechanisms.

In our study, our initial strategy was to concentrate on the two extreme conditions: ToM and Random animations. These represent scenarios with mental interactions and scenarios with absence of mental interactions, respectively. This choice was influenced by some previous fMRI studies using the Frith-Happé animated triangles task in humans and macaques, which primarily focused on these two conditions (Gobbini et al., 2007; Barch et al., 2013; Bliksted et al., 2019; Vandewouw et al., 2021; Weiss et al., 2021; Chen et al., 2023; Roumazeilles et al., 2021). Additionally, the duration of each video clip (19.5 sec) posed practical challenges in incorporating all the conditions with a sufficient number of repetitions in the fMRI task design for marmoset subjects. It was crucial for us to ensure that the subjects remained alert and focused throughout the entire scanning session, which becomes increasingly difficult with longer runs. Consequently, we chose to center our attention on the ToM and Random conditions, as the GD condition is situated between these two extremes, depicting physical interaction among the triangles without suggesting mental state attribution.

Nevertheless, we recognize the potential limitations of not incorporating the GD condition and the possible insights it might offer. In response to the reviewer's feedback, we conducted an additional eye-tracking experiment that included all three conditions: ToM, GD, and Random. This experiment involved 11 human subjects and 11 marmosets, with all ToM, GD and Random video clips presented once in a single run. The results from this experiment provided additional insights into the gaze patterns during the different conditions, complementing our initial findings.

We have updated the manuscript to clarify the choice of two conditions for the fMRI experiment and to incorporate the findings from the new eye-tracking experiment. The relevant sections in the methods, results, and discussion have been modified accordingly. These revisions can be found on pages 23-24, 5 to 7, and 14 to 20, respectively.

The reviewer's insightful question regarding whether the observed activations in the ToM condition might simply reflect the processing of an event or narrative as it unfolds, rather than mental state attribution, is an important consideration. We understand from the current literature, including the Synn et al. (2018) study mentioned by the reviewer, that distinguishing between these two processes can be challenging, particularly given the dynamic nature of the stimuli used. The ToM condition intrinsically involves the progression of an event or narrative, which is necessary for subjects to infer the mental states of the characters.

However, we believe our results provide evidence of the specific involvement of certain brain regions in mental state attribution. The enhanced activation of certain brain regions (e.g., TPJ and STS) during the ToM condition compared to Random condition aligns with several prior fMRI studies (Barch et al., 2013; Castelli et al., 2000; Chen et al., 2023; Gobbini et al., 2007; Vandewouw et al., 2021; Weiss et al., 2021; Wheatley et al., 2007). This suggests our observed activations may extend beyond merely event or narrative processing. For marmosets, it's more challenging to make definitive conclusions as no previous fMRI studies have used the same animations. However, the new eye-tracking experiment results show marmosets spend more time focused on the red triangle in ToM videos but allocate similar minimal attention to the red triangle in both Random and GD videos. This suggests that marmosets process the Random and GD conditions similarly, but differently for ToM animations that represent mental interactions. Nevertheless, conclusive interpretation remains challenging, and this is indeed a matter that warrants further exploration. We appreciate the reviewer's critical perspective on this aspect of our study.

I wonder whether this interpretation might also reflect the curious finding of stronger medial PFC activation in Random trials versus ToM trials in humans, and no clear mPFC activation in the ToM trials. This seems very much at odds with the wider literature on the brain regions necessary for ToM, which often place the medial PFC at the heart of the social brain.

We appreciate the reviewer's observation concerning the surprising patterns of activation in the medial prefrontal cortex (mPFC) in our study. While many studies have indeed associated mPFC with ToM tasks, our findings of stronger mPFC activation during Random animations compared to ToM animations in humans, and the lack of clear mPFC activation in ToM trials, appear to diverge from the wider literature.

We would like to highlight that the role of mPFC in ToM may be more nuanced. For instance, a recent meta-analysis conducted by Schurz et al. (2021) demonstrated that social animation tasks, such as the Frith-Happé animated triangles task we used, tend to engage an intermediate cluster between cognitive and affective clusters. This cluster involves a variety of brain regions, including temporo-parietal areas, anterior temporal areas, dorso-posterior medial prefrontal cortex, and inferior frontal areas. This suggests that such tasks may not uniformly engage the entirety of the mPFC and might involve more the dorsal part.

Our study revealed stronger activations in the ventral part of the mPFC for Random versus ToM animations. These activations could reflect other processes, such as attentional control. Notably, our results align closely with those of the Human Connectome Project by Barch et al. (2013), who also observed stronger activations for Random versus ToM animations in similar ventral parts of the mPFC.

These observations underscore the complexity of the neural substrates of ToM and the potential influence of task designs on the patterns of brain activation. We concur that further research is needed to fully understand these complex issues, and we greatly appreciate the reviewer's contribution to this ongoing discussion.

While previous fMRI studies using the Frith-Happé task have found dorsal mPFC activation for the ToM animations, we did not observe a clear activation pattern in ToM trials in our study. This discrepancy could be attributable to a variety of factors. Even minor differences in task design or the specific versions of the animations used could lead to different cognitive processes being engaged during the task. The methodological aspects, such as statistical power, could also have contributed to the differences in our findings. Furthermore, the number of participants can affect the resulting brain activation patterns.

We added this explanation concerning the possible issues in the Discussion section on page 17, which now read:

“Our slightly adapted versions of the Frith-Happé animations led to a similar distinct pattern of brain activations, with an exception for the lack of activations in the dorsal part of the medial prefrontal cortex. This discrepancy could be attributable to various factors, including differences in task design, methodological aspects such as statistical power, or variations in participants characteristics.”

I very much appreciated the check using the independent HCP dataset. This was a very nice inclusion to ensure that the shortened version corresponded well with previous reports.

We are pleased to hear that the reviewer appreciates our use of the independent HCP dataset to validate our results. We thank the reviewer for the positive feedback on this aspect of our study.

Reviewer #2 (Recommendations for the authors):

In their study, Dureux and colleagues are investigating the sensitivity of a highly social non-human primate species, the marmoset, to social illusion using the Frith and Happe task. Although this task is often considered a non-verbal TOM task, its relevance to investigate TOM has been disputed. For instance, the Frith and Happe task does not recruit in humans a similar network as other false-belief tasks and social gambling tasks (Schurz et al., 2020). While the authors might want to revise, or at least discuss their use of the TOM concept further, their results clearly show that marmosets distinguish the two types of videos shown to them.

We are grateful for the reviewer's thoughtful comments and critique. We appreciate the thoughtful reference to the study by Schurz et al. (2020). Indeed, as the reviewer has rightly highlighted, the Frith-Happé task has been debated for its relevancy to investigate ToM, given the divergence in neural networks it recruits compared to other ToM tasks, such as false-belief tasks and social gambling tasks, as demonstrated by Schurz et al. (2020).

In light of these insightful comments, we have revised our manuscript to acknowledge these differences and argue that the Frith-Happé task still holds value in the study of social cognition, albeit with a potential focus on different facets of this complex construct compared to more traditional ToM tasks. This understanding is reflected in our findings, showing that both humans and marmosets can distinguish between the types of videos in the Frith-Happé task.

The revised sections in our manuscript addressing these issues now read:

Page 17: “Overall, the ToM network we identified, as well as that reported by Barch et al. (2013), appear to be more extensive than those described in studies employing more complex experimental paradigms to study ToM. This aligns with the recent meta-analysis conducted by Schurz and colleagues (2020), which demonstrated that the network activated by simpler, non-verbal stimuli like social animations differs from the traditional network, with involvement of both cognitive and affective networks (Schurz et al., 2020).”

Page 20: “As such, future research including a range of tasks, from sensory-affective components to more abstract and decoupled representations of others' mental states (Schurz et al., 2020), will be fundamental in further unravelling the complexities of the evolution and functioning of the theory of mind across the primate lineage.”

We believe these revisions provide a more nuanced understanding of our study within the larger context of ToM research, and we thank the reviewer for prompting this important discussion. We address all the other points raised, below and in the manuscript.

Were there any differences between the different types of videos used? For instance, was there any difference between videos in which the interaction between the two shapes is more obvious (e.g. coaxing vs seducing).

We appreciate the reviewer's interest in the distinct types of videos used in our study. In this investigation, we did not specifically analyze the responses to videos where the interaction between the two shapes was more or less pronounced, such as in coaxing versus seducing scenarios. Our primary focus was on contrasting the overall responses elicited by Theory of Mind (ToM) animations and Random animations. Due to the design of our experiment, we did not include a sufficient number of repetitions for each distinct type of video within each condition to afford the statistical power necessary for such a comparison. We acknowledge that assessing responses to different types of ToM animations may provide additional insights into the specificity of neural responses and consider this an interesting avenue for future research.

Did the authors also present social videos to their animals? If so did they also observe additional recruitment of IPa and TPO areas like Clery and colleagues for social videos compared to Frith and Happe's videos (Cléry et al., 2021)?

In this study, our main focus was to examine the neural responses to the Frith-Happé animations, which represent abstract social scenarios. As such, we did not include realistic social videos to our animals in our current experimental design.

The study by Cléry et al. (2021) indeed demonstrates that the comparison social versus non-social realistic videos predominantly revealed a fronto-parietal network with additional temporal region engagement. The social condition in their study seems to recruit not only several areas that we observed to be activated in our ToM condition, but also additional temporal regions such as the IPa, TPO and TE1 areas. This suggests that these areas may play a significant role in the processing of more realistic social cues, thereby adding complexity to the social brain network.

Although our current study did not directly investigate this aspect, we agree that comparing the neural responses elicited by abstract versus realistic social scenarios in marmosets could provide valuable insights into the extent and adaptability of the social brain network. This approach would further our understanding of the neural substrates that underpin various facets of social cognition, depending on the complexity and realism of the presented stimuli. In future research, we plan to consider including such comparative investigations in our experimental design. We appreciate the reviewer's suggestion.

Humans and marmosets recruit a distinct set of subcortical structures during the viewing of video clips; for instance dorsal thalamus and cerebellum in humans, and the hippocampus and amygdala in marmosets. How do the authors interpret this difference?

We appreciate the reviewer for pointing out this difference in subcortical recruitment between humans and marmosets during the viewing of ToM versus Random video clips. The divergent patterns may reflect species-specific aspects of social cognition.

In humans, the involvement of the dorsal thalamus, which serves as a critical hub for information relay between various subcortical areas and the cortex, may indicate the necessity for complex information processing in interpreting the animations (e.g., Halassa and Sherman, 2019). The activation of the cerebellum, beyond its traditional role in motor functions, supports recent findings of its involvement in social cognition (e.g., Van Overwalle et al., 2014). The activations in a small portion of the amygdala may reflect emotional processing tied to understanding of the social scenarios in the animations (e.g., Janak and Tye, 2015).

On the other hand, the activation of the hippocampus and amygdala in marmosets might reflect a more emotion-driven interpretation of the animations. The recruitment of the hippocampus could suggest the role of memory in interpreting the animations by remembering past interactions to help interpret current social scenarios (e.g., Eichenbaum, 2017), while the more extended activation of the amygdala in marmosets might imply a higher degree of emotional processing compared to humans (e.g., Janak and Tye, 2015).

We have now expanded our discussion on this topic in the revised manuscript, specifically on pages 18-19:

"Regarding the distinct subcortical activations observed in humans and marmosets, it's important to consider the specific social cognitive demands that might be unique to each species. The involvement of the dorsal thalamus, cerebellum, and a small portion of the amygdala in humans may reflect the complexities of information processing, social cognition, and emotional involvement required to interpret the ToM animations (e.g., Halassa and Sherman, 2019; Janak and Tye, 2015; Van Overwalle et al., 2014). Conversely, the activation of the amygdala and hippocampus in marmosets could suggest a more emotion- and memory-based processing of the social stimuli (e.g., Eichenbaum, 2017; Van Overwalle et al., 2014). However, it's critical to consider that these interpretations are speculative and would require further study for confirmation.”

Unlike marmosets, rhesus macaques are not sensitive to the type of the Frith and Happe social illusion (Roumazeilles et al., 2021; Schafroth et al., 2021). The authors might want to discuss the singularity of the marmosets from an evolutionary perspective.

As suggested, we have now discussed the singularity of the marmosets from an evolutionary perspective in the Discussion section on pages 19-20, which reads:

“Interestingly, our results differed from those obtained by Roumazeilles et al. (2021) in their fMRI study conducted in macaques using the same animations. Roumazeilles and colleagues reported no differences in activation between ToM and Random animations, suggesting that rhesus macaques may not respond to the social cues presented by the ToM Frith-Happé animations. This disparity between our marmoset findings and those of macaques raises intriguing questions about potential differences in the evolutionary development of ToM processing within non-human primates. Marmosets, as New World monkeys, are part of an evolutionary lineage that diverged earlier than the lineage of Old-World monkeys such as macaques. This difference in lineage might lead to distinct evolutionary trajectories in cognitive processing, which could include varying sensitivity to abstract social cues in animations.

In summary, our study reveals novel insights into how New World marmosets, akin to humans, differentially process abstract animations that depict complex social interactions and animations that display purely physical or random movements. Our findings, supported by both specific gaze behaviors (i.e., the proportion of time spent on the red triangle, despite the inconclusiveness of overall fixation) and distinct neural activation patterns, shed light on the marmosets' capacity to interpret social cues embedded in these animations.

The differences observed between humans, marmosets, and macaques underscore the diverse cognitive strategies that primate species have evolved to decipher social information. This diversity may be influenced by unique evolutionary pressures that arise from varying social structures and lifestyles. Like macaque monkeys, humans often live in large, hierarchically organized social groups where status influences access to resources. However, both humans and marmosets share a common trait: a high degree of cooperative care for offspring within the group, with individuals other than the biological parents participating in child-rearing. These distinctive social dynamics of marmosets and humans may have driven the development of unique social cognitive abilities. This could explain their enhanced sensitivity to abstract social cues in the Frith-Happé animations.

Nonetheless, it is crucial to emphasize that even though marmosets respond to the social cues in the Frith-Happé animations, this does not automatically imply that they possess mental-state attributions comparable to humans. As such, future research including a range of tasks, from sensory-affective components to more abstract and decoupled representations of others' mental states (Schurz et al., 2020), will be fundamental in further unravelling the complexities of the evolution and functioning of the theory of mind across the primate lineage.”

Reviewer #3 (Recommendations for the authors):

This study is strong in many ways, and the goal is a good one. The below recommendations will help strengthen it further:

We are truly appreciative of the reviewer's thorough assessment and constructive feedback on our manuscript. We have addressed all raised points both in our responses below and in the revised manuscript.

In the abstract, the claim that there is no evidence that nonhuman primates attribute mental states to moving shapes is false. You even cite some of this positive evidence (e.g., Uller, 2004; Atsumi et al., 2015; 2017). There is also evidence that they don't (Kupferberg et al., 2013; Burkart et al., 2012; Schafroth et al., 2021). The abstract would be stronger if written to represent the state of the field more accurately.

We thank the reviewer for drawing our attention to this. We've updated the abstract to reflect the current state of the field and our findings more accurately. Please refer to page 2 for the revised version.

Page 2: “Theory of Mind (ToM) refers to the cognitive ability to attribute mental states to other individuals. This ability extends even to the attribution of mental states to animations featuring simple geometric shapes, such as the Frith-Happé animations in which two triangles move either purposelessly (Random condition), exhibit purely physical movement (Goal-directed condition), or move as if one triangle is reacting to the other triangle’s mental states (ToM condition). While this capacity in humans has been thoroughly established, research on nonhuman primates has yielded inconsistent results.

This study explored how marmosets (Callithrix jacchus), a highly social primate species, process Frith-Happé animations by examining gaze patterns and brain activations of marmosets and humans as they observed these animations. We revealed that both marmosets and humans exhibited longer fixations on one of the triangles in ToM animations, compared to other conditions. However, we did not observe the same pattern of longer overall fixation duration on the ToM animations in marmosets as identified in humans. Furthermore, our findings reveal that both species activated extensive and comparable brain networks when viewing ToM versus Random animations, suggesting that marmosets differentiate between these scenarios similarly to humans. While marmosets did not mimic human overall fixation patterns, their gaze behavior and neural activations indicate a distinction between ToM and non-ToM scenarios. This study expands our understanding of nonhuman primate cognitive abilities, shedding light on potential similarities and differences in ToM processing between marmosets and humans.”

The overall conclusion as stated in the abstract, at the end of the introduction, and in the discussion is not warranted by the evidence. Indeed, the abstract completely fails to mention that the marmosets failed to show the human-like pattern of longer fixations on the ToM videos. Many readers will likely interpret this evidence as primarily against the idea that marmosets view the ToM videos in a human-like way, or as equivocal evidence at best. This report will be a stronger piece of science if it accurately describes the results.

We agree with the reviewer's comment regarding the absence of certain result descriptions in the abstract, introduction, and discussion. In response to this feedback, we have conducted major revisions in these three sections, adding the missing information and elaborating on the overall conclusion. The primary changes can be found on page 2 (abstract), pages 4-5 (introduction), and pages 14 to 20 (Discussion). The revised content reads as follows:

Page 2: “Theory of Mind (ToM) refers to the cognitive ability to attribute mental states to other individuals. This ability extends even to the attribution of mental states to animations featuring simple geometric shapes, such as the Frith-Happé animations in which two triangles move either purposelessly (Random condition), exhibit purely physical movement (Goal-directed condition), or move as if one triangle is reacting to the other triangle’s mental states (ToM condition). While this capacity in humans has been thoroughly established, research on nonhuman primates has yielded inconsistent results.

This study explored how marmosets (Callithrix jacchus), a highly social primate species, process Frith-Happé animations by examining gaze patterns and brain activations of marmosets and humans as they observed these animations. We revealed that both marmosets and humans exhibited longer fixations on one of the triangles in ToM animations, compared to other conditions. However, we did not observe the same pattern of longer overall fixation duration on the ToM animations in marmosets as identified in humans. Furthermore, our findings reveal that both species activated extensive and comparable brain networks when viewing ToM versus Random animations, suggesting that marmosets differentiate between these scenarios similarly to humans. While marmosets did not mimic human overall fixation patterns, their gaze behavior and neural activations indicate a distinction between ToM and non-ToM scenarios. This study expands our understanding of nonhuman primate cognitive abilities, shedding light on potential similarities and differences in ToM processing between marmosets and humans.”

Pages 4-5: “Although the spontaneous attribution of mental states to moving shapes has been well established in humans, it remains uncertain whether other primate species share this capacity. There is some evidence suggesting that monkeys can attribute goals to agents with varying levels of similarity and familiarity to conspecifics, including human agents, monkey robots, moving geometric boxes, animated shapes, and simple moving dots (Atsumi et al., 2017; Atsumi and Nagasaka, 2015; Krupenye and Hare, 2018; Kupferberg et al., 2013; Uller, 2004). However, the findings in this area are somewhat mixed, with some studies investigating the attribution of goals to inanimate moving objects yielding inconclusive results (Atsumi and Nagasaka, 2015; Kupferberg et al., 2013). Nonhuman primates' spontaneous attribution of mental states to Frith-Happé animations is even less certain. While human subjects exhibit longer eye fixations when viewing the ToM condition compared to the Random condition of the Frith-Happé animations (Klein et al., 2009), a recent eye tracking study in macaque monkeys did not observe similar differences (Schafroth et al., 2021). Similarly, a recent fMRI study conducted on macaques found no discernible differences in activations between ToM and random Frith-Happé animations (Roumazeilles et al., 2021).

In this study, we investigated the behaviour and brain activations of New World common marmoset monkeys (Callithrix jacchus) while they viewed Frith-Happé animations. Living in closely-knit family groups, marmosets exhibit significant social parallels with humans, including prosocial behavior, imitation, and cooperative breeding. These characteristics establish them as a promising nonhuman primate model for investigating social cognition (Burkart et al., 2009; Burkart and Finkenwirth, 2015; Miller et al., 2016). To directly compare humans and marmosets in their response to these animations, we employed high-speed video eye-tracking to record eye movements in eleven healthy humans and eleven marmoset monkeys. Additionally, we conducted ultra-high field fMRI scans on ten healthy humans at 7T and six common marmoset monkeys at 9.4T. These combined methods allowed us to examine the visual behavior and brain activations of both species while they observed the Frith-Happé animations.”

Pages 14-15: “In our first experiment, we examined the gaze patterns of marmosets and humans during the viewing of these video animations. Klein et al. (2009) reported differing fixation durations for these animations, where the longest fixations were observed for ToM animations, followed by GD animations and the shortest fixations for Random animations. They further reported that the intentionality score – derived from verbal descriptions of the animations – followed a similar pattern: highest for ToM, lowest for Random, and intermediate for GD animations. This validated the degree of mental state attribution according to the categories and established that animations provoking mentalizing (ToM condition) were associated with long fixations. This, in turn, supports the use of fixation durations as a nonverbal metric for mentalizing capacity (Klein et al., 2009; Meijering et al., 2012). Our results with human subjects, which demonstrated longer fixation durations for the ToM animations compared to the GD and Random animations, paralleled those of Klein et al. (2009). However, unlike Klein et al.'s findings, we did not observe intermediate durations for GD animations in our study.

Interestingly, our marmoset data did not align with the human findings but instead resonated more with Schafroth et al. (2021)'s observations in macaque monkeys, which did not show significant differences in fixation durations across the three animation types.”

Pages 19-20: “In summary, our study reveals novel insights into how New World marmosets, akin to humans, differentially process abstract animations that depict complex social interactions and animations that display purely physical or random movements. Our findings, supported by both specific gaze behaviors (i.e., the proportion of time spent on the red triangle, despite the inconclusiveness of overall fixation) and distinct neural activation patterns, shed light on the marmosets' capacity to interpret social cues embedded in these animations.

The differences observed between humans, marmosets, and macaques underscore the diverse cognitive strategies that primate species have evolved to decipher social information. This diversity may be influenced by unique evolutionary pressures that arise from varying social structures and lifestyles. Like macaque monkeys, humans often live in large, hierarchically organized social groups where status influences access to resources. However, both humans and marmosets share a common trait: a high degree of cooperative care for offspring within the group, with individuals other than the biological parents participating in child-rearing. These distinctive social dynamics of marmosets and humans may have driven the development of unique social cognitive abilities. This could explain their enhanced sensitivity to abstract social cues in the Frith-Happé animations.

Nonetheless, it is crucial to emphasize that even though marmosets respond to the social cues in the Frith-Happé animations, this does not automatically imply that they possess mental-state attributions comparable to humans. As such, future research including a range of tasks, from sensory-affective components to more abstract and decoupled representations of others' mental states (Schurz et al., 2020), will be fundamental in further unravelling the complexities of the evolution and functioning of the theory of mind across the primate lineage.”

The justification for looking in marmosets could be read to imply that macaque monkeys do not live in family groups or share important social similarities with humans. Both species share many social similarities (and many social differences) with humans. Marmosets are a good species to study; this section would benefit from a more accurate rationale.

We apologize for any confusion our previous wording may have caused. We have now revised the sentence on page 4 to enhance its clarity and accuracy, which now reads:

Page 4: “Living in closely-knit family groups, marmosets exhibit significant social parallels with humans, including prosocial behavior, imitation, and cooperative breeding. These characteristics establish them as a promising nonhuman primate model for investigating social cognition (Burkart et al., 2009; Burkart and Finkenwirth, 2015; Miller et al., 2016).”

Because it is one of the main metrics in the Klein and Schafroth papers, and thus readers will want to see it for sake of comparison, the authors should include a figure showing the overall fixation durations as a function of category and species.

In response to the reviewer’s suggestion, we have added a new figure (Figure 2) on page 39, which presents the overall fixation durations as a function of both animation category and species. This figure provides a direct comparison of fixation durations between humans and marmosets across the different animation conditions (Random, Goal-directed, and ToM). We believe this additional visualization will assist readers in better understanding the overall durations of fixation across species and conditions and enable direct comparison with the findings of Klein and Schafroth.

The results about looking time to the large triangle need to follow up on the interaction between species and conditions so that readers know how to interpret it.

As we have now included the Goal-directed condition in our eye-tracking experiment, we have substantially revised the section “Gaze patterns for Frith-Happé’s ToM, GD and Random animations in humans and marmosets” in the results (pages 5 to 7). In response to the reviewer's comments, we have provided a more detailed analysis of the interaction between species and conditions regarding the looking time spent on the large triangle, which now reads as follows:

Pages 6-7: “To further analyze the gaze patterns of both humans and marmosets, we next measured the proportion of time subjects looked at each of the triangles in the videos (Figure 2B). We conducted mixed ANOVAs on the proportion of time the radial distance between the current gaze position and each triangle was within 4 visual degrees for each triangle separately.

Importantly, we observed a significant interaction between species and condition for the proportion of time spent looking at the large red triangle (F(2,40)=9.83, p<.001, ηp2 = .330). Specifically, both humans (Figure 2B left) and marmosets (Figure 2B right) spent a greater proportion of time looking at the red triangle in ToM compared to the GD and Random videos (For humans, ToM vs GD: Δ=.23, p<.001 and ToM vs Random: Δ=.31, p<.001 ; For marmosets, ToM vs GD: Δ=.13, p<.01 and ToM vs Random: Δ=.13, p<.01). However, while humans also allocated a greater proportion of time to the red triangle in GD compared to Random animations (Δ=.08, p=.05), marmosets did not show any difference between these two conditions (Δ=.0003, p=1).

For the small blue triangle, we also observed a significant interaction of species and condition (F(2,40)=3.54, p=.04, ηp2=.151) but the comparisons were not resistant to the p value adjustment by Bonferroni correction. Therefore, humans and marmosets spent the same proportion of time looking at the blue triangle in the three different types of videos (For humans, ToM vs GD: Δ=-.02, p=1, ToM vs Random: Δ=.04, p=1 and GD vs Random: Δ=.07, p=.23 ; For marmosets, ToM vs GD: Δ=-.05, p=.89, ToM vs Random: Δ=.07, p=.66 and GD vs Random: Δ=-.02, p=1; Figure 2B).

These results highlight the variation in gaze patterns observed in both humans and marmosets when their focus is directed towards the large red triangle during the viewing of ToM, GD, and Random videos. Notably, humans show a gradient of proportion of time spent looking at the red triangle across the three conditions, with the smallest proportion in Random videos and the greatest proportion in ToM videos. In contrast, marmosets exhibit a different pattern, spending more time looking at the red triangle in ToM videos, but allocating the same proportion of time to look at the red triangle in both Random and GD videos. This finding suggests that while humans demonstrate distinct attentional preferences for the red triangle across the three conditions, marmosets exhibit a similar attentional focus on the red triangle in the Random and GD conditions, but their pattern differs in the ToM condition. This suggests that marmosets process the Random and GD conditions in a similar manner, but their processing of the ToM condition is distinct, indicating a differential response to stimuli representing social interactions.”

The sentence on lines 97-99 might be an incomplete sentence.

We appreciate the reviewer's attention to detail and acknowledge the oversight in the sentence structure on lines 97-99. We have revised this sentence and the entire paragraph on pages 5 to 7, under the heading "Gaze patterns for Frith-Happé’s ToM, GD and Random animations in humans and marmosets". This revision takes into account the new results obtained after adding the Goal-Directed condition to the experiment.

Are the bars in Figure 2 meant to add up to 1 for any given participant? If you analyzed the total time fixating on either shape, would marmosets be spending less time looking at the shapes overall than humans?

We thank the reviewer for the question. The values in the previous Figure 2 are not intended to add up to 1. This is because there are instances where the triangles overlap or are in close enough proximity that the eye position falls within the defined radius for both simultaneously. In response to the second question, we have conducted an analysis on the total time spent fixating on either shape. Our findings revealed a significant effect of species (F(1,20)=14.38, p=.001, ηp2=.42), indicating that humans tend to look at the triangles more frequently than marmosets (Δ=.16, p=.001).

Overall, the figures are quite informative and aesthetically pleasing.

We sincerely thank the reviewer for their positive remarks about the figures in our study.

HCP should be explained the first time it is used.

We agree with the reviewer's point that all abbreviations should be clearly explained when first introduced. We have now amended the text to clarify this at the first instance where HCP appears on page 10. We thank the reviewer for bringing this oversight to our attention.

Readers will likely want clarification in cases where the same area showed stronger activation for ToM videos AND Random videos. I assume it was in different voxels in the same larger area, but this could be explicit.

We appreciate the reviewer's suggestion and agree that clarity on this issue is essential.

We have now added a sentence on this point in the two relevant sections of fMRI results on humans and marmosets, on pages 7-8 and 11, to ensure this is explicitly stated and clear to the reader. This can be read:

Pages 7-8 (Functional brain activations while watching ToM and Random Frith-Happé’s animations in humans): “Both ToM (Figure 3A) and Random (Figure 3B) videos activated a large bilateral network. While the same larger areas were activated in both conditions, the specific voxels showing this activation within those areas were typically distinct. In some cases, both conditions activated the same voxels, but the degree of activation differed. This suggests a degree of both spatial and intensity variation in the activations for the two conditions within the same areas. (…)”

Page 11 (Functional brain activations while watching ToM and Random Frith-Happé’s animations in marmosets): “Both the ToM (Figure 4A) and Random (Figure 4B) animations activated an extensive network involving a variety of areas in the occipito-temporal, parietal and frontal regions. As in human subjects, it should be noted that while both conditions elicited strong activation in some of the same larger areas, these activations might have either occurred in distinct voxels within those areas, or the same voxels were activated to varying degrees for both conditions. This suggests distinct yet overlapping patterns of neural processing for the ToM and Random conditions.”

The claim that these maps represent "dedicated brain networks" for ToM or Random videos (line 188) is too strong. These brain areas are used for many things.

We agree with the reviewer's concern regarding the term "dedicated brain networks". We understand that the use of this term could be misinterpreted as implying exclusivity, which is not the case. These areas are indeed involved in various cognitive functions. We have modified the statement on page 10 to indicate that these are "brain networks activated during the processing of ToM or Random videos" instead of "dedicated brain networks". We appreciate this valuable input.

For many of the sentences in the imaging results, the comparison needs to be made explicit. For example Line 193 – higher bilateral activation than what? Line 196 – greater activations than what? Line 202 – a larger network than what? Etc.

We appreciate the reviewer's attention to detail in pointing out the need for clear and explicit comparisons in our imaging results. We recognize that some sentences may lack specificity, leading to potential confusion. We have now revised these sentences in the Results section to clearly specify the comparisons being made in each case. The updated descriptions can be found in the Results section, on pages 7 to 12.

The description of Klein et al., (2009) on Lines 289-293 might be read to imply that they were attributing mentalizing without good reason. Klein also collected intentionality scores, which correlated with the viewing metric. This could be rephrased to be more accurate.

Thank you for pointing out the potential misinterpretation of our description of Klein et al., 2009. Our intention was not to undermine the work by Klein et al. We have revised the phrasing in our manuscript, on page 15, to better reflect this aspect of their study:

Page 15: “In our first experiment, we examined the gaze patterns of marmosets and humans during the viewing of these video animations. Klein et al. (2009) reported differing fixation durations for these animations, where the longest fixations were observed for ToM animations, followed by GD animations and the shortest fixations for Random animations. They further reported that the intentionality score – derived from verbal descriptions of the animations – followed a similar pattern: highest for ToM, lowest for Random, and intermediate for GD animations. This validated the degree of mental state attribution according to the categories and established that animations provoking mentalizing (ToM condition) were associated with long fixations. This, in turn, supports the use of fixation durations as a nonverbal metric for mentalizing capacity (Klein et al., 2009; Meijering et al., 2012).”

In general, the discussion could be strengthened by avoiding repeating the results in as much detail.

We appreciate the reviewer's feedback and agree that a more concise discussion could make the manuscript more effective. We have revised the Discussion section to provide a more focused analysis and interpretation of the results, limiting repetition from the Results section.

The updated Discussion section can be found from page 14 to 20 of the revised manuscript.

The inclusion of the authors as subjects is odd. Some readers will view it as a big red flag. The authors clearly know their own hypothesis and likely have a vested interest in a particular outcome. For the strongest report, the authors should remove their own data. At the very least, the authors need to demonstrate that the inclusion/exclusion of their unblinded data doesn't affect the interpretation of the human results.

We appreciate the reviewer's concern about the potential bias introduced by the inclusion of authors as subjects in our study. Taking this into consideration, we have removed the data from the three authors who initially participated in the study. For the eye tracking experiment, we have now introduced a new “Goal-Directed condition” and conducted the experiment with eleven new subjects, none of whom are authors of this study. For the fMRI experiment, we replaced the data from the three author-subjects with data from three additional subjects who were not privy to the study's hypothesis.

Consequently, we replaced the previous results with these new findings and made the necessary modifications to several sections of the manuscript as well as on the figures 2, 3 and 5 and the figures supplement 1 and 2. These updates can be found in the sections: “Gaze patterns for Frith-Happé’s ToM, GD and Random animations in humans and marmosets” on pages 5 to 7, “Functional brain activations while watching ToM and Random Frith-Happé’s animations in humans” on pages 7 to 10, and “Comparison of functional brain activations in humans and marmosets” on pages 13-14. The updated figures can be found on pages 39, 40, 42, 43, and 44, respectively.

We have also updated the participant information in the methods section, on page 22, to now read:

“Eleven healthy humans (4 females, 25-42 years, mean age: 30.7 years) participated in the eye tracking experiment. Among these, five individuals, along with eight additional subjects (4 females, 26-45 years), took part in the fMRI experiment.”

The method should state whether the subjects had experienced these animations before (e.g., they're shown in some psychology and neuroscience classes).

Thank you for pointing out the necessity to include this information. We understand that the participants' previous exposure to these animations could potentially affect the results. We have added the following sentence to the Methods section:

Page 22: “Importantly, all subjects confirmed they had not previously been exposed to the Frith-Happé animation videos used in our study.”

If the authors proceed in pushing this data without the Goal-Directed videos, they need to at least address their rationale for not testing these videos.

We appreciate this important point, raised also by the first reviewer. As previously mentioned, our initial strategy was to focus on the two extreme conditions: ToM and Random, representing scenarios with and without mental interactions, respectively. This choice was influenced by some previous fMRI studies using the Frith-Happé animated triangles task in humans and macaques, which predominantly examined these two conditions (Gobbini et al., 2007; Barch et al., 2013; Bliksted et al., 2019; Vandewouw et al., 2021; Weiss et al., 2021; Chen et al., 2023; Roumazeilles et al., 2021). We also faced practical challenges in incorporating all conditions with a sufficient number of repetitions into the fMRI task design for marmoset subjects, given the substantial duration of each video clip (19.5 sec). As such, we chose to focus our investigation on the ToM and Random conditions, as the Goal-Directed (GD) condition falls between these two extremes, depicting physical interaction among the triangles without suggesting mental state attribution. Recognizing the potential limitations of not including the GD condition, we conducted an additional eye-tracking experiment that encompassed all three conditions.

We have updated the manuscript to clarify the choice of conditions for the fMRI experiment and to incorporate the findings from the new eye-tracking experiment. Relevant modifications have been made in the methods, results, and Discussion sections. These revisions can be found on pages 23-24, 5 to 7, and 14 to 20, respectively.

The description of the monkey reward contingencies needs to be clearer about whether the monkeys were rewarded only during calibration or during videos as well, and whether any reward during videos was contingent on keeping their eyes on the screen.

We apologize for any previous ambiguity in the text. It is important to clarify that the monkeys were rewarded solely during the initial and final stages of the sessions, and no rewards were administered during the calibration or the experiment. Accordingly, we have updated the description in the Methods section, now stated on page 23 as:

“Monkeys were rewarded at the beginning and end of each session. Crucially, no rewards were provided during the calibration or while the videos were played.”

Because this is a social task when the scans were normalized to MNI space, did the authors divide the human participants into those with and without a paracingulate sulcus?

We thank the reviewer for the insightful question. Indeed, in this study, we normalized the MRI scans of human participants to MNI space, providing a standardized representation of the brain. However, we did not separate participants based on the presence or absence of a paracingulate sulcus in our analysis. Your suggestion to incorporate this anatomical variability is intriguing, especially given its potential implications in social cognition research. We appreciate this thoughtful suggestion and will certainly consider it in our future studies.

The authors need to better specify what counts as a "baseline" for the fMRI comparisons. They should also briefly justify why this is an informative comparison.

Apologies for any confusion regarding our baseline condition in the fMRI comparisons. In this study, our baseline refers to the brain's activity when the subjects are not engaged in the tasks (i.e., viewing ToM or Random animations). More specifically, we have defined the baseline as the brain activity during the presentation of a circular black cue between video clips. Selecting this as the baseline is crucial as it presents a 'resting state' scenario – a state where the brain is not actively engaged in processing task-specific stimuli but is instead in a neutral, alert state. This choice of baseline allows us to identify and compare increased activity in different brain regions during the ToM and Random conditions relative to this resting state. This, in turn, aids our understanding of the specific functional brain regions associated with the processing of these specific conditions. We have now clarified this point in the Methods section of the manuscript on page 30. It now reads:

“First, we identified brain regions involved in the processing of ToM and Random animations by contrasting each condition with a baseline (i.e., ToM condition > baseline and Random condition > baseline contrasts). This baseline brain activation recorded during the presentation of the circular black cue between video clips (i.e., baseline blocks of 15 sec, see above), reflects 'resting state' activation. By comparing it to the brain activation during ToM and Random animations, we could specifically highlight the task-related activations and isolate brain regions engaged during each condition.”

https://doi.org/10.7554/eLife.86327.sa2

Article and author information

Author details

  1. Audrey Dureux

    Centre for Functional and Metabolic Mapping, Robarts Research Institute, University of Western Ontario, London, Canada
    Contribution
    Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Writing - original draft, Writing - review and editing
    For correspondence
    audrey.dureux@gmail.com
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1687-8347
  2. Alessandro Zanini

    Centre for Functional and Metabolic Mapping, Robarts Research Institute, University of Western Ontario, London, Canada
    Contribution
    Conceptualization, Formal analysis, Investigation, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
  3. Janahan Selvanayagam

    Centre for Functional and Metabolic Mapping, Robarts Research Institute, University of Western Ontario, London, Canada
    Contribution
    Formal analysis, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3708-8742
  4. Ravi S Menon

    Centre for Functional and Metabolic Mapping, Robarts Research Institute, University of Western Ontario, London, Canada
    Contribution
    Project administration, Writing - review and editing
    Competing interests
    No competing interests declared
  5. Stefan Everling

    1. Centre for Functional and Metabolic Mapping, Robarts Research Institute, University of Western Ontario, London, Canada
    2. Department of Physiology and Pharmacology, University of Western Ontario, London, Canada
    Contribution
    Conceptualization, Supervision, Funding acquisition, Investigation, Project administration, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9714-9757

Funding

Canadian Institutes of Health Research (FRN 148365)

  • Stefan Everling

Canada First Research Excellence Fund

  • Stefan Everling

Natural Sciences and Engineering Research Council of Canada (Discovery grant)

  • Stefan Everling

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

Support was provided by the Canadian Institutes of Health Research (FRN 148365), the Canada First Research Excellence Fund to BrainsCAN, and a Discovery grant by the Natural Sciences and Engineering Research Council of Canada. We are grateful to Drs. Sarah White and Uta Frith for access to the Frith-Happé animation videos. We also wish to thank Cheryl Vander Tuin, Whitney Froese, Hannah Pettypiece, and Miranda Bellyou for animal preparation and care, Dr. Alex Li and Trevor Szekeres for scanning assistance, Dr. Kyle Gilbert and Peter Zeman for coil designs.

Ethics

Human subjects: This study was approved by the Ethics Committee of the University of Western Ontario and subjects were informed about the experimental procedures and provided informed written consent.

All experimental methods described were performed in accordance with the guidelines of the Canadian Council of Animal Care policy and a protocol approved by the Animal Care Committee of the University of Western Ontario Council on Animal Care (#2021-111). Animals were monitoring during the acquisition sessions by a veterinary technician.

Senior Editor

  1. Timothy E Behrens, University of Oxford, United Kingdom

Reviewing Editor

  1. Muireann Irish, University of Sydney, Australia

Version history

  1. Preprint posted: January 18, 2023 (view preprint)
  2. Received: January 20, 2023
  3. Accepted: July 13, 2023
  4. Accepted Manuscript published: July 14, 2023 (version 1)
  5. Version of Record published: August 17, 2023 (version 2)
  6. Version of Record updated: October 4, 2023 (version 3)

Copyright

© 2023, Dureux et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 622
    Page views
  • 91
    Downloads
  • 1
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Audrey Dureux
  2. Alessandro Zanini
  3. Janahan Selvanayagam
  4. Ravi S Menon
  5. Stefan Everling
(2023)
Gaze patterns and brain activations in humans and marmosets in the Frith-Happé theory-of-mind animation task
eLife 12:e86327.
https://doi.org/10.7554/eLife.86327

Share this article

https://doi.org/10.7554/eLife.86327

Further reading

    1. Developmental Biology
    2. Neuroscience
    Athina Keramidioti, Sandra Schneid ... Charles N David
    Research Article

    The Hydra nervous system is the paradigm of a ‘simple nerve net’. Nerve cells in Hydra, as in many cnidarian polyps, are organized in a nerve net extending throughout the body column. This nerve net is required for control of spontaneous behavior: elimination of nerve cells leads to polyps that do not move and are incapable of capturing and ingesting prey (Campbell, 1976). We have re-examined the structure of the Hydra nerve net by immunostaining fixed polyps with a novel antibody that stains all nerve cells in Hydra. Confocal imaging shows that there are two distinct nerve nets, one in the ectoderm and one in the endoderm, with the unexpected absence of nerve cells in the endoderm of the tentacles. The nerve nets in the ectoderm and endoderm do not contact each other. High-resolution TEM (transmission electron microscopy) and serial block face SEM (scanning electron microscopy) show that the nerve nets consist of bundles of parallel overlapping neurites. Results from transgenic lines show that neurite bundles include different neural circuits and hence that neurites in bundles require circuit-specific recognition. Nerve cell-specific innexins indicate that gap junctions can provide this specificity. The occurrence of bundles of neurites supports a model for continuous growth and differentiation of the nerve net by lateral addition of new nerve cells to the existing net. This model was confirmed by tracking newly differentiated nerve cells.

    1. Neuroscience
    Anna-Maria Grob, Hendrik Heinbockel ... Lars Schwabe
    Research Article

    Maintaining an accurate model of the world relies on our ability to update memory representations in light of new information. Previous research on the integration of new information into memory mainly focused on the hippocampus. Here, we hypothesized that the angular gyrus, known to be involved in episodic memory and imagination, plays a pivotal role in the insight-driven reconfiguration of memory representations. To test this hypothesis, participants received continuous theta burst stimulation (cTBS) over the left angular gyrus or sham stimulation before gaining insight into the relationship between previously separate life-like animated events in a narrative-insight task. During this task, participants also underwent EEG recording and their memory for linked and non-linked events was assessed shortly thereafter. Our results show that cTBS to the angular gyrus decreased memory for the linking events and reduced the memory advantage for linked relative to non-linked events. At the neural level, cTBS targeting the angular gyrus reduced centro-temporal coupling with frontal regions and abolished insight-induced neural representational changes for events linked via imagination, indicating impaired memory reconfiguration. Further, the cTBS group showed representational changes for non-linked events that resembled the patterns observed in the sham group for the linked events, suggesting failed pruning of the narrative in memory. Together, our findings demonstrate a causal role of the left angular gyrus in insight-related memory reconfigurations.