Unraveling the developmental dynamic of visual exploration of social interactions in autism

  1. Nada Kojovic  Is a corresponding author
  2. Sezen Cekic
  3. Santiago Herce Castañón
  4. Martina Franchini
  5. Holger Franz Sperdin
  6. Corrado Sandini
  7. Reem Kais Jan
  8. Daniela Zöller
  9. Lylia Ben Hadid
  10. Daphné Bavelier
  11. Marie Schaer  Is a corresponding author
  1. Psychiatry Department, Faculty of Medicine, University of Geneva, Switzerland
  2. Faculte de Psychologie et Science de l’Education, University of Geneva, Switzerland
  3. Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Ciudad Universitaria, Mexico
  4. Fondation Pôle Autisme, Switzerland
  5. College of Medicine, Mohammed Bin Rashid University of Medicine and Health Sciences, United Arab Emirates
  6. Bosch Sensortec GmbH, Germany

Abstract

Atypical deployment of social gaze is present early on in toddlers with autism spectrum disorders (ASDs). Yet, studies characterizing the developmental dynamic behind it are scarce. Here, we used a data-driven method to delineate the developmental change in visual exploration of social interaction over childhood years in autism. Longitudinal eye-tracking data were acquired as children with ASD and their typically developing (TD) peers freely explored a short cartoon movie. We found divergent moment-to-moment gaze patterns in children with ASD compared to their TD peers. This divergence was particularly evident in sequences that displayed social interactions between characters and even more so in children with lower developmental and functional levels. The basic visual properties of the animated scene did not account for the enhanced divergence. Over childhood years, these differences dramatically increased to become more idiosyncratic. These findings suggest that social attention should be targeted early in clinical treatments.

Editor's evaluation

This is an important study investigating a rare longitudinal dataset of eye-tracking to a cartoon video, measured in a group of children with autism and a control group that is typically developing. The core finding is a divergence in exploratory gaze onto the video stimulus in the children with ASD, compared to typically developing children, this finding is supported by convincing evidence. In addition, the effect appeared to be parametric: those autistic children with the least divergence also had the best adaptive functioning and communication skills. Additional strengths of the study are a relatively large sample size for this type of work and analyses that aim at generalizability. This study will be interesting for autism specialists, but also for a wider community interested in social cognitive, affective neuroscience, and developmental disorders.

https://doi.org/10.7554/eLife.85623.sa0

Introduction

Newborns orient to social cues from the first hours of life. They show privileged attention to faces (Simion et al., 2001), face-like stimuli (Goren et al., 1975; Johnson et al., 1991; Valenza et al., 1996), and orient preferentially to biological motion (Simion et al., 2008). This automatic and preferential orientation to social cues early in life is highly adaptive as it provides grounds for developing experience-dependent competencies critical for an individual’s adequate functioning. Social visual engagement is one of the first means of exploration and interaction with the world, preceding and determining more advanced levels of social interaction and autonomy (Klin et al., 2015). Impairments in this elemental skill are one of the core characteristics of ASD, a highly heterogeneous lifelong neurodevelopmental condition (American Psychiatric Association, 2013). Broad impairments in social communication and interaction, along with repetitive behaviors and circumscribed interests, have been suggested to lead to a spectrum of functional disabilities in ASD (Klin et al., 2007). In this regard, atypical social attention strategies may at least partially contribute to the emergence of the ASD phenotype. Many studies using eye-tracking have explored the atypicalities in attentional processes and their contribution to core symptoms in ASD (Chawarska and Shic, 2009; Klin et al., 2003; Falck-Ytter et al., 2013a). Recent meta-analyses concluded that, besides generally reduced social attention (Chita-Tegmark, 2016b), autism is also characterized by atypical attention deployment during the exploration of social stimuli (Chita-Tegmark, 2016a). Indeed, aside from a generally diminished interest in social stimuli, when individuals with ASD do attend to social information, they spend less time exploring key features, such as eyes while showing an increased interest in less relevant cues, such as bodies (Chita-Tegmark, 2016b). These atypicalities are observed as early as two months of age (Jones and Klin, 2013) and thus can exert a tremendous impact on downstream developmental processes that critically depend on experience. The exact biological mechanisms that govern the emergence of these aberrant social attention patterns and their course of evolution are currently unknown.

In typical development, following the initial social preference, social attention deployment shows dynamic changes during infancy and early childhood. During their first year of life, infants progressively increase the time spent looking at faces compared to other elements of their environment (Frank et al., 2009). The increasing ability to attend to faces in complex environments has been related to developmental changes in visual attention (Frank et al., 2014). Indeed, during the first year of life, we observe the development of more endogenous, cortically controlled attention (Colombo, 2001), which allows more flexible and controlled displacement of gaze (Hunnius and Geuze, 2004; Hendry et al., 2018; Frank et al., 2014; Helo et al., 2016). Developmental improvement in attentional abilities thus promotes engagement with social targets. Furthermore, the increase in capacity to attend to highly relevant social elements is followed by increased similarity in fixation targets between TD children (Frank et al., 2014). With increasing age, the TD children show more coherence in their visual behavior, as they increasingly focus on similar elements of the scene (Franchak et al., 2016; Frank et al., 2009; Shic et al., 2008). A trend toward progressively more coherent gaze patterns continues into adulthood (Kirkorian et al., 2012; Rider et al., 2018). In other words, despite the impressive complexity of our social environment and the diversity of each individual’s experiences, social visual engagement takes a convergent path across TD individuals, who are increasingly looking at similar elements of the social environment. However, the current understanding of the dynamic of this progressive tuning of gaze patterns is limited by the scarcity of studies using longitudinal designs. Indeed, most studies used cross-sectional designs when inferring developmental patterns, which can be biased by interindividual differences.

In regards to autism, understanding the typical development of social visual exploration is of utmost importance, as the social difficulties associated with ASD result from the cascading effect of a reduced social interest during the child’s development (Dawson et al., 1998; Dawson et al., 2005; Chevallier et al., 2012). Studies focusing on the developmental changes in visual exploration in autism are still rather scant but point to altered maturational changes in orienting to social cues. Attention deployment begins to differ from the age of 2 months in babies who later develop autism, suggesting that divergent trajectories of social visual exploration may start in the first months of life (Jones and Klin, 2013). A study by Shic et al., 2008 highlighted the absence of typical maturational change in face scanning strategies in children with ASD between 2 and 4 years of age. Longitudinal studies focusing on typical and atypical development are thus crucially needed to highlight the underlying developmental mechanisms of atypical attention deployment in ASD. Longitudinal follow-up design would allow the identification of periods of critical changes in visual behavior that can be targeted by early interventions. In addition to the parsing of the developmental patterns, a comprehensive characterization of factors that influence visual behavior in the social context is necessary to understand the mechanisms of atypical attention deployment in autism.

Gaze deployment is mediated by numerous factors acting simultaneously, including bottom-up and top-down processes. Bottom-up mechanisms direct attention to visually prominent elements as a function of their basic properties (such as orientation, intensity, color, and motion) (Itti and Koch, 2000; Itti et al., 2001; Koch and Ullman, 1985) while top-down factors (Itti et al., 2001) are more endogenous in nature and depend on previous experience, motivation, specific task demands, etc. (Yarbus, 1967). The complex interplay between these two processes orchestrates our attention deployment during everyday tasks. We can hypothesize that the imbalance, such as enhanced focus on bottom-up properties of visual content, maybe at the origin of atypical social attention in autism, driving it away from conventional social targets. Indeed, it has been shown that in the context of naturalistic static scenes, children and adults with ASD tend to focus more on basic, pixel-level properties than on semantic categories, compared to their TD peers (Amso et al., 2014; Wang et al., 2015). However, less is known of the contribution of these basic properties to a real-time visual exploration of dynamic content, as static stimuli only allow limited inference to the real-world dynamic deployment of attention. Studies using dynamic social content are rare and point to somewhat contrasting results compared to the ones using static stimuli. For example, it has been shown that in the context of dynamic social content, preschoolers with ASD tend to focus less on the motion properties of the scene and more on luminance intensity compared to age-matched TD children (Shic et al., 2007). However, there is currently no consensus in the literature on the relative predominance between bottom-up and top-down properties in generating aberrant visual exploration. These two processes were mostly analyzed separately, and studies using ecological dynamic stimuli are scarce. Hence, another important element is the content type, as it dramatically influences the attentional processes summoned. For instance, non-social content is prone to elicit more heterogeneous patterns of exploration (Wang et al., 2018). On the other hand, the social content of higher complexity induces more divergence in gaze deployment in TD (Wang et al., 2018) while giving rise to atypicalities in visual attention deployment in ASD (Chawarska et al., 2012; Chita-Tegmark, 2016b).

Measures of gaze deployment (e.g. time spent on the face or eyes) provided valuable insight into the specificity of social attention patterns in autism (Klin et al., 2002). These measures reflect the ‘macrostructure’ (Guillon et al., 2014) of the gaze deployment by quantifying the overall time spent exploring a predefined scene region. However, complementary to the ‘what’ of gaze, the ‘when’ of it is of equal importance as the demands in the real world come online and require a timely response. We attend to only a limited amount of elements from a breadth of possibilities, and what finds the way to our perception will dramatically influence the meaning we attribute to the social situation. Recent studies have provided important advances in our understanding of the mechanisms that control what we select to gaze upon on a moment-to-moment basis (Constantino et al., 2017; Kennedy et al., 2017). Quite strikingly, while viewing social scenes, toddler and school-age twins showed a high concordance not solely in the direction but also in the timing of their gaze movements (Constantino et al., 2017; Kennedy et al., 2017). Thus, subtle variations in the visual exploration of social scenes are strongly influenced by genetic factors that favor the selection of crucial social information (Constantino et al., 2017). The continuous active selection of pertinent elements from the abundance of possibilities is critical for the interactive specialization of our brain (Johnson, 2001) and significantly affects how our internal world is shaped. Only a few studies tackled the question of the moment-to-moment gaze deployment in ASD compared to TD. Indeed, while on this microstructural level, TD children and adults show coherence in fixation targets, the fine-grained gaze dynamic in their peers with ASD is highly idiosyncratic and heterogeneous (Nakano et al., 2010; Falck-Ytter and von Hofsten, 2011; Wang et al., 2018; Avni et al., 2020). Atypicalities in the fine-grained extraction of social information may have important consequences on learning opportunities and social functioning (Schultz, 2005). Overall, these findings urge for a better characterization of the underlying mechanisms and factors that contribute to coherence in visual patterns in typical development at different timescales, over months and years but also at the microstructural level (moment-to-moment) as a gateway for understanding the emergence of atypical gaze patterns in autism.

In the current study, we opted for a comprehensive approach to characterize atypical visual exploration in a large sample of 166 children with ASD (1.7–6.9 years old) compared to their age-matched TD peers (1.7–6.8 years old) by considering both bottom-up and top-down processes. We first measured the divergence from referent gaze patterns (obtained from the TD children) in autism on a microstructural level (moment-to-moment) and over larger temporal scales, measuring the developmental change during early childhood. We quantified the divergence between gaze patterns among the two groups of children while watching a cartoon depicting social interaction using a custom data-driven approach used in our previous studies (Sperdin et al., 2018; Jan et al., 2019; Kojovic et al., 2019). We estimated the relative contribution of basic visual properties of the scene to the visual exploration of this dynamic social scene in both groups. Finally, we measured the contribution of the different features of the video content (visual and social complexity, directedness of speech) to the divergence from the referent gazing patterns in the ASD group. We further measured the developmental change in visual exploration in young children with ASD and their TD peers using the yearly follow-up eye-tracking recordings.

Results

Divergence from the typical gazing patterns, its relation to clinical phenotype and movie properties

Moment-by-moment divergence from the referent gazing patterns

Gaze data from 166 males with ASD (3.37 ±1.16 years) were recorded while children watched a 3 min episode of the French cartoon Trotro (Lezoray, 2013). The cartoon depicts social interaction between the three donkey characters at a relatively slow pace. We were interested in capturing the difference in moment-to-moment gaze deployment in ASD children compared to the TD group while watching this animated social scene. For this, we compared the gaze allocation of each child with ASD to the referent gaze patterns obtained from 51 age-matched TD males (3.48 ±1.29 years) who watched the same social scene. Referent gaze patterns (‘reference’) were obtained by applying the probability density estimation function (Botev et al., 2010) on gaze data from the TD group on each frame. Hence, for each child with ASD, we obtained a measure indicating the closeness to the reference that we denote Proximity Index-PI, (see Figure 1 and Methods section for detailed explanation). Lower PI values indicate a higher divergence from the reference for the given frame. As the obtained measure dynamically determines the proximity to the referent gaze distribution, there is no need to define areas of interest based on the theoretical priors. Moreover, as it will be further detailed, this method allowed flexibly redefining the referent gaze distribution by constraining the reference sample to a specific age range or group.

Proximity Index method illustration.

Referent gaze data distribution (‘reference’) was created using gaze coordinates from 51 typically developing (TD) males (aged 3.48±1.29 years old). Upper row: two example frames with gaze coordinates of TD children (blue dots) used to define the ‘reference’ (delimited by contours) and gaze data from a three-year-old male with autism spectrum disorder (ASD) (whose gaze coordinates are depicted as a red circle). Hotter contour color indicates the area of higher density of distribution of gaze in the TD group, meaning that a particular area was more appealing for a higher number of TD preschoolers for the given frame; the Proximity Index value for the 3-year-old male with ASD for the frame on the left had a value of 0.39 and for the frame on the right a value of 0. Lower row: Proximity Index values for the visual exploration of the 3-year-old boy with ASD over the entire video with the mean Proximity Index value indicated by the dashed red lines.

As the reference TD group was a convenience sample, we ran a bootstrap analysis to ensure that the obtained referent distribution was not affected by sample size (see Appendix 1 for more details). According to our stability analyses, the sample size of 51 TD children allows us to define the reference with enough stability, considering it is more than two times bigger than the estimated smallest stable sample size of 18.

As the gaze data of the TD group were used as a reference, we wanted to understand how their individual gazing patterns would behave compared to a fixed average. To this end, we employed the leave-one-out method to obtain the PI value for each of the 51 TD children. In this manner, the gazing pattern of each TD child was compared to the reference created by the gaze data of 50 other TD children. The difference in average PI values between the two groups was found significant, t(215)=5.51, p<0.001 (Figure 2).

Mean proximity index (PI) comparison between groups.

Violin plots illustrate the distribution of Proximity Index (PI) values for two groups: typically developing (TD) in blue (n = 51) and autism spectrum disorder (ASD) in red (n = 166). The error bars on each plot represent the 95% confidence intervals around the means. Statistical significance of the differences between means was assessed using a two-sample t-test. The PI values for the TD group were derived using a leave-one-out approach, where the PI for each ASD child was calculated based on the referent gaze data from the 51 TD children in the original sample.

Less divergence in visual exploration is associated with better overall functioning in children with ASD

To explore how the gaze patterns, specifically divergence in the way children with ASD attended to the social content, related to the child’s functioning, we conducted a multivariate analysis. We opted for this approach to obtain a holistic vision of the relationship between visual exploration, as measured by PI, and different features of the complex behavioral phenotype in ASD. Behavioral phenotype included the measure of autistic symptoms and the developmental and functional status of the children with ASD. Individuals with ASD often present lower levels of adaptive functioning (Bal et al., 2015; Franchini et al., 2018) and this despite cognitive potential (Klin et al., 2007). Understanding factors that contribute to better adaptive functioning in very young children is of utmost importance (Franchini et al., 2018) given the important predictive value of adaptive functioning on later quality of life. The association between behavioral phenotype and PI was examined using the PLS-C analysis (Krishnan et al., 2011; McIntosh and Lobaugh, 2004). This method extracts commonalities between two data sets by deriving latent variables representing the optimal linear combinations of the variables of the compared data sets. We built the cross-correlation matrix using the PI on the left (A) and 12 behavioral phenotype variables on the right (B) side (see Methods section for more details on the analysis).

In our cohort, child autistic symptoms were assessed using the ADOS (Lord et al., 2000; Lord et al., 2012), child developmental functioning using the PEP-3 scale (Schopler, 2005) and child adaptive behavior using the Vineland Adaptive Behavior Scales, Second Edition, (Sparrow et al., 2005). Thus the final behavior matrix included two domains of autistic symptoms from the ADOS: social affect (SA) and repetitive and restricted behaviors (RRB); six subscales of the PEP-3: verbal and preverbal cognition (VPC), expressive language (EL), receptive language (RL), fine motor skills (FM), gross motor skills (GM), oculomotor imitation (OMI) and four domains from VABS-II: communication (COM), daily living skills (DAI), socialization (SOC), and motor skills (MOT). Age was regressed from both sets of the imputed data.

The PLS-C yielded one significant latent component (r=0.331, p=0.001), best explaining the cross-correlation pattern between the PI and the behavioral phenotype in the ASD group. The significance of the latent component was tested using 1000 permutations, and the stability of the obtained loadings was tested using 1000 bootstrap resamples. Behavioral characteristics that showed stable contributions to the pattern reflected in the latent component are shown in red Figure 3. Higher values of the PI were found in children with better developmental functioning across all six assessed domains and better adaptive functioning across all four assessed domains. Autistic symptoms did not produce a stable enough contribution to the pattern (loadings showed in gray bars on the Figure 3). Still, numerically, a more TD-like gazing pattern (high PI) was seen in the presence of fewer ASD symptoms (negative loading of both SA and RRB scales of the ADOS-2). Despite the lack of stability of this pattern, the loading directionality of ASD symptoms is in line with the previous literature (Wen et al., 2022; Avni et al., 2020), showing a negative relationship between visual behavior and social impairment. Among the developmental scales, the biggest loading was found on verbal and preverbal cognition, followed by fine motor skills. While the involvement of verbal and nonverbal cognition in the PI, an index of visual exploration of these complex social scenes is no surprise, the role of fine motor skills might be harder to grasp. Interestingly, in addition to measuring the control of hand and wrist small muscle groups, the fine motor scale also reflects the capacity of the child to stay focused on the activity while performing controlled actions. Thus, besides the measure of movement control, relevant as scene viewing implies control of eye movement, the attentional component measured by this scale might explain the high involvement of the fine motor scale in the latent construct pattern we obtain.

Proximity Index and its relation to behavioral phenotype in children with autism spectrum disorder (ASD).

Loadings on the latent component were derived using partial least squares correlation analysis in the sample of 166 children with ASD. The cross-correlation matrix consisted of the Proximity Index on the imaging (A) side and 12 variables on the behavior (B) side. The behavioral matrix encompassed two domains of autistic symptoms assessed by ADOS-2: Social Affect (SA) and Repetitive and Restricted Behaviors (RRB); six subscales of the PEP-3: Verbal and Preverbal Cognition (VPC), Expressive Language (EL), Receptive Language (RL), Fine Motor Skills (FM), Gross Motor Skills (GM), and Oculomotor Imitation (OMI); and four domains from VABS-II: Communication (COM), Daily Living Skills (DAI), Socialization (SOC), and Motor Skills (MOT). Age was controlled for by regressing it out from both sides (A and B) of the cross-correlation matrix. There was a positive correlation between the Proximity Index and all measures of developmental (PEP-3) and adaptive functioning (VABS-II). Error bars represent the bootstrapping 5th to 95th percentiles. Results that were not robust are indicated by a gray boxplot color.

More ambient and less focal fixations in children with ASD compared to the TD group

Next, we wanted to complement our analysis using standard measures of visual behavior. In our cross-sectional sample of 166 males with ASD (3.37 ±1.16 years) and 51 TD males (3.48 ±1.29 years), we did not find any significant difference between groups with regard to the overall number of fixations, saccades, median saccade duration, or saccade amplitude for the duration of the cartoon (p>0.05). However, there was a tendency in median fixation duration to be slightly higher in TD children compared to the ASD group (t(215) = 1.85, p=0.06), suggesting a more focused attentional style in the TD group. To characterize the predominant attention exploration mode while watching the cartoon, we defined two types of fixations based on their duration and the length of the preceding saccade. Thus using thresholds as in Unema et al., 2005, a fixation was considered as ‘focal’ if longer than 180 ms and preceded by a saccade of an amplitude smaller than 5° of visual angle. Shorter fixations <180 ms preceded by a longer saccade >5° were classified as ‘ambient.’ We then obtained the proportion of these two fixation types normalized for the overall fixation number. In the ASD group, we observed significantly more ambient fixations (Mann-Whitney test: U=2530, p<0.001) compared to the TD group. The TD group showed more focal fixations (U=2345, p<0.001) in comparison to the ASD group. In both groups, focal fixations were more frequent than ambient (p<0.001) (see Figure 4A1). Higher presence of focal fixations was positively correlated to higher values of Proximity Index in both groups (rTD=0.459, rASD = 0.434, p<0.001) while the opposite relationship was evidenced between Proximity index and proportion of ambient fixations (rTD=–0.400, rASD = –0.31, p=0.002) (see Figure 4 Panels A2 & 3). Compared to the ASD group, the TD group stays less in the ‘shallow’ exploration mode reflected by the ambient fixations. This exploration mode is deployed first to quickly extract the gist of a scene before a more in-depth scene analysis is carried out through focal fixations. Thus our findings suggest that, while in the TD group, the gist of the scene is rapidly extracted, the children in the ASD group spends significantly more time in the exploration mode, wondering where to place more deep attention compared to the TD group. Subsequently, they stay less in the focused mode of attention compared to the TD group.

Focal and ambient fixation modes, between-group comparison, and their relation to the Proximity Index (PI) across ASD and TD groups.

(A1) Relative proportion of focal and ambient fixations in a sample of 51 TD children and 166 ASD children. Box-and-whisker plots illustrate the distribution of fixation proportions. The interquartile range (IQR) is represented by each box, with the median shown as a horizontal line. Whiskers extend to the most extreme data points within 1.5 IQR from the box, as per Tukey's method. Differences between groups were statistically assessed using the Mann-Whitney U test, with asterisks (****) indicating p-values less than 0.0001. (A2 & A3): Scatter plots show the correlation between the proportion of focal (A2) or ambient (A3) fixations and PI. Red points represent ASD individuals and blue points represent TD individuals. Spearman's correlation was used for analysis. Each group's data is fitted with its own linear regression line and includes 95% confidence bands.

The relative contribution of the basic visual properties of the animated scene to gaze allocation in ASD and TD children

We next measured the group difference in the relative contribution of basic visual properties of the scene to visual exploration. Previous studies in adults with ASD have shown that these basic properties play an important role in directing gaze in ASD individuals while viewing naturalistic images (Amso et al., 2014; Wang et al., 2015). Less is known about the contribution of the basic scene properties to gaze allocation while viewing dynamic content. Moreover, besides using static stimuli, most studies focused on the adult population, while the early developmental dynamics of these mechanisms remain elusive. Therefore, we extracted the values of five salience features (intensity, orientation, color, flicker, motion) for each frame of the video using the variant of the biologically inspired salience model, namely graph-based visual saliency (GBVS) (Harel et al., 2006) as explained in details in the Methods section. We calculated salience measures for our cross-sectional sample with 166 males with ASD and age-matched 51 TD males individually for each frame. For each channel (intensity, orientation, color, flicker, and motion) as well as the full model (linear combination of all five channels), we calculated the area under a receiver operating characteristic curve (ROC) (Green and Swets, 1966). The mean ROC value was then used to compare the two groups.

Contrarily to our hypothesis, for all channels taken individually as well as for the full model, the salience model better-predicted gaze allocation in the TD group compared to the ASD group (Wilcoxon t-test returned with the value of p<0.001, Figure 5). The effect sizes (r=Z/N, Rosenthal, 1991) of this difference were most pronounced for the flicker channel r=0.182, followed by the orientation channel r=0.149, full model r=0.132, intensity r=0.099, color r=0.083, and lastly motion r=0.066, Appendix 2. The finding that the salient model predicted better gaze location in TD groups compared to the ASD was not expected based on the previous literature. Still, most studies used static stimuli and the processes implicated in the process of the dynamic content are very different. The salience model itself was validated on the adult vision system. It might be that the gaze in TD better approximates the adult, mature gaze behavior than the gaze behavior in the ASD group.

Visual salience group differences.

(A) Illustration of the graph-based visual saliency (GBVS) salience model (Full model combining five channels: I-Intensity, O-Orientation, C-Color, F-Flicker, M-Motion). From top to bottom: Saliency map extracted for a given frame, Saliency map overlay on the original image, Original image with 15% most salient parts shown. (B) Box plot depicting mean receiver operating characteristic (ROC) values, derived framewise from full salience maps and fixation coordinates (x,y), for a sample of 51 TD (Typically Developing) and 166 ASD (Autism Spectrum Disorder) children. Boxes indicate the interquartile range (IQR) and medians are shown as horizontal lines within the boxes. Whiskers extend to the farthest data points not exceeding 1.5 times the IQR from the box edges, in line with Tukey's method. Framewise statistical between group differences were evaluated using the Wilcoxon paired test, with asterisks (***) indicating p-values less than 0.001. Effect size is calculated using formula r=Z/N, (Rosenthal, 1991).

The association of movie content with divergence in visual exploration in ASD group

Taking into account previous findings of enhanced difficulties in processing more complex social information (Frank et al., 2012; Chita-Tegmark, 2016b; Parish-Morris et al., 2019) in individuals with ASD, we tested how the intensity of social content influenced visual exploration of the given social scene. As detailed in the Methods section, social complexity was defined as the total number of characters for a given frame and ranged from 1 to 3. Frames with no characters represented a substantial minority (0.02% of total video duration) and were excluded from the analysis. We also analyzed the influence of the overall visual complexity of the scene on this divergent visual exploration in the ASD group. The total length of edges defining details on the images was employed as a proxy for visual complexity (see Methods section for more details). Additionally, we identified the moments of vocalization (monologues versus directed speech) and more global characteristics of the scene (frame cuts and sliding background) to understand better how these elements might have influenced gaze allocation. Finally, as an additional measure, we considered how well the gaze of ASD children was predicted by the GBVS salience model or the average ROC scores we derived in the previous section Figure 6, panel A.

Proximity Index and its relation to movie content.

(A) From top to down: In red, the average proximity index (PI) from 166 children with autism spectrum disorder (ASD) over time frames. Red-shaded regions denote a 95% confidence interval of the mean, gray-shaded regions mark the moments of the significant drop in mean values of the PI (below 2.5 SD compared to the theoretical mean of 1); Dark blue: Visual complexity over time frames; Green: Social complexity over time frames; the last panel denotes moments of the movie with the monologue, directed speech, frame switching, or moments involving moving background. (B) PLS-C illustration with PI on the A side and on the B side: Visual complexity, Social Complexity, Monologue, Directed Speech, Frame switch, Moving background and graph-based visual saliency (GBVS), the salience model derived receiver operating characteristic curve (ROC) scores for children with ASD (average ROC framewise). Positive correlation between the Proximity Index and was found between the Proximity Index and monologue, frame switch, moving background and also visual salience. PI negatively correlated with the social and visual complexity, as well as directed speech. Error bars represent the bootstrapping 5th to 95th percentiles.

To explore the relationship between the PI and different measures of the movie content as previously, we used a PLS-C analysis that is more suitable than the GLM in case of strong collinearity of the regressors this is particularly the case of the visual and social complexity (r=0.763, p<0.001), as well as social complexity and vocalization (r=0.223, p<0.001), as can be appreciated on the Figure 6, panel B. The PLS-C produced one significant latent component (r=0.331, p<0.001). The latent component pattern was such that lower PI was related to higher social complexity, followed by higher visual complexity and the presence of directed speech. In addition, moments including characters engaged in monologue, moments of frame change, and background sliding increased the PI in the group of ASD children. The monologue scenes also coincide with the moments of lowest social complexity that produces higher PI values. For the frame switch and the sliding background, the TD reference appears more dispersed in these moments as children may recalibrate their attention onto the new or changing scene, making the referent gaze distribution more variable in these moments and thus giving ASD more chance to fall into the reference space as it is larger. Finally, visual salience also positively contributed to the PI loading, which is in line with our previous finding of the salience model being more successful in predicting TD gaze than ASD gaze.

Developmental patterns of visual exploration

More divergence in visual exploration is associated with unfolding autistic symptomatology a year later

To capture the developmental change in the PI and its relation to clinical phenotype we conducted the multivariate analysis considering only the subjects that had valid eye-tracking recordings at two time points one year apart. Out of 94 eligible children (having two valid eye-tracking recordings a year apart), 81 had a complete set of phenotype measures. All 94 children had an ADOS, but ten children were missing PEP-3 (nine were assessed using Mullen Scales of Early Learning [Mullen, 1995], one child was not testable at the initial visit), and three children were missing VABS-II as the parents were not available for the interview at a given visit. The proximity index in this smaller paired longitudinal sample was defined using the age-matched reference composed of 29 TD children spanning the age (1.66–5.56) who also had a valid eye-tracking recording a year later. As the current subsample was smaller than the initial one, we limited our analyses to more global measures, such as domain scales (not the test subscales as in our bigger cross-sectional sample). Thus, for the measure of autistic symptoms, we used the total severity score of ADOS. Cognition was measured using the Verbal and preverbal cognition scale of PEP-3 (as the PEP-3 does not provide a more global measure of development Schopler, 2005) and adaptive functioning using the Adaptive behavior Composite score of Vineland (Sparrow et al., 2005). To test how the PI relates within and across time points, we built three cross-covariance matrices (T1-PI to T1-symptoms; T1-PI to T2-symptoms; T2-PI to T2-symptoms) with the PI on one side (A) and the measure of autistic symptoms, cognition, and adaptation on the other side (B). As previously, the significance of the patterns was tested using 1000 permutations, and the stability of the significant latent components using 1000 bootstrap samples.

The PLS-C conducted on simultaneous PI and phenotype measures at the first time point (T1-PI - T1 symptoms) essentially replicated the pattern we observed on a bigger cross-sectional sample. One significant LC (r=0.306 and p=0.011) showed higher PI co-occurring with higher cognitive and adaptive measures (see Appendix 4). The cross-covariance matrix using a PI at T1 to relate to the phenotype at the T2 also yielded one significant latent component (r=0.287 and p=0.033). Interestingly, the pattern reflected by this LC showed higher loading on the PI co-occurring with lower loading on autistic symptoms. Children who presented lower PI values at T1 were the ones with higher symptom severity at T2. The gaze pattern at T1 was not related to cognition nor adaptation at T2 (see Figure 7, panel A). Finally, the simultaneous PLS-C done at T2 yielded one significant LC where higher loading of the PI coexisted with negative loading on autistic symptoms and higher positive loading on the adaptation score (r=0.322 and p=0.014) Figure 7, panel B. The level of typicality of gaze related to the symptoms of autism at T2 (mean age of 4.05±0.929) but not at a younger age (mean age of 3.01±0.885). This finding warrants further investigation. Indeed, on the one hand, the way children with TD comprehend the world changes tremendously during the preschool years, and this directly influences how the typicality of gaze is estimated. Also, on the other hand, the symptoms of autism naturally change over the preschool years, and all these elements can be responsible for the effect we observe.

Proximity Index and its relation to behavioral phenotype in children with autism spectrum disorder (ASD) seen two times a year apart.

Sample comprised 81 children with ASD who had valid eye-tracking recording and a complete set of behavioral phenotype measures a year after the baseline (T2). The PI for this paired longitudinal cohort was established using an age-matched reference group of 29 Typically Developing (TD) children. Loadings on the latent component were derived using PLS correlation analysis. The cross-correlation matrix included the Proximity Index (PI) on the imaging side A and three behavioral variables on the B side. The behavioral matrix accounted for two domains of autistic symptoms as assessed by ADOS-2, Verbal and Preverbal Cognition (VPC) from the PEP-3, and the Adaptive Behavior Composite Score from the VABS-II. Error bars represent the bootstrapping 5th to 95th percentiles. Results that were not robust are indicated by a gray boxplot color. (A) Proximity index (PI) obtained at T1 and phenotype measures obtained a year later (T2). PI at T1 positively correlated with reduced symptoms at T2 (B) Simultaneous PLS-C: both PI and phenotype measures were obtained at T2. PI at T2 positively correlated with symptoms at T2 and positively with adaptive behavior. Loading on the latent component was obtained using the partial least squares correlation analysis. The cross-correlation matrix was composed of the proximity index-PI on the imaging A and three variables on the behavior B side. The behavior matrix included two domains of autistic symptoms assessed by ADOS-2, Verbal and preverbal cognition (VPC) of PEP-3, and the Adaptive Behavior Composite Score of VABS-II.

Divergent developmental trajectories of visual exploration in children with ASD

After exploring the PI association with various aspects of the behavioral phenotype in ASD children, we were also interested in the developmental pathway of visual exploration in this complex social scene for both groups of children. Previous studies using cross-sectional designs have demonstrated important changes in how children attend to social stimuli depending on their age (Frank et al., 2012; Helo et al., 2014). As our initial sample spanned a relatively large age range (1.7–6.9 years), we wanted to obtain a more fine-grained insight into the developmental dynamic of visual exploration during the given period. To that end, when study-specific inclusion criteria were satisfied, we included longitudinal data from our participants who had a one-year and/or a two years follow-up visit (see Methods section). With the available 306 recordings for the ASD group and 105 for the TD group, we applied a sliding window approach (Sandini et al., 2018) (see Methods section). Our goal was to discern critical periods of change in the visual exploration of complex social scenes in ASD compared to the TD group. We opted for a sliding window approach considering its flexibility to derive a continuous trajectory of visual exploration and thereby capture such non-linear periods. The sliding window approach yielded a total of 59 age-matched partially overlapping windows for both groups covering the age range between 1.88–4.28 years (mean age of the window) (Figure 8, panel A illustrates the sliding window method).

Characterization of the evolution of visual exploration patterns in young children with autism spectrum disorder (ASD) and the typically developing (TD) group using a sliding window approach.

Panel A: The sliding window approach applied to the available recordings in our ASD group (red) and our TD group (blue); Panel B: gaze dispersion in two groups for the sliding windows n°7 and n°42 (mean age of windows 2.18 and 3.64 years, respectively); each circle represents a window encompassing 20 recordings; Panel C: Comparison of the gaze dispersion between two groups using Mean pairwise distance of gaze coordinates on each frame. The dispersion was calculated across 59 sliding windows spanning 1.88–4.28 years of age on average (here again, every circle represents a window encompassing 20 recordings). The windows with filled circles are those where a statistically significant difference between the two groups was shown using permutation testing. Error bars indicate a 95% confidence interval of the mean. As can be seen on panel C, dispersion values diminished in the TD group with advancing age, while the opposite pattern is observed in the ASD group showing a progressively more dispersed gaze behavior in the ASD group during childhood years.

We then estimated gaze dispersion on a group level across all 59 windows. Dispersion on a single frame was conceptualized as the mean pairwise distance between all gaze coordinates present on a given frame (Figure 8, panel B). Gaze dispersion was computed separately for ASD and TD. The measure of dispersion indicated an increasingly discordant pattern of visual exploration between groups during early childhood years. The significance of the difference in the gaze dispersion between two groups across age windows was tested using the permutation testing (see Methods section). The statistically significant difference (at the level of 0.05) in a window was indicated using color-filled circles and as can be appreciated from the Figure 8, panel C was observed in 46 consecutive windows out of 59 starting at the age of 2.5–4.3 (average age of the window). While the TD children showed more convergent visual exploration patterns as they got older, as revealed by progressively smaller values of dispersion (narrowing of focus), the opposite pattern was characterized by gaze deployment in children with ASD. From the age of 2 years up to the age of 4.3 years, this group showed a progressively discordant pattern of visual exploration (see Figure 8, panel C).

To ensure the robustness and validity of our findings, we addressed several potential confounding factors. These included differences in sample size TD (TD sample included 51 and ASD sample 166 children), the heterogeneity of ASD behavioral phenotypes, and the use of developmental age rather than chronological age in our sliding window approach. We adopted a sequential approach, first examining the impact of unequal sample sizes and then considering both sample size and phenotypic heterogeneity together. Additionally, we implemented a sliding window methodology using developmental age as the primary matching parameter (for a detailed description, see Appendix 5). Our results consistently reaffirmed our initial findings obtained when using chronologically age-matched samples. Specifically, when matched for both sample size and developmental age, children with ASD consistently demonstrated a greater degree of interindividual disparity across childhood years compared to TD children (Appendix 5, Panels D1-D2).

Discussion

In the present study, we used a data-driven method to quantify differences in spatio-temporal gaze patterns between children with ASD and their TD peers while watching an animated movie. Children with ASD who showed less moment-to-moment divergence in the exploration of a 3 min cartoon compared to referent gaze distribution of age-matched TD children had better adaptive functioning and better communication and motor skills. Visual exploration in the group of children with ASD was not better predicted by the low-level salience of the visual scene compared to their TD peers. Among various features of the video that children saw, the intensity of social content had the most important impact on divergence from the TD gaze patterns; children with ASD showed a more divergent deployment of attention on scene sequences with more than one character suggesting difficulties in processing social cues in the context of social interaction. On a larger temporal scale, across childhood years, the TD children showed a progressive tuning in the focus of their attention, reflected by a narrowing of the group focus while the ASD group showed no such narrowing. Instead, their gaze patterns showed increasing dispersion over the same period. Of note, the children with ASD showing lower levels of divergence in gaze deployment compared to the age-matched TD group tended to have fewer symptoms of autism a year later.

Our results corroborate and extend the findings of a body of studies that have explored microstructural gaze dynamics in autism (Avni et al., 2020; Nakano et al., 2010; Falck-Ytter et al., 2013b; Wang et al., 2018) and have demonstrated divergent moment-to-moment gaze deployment in children with ASD compared to their age-matched TD peers. These processes are very important as any slight but systematic divergence in gaze deployment can have a tremendous influence on the experience-dependent brain specialization (Johnson, 2001; Klin et al., 2009). These subtle but relevant patterns might not be detected by methods focusing on macrostructural gaze structure measuring overall attention allocation on distinct visual features (e.g. faces, eyes, etc.) based on predefined areas of interest (AOI). Here, we extend the existing findings by first using a different data-driven methodology and, second, by including a developmental aspect to the spatiotemporal gaze deployment in autism and typical development. In our study, to define the referent gaze behavior, we present a novel index – the proximity index - that accounts for the entire scene, whether multiple socially relevant targets are present or just a few objects, and in doing so, provides a more subtle estimation of ASD gaze deployment in comparison to TD (see Figure 1). Furthermore, in this study, we used a cartoon, and thus a dynamic stream that is also more ecological in its representation of social interactions and has the advantage of being very appealing to young children. Previous research (Riby and Hancock, 2009) has shown that children with ASD attend more dynamic cartoon stimuli representing social interaction than when shown natural movies of people interacting. Despite animated movies being a simplified version of social interaction with reduced social complexity, the movie we analyzed provided us with ample insight into the atypicality of gaze behavior in children with ASD.

We showed that the level of divergence in gaze exploration of this 3 min video was correlated with ASD children’s developmental level in children with ASD and their overall level of autonomy in various domains of everyday life. This finding stresses the importance of studying the subtlety of gaze deployment with respect to its downstream contribution to more divergent global behavioral patterns later in development (Schultz, 2005; Young et al., 2009; Klin et al., 2015; Jones and Klin, 2013). Gaze movements in a rich environment, as the cartoon used here, inform not only immediate perception but also future behavior as experience-dependent perception now is likely to alter the ongoing developmental trajectory. In accordance with this view, the level of typicality of visual exploration in ASD children at T1 was related to the level of autistic symptoms at T2 but not at T1. One possible interpretation of the lack of stable association at T1 might be due to the lower stability of symptoms early on. Indeed, while diagnoses of ASD show stability with age, still a certain percentage of children might show fluctuation. The study by Lord and collaborators Lord et al., 2006 following 172 2-year-olds up to the age of 9 years old showed that diagnosis fluctuations are more likely in children with lesser symptoms compared to children with more severe symptoms. Still, as our study included all ASD severities, it is subject to such fluctuations. Another possible interpretation comes from the maturation of the gaze patterns in the TD group, against which we define the typicality of gaze in the ASD group. As can be seen in our results, children with TD show a tremendous synchronization of their gaze during the age range considered, resulting in a tighter gaze distribution at T2 and thus, a more sensitive evaluation of ASD gaze at that time point. The possibility that TD shows more similar gaze allocation with age, while ASD’s gaze becomes increasingly idio-syncretic with age, highlights the value of addressing the mechanisms underlying the developmental trajectories of gaze allocation in future studies.

With regards to the exploration style, while watching the cartoon, compared to their TD peers, children with ASD presented more ambient, exploratory fixations, indicative of rapid acquisition of low-frequency information (Eisenberg and Zacks, 2016). On the other hand, they showed significantly fewer focal fixations that are known to operate with more fine-grained high-frequency information. This suggests that children with ASD spent more time than the TD group in an ambient mode trying to grasp the global scene configuration (Ito et al., 2017) and less in a detail-sensitive focused mode. These two modes of exploration are supported by distinct and yet functionally related systems of dorsal attention (ambient mode-related processing of spatial relations) and ventral attention (dealing with behaviorally salient object representation through the involvement of focused mode) (Helo et al., 2014). Our finding of differential recruitment of these two modes during the viewing of social stimuli might suggest differential recruitment of these two attentional networks during the processing of these complex social scenes. In our previous work on a smaller sample for which we also acquired EEG recording during the time that children watched the Trotro cartoon, we found that the divergence in gaze deployment was related to the vast abnormalities in neural activation, including reduced activation of frontal and cingulate regions and increased activation of inferior parietal, temporal, and cerebellar regions (Jan et al., 2019). In a similar EEG-eye-tracking study using videos involving biological motion (children doing yoga in nature) (Sperdin et al., 2018), we found increased contribution from regions such as the median cingulate cortex and the paracentral lobule in the toddlers and preschoolers with ASD who had a more similar visual exploration pattern to their TD peers (higher PI). Thus, the children who showed less divergence from referent gaze patterns (TD-like viewing patterns) more actively engaged the median cingulate cortex and the paracentral regions suggesting potential compensatory strategies to account for the divergent brain development over time. Longitudinal studies combining eye-tracking and neuroimaging techniques are necessary to confirm the hypothesis of such compensatory hyperactivation.

In an effort to parse the complexity in gaze deployment evidenced in our ASD group across childhood years, we measured the contribution of basic visual properties of the scene to the gaze deployment in this group as compared to the TD group. We found that the basic visual properties played a less important role in directing gaze in our group of young children with ASD as compared to their TD peers. This was observed across all separate channels, namely, intensity, orientation, color, motion, and flicker, as well as the full salience mode with all channels combined. Previous research has shown that bottom-up features are responsible for directing attention in very young infants, but from 9 months of age, top-down processes take predominance in directing gaze (Frank et al., 2009). Less is known about the relative contribution of these processes while watching complex dynamic stimuli over the developmental span. Using a cross-sectional sample of TD children and adults Rider et al., 2018 showed that gaze deployment in both children and adults was better predicted by the presence of a face in the scene (summoning top-down processing mechanisms) than by low-level visual properties of the scene. However, the two salience models they used (I&K and GBVS, the latter being the same as the one used in our study) were better at predicting gaze data in adults than in children suggesting that these dynamic salience models might be more adapted to the mature visual system. Indeed our sample is relatively young, and it is possible that the lesser success of the salience models to predict gaze allocation in ASD children might be influenced by the visual and motor abnormalities characterizing this age range (Rider et al., 2018; Farber and Beteleva, 2005).

Contrary to the bottom-up visual properties of the scenes, social intensity was an important element in governing the gaze divergence in children with ASD. The finding of a more divergent pattern in frames comprising the interaction between characters corroborates previous findings of atypical face (Hanley et al., 2013) and dynamic social stimuli (Speer et al., 2007) processing, particularly in the context of interaction (Parish-Morris et al., 2019). Social interaction processing depends strongly on the top-down inputs, as the choice of what is to be attended relies on prior expectations, attributed meaning, and global language and scene understanding. Here, our data show that ASD children most at risk on these skills also show lower, less TD-like PI.

The sliding window approach yielded a fine grained-measure of change in gaze deployment in both groups of children during early childhood. With advancing age, TD children showed increasingly coherent gaze patterns, corroborating previous findings of increased consistency in TD gaze behavior over time (Frank et al., 2009; Shic et al., 2007; Franchak et al., 2016; Rider et al., 2018; Kirkorian et al., 2012). On the other hand, children with ASD showed increasingly heterogeneous patterns during the same period. A similar contrasting pattern with gaze in TD individuals getting more stereotyped from childhood to adulthood and gaze in ASD groups showing more variability was brought forward in a study by Nakano et al., 2010. While this study used a cross-sectional design to study the developmental change in a group of children and adults, to our knowledge, our study is the first to extend the findings on both TD and those with ASD using a longitudinal design and focusing on a moment-to-moment gaze deployment. This higher consistency in gaze in the TD group with increasing age was put in relation to more systematic involvement of top-down processes (Kirkorian et al., 2012; Franchak et al., 2016; Helo et al., 2017). During typical development through the phylogenetically (Rosa Salva et al., 2011) favored mechanism of preferential orientation to social stimuli, children show increasing experience with and subsequently increasing understanding of social cues setting them on the typical path of social development (Klin et al., 2009; Jones and Klin, 2013). On the other hand, strikingly divergent patterns in children with ASD might be seen as a product of the accumulation of atypical experiences triggered by social attention deployment diverging early on in their development (Jones and Klin, 2013). Behaviorally, in children with ASD during the preschool years, we observe the emergence of circumscribed interests alongside the tendency of more rigid patterns of behaviors (insistence on sameness) (Richler et al., 2010). These emerging patterns of interests might contribute to the divergence in gaze as attention is rather attracted to elements related to the circumscribed interests (Sasson et al., 2008; Sasson et al., 2011), thus amplifying the derailment from the referent social engagement path (Klin et al., 2015). Ultimately, interests that are, indeed, idiosyncratic in nature might limit group-level coherence; however, a discernible amount of within-subject stability in gaze patterns over shorter time scales may be expected. While the present study does not address the latter, our results highlight the loss of group cohesion in gaze as ASD children age in line with emerging findings of marked gaze in-consistency across individuals with ASD (Nakano et al., 2010; Wang et al., 2018). Whether, as shown by Avni et al., 2020, within-individuals consistency also decreases when the same video is seen twice is an important topic for future studies to address. Overall, our results are consistent with the presence of growing idiosyncrasy in the selection and processing of information, particularly in the context of social interaction in ASD. An increased idiosyncrasy on the neural level while watching dynamic social scenes has been put forward by a number of studies (Hasson et al., 2009; Byrge et al., 2015; Bolton et al., 2018; Bolton et al., 2020) and was related to lower scene understanding (Byrge et al., 2015) and higher presence of autistic symptoms (Bolton et al., 2020). The mechanisms of efficient selection of relevant social elements are genetically controlled (Constantino et al., 2017; Kennedy et al., 2017), and a disturbance we observe in ASD is most likely a downstream product of the gene-environment correlation (Klin et al., 2015). According to this view, the initial vulnerability (Jones and Klin, 2013; Constantino et al., 2017) characterizing autism would lead to a lifetime of atypical experiences with the social world, which in turn could result in atypical brain specialization and more idiosyncratic behavioral patterns.

The finding of progressive divergence in gaze patterns in children with ASD during the childhood years urges for early detection and early intensive intervention to prevent further derailment from the typical social engagement path (Dawson et al., 2010). The present study is one of the first to tackle microstructural atypicalities in gaze deployment in young children with ASD taking into account developmental change. Our longitudinal findings of the initial gazing divergence informativeness of the later autistic symptomatology reflect the potential of the present method as a promising tool for understanding the mechanisms of developmental change in ASD. This work stresses the need to better characterize the link between behavioral phenotypes and the underlying neurobiological substrates to adapt early intervention strategies to the neurodevelopmental mechanisms involved.

The current study comes with a number of limitations. The lack of a control group of comparable size to the ASD group was a severely limiting factor. The study protocol inside which the present work was realized, is rather dense, and longitudinal visits are spaced 6 months from each other, which asks for an important investment from families who would otherwise not need this highly precise assessment of the developmental functioning of their child. From the developmental perspective, a bigger TD sample would allow more precision in measurements of the developmental change with age. It would allow defining the referent groups that are tightly matched with regard to age and allow pure longitudinal measures. We tried our best to account for this by using a sliding window approach with partially overlapping windows in order to infer developmental dynamics in both groups over childhood years, but an ideal design would be purely longitudinal. A bigger TD sample would also allow more sophisticated analysis, such as unsupervised clustering to test the potential of the Proximity Index method for data-driven classification. Moreover, an important question to address is the development of gaze dynamics in girls with ASD. In the current study, we focused only on males, as the number of eligible females with ASD was much smaller. Finally, another important element that was out of the scope of the present study but that would warrant an in-deep investigation in this early post-diagnosis period is the role of the behavioral treatment children received after the diagnosis was established. Early intensive behavioral intervention greatly improves the symptoms and the functioning profile of the individuals on the spectrum. It would be important to learn how gaze behavior is influenced by such intervention, and how behavioral profile changes following the change in visual behavior.

The method presented in the current study can easily be applied to any eye-tracking paradigm and any research question measuring the degree of similarity between any number of populations. It has the potential for application in population-wide studies for charting the developmental paths of visual exploration across the lifespan and is a promising tool for automated screening of children at risk of ASD.

Materials and methods

Experimental model and subject details

Cross-sectional sample

Request a detailed protocol

Hundred sixty-six males with autism (3.37 ±1.16 years) and 51 age-matched typically developing males (3.48 ±1.29 years) participated in the study. Table 1 summarizes the clinical characteristics of our cross-sectional sample. Our study included only males due to fewer females with ASD. The clinical diagnosis of autism, based on DSM criteria, was confirmed using the standardized observational assessment of the child and interviews with caregivers(s) retracing the child’s medical and developmental history. All children with ASD reached the cut-off for ASD on Autism Diagnostic Observation Schedule-Generic (ADOS-G), (Lord et al., 2000) or Autism Diagnostic Observation Schedule-2nd edition (ADOS-2) (Lord et al., 2012). For children who underwent the ADOS-G assessment, the scores were recoded according to the revised ADOS algorithm (Gotham et al., 2007) to ensure comparability with ADOS-2.

Table 1
Description of the cross-sectional sample.
MeasuresASD (n=166)TD (n=51)p-value
Mean±SDMean±SD
Age3.37±1.163.48±1.290.621a
Total Symptom Severity Score (ADOS-2 CSS)7.19±1.781.10±0.300<0.001a
Social Affect (ADOS-2 SA-CSS)6.08±2.061.18±0.478<0.001a
Repetitive Behaviors &
Restricted Interests (ADOS-2 RRB CSS)8.63±1.852.16±1.92<0.001a
Social Interaction (ADI-R: A)14.8±5.701.04±1.39<0.001a
Communication (ADI-R: B)9.97±3.441.12±1.35<0.001a
Repetitive Behaviors &
Restricted Interests (ADI-R: C)4.79±2.220.314±0.678<0.001a
Age of onset (ADI-R: D)3.60±0.9970.078±0.337<0.001a
Best Estimate IQ83.6±24.0119±16.5<0.001a
VABS-II Adaptive Behavior80.2±10.2103±8.21<0.001
VABS-II Communication80.2±13.7105±8.94<0.001
VABS-II Daily Living Skills83.7±11.6101±8.25<0.001
VABS-II Socialization79.2±9.82101±8.49<0.001
VABS-II Motor Skills88.4±11.5102±11.2<0.001a
  1. Note. p-valuesa are obtained using nonparametric Mann-Whitney tests of differences between the two groups.

Before inclusion in the study, TD children were screened using a questionnaire focusing on medical history and history of pregnancy. Children were not included in our TD group if they were born prematurely or had a positive screen for the presence of any known neurological or psychiatric disorder in the child itself or known case of ASD in any first-degree relative of the child. Moreover, all TD children were also assessed using the ADOS-G or ADOS-2 evaluations to exclude the presence of ASD symptoms. The majority of TD participants had a minimal severity score of 1, except four children who had a score of 2.

The data for the current study were acquired as a part of a larger longitudinal study of early development in autism based in Geneva. Detailed information about cohort recruitment has been given elsewhere (Franchini et al., 2017; Franchini et al., 2018; Kojovic et al., 2019). The study protocol was approved by the Ethics Committee of the Faculty of Medicine of Geneva University, Switzerland (Swissethics, protocol 12–163/Psy 12–014, referral number PB_2016–01880). All families gave written informed consent to participate.

Unstructured longitudinal sample

Request a detailed protocol

As participants in our study are followed longitudinally, their repeated visits were included when satisfying the inclusion criteria (later detailed in the Method details section). This yielded a total of 308 recordings for the ASD group and 105 for the TD group (all recordings were collected a year apart; 101 children with ASD contributed two recordings each, and 41 children with ASD contributed three recordings each, while 33 and 21 TD children contributed respectively 2 and 3 recordings each) (see Figure 8 for illustration of the available recordings). This sample was employed to derive trajectories of visual exploration over the childhood years using mixed models analysis and considering both within-subject and between-subject effects (Mutlu et al., 2013; Mancini et al., 2020) and sliding windows approach (Sandini et al., 2018) (further detailed in the Method details subsection).

One-year follow-up longitudinal sample

Request a detailed protocol

To obtain a longitudinal measure of change in visual exploration, we used a smaller subsample that included children who had recordings obtained a year apart. From the overall number of ASD children (101) that had two recordings, seven were removed as they were done two years apart. The same was done on the TD group, where four were removed. Thus, this final paired longitudinal sample included 94 males with ASD (1.66–5.43 years old) and 29 age-matched TD males (1.31–5.56 years old) who were evaluated a year later.

Behavioral phenotype measures

Request a detailed protocol

As detailed above, a direct assessment of autistic symptoms was obtained using the Autism Diagnostic Observation Schedule-Generic ADOS-G, (Lord et al., 2000) or Autism Diagnostic Observation Schedule-2nd edition (ADOS-2) (Lord et al., 2012). Since its latest version (ADOS-2) the ADOS yields a measure of severity of autistic symptoms ranging from 1 to 10, conceived to be relatively independent of the participant’s age or verbal functioning (Gotham et al., 2009; Estes et al., 2015). For subjects who were administered the older version of the ADOS (ADOS-G), the severity scores were obtained according to the revised ADOS algorithm (Gotham et al., 2007). For a more precise measure of symptoms according to their type, we included the domain severity scores, namely, social affect (SA) and restricted and repetitive behaviors (RRB) (Hus et al., 2014).

A detailed developmental history of symptom emergence and presentation was obtained using the Autism Diagnostic Interview-Revised (Lord et al., 1994). ADI-R is a standardized, semi-structured interview administered by trained clinicians to parents/caregivers. The ADI-R assesses the early developmental milestones and the present (last three months) and past behavior in the domains of reciprocal social interactions (A), communication (B), and restricted, repetitive, and stereotyped patterns of behavior (C). Being developed in the DSM-IV framework (Association AP, 1994) specific attention is given to the age of onset of symptoms (domain D, Demographics table).

In our large longitudinal autism cohort, the cognitive functioning of children is assessed using several assessments depending on the age of the children and their capacity to attend to the demands of cognitive tasks. Since the cohort conception in 2012, we used the Psycho-Educational Profile, third edition, PEP-3, (Schopler, 2005) validated for 24–83 months. In 2015 we added the Mullen Early Learning scales (Mullen, 1995) validated for 0–68 months. For the current study, in all analyses of the Results section, we used the scores obtained from the PEP-3 for the behavioral correlations with the PI. However, when we compared the group of ASD children with the TD children in the description of the sample at the beginning of this section, we faced a lot of missing data on the TD side, as a complete PEP-3 was frequently missing in children with TD (lack of time to complete several cognitive assessments). To be able to present a descriptive comparison between the two groups, in the Demographics table, and only there, we used the Best Estimate Intellectual Quotient, a composite measure obtained by combining available assessments as previously described in the literature (Howlin et al., 2014; Kojovic et al., 2019; Howlin et al., 2013; Bishop et al., 2015; Liu et al., 2008). In the ASD group, the majority of children had the Psycho-Educational Profile, the third edition, Verbal/Preverbal Cognition scale (PEP-3; VPC DQ, Schopler, 2005) (n=154). The VPC Developmental Quotient (DQ) was obtained by dividing the age equivalent scores by the child’s chronological age. For a smaller subset of children with ASD (below two years of age), as the PEP-3 could not be administered, we used Mullen Early Learning scales (Mullen, 1995), (n=10). Developmental quotients were obtained using the mean age equivalent scores from four cognitive scales of the MSEL (Visual Reception, Fine Motor, Receptive Language, and Expressive Language) and divided by chronological age. One child with ASD was administered only the Full-Scale IQ (FSIQ), Wechsler Preschool and Primary Scale of Intelligence, fourth edition (David, 2014), and one child was not testable at the initial visit (severe sensory stimulation). In the TD group, the majority of children were assessed using the MSEL (n=24), followed by PEP-3 n=23, and WPPSI-IV (n=4 children). The composite score comparison (BEIQ) is present in the Demographics table.

Adaptive functioning was assessed using the Vineland Adaptive Behavior Scales, second edition (VABS-II; Sparrow et al., 2005). VABS-II is a standardized parent interview measuring adaptive functioning from childhood to adulthood in communication, daily-living skills, socialization, and motor domain. The adaptive behavior composite score (ABCS), a global measure of an individual’s adaptive functioning, is obtained by combining the four domain standardized scores.

Method details

Stimuli and apparatus

Request a detailed protocol

The current experiment consisted of free-viewing of one episode of the French cartoon ‘Trotro’ lasting 2’53” (Lezoray, 2013). This cartoon was the first stimulus in an experiment involving the simultaneous acquisition of High-density EEG recorded with a 129-channel Hydrocel Geodesic Sensor Net (Electrical Geodesics Inc, Eugene, OR, USA). The findings concerning the EEG data are published separately (Jan et al., 2019). This cartoon depicts human-like interactions between three donkey characters at a relatively slow pace. The original soundtrack was preserved during recording. Gaze data were collected using Tobii TX300 eye tracker (https://www.tobiipro.com), sampled at 300 Hz, except for five recordings acquired at a lower sampling frequency (60 Hz) using Tobii TXL60. The screen size was identical for both eye-tracking devices height: 1200 pixels (29°38’) and width: 1920 pixels (45°53’), with a refresh rate of 60 Hz. Participants were seated at approximately 60 cm from the recording screen. The cartoon frames subtended a visual angle of 26°47’ × 45°53’ (height × width). A five-point calibration procedure consisting of child-friendly animations was performed using an inbuilt program in the Tobii system. Upon verification, the calibration procedure was repeated if the eye-tracking device failed to detect the participant’s gaze position accurately. The testing room had no windows, and lighting conditions were constant for all acquisitions.

Quantification and statistical analysis

Eye-tracking analysis

Request a detailed protocol

We excluded data from participants who showed poor screen attendance, defined as binocular gaze detection on less than 65% of video frames. The screen attendance was higher in the TD sample (93.8 ±6.37 s) compared to the ASD group (87.8 ±9.33 s), U=2568, p<0.001. To extract fixations, we used the Tobii IV-T Fixation filter (Olsen, 2012) (i.e. Velocity threshold: 30° /s; Velocity window length: 20 ms. Adjacent fixations were merged Maximum time between fixations was 75 ms; Maximum angle between fixations was 0.5°). To account for differences in the screen attendance, we omitted instances of non-fixation data (saccades, blinks, off-screen moments) in all calculations.

Determining the ‘reference’ of visual exploration

Request a detailed protocol

To define the referent gaze distribution (‘reference’), against which we will compare the gaze data from the ASD group, we employed the kernel density distribution estimation function on gaze data from TD individuals on each frame of the video. The reference sample comprised 51 typically developing children (3.48 ±1.29 years). To create referent gaze distribution, we opted for a non-fixed bandwidth of a kernel as the gaze distribution characteristics vary significantly from frame to frame. Precisely, fixed bandwidth would result in over-smoothing the data in the mode and under-smoothing extreme distribution cases of gaze data at tails. We used the state-of-the-art adaptive kernel density estimation that considers the data’s local characteristics by employing adaptive kernel bandwidth (Botev et al., 2010). Thus a Gaussian kernel of an adaptive bandwidth was applied at each pair of gaze coordinates, and the results were summed up to obtain an estimation of the density of gaze data (see Figure 1). Obtained density estimation reflects a probability of gaze allocation at the given location of the visual scene for a given group. This probability is higher at the distribution’s taller peaks (tightly packed kernels) and diminishes toward the edges. We used the Matlab inbuilt function contour to delimit isolines of the gaze density matrix.

Quantifying the divergence in visual exploration

Request a detailed protocol

Upon the ‘reference’ definition, we calculated the distance of gaze data from this referent distribution on each frame for each child with ASD (n=166; 3.37 ±1.16 years). Comparison to this referent pattern yielded a measure of Proximity index-PI (see Figure 1). The calculation of the Proximity Index values was done for each frame separately. Proximity Index values were scaled from 0 to 1 at each frame for comparison and interpretation. We used the Matlab inbuilt function contour to delimit isolines of the gaze density matrix. To have a fine-grained measure, we defined 100 isolines per density matrix (i.e. each frame). Then we calculated the proximity index for each child with ASD framewise. Gaze coordinates that landed outside the polygon defined by contour(s) of the lowest level (1) obtained a PI value of 0. The gaze coordinates inside the area defined by gaze density matrix isolines obtained the PI value between 0.01 and 1. The exact value of these non-zero PI values was obtained depending on the level number of the highest isoline/contour that contained the x and y coordinates of the gaze. As we defined 100 isolines per density matrix, the levels ranged from 1 to 100. Accordingly, a gaze coordinate that landed inside the highest contour (level 100) obtained a PI value of 1, and the one that landed inside the isoline 50 obtained a PI value of 0.50. A high PI value (closer to the mode of the density distribution) indicates that the visual exploration of the individual for a given frame is less divergent from the reference (more TD-like). A summary measure of divergence in visual exploration from the TD group was obtained by averaging the PI values for the total duration of the video.

While the smoothing kernel deployed in our density estimation function is Gaussian, the final distribution of the gaze data is not assumed Gaussian. As shown in Figure 1, right upper panel, the final distribution was sensitive to the complexity of gaze distribution (e.g. having two or more distant gaze foci in the TD group) which allowed a flexible and ecological definition of referent gaze behavior. The coexistence of multiple foci allows for pondering the relative importance of the different scene elements from the point of view of the TD group. It further distinguishes our method from hypothesis-driven methods that measure aggregated fixation data in the scene’s predefined regions. For the frames where the gaze of the TD group showed many distinct focal points, like the one in Figure 1, right upper panel, we calculated the PI in the same manner as for frames that had a unique focus distribution. For a given gaze coordinate from a child with ASD, we identify the level of the highest contour, ranging from 0.01 to 1, of any of the attention focus/clusters containing that coordinate. If we assume a hypothetical situation where the gaze data of the TD group are falling along two clusters identically (i.e. we obtain the density peaks of the same level/height), in this case, any two gaze coordinates that fall in the highest level of any of the peaks would obtain a PI value of 1.

Multivariate association between gaze patterns and behavioral data

Request a detailed protocol

The relation between behavioral phenotype and Proximity index was tested using the multivariate approach, Partial Least squares PLS-C (McIntosh and Lobaugh, 2004; Krishnan et al., 2011), Matlab-implemented source code is publicly available on https://github.com/MIPLabCH/myPLS; Zöller et al., 2019. This analysis focuses on the relationship between the two matrices, A (p by b) and B (p by k), formally expressed as R=BTA. Before computing the cross-correlation matrix R between A and B, both input elements are z-scored. As the correlation is not directional, the roles of A and B are symmetric, and the analyses focus on the shared information between the two. The cross-correlation matrix R was then decomposed using a singular value decomposition (SVD) according to the formula: R=UΔVT. The two singular vectors U and V are denoted as saliences, where U represents the behavioral pattern that best characterizes the R and V corresponds to the Proximity index pattern that best characterizes R. Finally, original matrix A and B are projected on their own saliences yielding two latent variables La=AV and Lb=BU. The PLS-C implements permutation testing to foster model generalization of the latent variables. Once a vector(s) of saliences is defined as generalized, its stability is tested using the bootstrapping approach with replacement. In all the analyses in this paper, we implemented 1000 permutations and 1000 bootstrapping to test the significance of the LC and the stability of the vectors of saliences, respectively.

Proximity Index with regards to the visual properties of the animated scene

Pixel level salience

Request a detailed protocol

Previous research has put forward the enhanced sensitivity to the low-level (pixel-level) saliency properties in adults with ASD while watching static stimuli (Wang et al., 2015) compared to healthy controls. We were interested in whether any low-level visual properties would more significantly contribute to the gaze allocation in one of the groups.

To extract values of basic visual qualities of the scene, we used a salience model that has been extensively characterized in the literature (Koch and Ullman, 1985; Itti et al., 1998; Itti and Koch, 2000; Itti et al., 2001). We used the GBVS version of this model (Harel et al., 2006), (for source code see http://www.animaclock.com/harel/share/gbvs.php; Pinoshino, 2022; Harel et al., 2006; Harel, 2022). This model extracts features based on simulated neurons in the visual cortex: color contrast (red/green and blue/yellow), intensity contrast (light/dark), four orientation directions (0°, 45°, 90°, 135°), flicker (light offset and onset) and four motion energies (up, down, left, right) (Itti et al., 1998; Itti and Koch, 2001). The final saliency map results from the linear combination of these separate ‘channels’ (Itti et al., 2001) into a unique scalar saliency map that guides attention (see Figure 5A for the illustration of salience features obtained using GBVS model on a given frame). To disentangle the relative importance of the channels besides using the global conspicuity map, we also considered the channels taken separately (see Appendix 2).

Considering the heavy computational cost of these analyses, all computations were performed at the University of Geneva on the Baobab and Yggdrasil clusters.

Movie characteristics

Request a detailed protocol
Social complexity
Request a detailed protocol

Furthermore, given the findings of the failure of ASD in allocating attention to social content (Chita-Tegmark, 2016b; Frank et al., 2012), we aimed to test the hypothesis that the Proximity Index values will be lower for the moments in the videos with enhanced social complexity, involving two or three characters compared to moments involving only one character (Appendix 3A). Note that, with an increasing number of characters, we recognize that the scene is inevitably richer in details, an issue we address by measuring visual and vocalization complexity.

Visual complexity
Request a detailed protocol

To measure visual complexity, we calculated the length of edges delimiting image elements (see Figure 1). Edge extraction was done on every image of the video using the Canny method (Canny, 1986) implemented in Matlab (version 2017 a; Mathworks, Natick, MA). This method finds edges by looking for the local maxima of the intensity gradient. The gradient is obtained using the derivative of a Gaussian filter and uses two thresholds to detect strong and weak edges. Weak edges are retained only if connected to strong edges, which makes this method relatively immune to noise (see Appendix 3B).

Vocal video aspects: Monologue and directed speech
Request a detailed protocol

Speech properties of the scenes were also analyzed, using the BORIS software (https://www.boris.unito.it/). We manually identified the moments where characters were vocalizing or speaking. Then we annotated the moments as a function of the social directness of the speech. In particular, we distinguished between monologue (characters thinking out loud or singing) and moments of socially directed speech (invitation to play and responses to invitations).

Coarse movie characteristics: Frame switching and moving background
Request a detailed protocol

Finally, to test how the global characteristics of video media influence gaze deployment, we focused on two movie features. The first feature, denoted as the ‘Frame switch,’ encompasses all instances in which the cartoon employs an abrupt frame transition using the hard-cut montage technique. To represent this feature numerically, a feature vector was created. In this vector, the first frame following the switch is assigned a code of 1, while all other frames are coded as 0. This coding scheme effectively highlights the occurrence of these abrupt shot changes within the movie. Throughout the duration of the movie, this event type occurs 25 times (as indicated in Figure 6).

The feature labeled as the ‘Moving background’ pertains to moments when the cartoon’s background moves in tandem with the characters, following their directional motion. We aimed to distinguish these segments from scenes featuring a static background, as the overall motion dynamics in these frames varied. The occurrence of a moving background is observable in 5 distinct sequences within the movie (as illustrated in Figure 6). Frames with a moving background were coded 1 yielding a binary feature vector.

Maturational changes in visual exploration of complex social scene

Sliding window approach

Request a detailed protocol

Besides understanding the behavioral correlates of atypical visual exploration in ASD, we wanted to characterize further the developmental pathway of visual exploration of the complex social scenes in both groups of children. We opted for a sliding window approach adapted from Sandini et al., 2018 to delineate fine-grained changes in visual exploration on a group level. Available recordings from our unstructured longitudinal sample were first ordered according to the age in both groups separately. Then, for each group, a window encompassing 20 recordings was progressively moved, starting from the first 20 recordings in the youngest subjects until reaching the end of the recording span for both groups. The choice of window width was constrained by the sample size of our TD group. The longitudinal visits in our cohort are spaced a year from each other, and the choice of a bigger window would result in significant data loss in our group of TD children as the windows were skipped if they contained more than one recording from the same subject. The chosen window width yielded 59 sliding windows in both groups that were age-matched and spanned the period from 1.88 to 4.28 years old on average.

Upon the creation of sliding windows and to characterize each group’s visual behavior and its change with age, gaze data from the TD group were pooled together to define the TD distribution in each of the 59 age windows. To characterize the group visual behavior in the ASD group, we performed the same by pooling the gaze data together from the ASD group in each of the 59 age windows (see Figure 8A and B). We calculated the mean pairwise distance between all gaze coordinates on every frame for the measure of gaze dispersion in each of the two groups. Then we compared the relative gaze dispersion between groups on the estimated gaze density of each group in each age window separately.

To quantify the heteroscedasticity between groups across different ages, we computed the difference in dispersion (mean pairwise distance to members of own group), denoted as (disp_t(ATD) - disp_t(ASD)), for each time window (t). Then, the permutation method was used in order to get the distribution under the null hypothesis in each window (t) (H0: disp_t(TD) - disp_t(ASD)=0). Thus, for each window (59) 100 permutations (i) were performed (i.e. individuals were mixed up randomly in each group) and then we computed our statistic (disp_ti(TD) - disp_ti(ASD)) for each permuted sample (i) and each time window (t). The hundred statistics per window thus formed a null distribution (the expected behavior of our statistic under the null hypothesis) against which we could compare the ‘real’ statistic estimated in the original sample. The p-value is the probability of getting a statistic at least as extreme as the one we observed in our sample if we consider H0 to be the truth. The windows where the dispersion values showed statistically significant differences between the two groups are graphically presented with color-filled circles (Figure 8C).

Appendix 1

Stability of the normative gaze distribution using simulated samples of varying size

The sample of 51 TD children whose gaze data was used to obtain a referent gaze distribution was a convenience sample. In the present study, we only included males due to the fewer number of females with ASD. Having this unique sample of TD children, we tested the stability of the referent distribution depending on the sample size by performing bootstrap analyses. Thus, from the available sample of 51 TD children, we performed 500 bootstraps, starting with a sample size of 10 until reaching the sample size of 50. To measure the change in gaze distribution on one frame, we calculated the average pairwise distance between all gaze coordinates available on the frame. Then for each frame, we calculated the variance of the average pairwise distance over 500 resamples. Finally, the variance obtained was averaged over the 5150 frames to yield a unique value of the variance in gaze patterns per sample size (10-50). Then we calculated the ‘cutoff,’ as defined by a sample size increase no longer yielding significant variation in the average variance. This was done using the kneed package implemented in Python that estimates the point of maximal curvature (elbow in curves with positive concavity) in discrete data sets based on the mathematical definition of curvature for continuous functions (Satopaa et al., 2011) (see Figure 1). The elbow of the fitted curve on our bootstrapping data was found at 18, meaning that the distribution was estimated to be stable from a sample size of 18.

Appendix 1—figure 1
Stability of the normative distribution regarding the normative sample size.

The continuous function was estimated using a kneed Python package using the average variance (over 5150 frames) of average (over 500 bootstrapped samples without replacement) mean pairwise distance of gaze coordinates on the frame (y-axis) for samples sizes ranging from 10 to 50 (x-axis) as the input: elbow point = 18.

Appendix 2

Basic visual properties of a scene: Prediction of the gaze allocation across individual salience channels

Appendix 2—figure 1
Visual salience group differences across channels.

(A) From left to right: full saliency model with all five channels combined and channels taken separately: I-intensity, O-orientation, D-color, F-flicker, and M-motion channel. From top to bottom: Saliency map extracted for a given frame, Saliency map overlay on the original image, Original image with 15% most salient parts shown. B.

Appendix 3

Social and visual scene complexity

Appendix 3—figure 1
Illustration of the measures of social intensity and visual complexity.

(A) Three frames (denoted as a, b, c) illustrate three levels of social intensity; (B) Visual complexity depicted using the edges of the images detected using the Canny method (Canny, 1986) for the frames a, b, and c.

Appendix 4

Relation between the PI and behavioral phenotype in a paired longitudinal subsample at the first time point (T1)

Appendix 4—figure 1
Proximity Index and its relation to behavioral phenotype in children with autism spectrum disorder (ASD) who were seen two times a year apart (the current figure depicts the initial (T1) visit).

Sample comprised 81 children with ASD who had valid eye-tracking recording and a complete set of behavioral phenotype measures a year after the baseline (T2). The PI for this paired longitudinal cohort was established using an age-matched reference group of 29 Typically Developing (TD) children. PI was obtained at T1 and its correlation with the behavioral phenotype measures was assessed at the same time (T1). Loadings on the latent component were derived using PLS correlation analysis. The cross-correlation matrix included the Proximity Index (PI) on the imaging side A and three behavioral variables B. The behavioral matrix accounted for two domains of autistic symptoms as assessed by ADOS-2, Verbal and Preverbal Cognition (VPC) from the PEP-3, and the Adaptive Behavior Composite Score from the VABS-II. Error bars represent the bootstrapping 5th to 95th percentiles. Results that were not robust are indicated by a gray boxplot color. PI at baseline was positively correlated with developmental and adaptive functionng at baseline.

Appendix 5

Exploring confounding factors in the sliding window analysis of maturation in visual exploration within complex social scenes

Size

In response to the disparity in sample size between our two groups (51 TD and 166 ASD children), we implemented a methodology to mitigate the influence of this factor. We generated 100 bootstrapped ASD samples (without replacement), each with a size identical to that of the TD (51 subjects). These ASD samples were matched to the TD sample in terms of chronological age. Subsequently, for each of the bootstrapped samples, we aggregated all longitudinal data and computed the dispersion measure over time, akin to the process described in Figure 8, panel C. As illustrated in Panel A below, the results reveal that the bootstrapped ASD samples, characterized by both size and chronological age alignment with the TD group, exhibit higher levels of dispersion across the span of childhood years. This is in contrast to TD children, who exhibit a discernible pattern of progressive refinement in their visual exploration behavior.

It’s worth noting that, while permutation testing could have been an ideal method for assessing the statistical significance of the findings in this section, we opted not to implement it due to the substantial computational cost associated with our analyses. The computational demands of our study necessitated an alternative approach to address the sample size and age-matching issue effectively. Consequently, we relied on the bootstrapping technique to provide valuable insights into the dispersion differences between the TD and ASD groups, while acknowledging the limitations imposed by the computational constraints.

Phenotypic heterogeneity. To address the considerable developmental heterogeneity inherent in the ASD group, we decided to repeat the analyses in the subsamples of a more restricted range of developmental functioning. Thus we derived 100 simulated samples of the same size as the TD group (51) firstly within the normal developmental range (DQ above 80) and then, we performed the same for the lower-functioning individuals with ASD (DQ below 80). As shown in Panels B-C below, both groups show sustained dispersion over the childhood years, in contrast to the convergence seen in the TD group. This trend is particularly pronounced in the subset of individuals with lower developmental functioning (Panel C), wherein a discernible divergence becomes increasingly evident during the preschool years.

Developmental age. As a final step, to comprehensively address the question of the difference in developmental age between our TD and ASD sample, we implemented a sliding window approach using our cross-sectional sample (51 TD and 166 ASD children). However, in this approach, we utilized developmental age for creating age-matched windows instead of chronological age as previously used. We initiated the process with the first 20 recordings from subjects with the lowest developmental age and progressively shifted a window encompassing 20 recordings. This continued until the entire range of recordings for both groups was covered. Similar to the method applied in the main part of the manuscript, we excluded windows containing duplicate recordings from the same subject. This method yielded a total of 60 windows, each matched based on age, with developmental age in the ASD group and chronological age in the TD group Panel D1. To test the stability of our findings and assess the potential influence of sample size, we replicated the sliding window procedure using 100 bootstrapped ASD samples, each comprising 51 subjects whose developmental age was matched to the chronological age of the TD subjects. For the purpose of interpretation, we plotted a linear regression line (in red) for each bootstrapped sample Panel D2. Our results reinforce our initial findings when using chronological age-matched samples Figure 8. Children with ASD consistently exhibit a greater degree of interindividual disparity across childhood years, in contrast to TD children. This outcome underscores our findings’ robustness and strengthens our observations’ validity.

Appendix 5—figure 1
Evolution of visual exploration patterns in young children with autism spectrum disorder (ASD) and the typically developing (TD) group using a sliding window and bootstrapping approach.

The dispersion in 100 bootstrapped samples of ASD recordings is given in red and the original group dispersion in the TD group is shown in blue. Panel A: ASD bootstrapped samples are matched to the TD group with regards to size (n=51) and chronological age; Panel B: ASD bootstrapped samples are matched to the TD group with regards to size (n=51), chronological age and have the DQ within the normal range (above 80); Panel C: ASD bootstrapped samples are matched to the TD group with regards to size (n=51), and chronological age and have the DQ below the normal range (below 80); Panel D1: Evolution of visual exploration patterns in young children with ASD whose developmental age was matched to the chronological age of the TD group using a sliding window approach. Comparison of the gaze dispersion between two groups using Mean pairwise distance of gaze coordinates on each frame. The dispersion was calculated across 60 sliding windows spanning 2.9–4.3 years of mental age on average (every circle represents a window encompassing 20 recordings); Panel D2: The sliding window approach was applied to the ASD bootstrapped samples that are matched to the TD group with regards to size (n=51) while mental age was aligned with the chronological age of the TD group.

Data availability

The Proximity Index method code and example data are publicly available at https://github.com/nadakojovic/ProximityIndexMethod (https://doi.org/10.5281/zenodo.10409645) and the data and codes used to produce figures of the current paper can be accessed at https://github.com/nadakojovic/ProximityIndexPaper (https://doi.org/10.5281/zenodo.10409651).

The following data sets were generated
    1. Kojovic N
    (2023) Zenodo
    nadakojovic/ProximityIndexMethod: ProximityIndexMethod:data&code.
    https://doi.org/10.5281/zenodo.10409645
    1. Kojovic N
    (2023) Zenodo
    nadakojovic/ProximityIndexPaper: ProximityIndexPaper:data&code.
    https://doi.org/10.5281/zenodo.10409651

References

  1. Book
    1. Association AP
    (1994)
    DSM-IV: Diagnostic and Statistical Manual of Mental Disorders
    American Psychiatric Association.
    1. Canny J
    (1986) A computational approach to edge detection
    IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-8:679–698.
    https://doi.org/10.1109/TPAMI.1986.4767851
  2. Book
    1. David W
    (2014)
    WPPSI-IV, Échelle d’intelligence de Wechsler Pour Enfants / David Wechsler ; [Adaptation Française Par Les Éditions ECPA]
    ECPA Pearson.
    1. Goren CC
    2. Sarty M
    3. Wu PY
    (1975)
    Visual following and pattern discrimination of face-like stimuli by newborn infants
    Pediatrics 56:544–549.
  3. Book
    1. Green DM
    2. Swets JA
    (1966)
    Signal Detection Theory and Psychophysics
    Hoboken: John Wiley.
  4. Conference
    1. Harel J
    2. Koch C
    3. Perona P
    (2006) Graph-Based Visual Saliency
    Proceedings of Neural Information Processing Systems (NIPS).
    https://doi.org/10.7551/mitpress/7503.001.0001
    1. Itti L
    2. Koch C
    3. Niebur E
    (1998) A model of saliency-based visual attention for rapid scene analysis
    IEEE Transactions on Pattern Analysis and Machine Intelligence 20:1254–1259.
    https://doi.org/10.1109/34.730558
    1. Klin A
    2. Jones W
    3. Schultz R
    4. Volkmar F
    (2003) The enactive mind, or from actions to cognition: lessons from autism
    Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 358:345–360.
    https://doi.org/10.1098/rstb.2002.1202
    1. Koch C
    2. Ullman S
    (1985)
    Shifts in selective visual attention: towards the underlying neural circuitry
    Human Neurobiology 4:219–227.
  5. Book
    1. Lezoray S
    (2013)
    Trotro Est Amoureux
    Cartoon.
    1. Lord C
    2. Risi S
    3. Lambrecht L
    4. Cook EH
    5. Leventhal BL
    6. DiLavore PC
    7. Pickles A
    8. Rutter M
    (2000)
    The autism diagnostic observation schedule-generic: a standard measure of social and communication deficits associated with the spectrum of autism
    Journal of Autism and Developmental Disorders 30:205–223.
  6. Conference
    1. Lord C
    2. DiLavore PC
    3. Gotham K
    4. Guthrie W
    5. Luyster RJ
    6. Risi S
    7. Rutter M
    (2012)
    Autism diagnostic observation schedule: ADOS-2
    Calif: Western Psychological Services.
  7. Book
    1. Mullen EM
    (1995)
    Mullen Scales of Early Learning Manual
    American Guidance Service.
  8. Book
    1. Olsen A
    (2012)
    The Tobii I-VT Fixation Filter
    Tobii Technology.
  9. Conference
    1. Satopaa V
    2. Albrecht J
    3. Irwin D
    4. Raghavan B
    (2011) Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior
    2011 31st International Conference on Distributed Computing Systems Workshops. pp. 166–171.
    https://doi.org/10.1109/ICDCSW.2011.20
  10. Book
    1. Schopler E
    (2005)
    PEP-3: Psychoeducational Profile
    PRO-ED.
  11. Conference
    1. Shic F
    2. Chawarska K
    3. Bradshaw J
    4. Scassellati B
    (2008) Autism, eye-tracking, entropy
    In 2008 7th IEEE International Conference on Development and Learning. pp. 73–78.
    https://doi.org/10.1109/DEVLRN.2008.4640808
  12. Book
    1. Sparrow SS
    2. Balla D
    3. Cicchetti DV
    (2005)
    Vineland II: Vineland Adaptative Behavior Scales: Survey Forms Manual: A Revision of Hte Vineland Social Maturity Scale by Edgar A. Doll
    Pearson.
    1. Valenza E
    2. Simion F
    3. Cassia VM
    4. Umiltà C
    (1996) Face preference at birth
    Journal of Experimental Psychology. Human Perception and Performance 22:892–903.
    https://doi.org/10.1037//0096-1523.22.4.892

Decision letter

  1. Christian Büchel
    Senior and Reviewing Editor; University Medical Center Hamburg-Eppendorf, Germany
  2. Ralph Adolphs
    Reviewer; California Institute of Technology, United States

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Decision letter after peer review:

[Editors’ note: the authors submitted for reconsideration following the decision after peer review. What follows is the decision letter after the first round of review.]

Thank you for submitting your work entitled "Unraveling the Developmental Dynamic of Visual Exploration of Social Interactions in Autism" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and a Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Ralph Adolphs (Reviewer #2); Julia Yurkovic (Reviewer #3).

Comments to the Authors:

We are sorry to say that, after consultation with the reviewers, we have decided that your work will not be considered further for publication by eLife.

The reviewers agreed that while the study is conceptually well motivated, there is some lack of clarity regarding the adequacy of the TD control group. In addition, while the longitudinal component was noted as novel, individual differences were not sufficiently considered, and mechanistic insights obtained were deemed to be moderate.

We have also prepared an Evaluation Summary and Public Reviews of your work below, which are designed to transform your manuscript into a preprint with peer reviews.

Reviewer #1 (Recommendations for the Authors)

In the introduction, it would be helpful if the authors mention the age range and functioning level of participants for every prior study they cite that used eye-tracking to assess social attention in autism. In addition, statements such as "these atypicalities are present early" should include what is meant by early (what age range).

35 age-matched TD children were used to obtain normative gaze patterns, but it is not clear how this number was determined. Are there either power analyses or previous data suggesting that this sample size is sufficient to determine what is normative? A related issue is that of age norms. How were age-specific norms considered, given that the sample include children between 2-7 years of age and was not very large (eg. it is not clear how many children of each age were included in the sample)?

In addition to child adaptive behavior, were any other assessments conducted and if so, were they related to visual exploration patterns in children? Were any of the visual exploration metrics related to severity of autism symptoms?

There is not much information provided regarding the TD sample. What screening tools and procedures were used to detect neurological or psychiatric diseases?

Reviewer #2 (Recommendations for the Authors)

The main novelty of the study lies in a longitudinal component. Some of the subjects were tested again, about a year apart. Using a sliding window approach, these data were used to derive continuous measures of gaze atypicality over age (this analysis was complemented by a more rigid analyses based only on those subjects with two datapoints one year apart). Median contour surface and convergence index were the two metrics used here. Both showed developmental differences between groups. Visual features in the video were examined using a saliency model, and social complexity was quantified as the number of characters on the screen while controlling for overall visual complexity. Overall, it was found that while spatial divergence in gaze decreased in the TD group (convergence index increased) over the developmental time window examined, this was not the case for the ASD group (Figure 3).

These are labor-intensive and valuable data that will certainly add to the literature on attention and visual processing in ASD. The longitudinal dataset is quite special, and the results are in broad strokes fairly compelling: there are different developmental trajectories in visual attention in ASD compared to TD controls. The methods are reasonably sophisticated (but see my comments below). My main hesitation with this study is that it stops short of providing a stronger mechanistic advance. In particular, the developmental eye tracking data are really the centerpiece of the work. However, they are not linked to any other developmental measures. Actually, it was unclear to me at which timepoints the other behavioral measures (Vineland, etc.) were measured. It would have been ideal to get the Vineland and other measures of functioning at each of the specific timepoints, one year apart, at which the eye tracking data were collected to examine associations. But that seems not do have been done (or at least is not reported).

A second shortcoming of the study is the lack of detail with respect to individual subjects (were there any clusters/subgroups/outliers of interest?) and with respect to stimulus features that could be driving the observed effects. There is the saliency analysis, and an analysis of how many characters are on the screen, but that is it. Surely one could quantify additional semantic features, even if ratings are obtained from adults. As it stands there are unconnected analyses in the study. For instance, I have the bottom of Figure 1, showing that proximity index in a subject with ASD varies tremendously over the frames of the movie. Does it vary like that in other subjects with ASD? Does this variation line up with specific features in the video? Most importantly, I would want to see the framewise plot like this (or better: slightly temporally smoothed) of proximity index differences between TD and ASD, as a function of the longitudinal data. What features in the video correlate with the increased convergence seen in TD that are missing in ASD? The analyses provided do not back out these specifics, and without them the story is more descriptive than mechanistic.

1. Since the TD group was always used as a reference, and single ASD participants were compared to this "norm", we do not have a good estimate of how TD subjects would look by comparison. A stronger approach would calculate the TD norm in a leave-one-out fashion and generate distributions for how each individual TD subject also compared – those are the data then to compare (single TD and single ASD) rather than just ASD individuals to one fixed TD average. The authors refer to their analyses as "data-driven", and it is perhaps that with respect to not using ROIs on the video, but they are testing specific differences between pre-defined groups (ASD and TD). It would substantially strengthen the paper if indeed a data-driven analyses were provided (clustering the subject groups on the basis of the data, rather than as predefined). This could also help reveal possible outliers (in both groups) as well as possible subgroups.

2. It is intriguing that the fixation durations between groups differed. I wonder if the authors would consider inspecting smooth pursuit, a type of oculomotor feature not mentioned but that could be relevant (and has been reported to be atypical in ASD).

3. Subjects: More information on the subjects is needed. On what criteria were ASD and TD matched, was level of functioning or intellect taken into consideration? Did ASD subjects have a DSM diagnosis? Was the ADI done? What exclusionary criteria were applied (epilepsy, comorbidity, medication, etc?). More info please.

4. Analysis: proximity index. Why was the proximity index not temporally smoothed? A frame-by-frame metric will (a) have relatively sparse normative distributions, and (b) show fairly discontinuous gaze proximity (as evident in the plot shown in Figure 1, bottom).

5. All of the analyses are within-sample and use standard parametric statistics; it would be preferable to use cross-validation together with permutation testing for a more robust approach. For correlations between gaze data and behavioral data, it seems that about 10 correlation analyses were done. It would be important to correct for the multiple tests. This is also a particularly problematic issue in the saliency analyses, where there are several that are barely at the magic "P<0.05" threshold.

6. Eyetracking exclusions. This is insufficiently described in the paper. We are only told that subjects were excluded if >45% of frames were dropped. First of all, this is an extremely lenient threshold. But we need to know what the distribution of dropped frames was between ASD and TD groups. We also need to know what other exclusions were applied to any portion of the data. Right now, only complete subject-wise exclusions are mentioned, but surely that was not the only criterion.

Reviewer #3 (Recommendations for the Authors)

These results are interesting and valuable to our understanding of the development of social visual attention. However, several weaknesses should be addressed that would strengthen the results. First, the authors assume Gaussian distribution of TD eye gaze, but some of their example figures show that this is not always the case. This may lead to lower proximity index scores and may inflate the significant results rather than reflecting the true proximity of gaze to the normative distribution. Second, it would be helpful to discuss more of the individual differences in normative viewing to help anchor some of the main points of the paper. Finally, causality is assumed in the lower-level visual saliency analyses when instead the social and visual saliency may be highly correlated (and inseparable).

It would be helpful if the authors could provide more details about the π scores, specifically the normalization of them. The authors should explain how the normalization of the π scores was conducted and if this normalization allows for consistency across frames, where the possible furthest distance from the mode of the Gaussian distribution may change depending on the x- and y-coordinates of the mode of the Gaussian on the screen. Additionally, a description of how the normalization of π scores may change based on the convergence of TD children (i.e., how peaked the distribution is) would be helpful. If and how these measures may be limited should be discussed. Additionally, Figure 1 shows a child's π score on a frame-by-frame level across the video. Frames where the child was looking offscreen were coded as -0.15. The authors should explain why this value was chosen, and why a value was chosen at all instead of excluding these frames from analysis. Additionally, it would be important to know that these frames are moments that the child is looking off-screen and not moments where the child is blinking.

It is unclear how the authors handled instances where there were two distinct clusters of gaze distribution and the distribution was therefore not Gaussian. This will directly impact the π score and may make children with ASD look more atypical in their viewing patterns than they truly are. For example, Figure 3b shows both groups having two distinct clusters of visual attention, but more children with ASD are attending to the second focal point than TD children. Additional information on these instances should be added, and limitations should be discussed.

It would be helpful if the authors could validate their normative gaze distribution with leave-one-out procedures or some other method to ensure that the normative distribution is not shifted by one participant. This same procedure would be necessary for the maturational sliding window to show that the gaze pattern is actually reflecting developmental change, not just change due to individual differences in two participants' gaze data (the participant newly included and the participant newly excluded in the sliding window).

It would be worthwhile to include a figure of the correlation of π and autism symptom severity in Figure 2.

In Figure 3c-d, what do the shaded regions represent?

Mean and standard errors of the Proximity Index were not reported for every comparison Figure 4. This could also be collapsed to a difference score for each participant to allow easier comparisons across figures. This particular analysis is complex and would benefit from some greater explanation of the comparisons and how to interpret them in both the results and the Discussion section. This feels like it would be the primary analysis of the paper and it was under-developed in both the results and discussion.

The authors could consider strengthening their results by including some additional analyses exploring individual differences. Children with ASD on the whole become less convergent with each other over time, but are there children who become more convergent with the TD group and those who do not?

A decent amount of space in the paper is dedicated to analyses that are only included in the supplemental analyses. The authors may consider restructuring some of the main and supplemental texts such that the relevant figures will appear near the analyses.

The authors suggest a direction of causality wherein TD children are relying more on lower-level salient features of the scene than are children with ASD. However, salient features of a scene as predicted by the Itti & Koch model are often highly correlated with the social aspects of a scene. Especially in a cartoon, backgrounds remain consistent and the only movement is vibrantly-colored characters moving across the scene. The authors should edit the methods to exclude the suggestion of causality and the above point should be discussed.

The authors should discuss why gaze behavior correlated with adaptive behavior scales but not with overall autism symptom severity.

In general, additional context could be provided to the Results section to clarify what questions the authors were trying to answer with each analysis.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Unraveling the Developmental Dynamic of Visual Exploration of Social Interactions in Autism" for further consideration by eLife. Your revised article has been evaluated by Christian Büchel (Senior Editor) and a Reviewing Editor.

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

1) Clarify the "coarse movie characteristic". The description provided does not allow us to replicate this metric.

2) Details for the permutation testing on the longitudinal changes in gaze divergence using a sliding window method are missing. What exactly was done to compute the significance value?

3) Clarify the role of the "reference distribution". This distribution seems not to have been used in any of the analyses.

4) More comparisons of the TD and ASD groups at the individual subject level should be performed to provide more insightful information. This could also address the major concern of reviewer #3 that the groups are quite different with respect to homogeneity and size.

5) A control analysis taking into account mental age would be helpful as chronological and mental age seem to differ more in ASD and this analysis could rule out potential confounds that are indicative of general developmental delays rather than ASD-specific characteristics.

Reviewer #2 (Recommendations for the authors):

Points for clarification:

– I am a bit unclear on what the "coarse movie characteristic" is, exactly. The description provided would not allow me to replicate this metric – can a more quantitative description be provided?

– In response to Reviewer 3, RC4, the authors now provided permutation testing on the longitudinal changes in gaze divergence using a sliding window method. However, I am unclear on the details of the permutation testing. To quote from the paper: "To test for the statistical significance of the difference between the two groups, we employed random permutation testing across 59 age windows. Accordingly, in each of the 59 windows, gaze data from the TD (20) and ASD groups (20 recordings per window) were pooled together. We performed 100 randomly permuted resamples of equal size to the original distribution (20) from this pooled sample to compute the significance value. The windows where the MCS values showed statistically significant differences between the two groups are graphically presented with color-filled circles (Figure 8C)." What exactly was done to compute the significance value? As far as I understand, they calculated the group divergence from 20 resampled data to generate a 'null divergence' for each group, then calculate that difference 100 times as a null distribution of the group difference, and finally compare the actual group difference with this null distribution. However, from their response letter, the distribution seems to be the group-level dispersion, but not the group difference. In their rebuttal letter, the authors write, "We applied 100 permutations inside each of the 59 windows (containing 20TD + 20ASD gaze recordings) to derive a null distribution of the measure of dispersion -the average pairwise distance between gaze coordinates inside the group." This seems inconsistent.

– In the Method of the "Maturational changes in visual exploration of complex social scene – Sliding window approach", in the second paragraph, the authors mention a reference distribution. However, this reference distribution seems not to have been used in any of the analyses (it's not mentioned in the longitudinal results at all). To quote from the paper: "Upon the creation of sliding windows and to characterize the group's visual behavior and its change with age, gaze data from the TD group were pooled together to define the referent distribution in each of the 59 age windows. To characterize the group visual behavior in the ASD group, we performed the same by pooling the gaze data together from ASD in each of the 59 age windows (see Figure 8 A&B)." Indeed, Figure 8A is about how to decide the sliding window, and Figure 8B is about pairwise gaze dispersion, not about referent distribution.

Reviewer #3 (Recommendations for the authors):

The present research uses the Typical Development (TD) group as a normative reference, comparing individual participants with Autism Spectrum Disorder (ASD) against this reference. However, as pointed out by Reviewer #2, this approach doesn't allow for a comprehensive understanding of the variation within the TD group itself. The leave-one-out calculation suggests that the π is higher for TD, but the TD group is also more homogenous, so I am not sure this is truly informative. A comparison between TD and ASD groups at the individual subject level could provide more insightful information.

Additional Concerns:

i) The study matches the two groups based on chronological age rather than mental age. This introduces a potential confound as the differences reported may be indicative of general developmental delays rather than ASD-specific characteristics.

ii) The TD group is not only smaller in size but also less heterogeneous, which may be a potential explanation for the findings illustrated in Figure 8C.

https://doi.org/10.7554/eLife.85623.sa1

Author response

[Editors’ note: the authors resubmitted a revised version of the paper for consideration. What follows is the authors’ response to the first round of review.]

Comments to the Authors:

Reviewer #1 (Recommendations for the Authors)

In the introduction, it would be helpful if the authors mention the age range and functioning level of participants for every prior study they cite that used eye-tracking to assess social attention in autism. In addition, statements such as "these atypicalities are present early" should include what is meant by early (what age range).

We thank the reviewer for attracting our attention to this point. Following the reviewer’s comment, we carefully revised the manuscript to provide age information when possible.

“[…] These atypicalities are observed as early as 2 months of age (Jones and Klin, 2013) and thus can exert tremendous impact on downstream developmental processes that critically depend on experience. […] Indeed, it has been shown that in the context of naturalistic static scenes, both children and adults with ASD tend to focus more on basic, pixel-level properties than on semantic categories, compared to their TD peers (Amso et al., 2014; Wang et al., 2015). […] For example, it has been shown that in the context of dynamic social content preschoolers with ASD tend to focus less on motion properties of the scene and more on intensity in comparison to age matched TD children (Shic et al., 2007). […] Quite strikingly, while viewing social scenes, toddler and school-age twins showed a high concordance not solely in the direction but also in the timing of their gaze movements (Constantino et al., 2017; Kennedy et al., 2017). […] Only few studies tackled the question of the moment-to-moment gaze deployment in ASD compared to TD. Indeed, while on this microstructural level TD children and adults show coherence in fixation targets, fine-grained gaze dynamic in their peers with ASD is highly idiosyncratic and heterogeneous (Avni et al., 2019; Falck-Ytter and Hofsten, 2011; Nakano et al., 2010; Wang et al., 2018).”

35 age-matched TD children were used to obtain normative gaze patterns, but it is not clear how this number was determined. Are there either power analyses or previous data suggesting that this sample size is sufficient to determine what is normative?

Indeed, a better definition of the sample size used as normative is warranted, considering its central role in our manuscript. To our knowledge, there is no consensus on the optimal size of a sample considered normative in the studies using analyses similar to the one we developed in this manuscript. Previous studies measuring the temporospatial "typicality of gaze " in ASD regarding the TD reference group also used convenience samples. The normative sample characteristics, such as size, age range, and gender distribution, varied notably. A study using multidimensional scaling had a normative group of 25 TD children (mean age: 3.1±1.11 years, 44% females)(Nakano et al., 2010); another study comparing the gaze of TD and ASD children to the mean gaze pattern of both groups included 40 TD children (mean age: 4.5±2.1, 37.5 % females) (Avni et al., 2019) and finally, a study on gaze cohesion included 163 TD children (mean age: 21.89±3.39 months, 40.5%females) (Wang et al., 2018). Similarly, our earlier work focusing on eye-tracking derived Proximity Index (PI) coupled to EEG data from preschoolers with ASD used normative gaze data of variable size. In the first paper, we defined the normative gaze distribution using the gaze recordings from 18 TD children (mean age 3.1±0.9 years, 30% females) (Sperdin et al., 2018) while for the second, we were able to increase the sample size to 26 TD males (mean age 3.4±1.2 years) (Jan et al., 2019). While preparing these studies, we noticed that the normative gaze distribution would become stable after the sample size of 15. Thus, our initial manuscript version’s sample of 35 TD children seemed adequate. Thanks to continued data collection during the revision process, we further were able to increase the sample of TD children slightly, reaching a final normative sample size of 51 in the revised analysis presented in the newly submitted manuscript.

Still, we never formally addressed the stability of the normative gaze distribution in any of our previous works, nor did we conduct power analysis before preparing the sample. To answer the reviewer’s comment, we addressed the question of interindividual gaze stability. We conducted analyses to understand how much the distribution of the TD gaze would change if done on a smaller sample. Inspired by others (Schaer et al., 2015), we performed bootstrapping to simulate smaller samples from the total available sample of 51 TD. We simulated sample sizes ranging from 10 to 50 TD children in these analyses. For each sample size level, we obtained 500 bootstrapped samples over which we measured the stability of the distribution. These analyses demonstrated that the stability of the distribution in the TD sample is, on average, reached at a sample size of 18, Appendix 1—figure 1. In the revised version of the paper, we added these results as a subsection in Appendix 1, as follows:

“The sample of 51 TD children whose gaze data was used to obtain a normative gaze distribution was a convenience sample. In the present study, we only included males due to the fewer number of females with ASD. Having this unique sample of TD children, we tested the stability of the normative distribution depending on the sample size by performing bootstrap analyses. Thus, from the available sample of 51 TD children, we performed 500 bootstraps, starting with a sample size of 10 until reaching the sample size of 50. To measure the change in gaze distribution on one frame, we calculated the average pairwise distance between all gaze coordinates available on the frame. Then for each frame, we calculated the variance of the average pairwise distance over 500 resamples. Finally, the variance obtained was averaged over the 5150 frames to yield a unique value of the variance in gaze patterns per sample size (10-50). Then we calculated the "cutoff," as defined by a sample size increase no longer yielding significant variation in the average variance. This was done using the kneed package implemented in Python that estimates the point of maximal curvature ("elbow in curves with positive concavity) in discrete data sets based on the mathematical definition of curvature for continuous functions (Satopaa et al., 2011) (see Figure 2). The elbow of the fitted curve on our bootstrapping data was found at 18, meaning that the distribution was estimated to be stable from a sample size of 18.”

A related issue is that of age norms. How were age-specific norms considered, given that the sample include children between 2-7 years of age and was not very large (eg. it is not clear how many children of each age were included in the sample)?

We agree with the reviewer that the age span of the normative sample in the previous version of the manuscript is relatively large. As explained in our previous answer, the normative sample increased from 35 to 51 during the revision process. We were also able to increase the sample size of the ASD sample; the exact age distribution of the revised cross-sectional sample (51 TD children and 166 children with ASD) is depicted on Author response image 1. At this initial level of analysis, we aimed for a broad norm covering a more extensive age range in childhood to allow a comparison of a bigger sample of children with ASD (having the age in the norm range) to this unique reference point. This first step of the analysis was important for delineating group differences in visual exploration during childhood. Following this first level of analysis, we adopted a more fine-grained approach to defining the trajectories of change in gaze patterns in both groups. Thus using the available longitudinal data from the children included in the initial level of analyses and using the sliding window approach we were able to define smaller age-matched groups covering more restricted age periods (Figure 8).

Author response image 1
Age distribution in our cross-sectional sample including 51 TD children and 166 children with ASD.

In this second level of analyses, where we tackled the developmental change of visual exploration, more closely matched age groups allowed us to get a better grasp of age-specific processes. As detailed in the Method section’s subsection, "Unstructured longitudinal sample," and following Sandini et al. (2018), we used a sliding-window approach to delineate developmental changes. Participants repeated visits were included when satisfying the inclusion criteria. This yielded a total of 308 recordings for the ASD group and 105 for the TD group (all recordings were collected a year apart; 101 children with ASD contributed two recordings each, and 41 children with ASD contributed three recordings each, while 33 and 21 TD children contributed respectively 2 and 3 recordings each). Available recordings from our unstructured longitudinal sample were first ordered according to the age in both groups separately. Then, a window encompassing 20 recordings was progressively moved across two groups, starting from the first 20 recordings in the youngest subjects until reaching the end of the recording span for both groups. The choice of window width was constrained by the sample size of our TD group. The longitudinal visits in our cohort are spaced a year from each other. The choice of a bigger window would result in important data loss in our group of TD children, as the windows were skipped if they contained more than one recording from the same subject. The chosen window width of 20 yielded 59 sliding windows in both groups that were age-matched and spanned the period from 1.9 – 4.2 years old on average (see Figure 6, Figure 8). As discussed in response to comment nº2, according to our stability analysis, the window width of 20 seemed a good compromise given that stability is reached at the sample size of 18. Moreover, it allows inferences on the gaze behavior between the two groups with enough temporal resolution between the age of 1.9 years to 4.2 years of age.

Author response table 1
Comparison of mean age across 59 age-matched sliding windows for the TD and the ASD group.

pvalues are obtained using t-test of differences between the ASD and TD group across 59 windows.

TDASDTDASDTDASD
MeanSDMeanSD p valueMeanMeanSDMeanSD p valueMeanMeanSDMeanSD p valueMean
SW 11.880.2421.940.1180.813SW 212.780.2262.860.0650.975SW 413.530.2393.610.0820.931
SW 21.930.2241.970.1110.931SW 222.820.2152.900.0600.911SW 423.570.2363.640.0880.919
SW 31.960.2382.010.1080.872SW 232.860.2022.930.0550.973SW 433.610.2333.680.0920.998
SW 42.000.2512.050.1110.880SW 242.900.1932.960.0591.000SW 443.650.2303.730.0910.951
SW 52.040.2682.090.1130.986SW 252.930.1873.000.0620.916SW 453.680.2203.750.0850.952
SW 62.090.2862.140.1140.912SW 262.960.1893.040.0620.980SW 463.720.2123.800.0720.981
SW 72.140.3002.180.1120.936SW 273.000.1973.070.0600.935SW 473.760.2113.830.0610.996
SW 82.180.3072.220.1000.933SW 283.040.2013.110.0570.958SW 483.790.2093.860.0540.950
SW 92.230.3092.280.0890.951SW 293.070.2013.140.0570.926SW 493.830.2033.900.0540.972
SW 102.280.3082.330.0660.965SW 303.110.2043.180.0630.952SW 503.870.1993.940.0620.920
SW 112.330.3082.430.0800.948SW 313.140.2093.210.0620.957SW 513.900.1973.970.0610.961
SW 122.380.3012.480.0730.928SW 323.180.2103.250.0590.971SW 523.940.1904.010.0550.977
SW 132.430.2962.520.0700.947SW 333.210.2133.290.0540.982SW 533.970.1844.040.0490.939
SW 142.480.2852.570.0650.992SW 343.250.2173.330.0540.962SW 544.010.1794.070.0470.998
SW 152.530.2782.620.0670.963SW 353.290.2303.370.0680.959SW 554.040.1884.100.0540.933
SW 162.570.2712.660.0710.936SW 363.330.2403.420.0700.977SW 564.070.1994.140.0680.998
SW 172.610.2602.700.0720.998SW 373.380.2433.450.0670.929SW 574.110.2054.180.0690.976
SW 182.660.2422.740.0770.985SW 383.420.2423.490.0620.966SW 584.140.2134.230.0630,941
SW 192.700.2382.780.0760.950SW 393.450.2443.530.0570.992SW 594.190.2294.280.0500.978
SW 202.740.2332.830.0690.937SW 403.490.2423.560.066

In addition to child adaptive behavior, were any other assessments conducted and if so, were they related to visual exploration patterns in children? Were any of the visual exploration metrics related to severity of autism symptoms?

Besides adaptive behavior, in all children included in our sample, we measured autistic symptoms and developmental functioning levels. As stated in the initial version of the manuscript, we assessed the symptoms of autism using the Autism Diagnostic Observation Scale (ADOS), the developmental profile was assessed using the Psychoeducational Profile – 3rd version (PEP-3), and adaptive levels were measured using the Vineland Adaptive Behavior Scale – 2nd version (VABS-II). To understand how the Proximity index (PI) was related to these three broad phenotype domains, we conducted several correlation analyses in the previous version of the manuscript. Visual exploration in children with ASD (expressed in PI) was positively related to developmental and adaptive functioning and negatively to the measure of symptom severity as measured by ADOS (although this latter association was not statistically significant). In the initial manuscript version, we included the details of the association of the π and uniquely global scores across three assessed domains to limit the number of comparisons in the main manuscript. The relation between π and more specific skill domains (i.e., sub-scales of the used tests, such as the Communication or Socialization domain of the VABS-II) was presented only in the section of the Supplementary material in the previous version of the manuscript.

In the current version of the paper, we changed analysis strategy to obtain a more holistic appreciation of the relationship between the π and the phenotype measures. Here we conducted a multivariate analysis allowing, in our opinion, a better grasp of the relation between the visual exploration of the given social scene and the more nuanced clinical behavioral characteristics of the children with ASD. The advantage of the current analysis is that we can appreciate the relation between the π and several behavioral characteristics simultaneously in relation to one another. This analysis confirmed that the children with better developmental and adaptive functioning levels across all assessed domains also had higher values of the PI. We found no significant relationship between the π and symptom severity as measured by the ADOS at the initial time point (cross-sectional sample of 166 children) (see Figure 3). Additionally, in our longitudinal sample, we found that the PI, while still not related to the simultaneous measure of symptoms (ADOS scores at baseline – T1) was related to the symptoms a year later (T2). This finding is discussed in the current version of the paper. The description of the relations between visual exploration (as measured by PI) and the phenotype characteristics in the current version of the manuscript is as follows:

Less divergence in visual exploration is associated with better overall functioning in children with ASD

To explore how the gaze patterns, specifically divergence in the way children with ASD attended to the social content, related to the child’s functioning, we conducted a multivariate analysis. We opted for this approach to obtain a holistic vision of the relationship between visual exploration, as measured by PI, and different features of the complex behavioral phenotype in ASD. Behavioral phenotype included the measure of autistic symptoms and the developmental and functional status of the children with ASD. Individuals with ASD often present lower levels of adaptive functioning (Franchini et al., 2018; Hus Bal et al., 2015) and this despite cognitive potential (Klin et al., 2007). Understanding factors that contribute to better adaptive functioning in very young children is of utmost importance (Franchini et al., 2018) given the important predictive value of adaptive functioning on later quality of life. The association between behavioral phenotype and π was examined using the PLS-C analysis (Krishnan et al., 2011; McIntosh and Lobaugh, 2004). This method extracts commonalities between two data sets by deriving latent variables representing the optimal linear combinations of the variables of the compared data sets. We built the cross-correlation matrix using the π on the left (A) and 12 behavioral phenotype variables on the right (B) side (see Methods section for more details on the analysis).

[…]

The PLS-C conducted on simultaneous π and phenotype measures at the first time point (T1-PI – T1 symptoms) essentially replicated the pattern we observed on a bigger cross-sectional sample. One significant LC (r=0.306 and p=0.011) showed higher π co-occurring with higher cognitive and adaptive measures (see Figure 12). The cross-covariance matrix using a π at T1 to relate to the phenotype at the T2 also yielded one significant latent component (r=0.287 and p=0.033). Interestingly, the pattern reflected by this LC showed higher loading on the π co-occurring with lower loading on autistic symptoms. Children who presented lower π values at T1 were the ones with higher symptom severity at T2. The gaze pattern at T1 was not related to cognition nor adaptation at T2 (see Figure 13, panel A). Finally, the simultaneous PLS-C done at T2 yielded one significant LC where higher loading of the π coexisted with negative loading on autistic symptoms and higher positive loading on the adaptation score (r=0.322 and p=0.014) Figure 13, panel B. The level of typicality of gaze related to the symptoms of autism at T2 (mean age of 4.05±0.929) but not at a younger age (mean age of 3.01±0.885). This finding warrants further investigation. Indeed, on the one hand, the way children with TD comprehend the world changes tremendously during the preschool years, and this directly influences how the typicality of gaze is estimated. Also, on the other hand, the symptoms of autism naturally change over the preschool years, and all these elements can be responsible for the effect we observe.”

There is not much information provided regarding the TD sample. What screening tools and procedures were used to detect neurological or psychiatric diseases?

We agree with the reviewer that the description of the TD group was insufficient in the initial manuscript version. Before inclusion in our cohort, we conduct screening interviews with the child’s parent(s). The children are not eligible for our TD group if they were treated for any psychological or neurological problem, if the parents have concerns about their development, or if they have a first-degree relative with a known ASD. Following this initial screening, all eligible TD children in our sample underwent the same assessment as the ASD children. The presence of autistic symptoms is excluded using direct observation (ADOS) and parent-reported measure of symptoms (Autism Diagnosis Interview Revised, ADI-R). In addition, we obtain detailed developmental and cognitive profiles and the profile of adaptive functioning in all TD children. We collect information on children’s medical history, including pregnancy history. Children are excluded from the TD group if they have a developmental delay (-1SD) in any assessed developmental areas of functioning.

Following the reviewer’s comment, we have modified the Method section of the manuscript. We now include a table containing the detailed clinical characteristics of the two groups with regards to the autistic symptoms (ADOS-2 & ADI-R), developmental (PEP-3), and adaptive functioning (VABS-II), Table 1. The table was added to the methods section of the revised manuscript version, and the manuscript was modified as

Experimental Model and Subject Details

Hundred sixty-six males with autism (3.37 ± 1.16 years) and 51 age-matched typically developing males (3.48 ± 1.29 years) participated in the study. Table 1 summarizes the clinical characteristics of our crosssectional sample. Our study included only males due to fewer females with ASD. The clinical diagnosis of autism was confirmed using the standardized observational assessment of the child and interviews with caregivers(s) retracing the child’s medical and developmental history. All children with ASD reached the cut-off for ASD on Autism Diagnostic Observation Schedule-Generic (ADOS-G), (Lord et al., 2000) or Autism Diagnostic Observation Schedule-2nd edition (ADOS-2) (Lord et al., 2012). For children who underwent the ADOS-G assessment, the scores were recoded according to the revised ADOS algorithm (Gotham et al., 2007) to ensure comparability with ADOS-2.

Before inclusion in the study, typically developing (TD) children were screened using a questionnaire focusing on medical history and history of pregnancy. Children were not included in our TD group if they were born prematurely or had a positive screen for the presence of any known neurological or psychiatric disorder in the child itself or the known case of ASD in any first-degree relative of the child. Moreover, all TD children were also assessed using the ADOS-G or ADOS-2 evaluations to exclude the presence of ASD symptoms. The majority of TD participants had a minimal severity score of 1, except four children who had a score of 2.

The data for the current study were acquired as a part of a larger longitudinal study of early development in autism based in Geneva. Detailed information about cohort recruitment has been given elsewhere (Franchini et al., 2018; Franchini et al., 2017; Kojovic, 2019). This study protocol was approved by the Ethics Committee of the Faculty of Medicine of Geneva University, Switzerland. All families gave written informed consent to participate.”

Reviewer #2 (Recommendations for the Authors)

The main novelty of the study lies in a longitudinal component. Some of the subjects were tested again, about a year apart. Using a sliding window approach, these data were used to derive continuous measures of gaze atypicality over age (this analysis was complemented by a more rigid analyses based only on those subjects with two datapoints one year apart). Median contour surface and convergence index were the two metrics used here. Both showed developmental differences between groups. Visual features in the video were examined using a saliency model, and social complexity was quantified as the number of characters on the screen while controlling for overall visual complexity. Overall, it was found that while spatial divergence in gaze decreased in the TD group (convergence index increased) over the developmental time window examined, this was not the case for the ASD group (Figure 3).

These are labor-intensive and valuable data that will certainly add to the literature on attention and visual processing in ASD. The longitudinal dataset is quite special, and the results are in broad strokes fairly compelling: there are different developmental trajectories in visual attention in ASD compared to TD controls. The methods are reasonably sophisticated (but see my comments below). My main hesitation with this study is that it stops short of providing a stronger mechanistic advance. In particular, the developmental eye tracking data are really the centerpiece of the work. However, they are not linked to any other developmental measures. Actually, it was unclear to me at which timepoints the other behavioral measures (Vineland, etc.) were measured. It would have been ideal to get the Vineland and other measures of functioning at each of the specific timepoints, one year apart, at which the eye tracking data were collected to examine associations. But that seems not do have been done (or at least is not reported).

We thank the reviewer for his time invested in reviewing our work and for the very insightful and inspiring comments. We are glad that the reviewer recognizes the contribution of our study to understanding the early development of visual exploration in autism.

We are thankful that the reviewer pointed out to the lack of clarity in the previous version of the manuscript regarding the behavioral measures available at different measurement points. In all children included in our sample, we measured autistic symptoms and developmental and adaptive functioning levels. Thus each time we obtained the eye-tracking measures, we also obtained the measures in these three phenotype domains. As stated in the initial version of the manuscript, we assessed the symptoms of autism using the ADOS (Lord et al., 2000; Lord et al., 2012), the developmental profile was assessed using the PEP-3 (Schopler, 2005), and adaptive levels were measured using the VABS-II (Sparrow, Balla, and Cicchetti, 2005).

In the present manuscript version, we significantly increased our sample size, the cross-sectional sample now includes 166 children with ASD compared to 59 in our initial manuscript version, and the longitudinal sample consists of 81 children with ASD (previously 34). Another important change concerns the analysis strategy. In the present manuscript version, we used the multivariate approach to test the relation between π and behavioral phenotype (autistic symptoms, developmental and adaptive levels). As the sample size increased, we could deploy the multivariate analyses not solely on the cross-sectional sample but also on the longitudinal sample. By exploring the link between the π and clinical phenotype measures in the longitudinal sample, we added an element that was critically lacking in the previous version of the manuscript. Here we tested both the simultaneous relation (at baseline: T1 and one-year-follow-up: T2) and also how the π at baseline was related to the behavioral phenotype at one-year-follow-up.

A second shortcoming of the study is the lack of detail with respect to individual subjects (were there any clusters/subgroups/outliers of interest?) and with respect to stimulus features that could be driving the observed effects. There is the saliency analysis, and an analysis of how many characters are on the screen, but that is it. Surely one could quantify additional semantic features, even if ratings are obtained from adults. As it stands there are unconnected analyses in the study. For instance, I have the bottom of Figure 1, showing that proximity index in a subject with ASD varies tremendously over the frames of the movie. Does it vary like that in other subjects with ASD? Does this variation line up with specific features in the video? Most importantly, I would want to see the framewise plot like this (or better: slightly temporally smoothed) of proximity index differences between TD and ASD, as a function of the longitudinal data. What features in the video correlate with the increased convergence seen in TD that are missing in ASD? The analyses provided do not back out these specifics, and without them the story is more descriptive than mechanistic.

We thank the reviewer for the detailed suggestions. We did not identify outliers driving the observed effects (no π value was more than three scaled median absolute deviations (MAD) away from the group median). With the current sample of 166 children with ASD, we confirm the finding from the initial version that focused on 59 children where more divergence in gaze patterns in the ASD group at baseline was related to poorer cognitive and adaptive functioning. The fact that the findings are confirmed on an almost three times bigger sample speaks against the fact that the outliers drive the initial effects.

Considering the movie content, in the initial manuscript version, we focused only on the social and visual complexity and saliency analyses that were considered separately. Following the reviewer’s suggestions, we now extracted additional movie features in the current version of the work. Thus besides the social and visual complexity, we now have included the vocal characteristics of the movie, namely whether vocalizations/verbalizations were socially addressed (directed speech) or not (monologue). Finally, we also marked the coarse characteristics of the movie sequence that might influence the gaze deployment, namely the rapid change in the movie (frame switch) but also slower change where the whole background of the scene was moving to follow the motion of the characters. To appreciate the relation of all these movie features with the PI, especially considering the lack of orthogonality among them, we conducted a multivariate analysis. In addition to six regressors obtained from the annotation of the movie content, we added the salience information to obtain a more global view of what elements contributed more to the gaze deployment in the ASD group. As described in the example we show below, the π was most influenced by social complexity, followed by visual complexity, vocal aspects of the movie, and finally, coarse characteristics of movie sequence (rapid switch and slide) and salience.

Methods:

“Movie characteristics

Social complexity

Furthermore, given the findings of the important impact of intensity of social content on social attention in ASD (Chita-Tegmark, 2016; Frank, Vul, and Saxe, 2012), we aimed to test the hypothesis that the Proximity Index values will be lower for the moments in the videos with enhanced social complexity, involving two or three characters compared to moments involving only one character Figure 8A. However, with an increasing number of characters, the scene is inevitably richer in detail, an issue we address through measuring visual and vocalization complexity.

Visual complexity

To measure visual complexity, we calculated the length of edges delimiting image elements (see Figure 8B). Edge extraction was done on every image of the video using the Canny method (Canny, 1986) implemented in Matlab (version 2017a; Mathworks, Natick, MA). This method finds edges by looking for local maxima of the intensity gradient. The gradient is obtained using the derivative of a Gaussian filter and uses two thresholds to detect strong and weak edges. Weak edges are retained only if connected to strong edges, which makes this method relatively immune to noise.

Vocal video aspects: Monologue and Directed speech

Speech properties of the scenes were also analyzed, using the BORIS software (https://www.boris.unito. it/). We manually identified the moments when characters were vocalizing or speaking. Then we annotated the moments as a function of the social directness of the speech. In particular, we distinguished between monologue (characters thinking out loud or singing) and moments of socially directed speech (invitation to play and responses to invitations).

Coarse movie characteristics: Frame switching and moving background

Finally, to test how global characteristics of the video media, scene changes, or type of the scenes would influence gaze deployment, we extracted moments of the frame switch and moments where the background would move (slide) to follow the movement of the characters along movie scenery.”

Results:

“The association of movie content with divergence in visual exploration in ASD group

Taking into account previous findings of enhanced difficulties in processing more complex social information (Chita-Tegmark, 2016; Frank, Vul, and Saxe, 2012; Parish-Morris et al., 2019) in individuals with ASD, we tested how the intensity of social content influenced visual exploration of the given social scene. As detailed in the Methods section, social complexity was defined as the total number of characters for a given frame and ranged from 1 to 3. Frames with no characters represented a substantial minority (0.02% of total video duration) and were excluded from the analysis. We also analyzed the influence of the overall visual complexity of the scene on this divergent visual exploration in the ASD group. The total length of edges defining details on the images was employed as a proxy for visual complexity (see Methods section for more details). Additionally, we identified the moments of vocalization (monologues versus directed speech) and more global characteristics of the scene (frame cuts and sliding background) to understand better how these elements might have influenced gaze allocation. Finally, as an additional measure, we considered how well the gaze of ASD children was predicted by the GBVS salience model or the average ROC scores we derived in the previous section Figure 9, panel A.

To explore the relationship between the π and different measures of the movie content as previously, we used a PLS-C analysis that is more suitable than the GLM in case of strong collinearity of the regressors this is particularly the case of the visual and social complexity r = 0.763, p < 0.001, as well as social complexity and vocalization (r = 0.223, p < 0.001), as can be appreciated on the Figure 9, panel B. The PLS-C produced one significant latent component (r = 0.331, p < 0.001). The latent component pattern was such that lower π was related to higher social complexity, followed by higher visual complexity and the presence of directed speech. In addition, moments including characters engaged in monologue, moments of frame change, and background sliding increased the π in the group of ASD children. The monologue scenes also coincide with the moments of lowest social complexity that produce higher π values. For the frame switch and the sliding background, the TD reference appears more dispersed in these moments as children may recalibrate their attention onto the new or changing scene, making the referent gaze distribution more variable in these moments and thus giving ASD more chance to fall into the reference space as it is larger. Finally, visual salience also positively contributed to the π loading, which is in line with our previous finding of the salience model being more successful in predicting TD gaze than ASD gaze.”

1. Since the TD group was always used as a reference, and single ASD participants were compared to this "norm", we do not have a good estimate of how TD subjects would look by comparison. A stronger approach would calculate the TD norm in a leave-one-out fashion and generate distributions for how each individual TD subject also compared – those are the data then to compare (single TD and single ASD) rather than just ASD individuals to one fixed TD average. The authors refer to their analyses as "data-driven", and it is perhaps that with respect to not using ROIs on the video, but they are testing specific differences between pre-defined groups (ASD and TD). It would substantially strengthen the paper if indeed a data-driven analyses were provided (clustering the subject groups on the basis of the data, rather than as predefined). This could also help reveal possible outliers (in both groups) as well as possible subgroups.

We are extremely grateful to the reviewer for these precious comments. We agree that the absence of an additional control group was a true limiting factor of our manuscript. Following the suggestion of the reviewers and using the leave-one-out method, we obtained the values of the π for our TD group (Figure 4). The following paragraph has been added to the Results section of the manuscript:

“As the gaze data of the TD group were used as a reference, we wanted to understand how their individual gazing patterns would behave compared to a fixed average. Due to the absence of an additional control group, we employed the leave-one-out method to obtain the values of the π for the 51 TD children. In this manner, the gazing pattern of each TD child was compared to the norm comprising the gaze data of 50 other TD children. The difference in average π values between the two groups was found significant, t(215) = 5.51, p < 0.001, with a considerable heterogeneity, especially in the ASD group (Figure 2).

Regarding the reviewer’s comment on clustering the subjects using a data-driven approach, we completly agree with the reviewer that such a fully data-driven method would be more interesting. However, data driven clustering is beyond the scope of the current study and would substantially expand the breadth and number of levels of an already long study. At this stage, our aim was to present a novel method for measuring deviation in gaze patterns that can be used to measure the developmental dynamics of gaze deployment in children with ASD and establish its relation with the clinical phenotype, both at baseline and in one-year follow-up. We felt that this already represented a comprehensive endeavor. That being said, we fully agree with the reviewer about the potential of a data-driven approach. Our next goal will be to explore the potential of the new measure to enable the classification of the two groups without an a priori definition of the class. For this, we would need to consider the dynamic properties of the π for the duration of the entire video to predict the class (ASD or TD) of our participants as the aggregated (averaged π values for the video) show considerable overlap between the two groups, as shown in Figure 2. An unsupervised class attribution using the eye-tracking measure may be a potentially promising avenue. We hope, however, that the reviewer agrees with us that this would be better suited for a separate paper as its goals would, in our opinion, be a better fit in a paper discussing this new eye-tracking measure as a potentially useful screening tool for the detection of the ASD, which is out of the scope of the current paper.

2. It is intriguing that the fixation durations between groups differed. I wonder if the authors would consider inspecting smooth pursuit, a type of oculomotor feature not mentioned but that could be relevant (and has been reported to be atypical in ASD).

Indeed, this is a precious suggestion. A more fine-grained gaze behavior analysis is always appealing. Up to this point, we explored mostly coarse characteristics of the visual exploration style for the current manuscript, using the fixations and saccades properties. However, considering that the manuscript is already rather rich encompassing two separate types of the analysis (cross-sectional and longitudinal), we were afraid to broaden the scope by adding a new set of analyses. We are thankful to the reviewer for this valuable suggestion and will explore the potential of this visual behavior in our future work.

3. Subjects: More information on the subjects is needed. On what criteria were ASD and TD matched, was level of functioning or intellect taken into consideration? Did ASD subjects have a DSM diagnosis? Was the ADI done? What exclusionary criteria were applied (epilepsy, comorbidity, medication, etc?). More info please.

We apologize for the lack of precise characterization of the two groups in our initial sample. Our two groups were age-matched in both previous and current versions of the manuscript. In both versions our sample only included males, as the number of females in the ASD group was smaller compared to males. Having a sex-mixed sample would mean that we would need to substantially lower the sample size, which would be not optimal considering our analysis strategy. In addition, we were very cautious to include females as previous research (Harrop et al., 2018, 2019) has highlighted sex-related differences in visual exploration between males and females, and this difference would warrant more detailed characterization using our method. We, of course, recognize the importance of the more in-depth characterization of sex differences in visual exploration in young children using our method.

We did not match our two groups based on intellectual or functional levels. In our opinion, matching the children on these two criteria would drive us away from our intention to understand better the visual exploration in a representative group of children with ASD and not uniquely the ones on the higher functioning end of the autistic spectrum. All children with ASD had the DSM-5-informed ASD diagnosis. We obtained a detailed medical and developmental history of all children in our sample. In addition, children were assessed on three broad domains, namely autistic symptoms (using ADOS-G/ADOS-2 and ADI-R), developmental (PEP-3), and adaptive functioning (VABS-II). For the exclusion criteria, potential participants would be excluded if they presented a known genetic condition with autism-like traits (such as Fragile X, and Tuberous Sclerosis). To our knowledge, we had no children with such genetic conditions in the sample that satisfied specific eye-tracking inclusion criteria. One child has a known diagnosis of epilepsy but no medication at the time of our study. In the revised manuscript, we added a table that compares the two groups with regard to their age, the severity of autism symptoms, and developmental and adaptive functioning across domains (see Table 1). Also, we modified the participant description paragraph in the Method section as follows:

4. Analysis: proximity index. Why was the proximity index not temporally smoothed? A frame-by-frame metric will (a) have relatively sparse normative distributions, and (b) show fairly discontinuous gaze proximity (as evident in the plot shown in Figure 1, bottom).

We are aware that the frame-by-frame metric such as the Proximity Index is prone to scarceness and we understand the reviewer’s concern that temporal smoothing could provide useful for analyses of the time course. Nevertheless, we did not perform smoothing or any other intervention on the frame-by-frame derived π values, because we only use average π values (average over all frames) in all analyses presented in the manuscript. As such, using the raw signal or the smoothed value would not change anything in any of the presented results.

Following the reviewer’s highly pertinent comments on the relation between the content of the video and the Proximity index, we performed smoothing for visual illustration in the section dealing with the movie content and its effects on the gaze deployment, Figure 6, first panel(in red).

5. All of the analyses are within-sample and use standard parametric statistics; it would be preferable to use cross-validation together with permutation testing for a more robust approach. For correlations between gaze data and behavioral data, it seems that about 10 correlation analyses were done. It would be important to correct for the multiple tests. This is also a particularly problematic issue in the saliency analyses, where there are several that are barely at the magic "P<0.05" threshold.

We completely agree with the reviewer that the statistical design used in our initial manuscript version, with several correlations presented simultaneously, was not optimal. Following the reviewers’ comments, the revised manuscript uses a multivariate approach better suited for dealing with highly colinear variables. The advantage of the current analysis is that we can appreciate the relation between the π and several behavioral characteristics simultaneously in relation to one another. The results of the multivariate approach confirm that the higher values of the Proximity Index are found in children with better developmental and functional levels across all assessed domains. We found no stable association between the π and symptom severity as measured by the ADOS at baseline. However, the π at baseline was associated with the phenotype measures one year later in our longitudinal sample. The description of the relation between visual exploration (PI) and the phenotype characteristics (cross-sectional and longitudinal analyses) is now included in the current version of the manuscript as follows:

Less divergence in visual exploration is associated with better overall functioning in children with ASD

To explore how the gaze patterns, specifically divergence in the way children with ASD attended to the social content, related to the child’s functioning, we conducted a multivariate analysis. We opted for this approach to obtain a holistic vision of the relationship between visual exploration, as measured by PI, and different features of the complex behavioral phenotype in ASD. Behavioral phenotype included the measure of autistic symptoms and the developmental and functional status of the children with ASD. Individuals with ASD often present lower levels of adaptive functioning (Franchini et al., 2018; Hus Bal et al., 2015) and this despite cognitive potential (Klin et al., 2007). Understanding factors that contribute to better adaptive functioning in very young children is of utmost importance (Franchini et al., 2018) given the important predictive value of adaptive functioning on later quality of life. The association between behavioral phenotype and π was examined using the PLS-C analysis (Krishnan et al., 2011; McIntosh and Lobaugh, 2004). This method extracts commonalities between two data sets by deriving latent variables representing the optimal linear combinations of the variables of the compared data sets. We built the cross-correlation matrix using the π on the left (A) and 12 behavioral phenotype variables on the right (B) side (see Methods section for more details on the analysis).

In our cohort, child autistic symptoms were assessed using the ADOS (Lord et al., 2000; Lord et al., 2012), child developmental functioning using the PEP-3 scale (Schopler, 2005) and child adaptive behavior using the Vineland Adaptive Behavior Scales, Second Edition, (Sparrow, Balla, and Cicchetti, 2005). Thus the final behavior matrix included two domains of autistic symptoms from the ADOS: social affect (SA) and repetitive and restricted behaviors (RRB); six subscales of the PEP-3: verbal and preverbal cognition (VPC), expressive language (EL), receptive language (RL), fine motor skills (FM), gross motor skills (GM), oculomotor imitation (OMI) and four domains from VABS-II: communication (COM), daily living skills (DAI), socialization (SOC) and motor skills (MOT). Age was regressed from both sets of the imputed data.

The PLS-C yielded one significant latent component (r = 0.331, p = 0.001), best explaining the crosscorrelation pattern between the π and the behavioral phenotype in the ASD group. The significance of the latent component was tested using 1000 permutations, and the stability of the obtained loadings was tested using 1000 bootstrap resamples. Behavioral characteristics that showed stable contributions to the pattern reflected in the latent component are shown in red Figure 7. Higher values of the π were found in children with better developmental functioning across all six assessed domains and better adaptive functioning across all four assessed domains. Autistic symptoms did not produce a stable enough contribution to the pattern (loadings showed in gray bars on the Figure 7). Still, numerically, a more TD-like gazing pattern (high PI) was seen in the presence of fewer ASD symptoms (negative loading of both SA and RRB scales of the ADOS-2). Despite the lack of stability of this pattern, the loading directionality of ASD symptoms is in line with the previous literature (Avni et al., 2019; Wen et al., 2022), showing a negative relationship between visual behavior and social impairment. Among the developmental scales, the biggest loading was found on verbal and preverbal cognition, followed by fine motor skills. While the involvement of verbal and nonverbal cognition in the PI, an index of visual exploration of these complex social scenes is no surprise, the role of fine motor skills might be harder to grasp. Interestingly, in addition to measuring the control of hand and wrist small muscle groups, the fine motor scale also reflects the capacity of the child to stay focused on the activity while performing controlled actions. Thus, besides the measure of movement control, relevant as scene viewing implies control of eye movement, the attentional component measured by this scale might explain the high involvement of the fine motor scale in the latent construct pattern we obtain.”

“More divergence in visual exploration is associated with unfolding autistic symptomatology a year later

To capture the developmental change in the π and its relation to clinical phenotype we conducted the multivariate analysis considering only the subjects that had valid eye-tracking recordings at two time points one year apart. Out of 94 eligible children (having two valid eye-tracking recordings a year apart), 81 had a complete set of phenotype measures. All 94 children had an ADOS, but ten children were missing PEP-3 (9 were assessed using Mullen Scales of Early Learning (Mullen, 1995), one child was not testable at the initial visit), and three children were missing VABS-II as the parents were not available for the interview at a given visit. The proximity index in this smaller paired longitudinal sample was defined using the age-matched reference composed of 29 TD children spanning the age (1.66-5.56) who also had a valid eye-tracking recording a year later. As the current subsample was smaller than the initial one, we limited our analyses to more global measures, such as domain scales (not the test subscales as in our bigger cross-sectional sample). Thus, for the measure of autistic symptoms, we used the total severity score of ADOS. Cognition was measured using the Verbal and preverbal cognition scale of PEP-3 (as the PEP-3 does not provide a more global measure of development (Schopler, 2005)) and adaptive functioning using the Adaptive behavior Composite score of Vineland (Sparrow, Balla, and Cicchetti, 2005). To test how the π relates within and across time points, we built three cross-covariance matrices (T1-PI to T1-symptoms; T1-PI to T2-symptoms; T2-PI to T2-symptoms) with the π on one side (A) and the measure of autistic symptoms, cognition, and adaptation on the other side (B). As previously, the significance of the patterns was tested using 1000 permutations, and the stability of the significant latent components using 1000 bootstrap samples.

The PLS-C conducted on simultaneous π and phenotype measues at the first time point (T1-PI – T1 symptoms) essentially replicated the pattern we observed on a bigger cross-sectional sample. One significant LC (r=0.306 and p=0.011) showed higher π co-occurring with higher cognitive and adaptive measures (see Figure 12). The cross-covariance matrix using a π at T1 to relate to the phenotype at the T2 also yielded one significant latent component (r=0.287 and p=0.033). Interestingly, the pattern reflected by this LC showed higher loading on the π co-occurring with lower loading on autistic symptoms. Children who presented lower π values at T1 were the ones with higher symptom severity at T2. The gaze pattern at T1 was not related to cognition nor adaptation at T2 (see Figure 13, panel A). Finally, the simultaneous PLS-C done at T2 yielded one significant LC where higher loading of the π coexisted with negative loading on autistic symptoms and higher positive loading on the adaptation score (r=0.322 and p=0.014) Figure 13, panel B. The level of typicality of gaze related to the symptoms of autism at T2(mean age of 4.05±0.929) but not at a younger age (mean age of 3.01±0.885). This finding warrants further investigation. Indeed, on the one hand, the way children with TD comprehend the world changes tremendously during the preschool years, and this directly influences how the typicality of gaze is estimated. Also, on the other hand, the symptoms of autism naturally change over the preschool years, and all these elements can be responsible for the effect we observe.”

For the developmental patterns of gaze, we now conducted permutation testing analysis over the 59 sliding windows to establish the statistical significance of our finding that the gaze patterns in ASD become progressively dispersed with age while the gaze patterns become more coherent over time in TD. In the current analyses, inside each of the 59 windows, gaze data of ASD and TD children were permuted 100 times to derive the null distribution of the measure of dispersion -average pairwise distance. The statistically significant difference in gaze dispersion between groups is evident between the age of the 2.5 and 3 years old (Appendix 2fFigure 3 Panel C). The corresponding manuscript parts are modified as follows:

“Divergent developmental trajectories of visual exploration in children with ASD

After exploring the π association with various aspects of the behavioral phenotype in ASD children, we were also interested in the developmental pathway of visual exploration in this complex social scene for both groups of children. Previous studies using cross-sectional designs have demonstrated important changes in how children attend to social stimuli depending on their age (Frank, Vul, and Saxe, 2012; Helo et al., 2014). As our initial sample spanned a relatively large age range (1.7 – 6.9 years), we wanted to obtain a more fine-grained insight into the developmental dynamic of visual exploration during the given period. To that end, when study-specific inclusion criteria were satisfied, we included longitudinal data from our participants who had a one-year and/or a two years follow-up visit (see Methods section). With the available 306 recordings for the ASD group and 105 for the TD group, we applied a sliding window approach (Sandini et al., 2018) (see Methods section). Our goal was to discern critical periods of change in the visual exploration of complex social scenes in ASD compared to the TD group. We opted for a sliding window approach considering its flexibility to derive a continuous trajectory of visual exploration and thereby capture such non-linear periods. The sliding window approach yielded a total of 59 age-matched partially overlapping windows for both groups covering the age range between 1.18 – 4.28 years (mean age of the window) (Figure 8 panel A illustrates the sliding window method).

We then estimated gaze dispersion on a group level across all 59 windows. Dispersion on a single frame was conceptualized as the mean pairwise distance between all gaze coordinates present on a given frame (Figure 8, panel B). Gaze dispersion was computed separately for ASD and TD. The measure of dispersion indicated an increasingly discordant pattern of visual exploration between groups during early childhood years. The significance of the difference in the gaze dispersion between two groups across age windows was tested using the permutation testing Methods section. The statistically significant difference (at the level of 0.05) in a window was indicated using color-filled circles and as can be appreciated from the Figure 8, panel C was observed in 46 consecutive windows out of 59 starting the age of 2.5 to 4.3 (average age of the window). While the TD children showed more convergent visual exploration patterns as they got older, as revealed by progressively smaller values of dispersion (narrowing of focus), the opposite pattern characterized gaze deployment in children with ASD. From the age of 2 years up to the age of 4.3 years, this group showed a progressively discordant pattern of visual exploration (see Figure 8, panel C).

And finally, for the saliency analyses, Wilcoxon t-test of group differences was significant for the full model (Figure 5) and all the 5 channels taken individually (Appendix 2-figure 1). We report the effect sizes according to formula r=Z/N, (Rosenthal, 1991).

“The relative contribution of the basic visual properties of the animated scene to gaze allocation in ASD and TD children

… for all channels taken individually as well as for the full model, the salience model better predicted gaze allocation in the TD group compared to the ASD group (Wilcoxon t-test returned with the value of <0.001, Figure 10). The effect sizes (r=Z/N, (Rosenthal, 1991)). of this difference were most pronounced for the flicker channel r = 0.182, followed by the orientation channel r = 0.149, full model r = 0.132, intensity r = 0.099, color r = 0.083 and lastly motion r = 0.066, Appendix 2. The finding that the salient model predicted better gaze location in TD groups compared to the ASD was not expected based on the previous literature.Still, most studies used static stimuli and gaze control when viewing dynamic content is very different. As the salience model used in this work has been validated on adults, our findings suggest that the gaze behavior in TD approximates that of TD adults better than the ASD gaze behavior.

6. Eyetracking exclusions. This is insufficiently described in the paper. We are only told that subjects were excluded if >45% of frames were dropped. First of all, this is an extremely lenient threshold. But we need to know what the distribution of dropped frames was between ASD and TD groups. We also need to know what other exclusions were applied to any portion of the data.

As in our first version of the manuscript, subjects were excluded if they showed poor screen attendance, defined as binocular gaze detection on less than 65% of video duration. Our exclusion criterion is, thus, more than 35% of dropped frames. No other exclusion criteria were applied to any portion of the data. While we understand that the threshold of 35% dropped frames might appear lenient, choosing it looked up to the thresholds of other studies in preschool children. Pierce and collaborators, in the much shorter Geometric preference task (1 minute) administered in toddlers (TD, ASD, DD) aged 12-43 months, defined a threshold of 50% of stimulus duration (Pierce et al., 2011, 2016; Wen et al., 2022). The authors also reported that the viewing time was significantly different between the groups in all studies. Then, in their 2013prospective study with children aged two months at the study onset, Jones and collaborators note "Trials in which a child failed to fixate on the presentation screen for a minimum of 20% total trial duration were excluded from analyses.". The apparent lack of a consensus regarding the screen attendance threshold might be due to differences in populations, stimuli, and goals of the studies. Our eye-tracking stimulus was coupled with simultaneous high-density EEG recording, and the data acquisition is particularly challenging in this context. More stringent inclusion criteria would undoubtedly result in a much smaller but more biased group, as the children with ASD who struggle to tolerate the EEG cap on their head while watching a cartoon are usually the ones who have more pronounced symptoms of ASD. To control for the missing data, we omitted the instances of non-fixation data (saccades, blinks, off-screen moments) in all the proximity index calculations. Thus the average value of the proximity index is based uniquely on the moments where eyes were detected under the condition that children looked at the screen for more than 65% of the time. Following the reviewer’s comment in the current version of the manuscript methods section, we added the following information:

“Eye-tracking analysis

We excluded data from participants who showed poor screen attendance, defined as binocular gaze detection on less than 65% of video frames. The screen attendance was somewhat higher in the TD sample (93.8 ±} 6.37 seconds) compared to the ASD group (87.8 ±} 9.33 seconds), U=2568, p < 0.001. To extract fixations, we used the Tobii IV-T Fixation filter (Olsen, 2012) (i.e., Velocity threshold: 30◦/s; Velocity window length: 20ms. Adjacent fixations were merged Maximum time between fixations was 75ms; Maximum angle between fixations was 0.5°). To account for differences in screen attendance, we omitted instances of non-fixation data (saccades, blinks, off-screen moments) in all calculations.

Reviewer #3 (Recommendations for the Authors)

These results are interesting and valuable to our understanding of the development of social visual attention. However, several weaknesses should be addressed that would strengthen the results. First, the authors assume Gaussian distribution of TD eye gaze, but some of their example figures show that this is not always the case. This may lead to lower proximity index scores and may inflate the significant results rather than reflecting the true proximity of gaze to the normative distribution. Second, it would be helpful to discuss more of the individual differences in normative viewing to help anchor some of the main points of the paper. Finally, causality is assumed in the lower-level visual saliency analyses when instead the social and visual saliency may be highly correlated (and inseparable).

We are very grateful to the reviewer for a very thorough analysis of our paper and for highly inspiring comments.

It would be helpful if the authors could provide more details about the π scores, specifically the normalization of them. The authors should explain how the normalization of the π scores was conducted and if this normalization allows for consistency across frames, where the possible furthest distance from the mode of the Gaussian distribution may change depending on the x- and y-coordinates of the mode of the Gaussian on the screen. Additionally, a description of how the normalization of π scores may change based on the convergence of TD children (i.e., how peaked the distribution is) would be helpful. If and how these measures may be limited should be discussed.

We thank the reviewer for this pertinent question. Following the reviewer’s comment, we realized that the term "normalization " was inadequate to reflect the analyses we did. In our approach, we did not post hoc normalize values of the PI. The correct definition would be that while calculating the PI, we defined a range for the values of the π (from 0 to 1) to allow comparability between frames. As we detail more in the revised manuscript, the values of the π are obtained as a function of the isolines projected onto the density matrix. In the current manuscript version, we project 100 isolines on every density matrix to facilitate the interpretation of their relation with the PI. Thus the gaze coordinates captured only by the isoline of the lowest level (nº1) will have a π value of 0.01. Accordingly, the gaze coordinates captured by the isoline at level 50 will obtain the π value of 0.50, and level 100 will yield the π value of 1.

Moreover, the calculation of the π depends directly on the density of the distribution of the TD gaze coordinates, so when the distribution is peaked, this would be represented by tightly packed kernels and thus the physical area concerned with the distribution is smaller than in the cases of wider attention distribution. In this case of extremely peaked distribution, attaining the maximum level of the π (PI=1) is more challenging than in the moments of more widespread distribution. This feature of the normative distribution (that we now refer to as "referent") could further be used to ponder the gaze differences even more, where the π can be weighted by the relative difficulty of the frame (peaked referent distribution = frame more difficult). For the current paper, we explored the global properties of the π (averaged over all frames) but in our future work, we intend to explore more in detail the fine-tuning of the π with regards to the frame content and temporal dynamics. The revised version of the manuscripts comprises the following paragraph in the Methods section:

“Upon the "reference" definition, we calculated the distance of gaze data from this referent distribution on each frame for each child with ASD (n = 166; 3.37 ± 1.16 years). Comparison to this referent pattern yielded a measure of Proximity Index-PI (see Figure 1). The calculation of the Proximity Index values was done for each frame separately. Proximity Index values were scaled from 0 to 1 at each frame for comparison and interpretation. We used the Matlab inbuilt function contour to delimit isolines of the gaze density matrix. To have a fine-grained measure, we defined 100 isolines per density matrix (i.e., each frame). Then we calculated the proximity index for each child with ASD framewise. Gaze coordinates that landed outside the polygon defined by contour(s) of the lowest level (1) obtained a π value of 0. The gaze coordinates inside the area defined by gaze density matrix isolines obtained the π value between 0.01 and 1. The exact value of these non-zero π values was obtained depending on the level number of the highest isoline/contour that contained the x and y coordinates of the gaze. As we defined 100 isolines per density matrix, the levels ranged from 1 to 100. Accordingly, a gaze coordinate that landed inside the highest contour (level 100) obtained a π value of 1, and the one that landed inside the isoline 50 obtained a π value of 0.50. A high π value (closer to the mode of the density distribution) indicates that the visual exploration of the individual for a given frame is less divergent from the reference (more TD-like). A summary measure of divergence in visual exploration from the TD group was obtained by averaging the π values for the total duration of the video.”

Additionally, Figure 1 shows a child's π score on a frame-by-frame level across the video. Frames where the child was looking offscreen were coded as -0.15. The authors should explain why this value was chosen, and why a value was chosen at all instead of excluding these frames from analysis. Additionally, it would be important to know that these frames are moments that the child is looking off-screen and not moments where the child is blinking.

We agree that the details in Figure 1 of the initial manuscript was somewhat misleading. Indeed, the value "-0.15" was used to visually illustrate the off-screen moments for the recording used in the example in Figure 1. We used it to indicate the periods where the child was not looking at the screen to distinguish them from moments where the child obtained a π value of 0. We have now modified Figure 1 (in the current version of the manuscript) so that it does not contain misleading information in the present document, the figure is also denoted as Figure 1.

For all the analyses, the missing gaze data were considered "NaNs" and were not included in the π calculation. A part of the data loss is also due to the blinking, and we did not interpolate these moments after applying the Tobii-IVT fixation filter.

It is unclear how the authors handled instances where there were two distinct clusters of gaze distribution and the distribution was therefore not Gaussian. This will directly impact the π score and may make children with ASD look more atypical in their viewing patterns than they truly are. For example, Figure 3b shows both groups having two distinct clusters of visual attention, but more children with ASD are attending to the second focal point than TD children. Additional information on these instances should be added, and limitations should be discussed.

We thank the reviewer for this highly pertinent question. Indeed, one of the study’s main goals was to develop a method sensitive to the complexity of gaze distribution in a rich in details scene. Preserving the complexity of attention distribution (e.g., having two or more distant foci of attention) was essential to us while creating our method, as this would correspond to a more flexible and ecological definition of gaze behavior. The coexistence of multiple foci allows for pondering the relative importance of the different scene elements from the point of view of the TD group. It further distinguishes our method from hypothesis-driven methods that measure aggregated fixation data in the scene’s predefined regions.

As detailed in our initial version of the manuscript (please see below), our approach uses a density estimation function that is flexibly adapting to the data without a predefined smoothing parameter. For this reason, we used kde2d (Botev, Grotowski, and Kroese, 2010) MATLAB function as it achieves this flexibility by applying a Gaussian kernel of an adaptive bandwidth on the data. Figure 3b in the initial version of the manuscript, in our view, was a good illustration that in this manner the complexity of gaze deployment is preserved (we observe two distinct clusters and not a unique cluster – potentially Gaussian). Moreover, while the smoothing kernel deployed in our density estimation function is Gaussian, we would like to state that the final distribution of the gaze data is not assumed Gaussian. We carefully revised all elements of the previous version of the manuscript that might have been misleading regarding this point. For the frames where the attention of the TD group showed many distinct focal points, like the one in Figure 3b in the initial manuscript version, we calculated the π in the same manner as for frames with a unique focus distribution. For a given gaze coordinate from a child with ASD, we identify the level (ranging from 0.01 to 1) of the highest contour (of any of the attention focus/clusters) containing that coordinate. Suppose we assume a hypothetical situation where the gaze data of the TD group are falling along two clusters identically (i.e., we obtain the density peaks of the same level/height), in this. In that case, the π value will obtain a value of 1 if the gaze coordinates are captured by any of the two contours of the highest level. The following text was added in the Methods section:

While the smoothing kernel deployed in our density estimation function is Gaussian, the final distribution of the gaze data is not assumed Gaussian. As shown in Figure 1, right upper panel, the final distribution was sensitive to the complexity of gaze distribution (e.g., having two or more distant gaze foci in the TD group) which allowed a flexible and ecological definition of referent gaze behavior. The coexistence of multiple foci allows for pondering the relative importance of the different scene elements from the point of view of the TD group. It further distinguishes our method from hypothesis-driven methods that measure aggregated fixation data in the scene’s predefined regions. For the frames where the gaze of the TD group showed many distinct focal points, like the one in Figure 1, right upper panel, we calculated the π in the same manner as for frames that had a unique focus distribution. For a given gaze coordinate from a child with ASD, we identify the level of the highest contour, ranging from 0.01 to 1, of any of the attention focus/clusters containing that coordinate. If we assume a hypothetical situation where the gaze data of the TD group are falling along two clusters identically (i.e., we obtain the density peaks of the same level/height), in this case, any two gaze coordinates that fall in the highest level of any of the peaks would obtain a π value of 1.”

It would be helpful if the authors could validate their normative gaze distribution with leave-one-out procedures or some other method to ensure that the normative distribution is not shifted by one participant. This same procedure would be necessary for the maturational sliding window to show that the gaze pattern is actually reflecting developmental change, not just change due to individual differences in two participants' gaze data (the participant newly included and the participant newly excluded in the sliding window).

We are very thankful to the reviewer for these valuable suggestions. We indeed already touched upon these topics in our responses to other reviewers, but we repeat the argumentation here as well.

For the stability of the gaze distribution, we conducted bootstrapping analysis. Inspired by others (Schaer et al., 2015) we performed bootstrapping procedure to simulate smaller samples originating from the available, total sample of 51 TD. For each sample size level (ranging from 10 to 50 TD children), we obtained 500 bootstrapped samples over which we measure the stability of the distribution. We found that the stability of the distribution in the TD sample is on average stable from the sample size of 18, Figure 2. In the present version of the paper, we added a subsection in the supplementary materials that details these analyses as follows:

The sample of 51 TD children whose gaze data was used to obtain a normative gaze distribution was a convenience sample. In the present study, we only included males due to the fewer number of females with ASD. Having this unique sample of TD children, we tested the stability of the normative distribution depending on the sample size by performing bootstrap analyses. Thus, from the available sample of 51 TD children, we performed 500 bootstraps, starting with a sample size of 10 until reaching the sample size of 50. To measure the change in gaze distribution on one frame, we calculated the average pairwise distance between all gaze coordinates available on the frame. Then for each frame, we calculated the variance of the average pairwise distance over 500 resamples. Finally, the variance obtained was averaged over the 5150 frames to yield a unique value of the variance in gaze patterns per sample size (10-50). Then we calculated the "cutoff," as defined by a sample size increase no longer yielding significant variation in the average variance. This was done using the kneed package implemented in Python that estimates the point of maximal curvature ("elbow in curves with positive concavity) in discrete data sets based on the mathematical definition of curvature for continuous functions (Satopaa et al., 2011) (see Figure 2). The elbow of the fitted curve on our bootstrapping data was found at 18, meaning that the distribution was estimated to be stable from a sample size of 18.”

For the maturational sliding window, we used a bigger sample than in the initial version of the manuscript using the window size of 20. Based on the findings above on the stability of distribution on smaller sample size, we can conclude that the sample size of 20 allows capturing the developmental effect with sufficient strength. To test the statistical significance of the differences, we conducted permutation testing. We applied 100 permutations inside each of the 59 windows (containing 20TD + 20ASD gaze recordings) to derive a null distribution of the measure of dispersion -the average pairwise distance between gaze coordinates inside the group. The statistically significant difference in gaze dispersion between groups emerges after the age of 2.5 years (Figure 8 Panel C). The corresponding manuscript parts are modified as follows:

Methods:

“Sliding window approach

[…] We opted for a sliding window approach adapted from Sandini et al. (2018) to delineate fine-grained changes in visual exploration on a group level. Available recordings from our unstructured longitudinal sample were first ordered according to the age in both groups separately. Then, for each group, a window encompassing 20 recordings was progressively moved, starting from the first 20 recordings in the youngest subjects until reaching the end of the recording span for both groups. The choice of window width was constrained by the sample size of our TD group. The longitudinal visits in our cohort are spaced a year from each other, and the choice of a bigger window would result in significant data loss in our group of TD children as the windows were skipped if they contained more than one recording from the same subject. The chosen window width yielded 59 sliding windows in both groups that were age-matched and spanned the period from 1.88 – 4.28 years old on average.

Upon the creation of sliding windows and to characterize the group’s visual behavior and its change with age, gaze data from the TD group were pooled together to define the referent distribution in each of the 59 age windows. To characterize the group visual behavior in the ASD group, we performed the same by pooling the gaze data together from ASD in each of the 59 age windows (see Figure 5 A&B). We calculated the mean pairwise distance between all gaze coordinates on every frame for the measure of gaze dispersion in each of the two groups. Then we compared the relative gaze dispersion between groups on the estimated gaze density of each group in each age window separately.

To test for the statistical significance of the difference between the two groups, we employed random permutation testing across 59 age windows. Accordingly, in each of the 59 windows, gaze data from the TD (20) and ASD groups (20 recordings per window) were pooled together. We performed 100 randomly permuted resamples of equal size to the original distribution (20) from this pooled sample to compute the significance value. The windows where the MCS values showed statistically significant differences between the two groups are graphically presented with color-filled circles (Figure 8C).”

Results:

“Divergent developmental trajectories of visual exploration in children with ASD

[…] We then estimated gaze dispersion on a group level across all 59 windows. Dispersion on a single frame was conceptualized as the mean pairwise distance between all gaze coordinates present on a given frame (Figure 8, panel B). Gaze dispersion was computed separately for ASD and TD. The measure of dispersion indicated an increasingly discordant pattern of visual exploration between groups during early childhood years. The significance of the difference in the gaze dispersion between two groups across age windows was tested using the permutation testing Methods section. The statistically significant difference (at the level of 0.05) in a window was indicated using color-filled circles and as can be appreciated from the Figure 8, panel C was observed in 46 consecutive windows out of 59 starting the age of 2.5 to 4.3 (average age of the window). While the TD children showed more convergent visual exploration patterns as they got older, as revealed by progressively smaller values of dispersion (narrowing of focus), the opposite pattern characterized gaze deployment in children with ASD. From the age of 2 years up to the age of 4.3 years, this group showed a progressively discordant pattern of visual exploration (see Figure 8, panel C).”

It would be worthwhile to include a figure of the correlation of π and autism symptom severity in Figure 2.

We agree with the reviewer that a figure of correlation of the π and autism symptoms would have been useful in our initial manuscript version. Our initial decision to limit the number of figures to only statistically significant findings was a potentially limiting factor for better transparency of our work. In the current version of the manuscript, we changed the analysis strategy. Instead of uni-variate measures of association between the π and the phenotype (several correlations analyses), we now conducted the multivariate PLS-C analysis. PLS-C allowed a more comprehensive view of the relationship between visual exploration and the clinical characteristics of the children.

In Figure 3c-d, what do the shaded regions represent?

In our previous analyses we used a GLM to establish the significance of the difference in the intercept and slope between the two groups over 40 windows. The shaded regions represented the 95% confidence interval of the fitted model (quadratic). In the review process, the complete set of these analyses has been changed along with the corresponding figure. We now used permutation testing to establish the significance of the group difference in each of the sliding windows. Thus the shaded regions are no longer present (see Figure 8).

Mean and standard errors of the Proximity Index were not reported for every comparison Figure 4. This could also be collapsed to a difference score for each participant to allow easier comparisons across figures. This particular analysis is complex and would benefit from some greater explanation of the comparisons and how to interpret them in both the results and the Discussion section. This feels like it would be the primary analysis of the paper and it was under-developed in both the results and discussion.

Following the major changes in the paper, using now the bigger sample, we completely changed our strategy also for the developmental analyses. Instead of the approach where we simply compared the Proximity indexes at the first time point with the Proximity index values to the next one, we currently adopt the approach allowing us to observe the proximity index in the context of autistic symptomatology, developmental and adaptive functioning. This was done using the same multivariate approach as in our current strategy for the cross-sectional approach (PLS-C). The Result section now contains the following description of the developmental analyses:

More divergence in visual exploration is associated with unfolding autistic symptomatology a year later

To capture the developmental change in the π and its relation to clinical phenotype we conducted the multivariate analysis considering only the subjects that had valid eye-tracking recordings at two time points one year apart. Out of 94 eligible children (having two valid eye-tracking recordings a year apart), 81 had a complete set of phenotype measures. All 94 children had an ADOS, but ten children were missing PEP-3 (9 were assessed using Mullen Scales of Early Learning (Mullen, 1995), one child was not testable at the initial visit), and three children were missing VABS-II as the parents were not available for the interview at a given visit. The proximity index in this smaller paired longitudinal sample was defined using the age-matched reference composed of 29 TD children spanning the age (1.66-5.56) who also had a valid eye-tracking recording a year later. As the current subsample was smaller than the initial one, we limited our analyses to more global measures, such as domain scales (not the test subscales as in our bigger cross-sectional sample). Thus, for the measure of autistic symptoms, we used the total severity score of ADOS. Cognition was measured using the Verbal and preverbal cognition scale of PEP-3 (as the PEP-3 does not provide a more global measure of development (Schopler, 2005)) and adaptive functioning using the Adaptive behavior Composite score of Vineland (Sparrow, Balla, and Cicchetti, 2005). To test how the π relates within and across time points, we built three cross-covariance matrices (T1-PI to T1-symptoms; T1-PI to T2-symptoms; T2-PI to T2-symptoms) with the π on one side (A) and the measure of autistic symptoms, cognition, and adaptation on the other side (B). As previously, the significance of the patterns was tested using 1000 permutations, and the stability of the significant latent components using 1000 bootstrap samples.

The PLS-C conducted on simultaneous π and phenotype measures at the first time point (T1-PI – T1 symptoms) essentially replicated the pattern we observed on a bigger cross-sectional sample. One significant LC (r=0.306 and p=0.011) showed higher π co-occurring with higher cognitive and adaptive measures (see Appendix ??). The cross-covariance matrix using a π at T1 to relate to the phenotype at the T2 also yielded one significant latent component (r=0.287 and p=0.033). Interestingly, the pattern reflected by this LC showed higher loading on the π co-occurring with lower loading on autistic symptoms. Children who presented lower π values at T1 were the ones with higher symptom severity at T2. The gaze pattern at T1 was not related to cognition nor adaptation at T2 (see Figure 13, panel A). Finally, the simultaneous PLS-C done at T2 yielded one significant LC where higher loading of the π coexisted with negative loading on autistic symptoms and higher positive loading on the adaptation score (r=0.322 and p=0.014) Figure 13, panel B. The level of typicality of gaze related to the symptoms of autism at T2(mean age of 4.05±0.929) but not at a younger age (mean age of 3.01±0.885). This finding warrants further investigation. Indeed, on the one hand, the way children with TD comprehend the world changes tremendously during the preschool years, and this directly influences how the typicality of gaze is estimated. Also, on the other hand, the symptoms of autism naturally change over the preschool years, and all these elements can be responsible for the effect we observe.”

The authors could consider strengthening their results by including some additional analyses exploring individual differences. Children with ASD on the whole become less convergent with each other over time, but are there children who become more convergent with the TD group and those who do not?

This is a fascinating question. Indeed, we showed previously in the analyses using the sliding windows that children with ASD become more divergent from their group over the childhood years, unlike TD children, who show more group cohesion with age. While the sliding window design allows us to obtain a fine-grained inference on the developmental processes over childhood, a better approach would include more densely sampled purely longitudinal data, which at the time of writing this revised manuscript, we did not have. We can conclude from the present data that the more pronounced effect of change with age seems to happen on the side of the TD, not ASD, meaning that in one year, the TD changed more impressively than the ASD. The trajectories of development in ASD are also more heterogeneous, so the average effect is inevitably blunted. Still, to be able to address properly the heterogeneity of the visual exploration pathway in ASD an important topic we would need to include a precise measure of the interventions the children receive following the diagnosis. After diagnosis, almost all children start some intervention, speech therapy, occupational therapy, or more structured intensive naturalistic behavioral interventions. Previous research from our group showed that the type of intervention children receive impacts the way children attend to social information (Latrèche et al., 2021). Given the complexity of the question, we decided this would be a better fit for a separate study. We would also like to have a bigger sample of children with ASD at the follow-up visit two years after the initial one in order to be able to have a more thorough insight into the trajectories in this very dynamic developmental period.

A decent amount of space in the paper is dedicated to analyses that are only included in the supplemental analyses. The authors may consider restructuring some of the main and supplemental texts such that the relevant figures will appear near the analyses.

We thank the reviewer for this suggestion. In the review process, we changed many of the analyses, so some of the figures from the supplementary materials were omitted. The current supplementary material contains only complementary figures to the analyses we present in the Results section.

The authors suggest a direction of causality wherein TD children are relying more on lower-level salient features of the scene than are children with ASD. However, salient features of a scene as predicted by the Itti & Koch model are often highly correlated with the social aspects of a scene. Especially in a cartoon, backgrounds remain consistent and the only movement is vibrantly-colored characters moving across the scene. The authors should edit the methods to exclude the suggestion of causality and the above point should be discussed.

We thank the reviewer for attracting our attention to the misleading phrasing. Indeed, our intention was not to convey any causality in interpreting the salience findings. We were interested in group differences regarding how gaze may be influenced by the low-level salience features of the scene. All between-group comparisons are made frame by frame. While we fully agree that we cannot disentangle these basic aspects of the scene from the purely social ones, we aimed to highlight the potential group differences in how the salience model predicted gaze allocation. We modified the corresponding parts of the manuscript as follows:

Methods:

“Previous research has put forward the enhanced sensitivity to the low-level (pixel-level) saliency properties in adults with ASD while watching static stimuli (Wang et al., 2015) compared to healthy controls. We were interested in whether any low-level visual properties would more significantly contribute to the gaze allocation in one of the groups.”

Results:

“Contrarily to our hypothesis, for all channels taken individually as well as for the full model, the salience model better predicted gaze allocation in the TD group compared to the ASD group (Wilcoxon t-test returned with the value of p < 0.001, Figure 10). The effect sizes r=Z/N, (Rosenthal, 1991) of this difference were most pronounced for the flicker channel r = 0.182, followed by the orientation channel r = 0.149, full model r = 0.132, intensity r = 0.099, color r = 0.083 and lastly motion r = 0.066, Figure 11. The finding that the salient model predicted better gaze location in TD groups compared to the ASD was not expected based on the previous literature.

Still, most studies used static stimuli and the processes implicated in the process of the dynamic content are very different. The salience model itself was validated on the adult vision system. It might be that the gaze in TD better approximates the adult, mature gaze behavior than the gaze behavior in the ASD group.

The authors should discuss why gaze behavior correlated with adaptive behavior scales but not with overall autism symptom severity.

We agree with the reviewer, in our previous manuscript we did not devote enough attention to the relation between the π and autistic symptomatology. Still, despite the absence of a significant association between π and autism severity, these negative findings warrant discussion. In the current version of the manuscript, as previously, we do not find a significant relationship between the severity of autistic symptomatology (as measured by ADOS) and the π in the cross-sectional sample. One possible reason can be the lack of granularity of the ADOS scale (ranging from 3-10 in our sample of ASD children) compared to the Vineland scale. Interestingly, in our current analyses, while we do not find the relationship between the ADOS and the π initially (at the initial time point) we find that the π at T1 is related to the symptoms of ADOS a year later, at T2. Additionally, ADOS scores at T2 are also related to π at T2. This might be due to the better stability of autistic symptomatology at the visit a year later. In the new version of the manuscript we added the following paragraph to the Discussion section:

“We showed that the level of divergence in gaze exploration of this 3-minute video was correlated with ASD children’s developmental level in children with ASD and their overall level of autonomy in various domains of everyday life. This finding stresses the importance of studying the subtlety of gaze deployment with respect to its downstream contribution to more divergent global behavioral patterns later in development (Jones and Klin, 2013; Klin, Shultz, and Jones, 2015; Schultz, 2005; Young et al., 2009). Gaze movements in a rich environment, as the cartoon used here, inform not only immediate perception but also future behavior as experience-dependent perception now is likely to alter the ongoing developmental trajectory. In accordance with this view, the level of typicality of visual exploration in ASD children at T1 was related to the level of autistic symptoms at T2 but not at T1. One possible interpretation of the lack of stable association at T1 might be due to the lower stability of symptoms early on. Indeed, while diagnoses of ASD show stability with age, still a certain percentage of children might show fluctuation. The study by Lord and collaborators (Lord et al., 2006) following 172 2-year-olds up to the age of 9 years old showed that diagnosis fluctuations are more likely in children with lesser symptoms compared to children with more severe symptoms. Still, as our study included all ASD severities, it is subject to such fluctuations. Another possible interpretation comes from the maturation of the gaze patterns in the TD group, against which we define the typicality of gaze in the ASD group. As can be seen in our results, children with TD show a tremendous synchronization of their gaze during the age range considered, resulting in a tighter gaze distribution at T2 and thus, a more sensitive evaluation of ASD gaze at that time point. The possibility that TD show more similar gaze allocation with age, while ASD’s gaze becomes increasingly idio-syncretic with age, highlights the value of addressing the mechanisms underlying the developmental trajectories of gaze allocation in future studies.”

In general, additional context could be provided to the Results section to clarify what questions the authors were trying to answer with each analysis.

We thank the reviewer for this valuable suggestion. In the revised manuscript we added more descriptions at the beginning of subsections in the Results section. Also, analyses in the Results section are mirrored by the corresponding description in the Methods section that provides complementary details with regard to the goals and theoretical motivation for the analyses.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

1) Clarify the "coarse movie characteristic". The description provided does not allow us to replicate this metric.

We thank both the Editor and Reviewer 2 for attracting our attention to the lack of clarity in this aspect of our manuscript. As explained in detail in our response to Reviewer 2, the term "coarse movie characteristics" refers to two global, high-level aspects of the movie: abrupt changes in the movie frames from one scene to another, and a distinction between all frames where the background is static vs all frames where the backgrounds moves along with the moving characters. Our aim was to probe how these global properties of the movie exert an influence on how the gaze is directed. We think that our clarification eases the concern of lack of replicability about this aspect of our study.

2) Details for the permutation testing on the longitudinal changes in gaze divergence using a sliding window method are missing. What exactly was done to compute the significance value?

We observed in the longitudinal data that TD children become increasingly similar to each other in their gaze deployment compared to ASD children. For each longitudinal window, we have an estimate of each group’s dispersion and we computed a difference between the dispersion across the two groups (i.e., TD vs. ASD). The permutation analysis allows us to make a statistical inference about the observed difference between groups by first generating a null distribution of differences in the gaze dispersion between random group assignments which preserves the general statistics of the data but corrupts the true grouping assignments, and then comparing the original estimate against the estimates of the null. Thus, for each longitudinal window (each of the 59 longitudinal windows) and for each permutation iteration (i.e., each of the one hundred random iterations) we randomly assigned each child to one of two groups (e.g., g1 and g2) preserving the original number of group members (i.e., 20 children per group), we then computed the dispersion quantity within each newly formed group (by computing the average distance in gazes across the entire movie) and computed the difference, across the two groups, in the average dispersion values. The one hundred iterations thus formed a null distribution against which we could compare the original estimation in a two-tailed fashion with an α value of 5%, the p-value is the proportion of the random samples that resulted in estimated differences between the groups that were higher than or equal to the original difference estimation.

3) Clarify the role of the "reference distribution". This distribution seems not to have been used in any of the analyses.

We acknowledge and agree that the use of the term "reference distribution" in the Method section of "Maturational changes in visual exploration of complex social scene – Sliding window approach" is inaccurate. In the section where we measure developmental changes, we do not compare each individual child with ASD to a "reference distribution." Instead, we are focused on evaluating how the group behavior, as defined by group gaze dispersion, evolves across defined sliding windows of age.

As detailed in our response to Reviewer 2, we have made revisions to the manuscript to rectify this terminology and avoid any further misunderstanding.

4) More comparisons of the TD and ASD groups at the individual subject level should be performed to provide more insightful information. This could also address the major concern of reviewer #3 that the groups are quite different with respect to homogeneity and size.

We sincerely appreciate the suggestion to delve deeper into individual-level processes. In the first part of our paper, we indeed focus on individual-level information and examine how each individual with ASD behaves in comparison to normative visual exploration patterns. We demonstrated that individual deviations from normative behavior exhibit distinctive phenotypic characteristics and are associated with less optimal developmental functioning, as well as an increase in ASD symptoms in subsequent years. Our next objective in this paper was to document the developmental dynamics in the visual exploration of social scenes during early childhood. Here, we aimed to outline this developmental dynamic at the population level rather than the individual level; we accordingly employed a sliding window approach to highlight a notable trend of progressive convergence among typically developing (TD) children over the preschool years, wherein their manner of viewing social scenes became increasingly consistent across participants. Conversely, we observed a contrasting process among children with ASD, where, at the group level, we noted an aberrant maturation process and a growing degree of heterogeneity.

In response to the valuable feedback from the Editor and Reviewers, we have incorporated additional individual-level approaches into our developmental analyses. By utilizing a mixed model approach, we were able to assess developmental changes in children with ASD compared to their typically developing counterparts throughout the preschool years. Our findings highlight that in TD children, within-subject effects align closely with the group effect, of increasing convergence with the own group, whereas in children with ASD, we observe more heterogeneous patterns of maturation with development. We hope that these additional analyses offer a more nuanced perspective on the complex interplay of factors involved. In the revised manuscript, we have included solely group-level analysis, as we intend to perform a more in-depth study of the question of individual-level analysis in the context of a separate paper, as suggested by Reviewer 2. However, if the Editor and the Reviewers deem it necessary, we are willing to include this level of analysis in the manuscript.

5) A control analysis taking into account mental age would be helpful as chronological and mental age seem to differ more in ASD and this analysis could rule out potential confounds that are indicative of general developmental delays rather than ASD-specific characteristics.

We appreciate the concerns raised by both the Editor and Reviewer 3 regarding the potential impact of developmental delay. In our manuscript, we initially did not match our two groups based on intellectual or functional levels. Our rationale for this decision was to maintain our focus on gaining a comprehensive understanding of visual exploration in a diverse group of children with ASD, rather than exclusively studying those at the (relatively) higher functioning end of the autistic spectrum. However, in direct response to the feedback provided by Reviewer 3, we detail the steps undertaken in a more fine-grained evaluation of developmental trajectories in visual exploration. This evaluation involves considering the developmental age of children with ASD and aligning it with their typically developing peers. We have reanalyzed our data using these developmental age-matched samples. Remarkably, our findings remain consistent with our initial results. Specifically, we observe atypical developmental patterns in visual exploration among children with ASD when compared to their typically developing counterparts, despite these being matched for developmental age. We have provided a more detailed exposition of these novel findings in response to Reviewer 3’s second comment. We thank the reviewer for suggesting this additional analysis which underscores the robustness and reliability of our findings and allows us to explore the developmental nuances within our study population more comprehensively.

Reviewer #2 (Recommendations for the authors):

Points for clarification:

– I am a bit unclear on what the "coarse movie characteristic" is, exactly. The description provided would not allow me to replicate this metric – can a more quantitative description be provided?

We appreciate the reviewer bringing to our attention the lack of precision in our reporting of this particular aspect of our analysis. What we have labeled "coarse movie characteristics" represents a foundational attribute of this cartoon design. Our intention was to explore how these properties of the cartoon design influenced the deployment of gaze. We defined two coarse movie characteristics; these were then extracted manually.

The first characteristic, denoted "frame switch," encompasses all instances in which the cartoon employs an abrupt frame transition using the hard-cut montage technique (like the example illustrated in Figure 1). Throughout the duration of the movie, this event type occurs 25 times (as indicated in Figure 2). During these moments, the viewer’s gaze necessitates recalibration to synchronize with the new scene. Accordingly, the ability to disengage from the previous scene and adapt to the novel social context at a pace similar to the normative group might have an impact on the Proximity index. The second characteristic labeled as the "Moving background" pertains to moments when the cartoon’s background moves in tandem with the characters, following their directional motion. We aimed to distinguish these segments from scenes featuring a static background, as the overall motion dynamics in these frames varied. The occurrence of a moving background is observable in 5 distinct sequences within the movie (as illustrated in Appendix 1-figure 1). We have accordingly revised the method section to provide a more accurate description.

“Coarse movie characteristics: Frame switching and moving background Finally, to test how the global characteristics of video media influence gaze deployment, we focused on two movie features.

The first feature, denoted as the "Frame switch," encompasses all instances in which the cartoon employs an abrupt frame transition using the hard-cut montage technique. To represent this feature numerically, a feature vector was created. In this vector, the first frame following the switch is assigned a code of 1, while all other frames are coded as 0. This coding scheme effectively highlights the occurrence of these abrupt shot changes within the movie. Throughout the duration of the movie, this event type occurs 25 times (as indicated in Figure 6).

The feature labeled as the "Moving background" pertains to moments when the cartoon’s background moves in tandem with the characters, following their directional motion. We aimed to distinguish these segments from scenes featuring a static background, as the overall motion dynamics in these frames varied. The occurrence of a moving background is observable in 5 distinct sequences within the movie (as illustrated in Figure 6). Frames with a moving background were coded 1 yielding a binary feature vector.

ARI

– In response to Reviewer 3, RC4, the authors now provided permutation testing on the longitudinal changes in gaze divergence using a sliding window method. However, I am unclear on the details of the permutation testing. To quote from the paper: "To test for the statistical significance of the difference between the two groups, we employed random permutation testing across 59 age windows. Accordingly, in each of the 59 windows, gaze data from the TD (20) and ASD groups (20 recordings per window) were pooled together. We performed 100 randomly permuted resamples of equal size to the original distribution (20) from this pooled sample to compute the significance value. The windows where the MCS values showed statistically significant differences between the two groups are graphically presented with color-filled circles (Figure 8C)." What exactly was done to compute the significance value? As far as I understand, they calculated the group divergence from 20 resampled data to generate a 'null divergence' for each group, then calculate that difference 100 times as a null distribution of the group difference, and finally compare the actual group difference with this null distribution. However, from their response letter, the distribution seems to be the group-level dispersion, but not the group difference. In their rebuttal letter, the authors write, "We applied 100 permutations inside each of the 59 windows (containing 20TD + 20ASD gaze recordings) to derive a null distribution of the measure of dispersion -the average pairwise distance between gaze coordinates inside the group." This seems inconsistent.

Indeed, for each age window (comprising 20 recordings from ASD children and 20 recordings from TD children, totaling 59 windows), we calculated the average dispersion value across all frames in parallel for both groups. Available recordings from our unstructured longitudinal sample were first ordered according to the age in both groups separately. Then, for each group, a window encompassing 20 recordings was progressively moved, starting from the first 20 recordings in the youngest subjects until reaching the end of the recording span for both groups. Dispersion on an individual frame is defined as the average pairwise distance across all pairs of members of the same group.

To assess whether group differences in dispersion were statistically significant within windows, we conducted 100 permutations inside each window. In each permutation, we created two new samples (g1 and g2), by randomly selecting 20 recordings from the combined dataset of that specific window (40 recordings in total). Subsequently, we computed the dispersion values for each group within the newly generated samples (g1 and g2), averaging them across all frames. This iterative process enabled us to construct the null distribution of the dispersion measure for each window. Following this, we examined the proportion of sampled permutations in which the disparity in dispersion between g1 and g2 deviated from the original estimate with the true groupings. A significance level (α) of 0.05 was set for the analysis. If the observed disparity in dispersion fell within the lower or upper 2.5% of the null distribution, we rejected the null hypothesis and concluded that the observed difference held statistical significance. This was consistently the outcome for all windows beginning from the average age of 2.5 years. We modified the specific paragraph in the methods section of the manuscript as follows:

To quantify the heteroscedasticity between groups across different ages, we computed the difference in dispersion (mean pairwise distance to members of own group), denoted as (disp_t(ATD) – disp_t(ASD)), for each time window (t). Then, the permutation method was used in order to get the distribution under the null hypothesis in each window (t) (H0: disp_t(TD) – disp_t(ASD)=0). Thus, for each window (59) 100 permutations (i) were performed (i.e. individuals were mixed up randomly in each group) and then we computed our statistic (disp_ti(TD) – disp_ti(ASD)) for each permuted sample (i) and each time window (t). The hundred statistics per window thus formed a null distribution (the expected behavior of our statistic under the null hypothesis) against which we could compare the"real" statistic estimated in the original sample. The p-value is the probability of getting a statistic at least as extreme as the one we observed in our sample if we consider H0 to be the truth. The windows where the dispersion values showed statistically significant differences between the two groups are graphically presented with color-filled circles (Figure 8C).”

– In the Method of the "Maturational changes in visual exploration of complex social scene – Sliding window approach", in the second paragraph, the authors mention a reference distribution. However, this reference distribution seems not to have been used in any of the analyses (it's not mentioned in the longitudinal results at all). To quote from the paper: "Upon the creation of sliding windows and to characterize the group's visual behavior and its change with age, gaze data from the TD group were pooled together to define the referent distribution in each of the 59 age windows. To characterize the group visual behavior in the ASD group, we performed the same by pooling the gaze data together from ASD in each of the 59 age windows (see Figure 8 A&B)." Indeed, Figure 8A is about how to decide the sliding window, and Figure 8B is about pairwise gaze dispersion, not about referent distribution.

We agree with the reviewer that the use of the term "reference distribution” here might appear misleading. Indeed, in this set of analyses we were interested in the group gaze behavior thus the idea of a unique reference is not relevant here. What we meant by the reference distribution here is that we extracted a distribution of the TD gaze in each of the 59 windows. As we did not calculate the π index in each window, the proper terminology should have been “TD distribution”. We changed the Mthods section of the manuscript as follows:

“Upon the creation of sliding windows and to characterize each group’s visual behavior and its change with age, gaze data from the TD group were pooled together to define the TD distribution in each of the 59 age windows. To characterize the group visual behavior in the ASD group, we performed the same by pooling the gaze data together from the ASD group in each of the 59 age windows (see Figure 3 A&B).”

Reviewer #3 (Recommendations for the authors):

The present research uses the Typical Development (TD) group as a normative reference, comparing individual participants with Autism Spectrum Disorder (ASD) against this reference. However, as pointed out by Reviewer #2, this approach doesn't allow for a comprehensive understanding of the variation within the TD group itself. The leave-one-out calculation suggests that the π is higher for TD, but the TD group is also more homogenous, so I am not sure this is truly informative. A comparison between TD and ASD groups at the individual subject level could provide more insightful information.

In the initial manuscript version we compared the ASD children to the referent group falling short of showing how would the TD group behave if put in relation to another control group. Following the suggestion put forth by Reviewer 2 during the first round of review, we have successfully rectified this important limitation of our study. Specifically, we adopted the leave-one-out method, which allowed us to address the absence of an additional control group, and we are pleased to report that this development has met Reviewer 2’s approval. This method yielded the Proximity Index (PI) values for our TD group, demonstrating that, on average, TD children exhibit higher π values (see Figure 4). Importantly, it is noteworthy that these TD children also display interindividual variations similar to those observed in the ASD group. Furthermore, statistical analysis revealed that the variance between the two groups was not statistically significant. In the first part of our paper, we then focused on individual-level information and examined how each individual with ASD behaves in comparison to normative visual exploration patterns. We have demonstrated that individual deviations from normative behavior exhibit distinctive phenotypic characteristics and are associated with less optimal developmental functioning, as well as an increase in ASD symptoms in subsequent years.

While the behavioral characterization of the TD group implies a higher degree of homogeneity, attributed to their normal IQ range and absence of ASD symptoms, our investigation into the visual exploration patterns among TD individuals also captured an inherent interindividual diversity (as shown in Figure 4). This finding inspired our next steps where we wanted to address more specifically the topic of intragroup variation, notably its temporal evolution, in the latter portion of our manuscript, specifically within the "Developmental Patterns of Visual Exploration" section of our results. We provide a broad-strokes delineation of the developmental dynamics in the visual exploration of social scenes during early childhood. To achieve this, we employed a sliding window approach to highlight a notable trend of progressive convergence among typically developing (TD) children over the preschool years, wherein their manner of viewing social scenes became increasingly consistent. Conversely, we observed a contrasting process among children with ASD, where, at the group level, we noted an aberrant maturation process and a growing degree of heterogeneity. Regrettably, delving into individual-level developmental information was initially beyond the scope of our manuscript. In response to your feedback, we have incorporated additional individual-level approaches into our developmental analyses, which we will elaborate on in more detail as part of our response to your following comment.

Additional Concerns:

i) The study matches the two groups based on chronological age rather than mental age. This introduces a potential confound as the differences reported may be indicative of general developmental delays rather than ASD-specific characteristics.

ii) The TD group is not only smaller in size but also less heterogeneous, which may be a potential explanation for the findings illustrated in Figure 8C.

We agree with the reviewer that addressing the matters of age matching, size, and homogeneity holds significant relevance. To effectively tackle these concerns, we employed a sequential approach in which we first addressed the reviewer’s second concern, the question of inequality of size, followed by the combined consideration of size and homogeneity. Subsequently, our analysis incorporated a sliding window methodology that utilized developmental age, as opposed to chronological age, as the focal parameter. This sequential approach was undertaken to ensure a comprehensive and nuanced exploration of the concerns raised by the reviewer.

Size. In response to the disparity in sample size between our two groups (51 TD and 167 ASD children), we implemented a methodology to mitigate the influence of this factor. We generated 100 bootstrapped ASD samples (without replacement), each with a size identical to that of the TD (51 subjects). These ASD samples were matched to the TD sample in terms of chronological age. Subsequently, for each of the bootstrapped samples, we aggregated all longitudinal data and computed the dispersion measure over time, akin to the process described in Figure 8C of the manuscript. As illustrated in Figure 6, Panel A, the results reveal that the bootstrapped ASD samples, characterized by both size and chronological age alignment with the TD group, exhibit higher levels of dispersion across the span of childhood years. This is in contrast to TD children, who exhibit a discernible pattern of progressive refinement in their visual exploration behavior.

.

It’s worth noting that, while permutation testing could have been an ideal method for assessing the statistical significance of the findings in this section, we opted not to implement it due to the substantial computational cost associated with our analyses. The computational demands of our study necessitated an alternative approach to address the sample size and age-matching issue effectively. Consequently, we relied on the bootstrapping technique to provide valuable insights into the dispersion differences between the TD and ASD groups, while acknowledging the limitations imposed by the computational constraints.

Homogeneity. To address the question of intragroup homogeneity, taking into account the considerable developmental heterogeneity inherent in the ASD group, we restricted the range of developmental functioning. Thus we derived 100 simulated samples of the same size as the TD group (51) firstly within the normal developmental range (DQ above 80) and then, we performed the same for the lower-functioning individuals with ASD (DQ below 80). As shown in Figure 6, Panels B-C, both groups show sustained dispersion over the childhood years, in contrast to the convergence seen in the TD group. This trend is particularly pronounced in the subset of individuals with lower developmental functioning (Panel C), wherein a discernible divergence becomes increasingly evident during the preschool years.

Developmental age. To comprehensively address the question of developmental age-matched samples, we implemented a sliding window approach using the same dataset as in our manuscript (51 TD and 167 ASD children). However, in this approach, we utilized developmental age for creating age-matched windows instead of chronological age as previously used. We initiated the process with the first 20 recordings from subjects with the lowest developmental age and progressively shifted a window encompassing 20 recordings. This continued until the entire range of recordings for both groups was covered. Similar to the method applied in the manuscript, we excluded windows containing duplicate recordings from the same subject. This method yielded a total of 60 windows, each matched based on age, with developmental age in the ASD group and chronological age in the TD group (developmental age and chronological age are highly aligned, r = 0.93, p = 6.82E-23). To test the stability of our findings and assess the potential influence of sample size, we replicated the sliding window procedure using 100 bootstrapped ASD samples, each comprising 51 subjects whose developmental age was matched to the chronological age of the TD subjects. For the purpose of interpretation, we plotted a linear regression line (in red) for each bootstrapped sample. Remarkably, our results reinforce our initial findings when using chronological age-matched samples. Children with ASD consistently exhibit a greater degree of interindividual disparity across childhood years, in contrast to TD children. This outcome underscores the robustness of our findings and strengthens the validity of our observations.

Individual level developmental change. In response to the valuable feedback from the Reviewer, we have integrated additional individual-level approaches into our developmental analyses, recognizing their crucial role in understanding the dynamics of visual exploration. In the previous step, we employed a method that utilized the mental age of children with ASD to match them with typically developing (TD) subjects of the same chronological age. However, instead of aggregating the information regarding an individual’s deviation from their respective group at the level of a sliding window, we retained individual-level data. This approach enabled us to delve deeper into the intricacies of individual developmental trajectories using the mixed model approach. The mixed model approach is well-suited for designs like the present case, where we have multiple time points for each subject, with varying numbers of observations. For our analysis, we utilized a publicly available toolbox (Mancini et al., 2019; Mutlu et al., 2013) in MATLAB R2021a (MathWorks). We modelled age and diagnosis as fixed effects, while within-subject factors were treated as random effects, utilizing the nlmefit function. To estimate developmental trajectories, we employed random-slope models, including constant, linear, quadratic, and cubic terms, each representing a different relationship between age and the averaged pairwise distance to members of the same group. These models accounted for both within-subject and between-subject effects. We determined the most appropriate model order using the Bayesian information criterion.

Strikingly, in typically developing children, we observed a strong alignment between within-subject effects and the group effect, signifying a trend of increasing convergence within the TD group. Conversely, in children with ASD, our observations unveiled again more heterogeneity in patterns of maturational processes. These nuanced insights are graphically represented in Figure 7, providing a visual representation of the developmental trajectories that underlie our findings.

The following paragraph has been added to the Results section:

“To ensure the robustness and validity of our findings, we addressed several potential confounding factors. These included differences in sample size TD (TD sample included 51 and ASD sample 166 children), the heterogeneity of ASD behavioral phenotypes, and the use of developmental age rather than chronological age in our sliding window approach. We adopted a sequential approach, first examining the impact of unequal sample sizes and then considering both sample size and phenotypic heterogeneity together. Additionally, we implemented a sliding window methodology using developmental age as the primary matching parameter for a detailed description, see Appendix 5. Our results consistently reaffirmed our initial findings obtained when using chronologically age-matched samples. Specifically, when matched for both sample size and developmental age, children with ASD consistently demonstrated a greater degree of interindividual disparity across childhood years compared to TD children (Appendix 5, Panels D1-D2).”

The Appendix 5 has been added as a supplementary material.

https://doi.org/10.7554/eLife.85623.sa2

Article and author information

Author details

  1. Nada Kojovic

    Psychiatry Department, Faculty of Medicine, University of Geneva, Geneva, Switzerland
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing – original draft, Writing – review and editing
    For correspondence
    nada.kojovic@unige.ch
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0116-2485
  2. Sezen Cekic

    Faculte de Psychologie et Science de l’Education, University of Geneva, Geneva, Switzerland
    Contribution
    Methodology, Writing – review and editing
    Competing interests
    No competing interests declared
  3. Santiago Herce Castañón

    Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México, Ciudad Universitaria, Mexico City, Mexico
    Contribution
    Methodology, Writing – review and editing
    Competing interests
    No competing interests declared
  4. Martina Franchini

    Fondation Pôle Autisme, Geneva, Switzerland
    Contribution
    Investigation, Writing – review and editing
    Competing interests
    No competing interests declared
  5. Holger Franz Sperdin

    Psychiatry Department, Faculty of Medicine, University of Geneva, Geneva, Switzerland
    Contribution
    Investigation, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3438-1572
  6. Corrado Sandini

    Psychiatry Department, Faculty of Medicine, University of Geneva, Geneva, Switzerland
    Contribution
    Methodology, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-2933-1607
  7. Reem Kais Jan

    College of Medicine, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, United Arab Emirates
    Contribution
    Investigation, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-1685-5594
  8. Daniela Zöller

    Bosch Sensortec GmbH, Reutlingen, Germany
    Contribution
    Methodology, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7049-0696
  9. Lylia Ben Hadid

    Psychiatry Department, Faculty of Medicine, University of Geneva, Geneva, Switzerland
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  10. Daphné Bavelier

    Faculte de Psychologie et Science de l’Education, University of Geneva, Geneva, Switzerland
    Contribution
    Conceptualization, Supervision, Writing – review and editing
    Competing interests
    No competing interests declared
  11. Marie Schaer

    Psychiatry Department, Faculty of Medicine, University of Geneva, Geneva, Switzerland
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Project administration, Writing – review and editing
    For correspondence
    marie.schaer@unige.ch
    Competing interests
    No competing interests declared

Funding

National Centre of Competence in Research (NCCR) SYNAPSY (51NF40-185897)

  • Marie Schaer

Swiss National Science Foundation (163859)

  • Marie Schaer

Swiss National Science Foundation (190084)

  • Marie Schaer

Swiss National Science Foundation (202235)

  • Marie Schaer

Swiss National Science Foundation (212653)

  • Marie Schaer

ERC Synergy fund BrainPlay - The Self-teaching Brain grant (810580)

  • Daphné Bavelier

Fondation Privée des Hôpitaux Universitaires de Genève (https://www.fondationhug.org

  • Marie Schaer

Fondation Pôle Autisme

  • Marie Schaer

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We express our utmost gratitude to all families participating in this study. We thank the clinical team for their immense investment in the data collection. Funding Funding for this study was provided by the National Centre of Competence in Research (NCCR) Synapsy, financed by the Swiss National Science Foundation-SNF (Grant No. 51NF40–185897), by SNF grants to MS (#163859, #190084, #202235 & #212653), ERC Synergy fund BrainPlay - The Self-teaching Brain grant to DB #810580, the Fondation Privée des Hôpitaux Universitaires de Genève (https://www.fondationhug.org), and by the Fondation Pôle Autisme (https://www.pole-autisme.ch).

Ethics

The study protocol was approved by the Ethics Committee of the Faculty of Medicine of Geneva University, Switzerland (Swissethics, protocol 12-163/Psy 12-014, referral number PB_2016-01880). All families gave written informed consent to participate.

Senior and Reviewing Editor

  1. Christian Büchel, University Medical Center Hamburg-Eppendorf, Germany

Reviewer

  1. Ralph Adolphs, California Institute of Technology, United States

Version history

  1. Preprint posted: September 17, 2020 (view preprint)
  2. Received: December 15, 2022
  3. Accepted: December 1, 2023
  4. Accepted Manuscript published: January 9, 2024 (version 1)
  5. Version of Record published: February 19, 2024 (version 2)

Copyright

© 2024, Kojovic et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 962
    Page views
  • 239
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Nada Kojovic
  2. Sezen Cekic
  3. Santiago Herce Castañón
  4. Martina Franchini
  5. Holger Franz Sperdin
  6. Corrado Sandini
  7. Reem Kais Jan
  8. Daniela Zöller
  9. Lylia Ben Hadid
  10. Daphné Bavelier
  11. Marie Schaer
(2024)
Unraveling the developmental dynamic of visual exploration of social interactions in autism
eLife 13:e85623.
https://doi.org/10.7554/eLife.85623

Share this article

https://doi.org/10.7554/eLife.85623

Further reading

    1. Neuroscience
    Eyal Y Kimchi, Anthony Burgos-Robles ... Kay M Tye
    Research Article

    Basal forebrain cholinergic neurons modulate how organisms process and respond to environmental stimuli through impacts on arousal, attention, and memory. It is unknown, however, whether basal forebrain cholinergic neurons are directly involved in conditioned behavior, independent of secondary roles in the processing of external stimuli. Using fluorescent imaging, we found that cholinergic neurons are active during behavioral responding for a reward – even prior to reward delivery and in the absence of discrete stimuli. Photostimulation of basal forebrain cholinergic neurons, or their terminals in the basolateral amygdala (BLA), selectively promoted conditioned responding (licking), but not unconditioned behavior nor innate motor outputs. In vivo electrophysiological recordings during cholinergic photostimulation revealed reward-contingency-dependent suppression of BLA neural activity, but not prefrontal cortex. Finally, ex vivo experiments demonstrated that photostimulation of cholinergic terminals suppressed BLA projection neuron activity via monosynaptic muscarinic receptor signaling, while also facilitating firing in BLA GABAergic interneurons. Taken together, we show that the neural and behavioral effects of basal forebrain cholinergic activation are modulated by reward contingency in a target-specific manner.

    1. Neuroscience
    Olgerta Asko, Alejandro Omar Blenkmann ... Anne-Kristin Solbakk
    Research Article Updated

    Orbitofrontal cortex (OFC) is classically linked to inhibitory control, emotion regulation, and reward processing. Recent perspectives propose that the OFC also generates predictions about perceptual events, actions, and their outcomes. We tested the role of the OFC in detecting violations of prediction at two levels of abstraction (i.e., hierarchical predictive processing) by studying the event-related potentials (ERPs) of patients with focal OFC lesions (n = 12) and healthy controls (n = 14) while they detected deviant sequences of tones in a local–global paradigm. The structural regularities of the tones were controlled at two hierarchical levels by rules defined at a local (i.e., between tones within sequences) and at a global (i.e., between sequences) level. In OFC patients, ERPs elicited by standard tones were unaffected at both local and global levels compared to controls. However, patients showed an attenuated mismatch negativity (MMN) and P3a to local prediction violation, as well as a diminished MMN followed by a delayed P3a to the combined local and global level prediction violation. The subsequent P3b component to conditions involving violations of prediction at the level of global rules was preserved in the OFC group. Comparable effects were absent in patients with lesions restricted to the lateral PFC, which lends a degree of anatomical specificity to the altered predictive processing resulting from OFC lesion. Overall, the altered magnitudes and time courses of MMN/P3a responses after lesions to the OFC indicate that the neural correlates of detection of auditory regularity violation are impacted at two hierarchical levels of rule abstraction.