Abstract
Despite strong evidence that children learn more effectively from face-to-face interactions than from screens, we still understand relatively little about the dynamic, adaptive processes through which inter-personal contingency enhances attention and learning during live interactions. In this study, we investigate how social signals during early interactions operate across multiple hierarchical levels, ranging from low-level salience cues to higher-order features. Specifically, we examine how mothers dynamically and reciprocally adjust their behaviours across these levels in response to their infants’ attention during play. To achieve this, we developed a suite of novel, information-theory-based methods to quantify naturalistic audio-visual-semantic behaviours. Using time-series analyses, we assessed moment-by-moment associations between infant attention and both lower-order features (e.g., spectral flux of ambient noise and maternal vocalizations, maternal face and hand movement) and higher-order features (e.g., speech information rate, facial expression novelty, semantic surprisal, and toy naming) in tabletop interactions involving 67 mother-infant dyads (5- and 15-month-olds). Our findings suggest that, from early infancy, the information infants perceive is continuously and dynamically modulated across multiple hierarchical levels, contingent on their behaviour and attention. When infants focus on objects, mothers reduce low-level sensory input, minimising distractions. Conversely, increases in object naming and high-level information content associate with increases in sustained attention. These results indicate that maternal behaviours are both driven by and predictive of infant attention, and that, even from early development, attention involves interactive processes which unfold across multiple levels, from salience to semantics.
1. Introduction
Infants construct and refine predictive models of their world, seeking out stimuli that maximize their learning potential (Gerken et al., 2011; Kidd et al., 2014; Poli et al., 2020). These predictions allow them to navigate their environments more effectively (Köster et al., 2020), orienting their attention toward information that optimizes their ability to learn (Berger & Posner, 2022). Maximizing learning from the environment requires infants to build hierarchical predictive models that span a broad range of representations, from low-level features such as fast-varying fluctuations in amplitude, pitch and luminance, through to higher-order features which operate over slower time-scales, and which arise from the hierarchical integration of low-level features into cognitively meaningful information (Heilbron et al., 2022; Kothinti & Elhilali, 2023).
When viewing TV and movie clips, for example, neuroimaging evidence suggests that adult brains differentially track low- and higher-level features (Chang et al., 2022). Temporally coarse-grained predictions originating in the default mode network inform temporally fine-grained predictions in primary auditory and motor areas (Baldassano et al., 2017; Chang et al., 2022; Hasson et al., 2015). In infants, these processes are thought to be more rudimentary. Infant brains represent only longer events, even in early visual regions, with no time-scale hierarchy (Yates et al., 2022). Whereas younger infants’ gaze is better predicted by layers of a neural network model corresponding to lower-level areas of the ventral visual stream, older infants’ gaze was better predicted by higher-level layers (Kiat et al., 2022). Over time, information increasingly becomes integrated over more coarse-grained spatial and temporal scales (Pempek et al., 2010). As multiple physical aspects of an event reach an infant, they must be related to each other for the infant to model a complete representation of that event (Bahrick & Lickliter, 2000; Dionne-Dostie et al., 2015).
1.1 The Role of Caregivers in Scaffolding Infant Attention
However, these previous studies do not account for the fact that most early real-world experiences are embedded in social interaction (Carretero & Español, 2016; Español & Shifres, 2015). Traditional approaches use experimental tasks that conceptualise attention as a passive response to incoming sensory information (L. Smith & Gasser, 2005; Wass & Jones, 2023; Wass, 2014; Wass & Goupil, 2022). But in reality, infants actively modify their environments through their interactions, shaping the information they receive based on their own behaviour and experiences (Anderson et al., 2022; Mendez et al., 2024; L. Smith & Gasser, 2005; Wass & Goupil, 2022). This is true of children’s interactions with physical objects and features in their environment (Anderson et al., 2022, 2024) but it is especially true of their early social interactions.
Extensive evidence already shows that caregivers shape infants’ attention and learning by adapting their behaviour to align with the developmental needs of the infant (Brennan et al., 2010; Schick et al., 2022; Tippenhauer et al., 2020). One well-studied example is the use of infant-directed (ID) speech, which is characterized by exaggerated acoustic features such as higher pitch, greater pitch variability, and slower amplitude modulation (M. Cooke et al., 2014; Hilton et al., 2022). These features, although not completely universal, act to capture and maintain infant attention effectively while facilitating language learning (Nencheva & Lew-Williams, 2022; Räsänen et al., 2018). Additionally, caregivers simplify semantic complexity in their speech, adjusting it to match the infant’s vocabulary development (Schwab et al., 2018; Schwab & Lew-Williams, 2016) and further supporting word learning. Caregivers also modify their facial expressions and hand movements to maintain infant engagement and visual attention (Chong et al., 2003; Kliesch et al., 2022; Stern, 1974; van Schaik et al., 2020). These adaptations are thought to be constrained by diverging need (Chater & Vitányi, 2002; Woźniak & Knoblich, 2022): predictable elements help infants integrate upcoming stimuli, while novel stimuli capture attention by introducing uncertainty (Kidd et al., 2012, 2014; Meyer et al., 2023; Labendzki et al., pre-print).
1.2 “Alive” joint attention
Most current quantitative research on infant attention has focused on global differences, such as average comparisons between infant-directed versus adult-directed communication. These approaches miss the dynamic, reciprocal, “alive” (Fogel & Garvey, 2007) nature of social interactions. Although countless studies have used observer ratings to measure maternal sensitivity and dyadic mutuality (J. E. Cooke et al., 2022; Feldman, 2007; Murray et al., 2016), relatively few studies have quantitatively observed how infants and caregivers continuously respond and adapt to each other’s behaviours in a bidirectional exchange (Beebe et al., 2016; Feldman, 2007; Jaffe et al., 2001).
Previous research has shown that mothers dynamically adjust both the pitch of their speech and their modulation patterns (rate of change of pitch) based on infants’ attentional states (Phillips et al., 2023; Reisner et al., 2024; N. A. Smith & Trainor, 2008). We also know that caregivers modulate their gaze contingent on infant behaviours (Perapoch Amadó et al., 2025), and that mothers’ object labelling and handling are often contingent on infants’ gaze (Goupil et al., 2024; Sun & Yoshida, 2022). From previous research we also know that object naming enhances infant attention during joint play (Mendez et al., 2023; Sun & Yoshida, 2024). This creates a shared history of interaction, enabling both partners to predict and adapt to each other’s actions, ultimately fostering a dynamic communication system (Bruner, 1974; Fogel et al., 1992; Gratier & Magnier, 2012; Hasson & Frith, 2016; Malloch & Trevarthen, 2009; Murray, 2014; Ravreby et al., 2022; Vygotsky et al., 1978).
As yet, though, most previous that has examined dynamical processes during real-time caregiver-infant interactions has concentrated on examining different individual features in isolation (Beebe et al., 2016; Ham & Tronick, 2009; Lavelli & Fogel, 2013; Phillips et al., 2023). No previous research has examined how these reciprocal interactive influences operate across multiple hierarchical layers, ranging from low-level salience cues (such as physical movement, and vocal pitch and amplitude fluctuations) through to higher-level features that represent meaning and context. Understanding this is crucial from both a practical perspective, for improving interventions that target interaction dynamics to improve long-term infant outcomes (e.g. Murray et al., 2016); and from a theoretical perspective, for enriching our understanding of how multimodal and hierarchical features interactively influence social attention, and how these influences change and develop with age.
1.3 Current Study
The present study examines how infant-caregiver interaction change between 5 and 15 months, which is the period when the capacity for infant-led joint attention is thought to emerge (Mundy et al., 2009). We manually coded the gaze of mothers and their infants as they played together with toys (see 2.2 for more details) and calculated a range of lower-level and higher-level features of the interaction. Lower-level features included ‘spectral flux’, i.e. the instantaneous change in audio spectral content, which was calculated separately for ambient noise and for sections where the mother was vocalising; and differentials (i.e. rate of change) of maternal face and hand movement, which drive low-level visual and auditory salience (Itti & Baldi, 2009). Higher-level features operate over longer timescales and involve the integration of low-level features into information that is cognitively meaningful. These included information rate (i.e., the rate of meaningful data transmitted per time unit), semantic surprisal (i.e., log probability of the upcoming word given the preceding words), toy naming, and facial novelty (i.e., the predictability of mothers’ facial expression given previous expression) (see 2.2 for equations and formal definitions). The distinction presented here between lower- and higher-level features is a practical simplification; in reality, lower- and higher-level features exist along a continuum (Gwilliams et al., 2024; Heilbron et al., 2022).
We had three main predictions. Prediction 1: Maternal behaviours and infant attention. Across the age groups, mothers’ behaviour will be tightly coupled to changes in infant attention to objects over modality (i.e., audio and visual) and levels (low and high order). Prediction 2: Age-dependent effectiveness of features. Lower-level features will be more effective in both capturing and maintaining younger infant’s attention to objects, while higher-level features will be more effective with older infants. Prediction 3: Changing temporal dynamics. The temporal relationship between fluctuations in maternal features and infants’ attention to objects will change with age. At 5 months, lower-level variables will more frequently precede infant attention. However, as infants increasingly take the lead in play, we predict that changes in higher-level features of the mother’s behaviour will more often follow the infant’s attention to objects.
To explore these three predictions, we first calculated the mean and median for all the features of the interaction we studied. Following this, we built a cross-correlation matrix to understand how our behavioural variables (e.g. the lower-level and higher-level features of the interaction) inter-relate. Finally, we performed cross-correlations to examine the temporal relationship between each lower- and higher-order feature and the infant’s attention to objects.
For our primary analyses, infants’ attention to objects was conceptualized as the ‘on-task’ behaviour reflecting active engagement with the central goal. In addition, however, because our tabletop play setting allowed for three gaze locations (object, partner, inattentive) we also wished to examine the possibility that decreases in infant’s attention to objects might be more readily explained as increases in infants’ attention to the mother’s face. To test this, we also include analyses in the Supplementary Materials in which we performed the same cross-correlations but with infants’ looks to the mother’s face rather than to objects as the dependent variable.
2. Materials and methods
2.1 Participants
Participants were typically developing infants and their mothers. Only mothers were included because of practical difficulties in recruiting sufficient fathers to provide a gender-matched sample. The catchment area for this study was East London, including boroughs such as Tower Hamlets, Hackney and Newham. Further demographic details on the sample are given in Table S1.
Participants were recruited postnatally through advertisements at local baby groups, local preschools/nurseries, community centres and targeted social media campaigns aimed at all parents in the area, from databases of prior projects and via word-of-mouth. Informed consent and authorisation for publication has been obtained by the caregivers featured in Fig 1 and in Fig.S1. Ethical approval was obtained from the University of East London ethics committee (application ID: ETH2021-0076).

Illustration of analyses with the central figure showing all low- and high-level features synced.
A) Spectral flux being computed as the differences between consecutive spectrum (green area between the current and previous spectrogram). B) Hand movement is computed as the average distance travelled by the right and left hands. C) Face movement is computed as the sum of derivative of the eye-to-brow distance (blue dot on the eye to red dot on the eyebrow) and the mouth opening distance (blue dot on the upper lip to red dot on the lower lip). D) Object naming was obtain using the automatic transcription and a query for the specific toys present during the interaction. E) Semantic surprisal. For each word, a probability distribution was obtain using GPT2 prompted with the previous words, the semantic surprisal is the log probability of the observed word. F) Information rate. For each word, a cumulative complexity (upper) is computed using lossless compression algorithms, taking the derivative gives the information rate (lower). G) Facial novelty. For every frame, the facial expression is estimated, and an information distance is computed using the Kullback-Leibler between consecutive frames.
Initial exclusion criteria include complex medical conditions, known developmental delays, prematurity, uncorrected vision difficulties and parents below 18 years of age. Further exclusion criteria as well as final numbers of data included in each of the analyses for both samples are summarised in Table S2. The final samples included 33 5-month-old infants (15 females) and 34 15-month-old infants (17 females) and their mothers. Data were analysed in a cross-sectional manner. Average age for infants was 5.23 months (std= 0.5) and 15.75 months (std= 1.15) respectively. Average age for mothers was 35.31 years (std = 3.9, N= 29) at 5 months and 37.41 years (std= 3.82, N= 28) at 15 months. Two previous studies on joint attention and imitation have already reported analysis of the same participant cohort (Perapoch Amadó et al., 2025; Viswanathan et al., pre-print) although it is the first time that the rest of the data (i.e. lower- and higher-order variables) are analysed and reported.
2.2 Experimental design
Mothers and infants were seated facing each other on opposite sides of a table. Infants were seated either in a highchair or on a researcher’s lap, within easy reach of the toys (see Fig S1A). At the beginning of the joint play session, a researcher placed the toys on the table and asked the mothers to “play with their infants just as they would at home”. During the play session, researchers stayed behind a divider out of view of both the mother and the infant. The same three toys were used for each age group (see Fig S1D). The average duration of the joint play interactions was 5.06 minutes (std= 1.37) at 5 months and 6.17 minutes (std= 1.64) at 15 months (Fig S1C). Average duration differed significantly between 5- and 15-months (t(65) =-3.014, p=0.003). Given that the analyses conducted here are relative to the duration of each interaction or on specific events (e.g. infant looks at objects), variations in interaction durations should not be an issue.
The interactions were filmed using three Canon LEGRIA HF R806 camcorders recording at 50 frames-per-second (fps). Two cameras were placed in front of the infant, one on each side of the mother, and another one was placed in front of the mother, just behind the right side of the infant. All cameras were positioned so that the infant’s and the mother’s gaze, as well as the three toys placed on the table, were always visible (see Fig S1A). Microphone data were also collected using two wireless omni-directional Lavalier microphones recording at 44.1kHz in wav format, one attached to the mothers’ clothing and the other to the infant’s highchair.
Of note, brain activity was also recorded from both the infants and their mothers, at both ages, using a 64-channel BioSemi gel-based ActiveTwo EEG system. However, this data is not included in the current manuscript.
2.3 Data processing
2.3.1 Synchronisation of the different datasets
The cameras pointing at the participants were synchronised via radio frequency (RF) receiver LED boxes attached to each camera. The RF boxes received trigger signals from a single source (computer running Matlab) at the beginning and end of the play session and concurrently triggered light pulses to LED lights visible to each camera, along with an audible beep. The synchronisation of the video coding was conducted offline by aligning the times of the LED lights of the three cameras and checking that the durations matched. The audio data from the two microphones was synchronised from the start using the Zoom H4N PRO Handy Recorder, which enables simultaneous recording. Finally, Adobe Premiere Pro was used to synchronise the video with the audio data, allowing us to align the two datasets in time and estimate the lag between them.
2.3.2 Gaze behaviour coding and processing
The looking behaviour of the infants and their mothers was manually coded offline on a frame-by-frame basis, at 50fps. The start of a look was the first frame in which the gaze was static after moving to a new location. The following categories of gaze were coded: looks to objects (focusing on one of the three objects), looks to partner (looking at their partner), inattentive (not looking to any of the objects nor the partner) and uncodable (see Fig S1B). Uncodable moments included periods where: 1) their gaze was blocked or obscured by an object and/or their own hands, 2) their eyes were outside the camera frame, and/or 3) a researcher was within the camera frame.
To assess inter-rater reliability, ~22% of the data (15 datasets) were double coded by a second coder and both Cohen’s kappa and observed agreement were calculated. There was substantial agreement (κ= 0.628, std= 0.134; Kappa error= 0.005, std= 0.001; observed agreement = 0.751, std= 0.092) (Landis & Koch, 1977). Looking behaviour data was then processed such that any look preceding and following an “uncodable” period was excluded from further analyses. Similarly, both the first and the last look of every interaction were also excluded from further analyses.
2.3.3 Correlation analyses and calculation of significance
To calculate the associations between different features of the interaction we first performed a standard correlation analysis between all variables (Fig 3). Next, we conducted cross-correlation analyses to examine the dynamic associations between interaction features and infant attention (3.3) (i.e., whether one variable leads and the other follows) (SM Fig 5). First, we first converted the infants’ looking behaviour data into binary arrays of ones (looking at an object) and zeros (not looking at an object) (Wass et al, 2019; Phillips et al., 2023). Following this, we linearly detrended all the time series datasets and calculated the cross-correlations between all the different features of the interaction (3.2), and between these and infant attention (3.3) separately.

Descriptive analyses on lower- and higher-level features.
Violin plots showing the average at a group level of lower-level: Spectral flux - ambient (A); spectral flux - vocalisations (B); facial movement (C); hand movement (D), and higher-level features: object naming (E); information rate (F); semantic surprisal (G); facial novelty (H). Individual dots represent the data for each participant, in orange is data at 5 months, and in purple is data at 15 months. Red dots are showing the mean and red lines are showing the median. Asterisks indicate significance (* = p-adj <0.05, ** = p-adj <0.01, *** = p-adj < 0.001).

Correlation matrix between all features of the interaction at 5 and 15 months.
Each square shows the correlation between two maternal behavioural variables with a zero lag. The bottom-left triangle shows correlation values for the 5 months visit, and the top-right triangle shows correlation values for the 15 months visit.
Because cross-correlations are not time locked to specific moments (i.e. onset of an infant look to object) but instead are conducted on two time series (e.g. infant looks to objects and information rate) as a whole (see Fig 1), the strength of the overall correlation is weakened by the fact that periods of expected stronger correlation are balanced by weaker correlations where we would not expect any correlation at all (Xu et al., 2020). This can lead to very small correlation coefficients (e.g. around r = ±0.05), albeit sometimes significant when compared against the permuted correlations (see below).
To assess whether the results from the cross-correlations were significantly different than chance we generated permuted data and compared it against the observed data using Cluster-Based Permutation (CBP) tests. To generate permuted data, we computed the cross-correlation between 500 random combinations of time series data from different dyads. For example, for each feature, the time series data of one participant (e.g. information rate from dyad 1) was randomly paired with the time series data of another participant (e.g. infant looking behaviour from dyad 13). We repeated this 500 times and then performed a CBP test to examine significant differences between the results from the observed (i.e. real) and the permuted data.
The CBP test statistic was calculated using a function from FieldTrip (Maris & Oostenveld, 2007) called “ft_timelockstatistics”. This nonparametric framework allowed us to both control for the multiple comparison problem that arises from the fact that the effect of interest is evaluated many times (e.g. changes in mothers’ spectral flux around infant looks to objects), and to reduce the potential for false negative effects (Meyer et al., 2021).
Please refer to Fig S4 for a visual guide on interpreting findings from a cross-correlation analysis, illustrating the interpretation of positive and negative cross-correlation values across forward and backward time-lags. Of note, we refer to ‘forward lags’ as positive lags and ‘backward lags’ as negative lags to avoid redundancy with the terms ‘positive’ and ‘negative’ used to describe correlation values.
2.3.4 Calculation of features of the interaction
In this section, we explain how both lower and higher-level features were calculated. Of note, these variables were all resampled to match the video sampling rate of 50Hz.
2.3.4.1. Calculation of lower-level features
2.3.4.1.1 Spectral Flux
Spectral flux (SF) is a measure of acoustic change over time. In this study, we opted to separate spectral flux of the ambient background (i.e., other environmental noises such as claps, toys banging on the table, etc) from spectral flux of the mothers’ vocalisations to better analyse the distinct acoustic properties and influences of each on infant attention (see Fig 1A and 1B). We computed spectral flux as the sum of absolute differences over frequency of successive short-time Fourier transform (Müller, 2015).

where X(t;f) is the spectral amplitude at time t and frequency f computed over a 0.05sec window with 0.02sec hop-size.
Spectral flux was chosen as a metric for low-level auditory saliency, as it measures both amplitude and frequency changes over time and reflects acoustic differences between consecutives time points, both important salient auditory features (Huang & Elhilali, 2017; Kothinti et al., 2021). These features are modulated in Infant Directed Speech (IDS) (Cooke et al., 2014; Leong et al., 2017) and signal short term novelty in the audio stream (Müller & Chiu, 2024). Furthermore, spectral flux has been found to be a better predictor of neural entrainment to music, as it also measures changes in the frequency domain, compared to amplitude envelope only (Weineck et al., 2022). This approach allows us to have a holistic approach by measuring all possible spectral changes in naturalistic speech.
We then spilt spectral flux into two complementary streams using OpenAI’s Whisper (Radford et al., 2023), an open-source automatic speech recognition system. The general architecture of the classifier as well as the process followed to train the model are presented in more detail in Radford et al., (2023). WhisperX (Bain et al., 2023) was also used to obtain word-level onsets and offsets with millisecond precision. From there, we split the spectral flux for both vocalisations and ambient environmental sounds (i.e., all periods that occurred outside of the periods where WhisperX detected speech) separately.
2.3.4.1.2 Face movement
Face movement measures low-level facial movements, particularly the coordination of eyebrow and mouth movements. We used the “whole body” model from MMPose toolbox (mmpose author 2020) to extract two dimensional coordinates of mothers’ body features in every frame. The raw 2D coordinates were cleaned using a series of low-pass filters designed to reject movements that were impossibly fast. We then interpolated those short missing segment and low-pass filtered the time series to reduce the jitteriness from the feature estimation. To calculate the “Face movement” variable, we first averaged the distances between the left and right eye-to-brow measurements and then averaged this result with the mouth opening measurement. This approach captures low-level facial movements, which are considered salient during mother-infant interactions (Biringen, 1987), in particular mouth movements (Lewkowicz & Hansen-Tift, 2012; Zhang et al., 2021) and eyebrow movements (de Klerk et al., 2018; Isomura & Nakano, 2016).
Distances between the eye-to-brow as well as mouth opening were calculated using the Euclidian distance between the 2D coordinates of the centre of the eye, centre of the brow, and upper and lower lips (see Fig 1C). These distances were then normalised by the nose length at each frame to account for dynamic distance between the mothers’ head and the camera.
2.3.4.1.3 Hand movement
Our hand movement measure reflects the average distance travelled by the left and right hands. We use the same pipeline as for the face movement analysis (“whole body” model from MMPose with in-house cleaning) to extract the distance travelled over time by the left and right hands and then computed their average. We took the distance travelled by the hands as an index of low-level movement activity within the computed time window. The distance travelled was computed over 50 consecutive samples (1 second) and is proportional to speed.
2.3.4.2 Calculation of higher-level features
2.3.4.2.1 Object naming
Here we created a binary array where ones represented the name of the toys (i.e. panda, book and rattle; Fig S1D) and zeroes indicate the absence of these words. To obtain the onsets and offsets of each toy object we used Open AI’s Whisper (Radford et al., 2023) and WhisperX (Bain et al., 2023) (explained above in 2.3.4.1.1).
2.3.4.2.2 Information Rate
Information rate is a measure of the amount of new (uncompressible) information per unit of time. It was computed as the derivative of the cumulative compression size (Schmidhuber, 2009). In our case it was computed at a word level and can be expressed as:

With wi being the ith word and compression (wi) being the size of the losslessly compressed representation of that word. The lossless compression technique can track redundancy in text sequences in hierarchically increasing n-sequences. Compressing text in a cumulative window with PyLZMA, a Lempel-Ziv-Markov algorithm with dynamical dictionaries (Bauch et al., 2015), resembles how adults make predictions about upcoming events (Hasson et al., 2015; Schmidhuber, 2009). The upcoming words are predicted from the integrative posterior, i.e. the words (or sequences of words) that have already come. For instance, as more words are introduced the size of the description might increase. New words (i.e. words that have not already come before) would increase the description’s size by a larger degree when compared to the introduction of words that have already come. By taking the derivative of that increasing size, we effectively compute the rate at which new information is introduced. This integrative compression process can be compared to predictive processing (Schmidhuber, 2009) and to cognitive constructivism (Chater & Vitányi, 2002; Wolff, 2014, 2019) where a pattern-seeking agent compresses incoming data into a modelled representation using redundancies from the data.
2.3.4.2.3 Semantic surprisal
Semantic surprisal measures the linguistic probabilistic inference: it quantifies how surprising each upcoming word is given their context. This approach was chosen as it shows correlation with reading time and the N400 (S. L. Frank et al., 2015; Shain et al., 2024) and was computed for each word as the negative log probability of that word given all previous words (Willems et al., 2016). Specifically, this was conducted using the GPT-2 large language model, a generative transformer model that predicts the most probable next words given a text, along with their respective probabilities. We then retrieved the probability of the observed upcoming word and computed the negative log2 of that probability as the semantic surprisal (Shannon, 1948). The word was then added to the context window, and the prediction process was repeated for the next upcoming word (see Fig 1.E). Importantly, we prompted every interaction with the same sentence presenting the experiment to prevent the model from being surprised when it encountered normally unusual or unexpected words such as ‘panda’ (“The following text is about a mother and her infant playing together, as they would at home, with a toy panda, a book and a rattle, while wearing EEG hats and facial electrodes”).

2.3.4.2.4 Facial novelty
Our measure of facial novelty reflects an estimate of changes in the facial expressions of the mother over time. To calculate these, we used an open-source Python library, the ‘Facial Expression Recognition using Residual Masking Network’ (Pham et al., 2020). This classifier gives probabilities for seven canonical facial expressions (happy, sad, surprise, angry, disgust, fear, neutral) using residual neural network, a machine learning model that shows high generalisability due to “skip connections” that allows the model to ignore irrelevant features and focus on informative features for the task. On adult faces the model performs with 76.82% accuracy compared to manually labelled expressions (Pham et al., 2020). To measure the changes over time we computed the Kullback-Leibler divergence between consecutive frames.

Where Ex,t is the expression probability for the expression x distribution at time t. The Kullback-Leibler divergence can be interpreted as the information gained from updating Ex,t-1 to Ex,t , e.g. if the divergence is high, it means that the most recent facial distribution vector required a lot of information to be updated, i.e. the facial expression was novel compared to the previous frame (see Fig 1). We classified this as a higher-level feature as it the frame-wise difference between the last, and most abstract layer of neural network (Giordano et al., 2023; Kiat et al., 2022; Kothinti & Elhilali, 2023; Lecun et al., 2015; Zeiler & Fergus, 2013), and because we chose an information metric measure that has been used to model infant attention (M. C. Frank et al., 2009; Poli et al., 2020) to measure the model updating between consecutive frames.
3. Results
3.1 General descriptives
First, we explored between- and within-age group differences in how many times per minute infants and mothers engaged in looks to objects (see SM, Fig S2A) and for how long these episodes lasted on average (see SM, Fig S2B). We found no differences between age groups in the average look duration to objects or counts for either infants or mothers. However, mothers looked more frequently to objects (t5M(61)= −3.01, p5M=0.004; t15M(63)= −6.34, p15M<0.001) but for shorter durations (t5M(61)= 5.43, p5M<0.001; t15M(63)= 8.47, p15M<0.001) compared to infants at both ages. The same analyses on looks to partner are presented in the SM (Fig S2C and D). Second, we quantified the amount of caregiver speech during the joint play interactions at 5 and 15 months (see SM, Fig S3). The amount of speech was greater at 15 months compared to 5 months (t(52)= −2.43, p= 0.018).
Finally, we calculated the overall mean and median values for all the lower- and higher-level features. This allowed us not only to see how these variables are distributed across participants but also to conduct age group comparisons. To assess significant differences across age groups, we first tested for normality using MATLAB’s “vartestn” function. For normally distributed data, we applied two-sided t-tests, and for non-normally distributed variables, we used two-sided Wilcoxon rank-sum tests. We adjusted for the family wise false positive discovery rate using the false discovery rate (FDR) method, using the Benjamini-Hochberg procedure (Benjamini & Hochberg, 1995) with an alpha set at 0.05. We report adjusted p values (p-adj) that have been corrected for multiple comparisons in Table S3 after each uncorrected p value. We observed that both the averages for “Information Rate” and “Object naming” increased significantly between 5 months and 15 months (see Table S3). The rest of the variables did not change significantly from 5 to 15 months.
3.2 Correlations between all features of the interaction
We performed a standard correlation analysis across all eight lower- and higher-level features to examine the associations between all the different interaction features. Figure 3 shows the correlation matrix at 5 and 15 months (see Fig. 3). We also conducted a cross-correlation analyses to capture more dynamic associations (see SM Fig 5).
Overall, both analyses revealed that most variables were interrelated, as evidenced by significant correlations observed across multiple feature pairs (SM Fig 5). While the strength and timing of these correlations vary, most of them exhibit significant positive associations, particularly around lag 0, suggesting a high degree of temporal coordination (see Fig 3 for correlation values at lag 0, and SM Fig 5 for cross-correlation results). For example, when ‘Spectral Flux of vocalisations’ is high, ‘semantic surprisal’, ‘Information rate’ and ‘object naming’ are also high.
Conversely, the only pairs exhibiting negative correlations were the ones involving ‘spectral flux of ambient noise’ and the speech related variables (’spectral flux of vocalisations’, ‘object naming’, ‘information rate’ and ‘semantic surprisal’). This relationship is likely due to the way these variables are calculated, as they are inherently interdependent. The spectral flux of ambient noise and the spectral flux of vocalisations or information rate are measured in mutually exclusive contexts: when mothers are vocalising, spectral flux of ambient is set to zero and when spectral flux of ambient is being measured, it indicates that mothers were not vocalising. As a result, the observed pattern is a byproduct of our methodology rather than an intrinsic relationship between the variables.
The variables ‘facial movement’ and ‘facial novelty’, although both measuring the dynamics of facial feature, showed correlation values below .02, validating the theoretical differences described in the method section. Interestingly, there were few differences observed between 5 and 15 months.
3.3 Cross-correlations between features of the interaction and infant attention to objects
For our primary analyses we conducted cross-correlations to examine whether certain features of the interaction, most of them related to maternal behaviours, forward-predicted changes in infant attention to objects, or vice versa. Our predictions were that mothers would respond to decreases in infant attention by increasing their own lower-level salience (lower-level features forward-predicts infant attention); but then, when the infants’ attention is re-engaged, they would downregulate salience and upregulate higher-order semantic features (infant attention forward-predicts higher-level features).
Our findings are presented below, organized by each variable. The p-values resulting from statistical tests were adjusted separately for the lower-level and higher-level groups using the Benjamini-Hochberg procedure (Benjamini & Hochberg, 1995) with an alpha set at 0.05. The adjusted p-values are reported as ‘p-adj’ in the main text alongside the corresponding uncorrected p-values. Unless otherwise specified, all reported findings reflect bidirectional relationships (i.e., if an increase in X leads to a decrease in Y, the opposite - where a decrease in X leads to an increase in Y - also holds). Please refer to Fig S4 for a visual guide on interpreting findings from a cross-correlation analysis.
Since one important question in interpreting our findings is whether, instead of paying attention to objects, infants are instead paying increased attention to their parents, we have also presented in the SM (Figs S6 and S7) the exact same set of analyses but examining instead the associations between features of the interaction and infant attention to mother.
3.3.1 Lower-order features
3.3.1.1 Spectral flux - ambient
As expected, we observed significant negative correlation values at t=0, suggesting that less ambient spectral flux associated with more infant attention to objects. When we introduced a time lag between the variables, we found that this negative association was significant from t=-1.1/-0.86sec (infant precedes ambient SF) to t=3.06/3.54sec (ambient SF precedes infant) at 5 months (p=0.008, p-adj= 0.013) and at 15 months (p=0.009, p-adj= 0.013) respectively (Fig 3A, 3B). The significance of both findings is stronger for positive time-lags indicating that, overall, reductions in the ambient spectral flux tended to forwards-predict increases in infant’s subsequent attention to objects (or that increases in ambient spectral flux forward predicted decreases in infant’s object attention) at both ages (Fig 3A, 3B).
3.3.1.2 Spectral flux - vocalisations
We found negative correlation values at t=0, suggesting that maternal speech containing less spectral flux associates with more infant attention to objects. At 5 months, this relationship was significant during two temporal windows: from t=-2.1sec (infant precedes mother) to t=1.24sec (mother precedes infant) (p=0.015, p-adj= 0.017), and from t=6.54sec to t=9.52sec (p=0.022, p-adj= 0.022) (Fig 3C). This suggests that, when 5-month-old infants focus on objects, mothers reduce the acoustic variability of their vocalisations, or conversely, when infants’ attention to objects decreases, mothers increase the acoustic variability of their vocalisations (backward time-lag) (see Figure S4). Additionally, we also found that an increase in mothers’ acoustic variability was followed by a reduction in infants’ attention to objects, or vice versa, when mothers decrease their acoustic variability, infants’ attention to objects increased (forward time-lag). At 15 months, instead, these negative associations were present at forward time-lags but not statistically significant (Fig 3D).
We considered the possibility that an increase in the spectral flux of maternal vocalisations might be associated not with an increased likelihood of looking at the object, but rather with an increased likelihood of looking at the mother. However, we found no support for this SM (Fig S6C and D).
3.3.1.3 Facial movement
We found negative correlations at time t=0, suggesting that increased maternal facial movement associates with less infant attention to objects, at both 5 and 15 months. This was significant from t=- 3.92/-3.46sec to t=2.48/3.06sec at 5 (p= 0.006, p-adj= 0.013) and 15 months (p<0.001, p-adj= 0.007) respectively (Fig 3E, F). This suggests that when infants focus on the objects, mothers make fewer facial movements, potentially removing stimuli from an infant’s peripheral vision. Conversely, when mothers increase facial movements, infants tend to look less at the objects (Fig 3E, F). As expected, when we looked at the relationship between maternal facial movement and infants’ looks to their mother, we found the opposite pattern (see SM, Fig S6E, F). When infants look at their mother, there is an increase in maternal facial movement; and when mothers increase facial movements, infants look at their mothers more. Again, this pattern was observed at both ages. This relationship is independent of the relationship with vocal spectral flux documented in 3.3.1.2 (see Fig 3, Fig S5) because the direct association between the two variables is negative, but the two variables are each negatively associated with infant attention.
3.3.1.4 Hand movement
We found negative correlations at time t=0, suggesting that increased maternal hand movement associates with less infant attention to objects, at both 5 and 15 months. This was significant from t=- 3.6/-8.02sec to t=1.52/1.02sec at 5 (p=0.01, p-adj= 0.013) and 15 months (p=0.003, p-adj= 0.012) respectively (Fig 3G, H). The significance of both findings is stronger at backward time-lags indicating that, overall, when an infant focuses their attention on an object, it leads to a decrease in the mother’s hand movements (or that a decrease in the infants’ attention to an object leads to an increase in the mother’s hand movements) (see Fig 3G, H). Facial movement and hand movement were themselves positively correlated (see Fig 3, Fig S5), so the similar associations with attention likely pick up common variance across multiple variables.
3.3.2 Higher-level features
3.3.2.1 Object naming
No association between object naming and infant attention was observed at 5 months (Fig 4A). At 15 months, instead, we observed a positive significant correlation from t=0.24 to t=2.6sec (p=0.032, p- adj= 0.032), indicating that, when mothers name objects, it associates with subsequent increases in infants’ attention towards the named object (Fig 4B). These associations may be due to the shared variance between object naming and low-level features documented previously (see Fig 3, Fig S5). However, the fact that the associations with low-level features were consistent at both ages, whereas the association between object naming and attention was present only at 15 months, partially precludes this possibility.

Cross correlations between lower-level features and infant attention to objects.
Spectral flux of ambient background and vocalisations, facial movement, and hand movement in relation to infant attention to objects at 5 months (A, C, E, G) and 15 months (B, D, F, H), respectively. Thick orange (5 months) and purple (15 months) lines represent the observed cross-correlation results, with shaded coloured areas showing their SEM. Grey lines represent control (permuted) data, with the shaded grey area indicating its SEM. Red thick lines indicate significance from the CBP test (significance for the CBP tests was set to p<0.025, two-sided, and was then FDR adjusted).
3.3.2.2 Information rate
We observed a significant positive correlation between information rate and infant attention where increased maternal information rate associated with increased infant attention to objects. This was significant from t=2.46 to 5.56sec at 5 months (p=0.023, p-adj= 0.032; Fig 4C), but non-significant at 15 months (Fig 4D). These findings are stronger at forward time-lags indicating that, overall, increases in maternal information rate more often precede infant attention to objects than follow it. Interestingly and conversely, we observed the opposite pattern for infant attention to the partner, where maternal information rate tended to precede a decrease in infant attention to partner. This pattern was present at both ages, though it was only statistically significant at 5 months (see SM, Fig S7C, D). This association is independent of the relationship documented between information rate and vocal spectral flux (see Fig 3, Fig S5) because those two variables are positively associated whereas the latter is positively associated with infant attention and the latter negatively.
3.3.2.3 Semantic surprisal
No significant associations were observed between semantic surprisal and infant attention to objects (Fig 4E, 4F) or attention to mothers (Figure S7E, F).
3.3.2.4 Facial novelty
We found negative correlations at t=0, indicating that increased infant attention to objects is associated with decreased maternal facial variability. These associations were significant at 15 months from t=-5.68 (infant precedes mother) to t=+1.96sec (mother precedes infant) (p<0.001, p-adj= 0.003; Fig 4H). At 5 months, instead, these associations were non-significant (Fig 4G). Overall, these findings are stronger at backward time-lags indicating that increases in infant attention to objects precede decreases in maternal facial expressions (or that decreases in infant attention to objects precede increases in the variability of maternal facial expressions) (Fig 4G, H). As with the face movement data (Fig 3E, F), we expected that increases in the variability of maternal facial novelty would associate with infant attention to mothers. However, this was not the case. Although a pattern emerged, where infants appeared to pay more attention to their mothers when they increased the variability of their facial expressions, this was not significant (Fig SM 7G, H). The relationship between facial novelty and attention may be related to the relationships already observed between facial salience and attention (Fig 4E, F) as the two variables are themselves weakly positively associated (r=.007/.02 at 5/15 months).

Cross correlations between higher-level features and infant attention to objects.
Maternal object naming, information rate, semantic surprisal, and facial novelty in relation to infant attention to objects at 5 months (A, C, E, G) and 15 months (B, D, F, H), respectively. Thick orange (5 months) and purple (15 months) lines represent the observed cross-correlation results, with shaded coloured areas showing their SEM. Grey lines represent control (permuted) data, with the shaded grey area indicating its SEM. Red thick lines indicate significance from the CBP test (significance from CBP was set to p<0.025, two-sided; of note, significance for object naming was set at p<0.05, one-sided, based on the expectation that object naming would positively correlate with attention. All results were FDR adjusted).
4. Discussion
In this study we examined mother-infant interactions during dyadic object play at 5 and 15 months. We characterised eight features of the interaction, primarily derived from maternal behaviours, that were further categorized into lower-level features - namely spectral flux of ambient noise and of maternal vocalisations, and maternal face and hand movement - and higher-level features, including information rate, semantic surprisal, toy naming and facial novelty. Using time-series analyses we investigated how these features of the interaction, most of them coming from the behaviours of the mothers, dynamically modulate and are modulated by fine-grained fluctuations in the infant’s attention state.
Our preliminary analyses indicated no significant differences in infants’ overall attentiveness to the objects between 5 and 15 months (Fig S2A and B). Examining interaction features from 5 to 15 months revealed increases in mother object naming, information rate and total amount of speech (Fig 2E, 2F, S3), while all other features remained unchanged (Fig 2). This is in line with previous research on caregiver speech becoming more complex over infant development (Schwab et al., 2018) in order to accommodate for infant needs for redundancy for word learning (Schwab & Lew-Williams, 2016; Tal et al., 2021).
Next, we examined the relationships among all eight lower- and higher-level interaction features (Fig 3, S5). The majority of the features of the interaction showed positive associations, particularly around lag 0, indicating small but significant levels of temporal coordination. This suggests that mothers engage in multiple behaviours across various modalities simultaneously when interacting with their infants, highlighting the interconnected nature of our variables. Interestingly, there were few differences observed between 5 and 15 months, implying that the features of the interaction remain relatively consistent across these time points. Importantly, while higher- and lower-level features were interrelated (Fig. 3, S5), they showed distinct patterns in relation to infant attention to objects and to the mothers. This suggests that, despite their interconnectedness, they contribute differently to dyadic interaction dynamics and the shaping of infant attention.
When we examined the dynamic associations between infants’ attention and lower-level features of the mothers’ behaviour, we found a range of fine-grained associations, consistent with Prediction 1, which suggests that mothers use multiple behavioural cues during play that are tightly coupled to changes in infant attention (Perapoch Amadó et al., 2025; Phillips et al., 2023; Suarez-Rivera et al., 2019). We found that ambient spectral flux, spectral flux of vocalisations, face and hand movements all showed negative correlations at time t=0, suggesting either that caregivers increased their saliency when infants were not paying attention to the toy objects, or that they decreased their saliency when infants were paying attention, presumably to help them focus on the toys (Fig 3). Contrary to Prediction 2, we found that the direction of these associations did not change with age: relationships were generally consistent between 5 months and 15 months. When we introduced a time lag between the variables we found, consistent with Prediction 3, that decreases in spectral flux (a measure of acoustic change) in both maternal vocalisations and the ambient noise was followed by subsequent increases in infants’ attention to objects (Fig 3 A-D). Additionally, increases in infants’ attention to objects were followed by decreases in mothers’ hand and facial movements (Fig 3 E-H). As expected, maternal facial movements were associated with infant attention to mother’s faces (Fig S6E and S6F).
Overall, these findings suggest that mothers adapt their low-level behaviour across different modalities to support their infant’s attentional focus. Specifically, when infants direct their attention toward an object, mothers may intentionally pause their ongoing actions or interactions to facilitate the infant’s engagement with the object of interest. Interestingly, these lower-level features were all positively correlated with infant attention to mothers’ face at t=0 (the opposite pattern observed with infant attention to objects), suggesting that mothers may reduce these behaviours not only to support infants’ attention to objects but also to avoid distracting them from the objects. This aligns with previous research demonstrating that mothers actively scaffold their infants’ attention (e.g. Bakeman & Adamson, 1984; Bigelow et al., 2004; Suarez-Rivera et al., 2019; Sun & Yoshida, 2022). Our findings are interesting because they highlight not only the positive associations between low-level maternal behaviours and infant attention - where increased maternal behaviours correspond to increased infant attention - but also the role of maternal behaviour reduction. More specifically, the decrease in certain maternal behaviours, potentially those that could have been acting as distractors, was also linked to greater infant attention. This suggests an adaptive parental strategy that not only involves increasing certain behaviours to support infant engagement, but also strategically reducing others, highlighting the importance of modulating low-level behaviour contingent on infant attention.
We also observed dynamic associations between infants’ attention and higher-level features of the interaction. Consistent with Prediction 1, we found that decreases in infant attention to objects preceded increases in the variability of maternal facial expressions; or, alternatively, that increases in infant attention preceded decreases in the variability of maternal facial expressions (Fig 4G, H). We also found that increases in object naming and information rate were positively associated with subsequent increases in infant attention to objects (Fig 4 A-D). Object naming predicted attention to objects at 15 months but not at 5 months (Fig 4A, B), while information rate showed this relationship at both ages but reached significance only at 5 months (Fig 4C, D). These findings are comparable to Suarez and colleagues who found that parental behaviours such as talking and touching were not only highly likely to occur when both the parent and infant visually attended to the same object but were also associated with longer periods of infant attention (Suarez-Rivera et al., 2019).
Similarly (and perhaps surprisingly, given that information rate and semantic surprisal were the most strongly associated of the behavioural variables (Fig S5)), we found associations between information rate and infant attention (Fig 4C, D) but no associations between semantic surprisal and infant attention (Fig 4E, F). This may indicate that 5- and 15-month-old infants lack of sensitivity to higher-order linguistic structures (Hasson et al., 2015; Heilbron et al., 2022). Instead, infants at 5 and 15 months might base their predictions on a simpler statistical learning process that tracks transitional probabilities between part of words (Chater & Vitányi, 2002; Schmidhuber, 2009; Wolff, 2014, 2019). An alternative, and complementary, explanation could be that the “semantic surprisal” variable was calculated using models trained on adult-directed text. Consequently, it is possible that the aspects deemed surprising by the model were not equally surprising to infants. However, the calculations from the model are based on what has been said within the interaction itself, which makes it unlikely that the model’s interpretation of surprisal diverges significantly from what anyone, including infants, could experience in that context. Instead, what these findings might suggest is that even at 15 months, infants are still likely in the process of developing the linguistic and cognitive frameworks necessary to actively infer meaning from language in a very coarse hierarchical scale and are not yet making predictions based on the semantic meaning of words. However, they orient their attention to speech that updates their past representation, potentially revealing that their predictions are based on a lossless compression process that needs updating when facing unseen sequences. In that regard, early attention follows the principle of least effort, where attention is allocated to speech when it offers relevant information for model updating (Gerken et al., 2011).
Overall, Prediction 2, which proposed that higher-level features would be more influential for older infants, was only partially supported. While associations between object naming and facial novelty with infant attention were stronger at 15 months, associations between information rate and infant attention were stronger at 5 months. Similarly, Prediction 3, which posited that as infants take the lead in play, higher-level maternal behaviours would more often follow infant attention to objects, was also not supported. Even at 15 months, both increased object naming and increased information rate continued to predict subsequent increases in infant attention, rather than following it. This finding is surprising given prior research suggesting that mothers often name the object their infant is already attending to (i.e. object naming/talk follows infant attention more than the other way around) (Schroer & , 2022), and that this practice facilitates infant learning more effectively than naming objects not currently in the infant’s focus (Goupil et al., 2024; Yu & Smith, 2012). However, this effect may not necessarily hold if infants already know the words.
Recent research in psychology has challenged traditional theoretical approaches which characterise the physical environment purely in terms of exogenous features, quantified through measures such as salience (Itti & Baldi, 2009), and attention as the product of a push-pull between exogenous and endogenous factors (Luna et al., 2008). Rather, we dynamically recalibrate the salience in our environments through how we interact with them - for example, by picking up an object we are interested in and pulling it closer (Anderson et al., 2022; Franchak & Yu, 2022; Méndez et al., 2021; Schroer & Yu, 2022): we generate experiences through behaviours (Dewey, 1896; Gibson, 1988). In many ways, the findings from this study extend this idea into social interaction, by showing that our behaviours influence not just the low-level properties of how our partners move and talk, but also what information they present to us, and when. Already during infancy, the information that infants perceive is constantly and dynamically changing over multiple hierarchical levels, contingent on their behaviour and on their attention. Even from early development, attention involves interactive processes which unfold across multiple levels.
In summary, our findings describe how, even during early development, maternal behaviours both depend on, and influence, infants’ attention. This suggests that, even from early life, joint attention involves both lower- and higher-level processes. Interactive processes operate across multiple levels, from salience to semantics.
4.1 Limitations
There are a number of limitations to our study. First, the experimental setup, in which a mother and her infant play together while seated across a table in the lab, cannot fully compare to a real-life interaction (Tamis-LeMonda, 2023), it also limits the variability of interactional patterns that the dyad can engage in, and might have put pressure on mothers to engage in the interaction more than they would have otherwise (Abney et al., 2020). However, despite these limitations, we believe the present set-up still preserves important characteristics of early interactions, such as responsivity (Fogel & Garvey, 2007) and multimodality (Español & Shifres, 2015). Second, the measures selected for this study provide only a limited representation of the diverse and complex contexts of real-world interactions. For instance, we were unable to capture narrative event structures, which are considered crucial for interpreting everyday settings (Zacks, 2020). Similarly, higher-order factors, such as play sequences, were beyond the scope of this study but are likely to be closely linked to infant attention, too. Third, the behaviours we quantified interact dynamically, potentially producing effects greater than the sum of their individual contributions (Zhang et al., 2021); however, we analysed them independently. Although we considered creating composite variables or using Principal Component Analysis (PCA) to work with components, doing so would result in the loss of the “meaning” of the individual variables. Fourth, our sample is homogeneous in terms of ethnicity, culture and socioeconomics and consists only of mothers (Table S1). It will be useful for future studies to include investigations from more heterogeneous groups/communities (Feldman, 2007; Mundy et al., 2007; Taverna et al., 2024) as well as to include fathers (Aureli et al., 2022).
Data sharing statement:
Partial restrictions to the data and/or materials apply. Due to the personally identifiable nature of this data (video and audio recordings from infants and their mothers) the raw data will not be made publicly accessible. Researchers who wish to access the raw data should email the lead author and permission to access the raw data will be granted as long as the applicant can guarantee that certain privacy guidelines can be provided.
Supplementary Materials

Demographic data at 5 and 15 months.

Table summarising the numbers of datasets included in the analyses for both samples as well as reason for exclusion.

Statistics for overall average levels

Experimental paradigm
Experimental paradigm. A) Top figure shows the experimental set up for the joint play condition. Two cameras pointed at the infant (view in photos 1 and 2) and one camera pointed at the mother (view in photo number 3). Looking behaviour was coded manually at 50fps for object and partner looks from both the mother and the infant. B) shows the different type of looks (i.e. looks to object 1-3, looks to partner and ‘others’. Notice that the latter category – ‘others’ – included inattention and uncodable moments). C) Plot of the average duration of the joint play interactions at 5 months (in orange) and 15 months (in purple). Asterisks indicate significance from the two-sample t-test (* = p<0.05, ** = p<0.01, *** = p< 0.001). D) Photos of the toys employed at both time points: panda (A), book (B) and rattle (C).

Descriptive analysis on looks to the object and to the partner.
Descriptive analyses on looking behaviour. Figure showing average number of looks per minute (A, C) and average look duration (in seconds) (B, D) for looks to object (A, B) and looks to partner (C, D) respectively. Asterisks indicate significance (* = p<0.05, ** = p<0.01, *** = p< 0.001).

Average amount of speech per participant.
Descriptive analyses on amount of speech. Figure showing average number of words. To quantify the amount of speech in each interaction, we transformed the output from Whisper Open AI (Radford et al., 2023) into a binary array where ones represent words and zeroes indicate the absence of words. We then summed the number of words per interaction and divided it by the length of each interaction respectively. We conducted a two-sided t-test to assess significant differences across age groups (* indicates p<0.05).

Schematic illustrating how to interpret cross-correlation results
Visual guide on how to interpret findings from a cross-correlation analysis. Each plot illustrates the interpretation of a positive (A and B) and negative cross-correlation (C and D) values across backward (A and C) and forward (B and D) time-lags. Below each cross-correlation scheme is a illustration of the alternative possible explanations regarding the cross-correlated variables. For instance A) shows a positive correlation in backward lags between infant gaze and maternal behaviour, this can be interpreted is two complementary ways : an increased in infant gaze in followed by an increased in maternal behaviour or that a decreased in infant gaze is followed by a decreased in maternal behaviour.

Correlations between behavioural measures.
Matrix of pairwise cross-correlations between behavioural measures. Each subplot shows the correlation values between pairs of variables at different lags (in seconds), where backwards lags indicate that the first variable (variables at the top) precede the second one (variables on the side), and forward lags that the first variable follows the second one. Please refer to Fig S4 for a visual guide on interpreting findings from

Cross correlations between lower-level features and infant attention to faces.
Cross correlations between lower-level features and infant attention to faces. Spectral flux of non-vocalisations and vocalisations, facial movement, and hand movement in relation to infant attention to faces at 5 months (A, C, E, G) and 15 months (B, D, F, H), respectively. Thick orange (5 months) and purple (15 months) lines represent the observed cross-correlation results, with shaded coloured areas showing their SEM. Grey lines represent control (permuted) data, with the shaded grey area indicating its SEM. Red thick lines indicate significance from the CBP test (p<0.025, two-sided). All results were FDR adjusted.

Cross correlations between higher-level features and infant attention to faces.
Cross correlations between higher-level features and infant attention to faces. Maternal object naming, information rate, semantic surprisal, and facial novelty in relation to infant attention to objects at 5 months (A, C, E, G) and 15 months (B, D, F, H), respectively. Thick orange (5 months) and purple (15 months) lines represent the observed cross-correlation results, with shaded coloured areas showing their SEM. Grey lines represent control (permuted) data, with the shaded grey area indicating its SEM. Red thick lines indicate significance from the CBP test (p<0.025, two-sided). All results were FDR adjusted.
Acknowledgements
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. [853251 - ONACSA]) and from Medical Research Council grant MR/X021998/1. We want to send a special thank you to María Penaherrra, Mukrime Gok, Stefanie Pow, Desirèe Cardile and Georgina Harris for their patience and dedication with the coding of our gaze data. We would also like to thank all participating infants and caregivers that took part in our study.
References
- What are the building blocks of parent-infant coordinated attention in free-flowing interaction?Infancy :infa.12365https://doi.org/10.1111/infa.12365Google Scholar
- An edge-simplicity bias in the visual input to young infantsSci. Adv 10Google Scholar
- Scene saliencies in egocentric vision and their creation by parents and infantsCognition 229https://doi.org/10.1016/j.cognition.2022.105256Google Scholar
- Mother-infant co-regulation during infancy: Developmental changes and influencing factorsInfant Behavior and Development 69https://doi.org/10.1016/j.infbeh.2022.101768Google Scholar
- Intersensory redundancy guides attentional selectivity and perceptual learning in infancyDevelopmental Psychology 36:190–201https://doi.org/10.1037/0012-1649.36.2.190Google Scholar
- WhisperX: Time-Accurate Speech Transcription of Long-Form AudioIn: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH pp. 4489–4493https://doi.org/10.21437/Interspeech.2023-78Google Scholar
- Coordinating Attention to People and Objects in Mother-Infant and Peer-Infant InteractionChild Development 55Google Scholar
- Discovering Event Structure in Continuous Narrative Perception and MemoryNeuron 95:709–721https://doi.org/10.1016/j.neuron.2017.06.041Google Scholar
- PyLZMA
- A systems view of mother-infant face-to-face communicationDevelopmental Psychology 52:556–571Google Scholar
- Controlling the false discovery rate: a practical and powerful approach to multiple testingJournal of the Royal Statistical Society 57:289–300Google Scholar
- Beyond Infant’s Looking: The Neural Basis for Infant Prediction ErrorsPerspectives on Psychological Science 18:664–674https://doi.org/10.1177/17456916221112918Google Scholar
- The role of joint attention in the development of infants’ play with objectsDevelopmental Science 7Google Scholar
- Infant Attention to Facial Expressions and Facial MotionJournal of Genetic Psychology 148:127–133https://doi.org/10.1080/00221325.1987.9914543Google Scholar
- Two Minds, One Dialog: Coordinating Speaking and UnderstandingPsychology of Learning and Motivation - Advances in Research and Theory 53:301–344https://doi.org/10.1016/S0079-7421(10)53008-1Google Scholar
- From communication to language - A psychological perspectiveCognition 3:255–287Google Scholar
- Multimodal Study of Adult-Infant Interaction: A Review of Its Origins and Its Current StatusPaidéia (Ribeirão Preto) 26:377–385https://doi.org/10.1590/1982-43272665201613Google Scholar
- Information flow across the cortical timescale hierarchy during narrative constructionProceedings of the National Academy of Sciences 119:e2209307119-e2209307119https://doi.org/10.1073/pnas.2209307119Google Scholar
- Simplicity : A unifying principle in cognitive science?Trends in Cognitive Sciences 7Google Scholar
- Three facial expressions mothers direct to their infantsInfant and Child Development 12:211–232https://doi.org/10.1002/icd.286Google Scholar
- Parental sensitivity and child behavioral problems: A meta-analytic reviewChild Development 93:1231–1248https://doi.org/10.1111/cdev.13764Google Scholar
- The listening talker: A review of human and algorithmic context-induced modifications of speechComput. Speech Lang. 28:543–571Google Scholar
- Eye contact modulates facial mimicry in 4-month-old infants: An EMG and fNIRS studyCortex 106:93–103https://doi.org/10.1016/j.cortex.2018.05.002Google Scholar
- The reflex arc concept in psychologyPsychological Review 3.4 3:357Google Scholar
- Multisensory Integration and Child NeurodevelopmentBrain Sciences 5:32–57https://doi.org/10.3390/brainsci5010032Google Scholar
- The Artistic Infant Directed Performance: A Mycroanalysis of the Adult’s Movements and SoundsIntegrative Psychological and Behavioral Science 49:371–397Google Scholar
- Parent-infant synchrony and the construction of shared timing; physiological precursors, developmental outcomes, and risk conditionsJournal of Child Psychology and Psychiatry and Allied Disciplines 48:329–354https://doi.org/10.1111/j.1469-7610.2006.01701.xGoogle Scholar
- Alive communicationInfant Behavior and Development 30:251–257https://doi.org/10.1016/j.infbeh.2007.02.007Google Scholar
- Social process theory of emotion: A dynamic systems approachSocial Development 1:122–142https://doi.org/10.1111/j.1467-9507.1992.tb00116.xGoogle Scholar
- Beyond screen time: Using head-mounted eye tracking to study natural behaviorAdvances in Child Development and Behavior 62:61–91Google Scholar
- Development of infants’ attention to faces during the first yearCognition 110:160–170https://doi.org/10.1016/j.cognition.2008.11.010Google Scholar
- The ERP response to the amount of information conveyed by words in sentencesBrain and Language 140:1–11https://doi.org/10.1016/j.bandl.2014.10.006Google Scholar
- Infants avoid “labouring in vain” by attending more to learnable than unlearnable linguistic patternsDevelopmental Science 14:972–979https://doi.org/10.1111/j.1467-7687.2011.01046.xGoogle Scholar
- Exploratory behavior in the development of perceiving, acting, and the acquiring of knowledgeAnn. Rev. Psychol 39:1–41Google Scholar
- Intermediate acoustic-to-semantic representations link behavioral and neural responses to natural soundsNature Neuroscience 26:664–672https://doi.org/10.1038/s41593-023-01285-9Google Scholar
- Leader-follower dynamics during early social interactions matter for infant word learningProceedings of the National Academy of Sciences 121:e2321008121-e2321008121https://doi.org/10.1073/pnas.2321008121Google Scholar
- Sense and Synchrony: Infant Communication and Musical ImprovisationIntermédialités: Histoire et Théorie Des Arts, Des Lettres et Des Techniques 19:45https://doi.org/10.7202/1012655arGoogle Scholar
- Hierarchical dynamic coding coordinates speech comprehension in the brainhttps://doi.org/10.1101/2024.04.19.590280
- Relational psychophysiology: Lessons from mother-nfant physiology research on dyadically expanded states of consciousnessPsychotherapy Research 19:619–632https://doi.org/10.1080/10503300802609672Google Scholar
- Hierarchical process memory: Memory as an integral component of information processingTrends in Cognitive Sciences 19:304–313https://doi.org/10.1016/j.tics.2015.04.006Google Scholar
- Mirroring and beyond: coupled dynamics as a generalized framework for modelling social interactionsPhilosophical Transactions of the Royal Society B: Biological Sciences 371:20150366https://doi.org/10.1098/rstb.2015.0366Google Scholar
- A hierarchy of linguistic predictions during natural language comprehensionProceedings of the National Academy of Sciences 119:e2201968119-e2201968119https://doi.org/10.1073/pnas.2201968119Google Scholar
- Acoustic regularities in infant-directed speech and song across culturesNature Human Behaviour 6:1545–1556https://doi.org/10.1038/s41562-022-01410-xGoogle Scholar
- Auditory salience using natural soundscapesThe Journal of the Acoustical Society of America 141:2163–2176https://doi.org/10.1121/1.4979055Google Scholar
- Automatic facial mimicry in response to dynamic emotional stimuli in five-month-old infantsProceedings of the Royal Society B: Biological Sciences 283:20161948https://doi.org/10.1098/rspb.2016.1948Google Scholar
- Bayesian surprise attracts human attentionVision Research 49:1295–1306https://doi.org/10.1016/j.visres.2008.09.007Google Scholar
- Rhythms of Dialogue in Infancy: Coordinated Timing in DevelopmentMonographs of the Society for Research in Child Development 66Google Scholar
- Linking patterns of infant eye movements to a neural network model of the ventral stream using representational similarity analysisDevelopmental Science 25https://doi.org/10.1111/desc.13155Google Scholar
- The Goldilocks Effect: Human Infants Allocate Attention to Visual Sequences That Are Neither Too Simple Nor Too ComplexPloS One 7:e36399–e36399https://doi.org/10.1371/journal.pone.0036399Google Scholar
- The Goldilocks Effect in Infant Auditory AttentionChild Development 85https://doi.org/10.1111/cdev.12263Google Scholar
- The role of social signals in segmenting observed actions in 18-month-old childrenDevelopmental Science 25:e13198–e13198https://doi.org/10.1111/desc.13198Google Scholar
- Making Sense of the World: Infant Learning From a Predictive Processing PerspectivePerspectives on Psychological Science 15:174569161989507–174569161989507https://doi.org/10.1177/1745691619895071Google Scholar
- Are acoustics enough? Semantic effects on auditory salience in natural scenesFrontiers in Psychology 14https://doi.org/10.3389/fpsyg.2023.1276237Google Scholar
- Auditory salience using natural scenes: An online studyThe Journal of the Acoustical Society of America 150:2952–2966https://doi.org/10.1121/10.0006750Google Scholar
- Temporal patterns in the complexity of child-directed song lyrics reflect their functions
- An Application of Hierarchical Kappa-type Statistics in the Assessment of Majority Agreement among Multiple ObserversBiometrics 33:363–374https://doi.org/10.2307/2529786Google Scholar
- Interdyad Differences in Early Mother-Infant Face-to-Face Communication: Real-Time Dynamics and Developmental PathwaysDevelopmental Psychology 49:2257–2271https://doi.org/10.1037/a0032268.suppGoogle Scholar
- Deep learningNature 521:436–444https://doi.org/10.1038/nature14539Google Scholar
- The Temporal Modulation Structure of Infant-Directed Speechhttps://doi.org/10.17863/CAM.9089
- Infants deploy selective attention to the mouth of a talking face when learning speechProceedings of the National Academy of Sciences 109:1431–1436https://doi.org/10.1073/pnas.1114783109Google Scholar
- Development of eye-movement controlBrain and Cognition 68:293–308https://doi.org/10.1016/j.bandc.2008.08.019Google Scholar
- Communicative Musicality: Exploring the Basis of Human CompanionshipBritish Journal of Psychotherapy 26:100–105https://doi.org/10.1111/j.1752-0118.2009.01158_1.xGoogle Scholar
- Nonparametric statistical testing of EEG- and MEG-dataJournal of Neuroscience Methods 164:177–190https://doi.org/10.1016/j.jneumeth.2007.03.024Google Scholar
- One-year old infants control bottom-up saliencies to purposely sustain attention
- Controlling the input: How one-year-old infants sustain visual attentionDevelopmental Science https://doi.org/10.1111/desc.13445Google Scholar
- Controlling the input: How one-year-old infants sustain visual attentionDevelopmental Science 27:1–13https://doi.org/10.1111/desc.13445Google Scholar
- Enhancing reproducibility in developmental EEG research: BIDS, cluster-based permutation tests, and effect sizesDevelopmental Cognitive Neuroscience 52https://doi.org/10.1016/j.dcn.2021.101036Google Scholar
- How infant-directed actions enhance infants’ attention, learning, and exploration: Evidence from EEG and computational modelingDevelopmental Science 26https://doi.org/10.1111/desc.13259Google Scholar
- Fundamentals of music processing: Audio, analysis, algorithms, applicationsSpringer Google Scholar
- A Basic Tutorial on Novelty and Activation Functions for Music Signal ProcessingTransactions of the International Society for Music Information Retrieval https://doi.org/10.5334/tismir.202Google Scholar
- Individual differences and the development of joint attention in infancyChild Development 78:938–954https://doi.org/10.1111/j.1467-8624.2007.01042.xGoogle Scholar
- A parallel and distributed-processing model of joint attention, social cognition and autismAutism Research 2:2–21https://doi.org/10.1002/aur.61Google Scholar
- The psychology of babies: How relationships support development from birth to twoGoogle Scholar
- The functional architecture of mother-infant communication, and the development of infant social expressiveness in the first two monthsScientific Reports 6https://doi.org/10.1038/srep39019Google Scholar
- Understanding why infant-directed speech supports learning: A dynamic attention perspectiveDevelopmental Review 66:101047https://doi.org/10.1016/j.dr.2022.101047Google Scholar
- Video comprehensibility and attention in very young childrenDevelopmental Psychology 46:1283–1293https://doi.org/10.1037/a0020614Google Scholar
- Who Leads and Who Follows? The Pathways to Joint Attention During Free-Flowing Interactions Change Over Developmental TimeChild Development https://doi.org/10.1111/cdev.14229Google Scholar
- Facial expression recognition using residual masking networkIn: Proceedings - International Conference on Pattern Recognition pp. 4513–4519https://doi.org/10.1109/ICPR48806.2021.9411919Google Scholar
- Proactive or reactive? Neural oscillatory insight into the leader-follower dynamics of early infant-caregiver interactionProceedings of the National Academy of Sciences 120Google Scholar
- Infants tailor their attention to maximize learningScience Advances 6:5053–5076Google Scholar
- Robust speech recognition via large-scale weak supervisionIn: Proceedings of the 40th International Conference on Machine Learning Google Scholar
- Is infant-directed speech interesting because it is surprising? - Linking properties of IDS to statistical learning and attention at the prosodic levelCognition 178:193–206https://doi.org/10.1016/j.cognition.2018.05.015Google Scholar
- Liking as a balance between synchronization, complexity and noveltyScientific Reports 12:3181https://doi.org/10.1038/s41598-022-06610-zGoogle Scholar
- The reciprocal relationship between maternal infant-directed singing and infant behaviorhttps://pure.pmu.ac.at/en/activities/the-reciprocal-relationship-between-maternal-infant-directed-sing
- The function and evolution of child-directed communicationPLoS Biology 20https://doi.org/10.1371/journal.pbio.3001630Google Scholar
- Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, JokesAnticipatory Behavior in Adaptive Learning Systems, Volume 5499 of Lecture Notes in Computer Science 48https://doi.org/10.1007/978-3-642-02565-5_4Google Scholar
- The real-time effects of parent speech on infants’ multimodal attention and dyadic coordinationInfancy 27:1154–1178https://doi.org/10.1111/infa.12500Google Scholar
- Repetition across successive sentences facilitates young children’s word learningDevelopmental Psychology 52:879–886https://doi.org/10.1037/dev0000125Google Scholar
- Fathers’ repetition of words is coupled with children’s vocabulariesJournal of Experimental Child Psychology 166:437–450https://doi.org/10.1016/j.jecp.2017.09.012Google Scholar
- Large-scale evidence for logarithmic effects of word predictability on reading time121:2307876121–2307876121https://doi.org/10.1073/pnas
- A mathematical theory of communicationThe Bell System Technical Journal 27:623–656https://doi.org/10.1002/j.1538-7305.1948.tb00917.xGoogle Scholar
- The Development of Embodied Cognition: Six Lessons from BabiesArtificial Life 11:13–29Google Scholar
- Infant-Directed Speech Is Modulated by Infant FeedbackInfancy 13:410–420https://doi.org/10.1080/15250000802188719Google Scholar
- The Goal and Structure of Mother-Infant PlayJournal of the American Academy of Child Psychiatry 13:402–421https://doi.org/10.1016/S0002-7138(09)61348-0Google Scholar
- Joint engagement in the home environment is frequent, multimodal, timely, and structuredInfancy 27:232–254https://doi.org/10.1111/infa.12446Google Scholar
- Multimodal parent behaviors within joint attention support sustained attention in infantsDevelopmental Psychology 55:96–109https://doi.org/10.1037/dev0000628Google Scholar
- Why the parent’s gaze is so powerful in organizing the infant’s gaze: The relationship between parental referential cues and infant object lookingInfancy 27:780–808https://doi.org/10.1111/infa.12475Google Scholar
- Effects of Viewed Object Size and Scene Saliency on Sustained Attention in Parent-Infant Object PlayIn: 2024 IEEE International Conference on Development and Learning, ICDL 2024 https://doi.org/10.1109/ICDL61372.2024.10644837Google Scholar
- Infant-directed speech becomes less redundant as infants grow: implications for language learningPsyArXiv https://doi.org/10.31234/osf.io/bgtzdGoogle Scholar
- The mountain stream of infant developmentInfancy 28:468–491https://doi.org/10.1111/infa.12538Google Scholar
- How pervasive is joint attention? Mother-child dyads from a Wichi community reveal a different form of “togethernessDevelopmental Science 27Google Scholar
- The scope of audience design in child-directed speech: Parents’ tailoring of word lengths for adult versus child listenersJournal of Experimental Psychology: Learning Memory and Cognition 46:2163–2178https://doi.org/10.1037/xlm0000939Google Scholar
- Motion tracking of parents’ infant-versus adult-directed actions reveals general and action-specific modulationsDevelopmental Science 23:e12869–e12869https://doi.org/10.1111/desc.12869Google Scholar
- Do infants spontaneously imitate their caregivers’ voice during dyadic play?
- Mind in society: Development of higher psychological processesHarvard university press Google Scholar
- Editorial perspective: Leaving the baby in the bathwater in neurodevelopmental researchJournal of Child Psychology and Psychiatry 64:1256–1259https://doi.org/10.1111/jcpp.13750Google Scholar
- Comparing methods for measuring peak look duration: Are individual differences observed on screen-based tasks also found in more ecologically valid contexts?Infant Behavior and Development 37:315–325https://doi.org/10.1016/j.infbeh.2014.04.007Google Scholar
- Studying the Developing Brain in Real-World Contexts: Moving From Castles in the Air to Castles on the GroundFrontiers in Integrative Neuroscience Frontiers Media S.A. 16https://doi.org/10.3389/fnint.2022.896919Google Scholar
- Neural synchronization is strongest to the spectral flux of slow music and depends on familiarity and beat salienceeLife 11https://doi.org/10.7554/ELIFE.75515Google Scholar
- Prediction During Natural Language ComprehensionCerebral Cortex 26:2506–2516https://doi.org/10.1093/cercor/bhv075Google Scholar
- The SP theory of intelligence: Benefits and applicationsInformation (Switzerland) 5https://doi.org/10.3390/info5010001Google Scholar
- Information Compression as a Unifying Principle in Human Learning, Perception, and CognitionComplexity Hindawi Limited 2019https://doi.org/10.1155/2019/1879746Google Scholar
- Communication and action predictability: two complementary strategies for successful cooperationRoyal Society Open Science 9https://doi.org/10.1098/rsos.220577Google Scholar
- Finding Structure in Time: Visualizing and Analyzing Behavioral Time SeriesFrontiers in Psychology 11https://doi.org/10.3389/fpsyg.2020.01457Google Scholar
- Neural event segmentation of continuous experience in human infantsProceedings of the National Academy of Sciences 119:e2200257119-e2200257119https://doi.org/10.1073/pnas.2200257119Google Scholar
- Embodied attention and word learning by toddlersCognition 125:244–262https://doi.org/10.1016/j.cognition.2012.06.016Google Scholar
- Event Perception and MemoryAnnual Review of Psychology 71:165–191https://doi.org/10.1146/annurev-psych-010419-051101Google Scholar
- Visualizing and Understanding Convolutional NetworksarXiv https://doi.org/10.48550/arXiv.1311.2901Google Scholar
- More than words: Word predictability, prosody, gesture and mouth movements in natural language comprehensionProceedings of the Royal Society B: Biological Sciences 288https://doi.org/10.1098/rspb.2021.0500Google Scholar
Article and author information
Author information
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
Cite all versions
You can cite all versions using the DOI https://doi.org/10.7554/eLife.109024. This DOI represents all versions, and will always resolve to the latest one.
Copyright
© 2025, Labendzki et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 0
- downloads
- 0
- citations
- 0
Views, downloads and citations are aggregated across all versions of this paper published by eLife.