Introduction

Children acquiring a language are learning a body of knowledge - a set of words and the ways they are combined - but they are also learning to deploy this knowledge in the myriad complex, noisy, and fast-moving environments in which language is used. As children enter their second year, language explodes onto the scene; both vocabulary and grammatical abilities grow rapidly and in tandem (1, 2). This growth in knowledge is also accompanied by changes in language processing efficiency: children become quicker and more accurate in recognizing words and matching them with their referents (35).

Yet unlike language production, which is manifest via overt behavior, evidence for word recognition is often more subtle. Very young children with incomplete knowledge may not be able to point to the correct referent of a word, but they may still have some representation of word meaning (6). Eye tracking has thus emerged as a key method that allows the measurement of language comprehension with high temporal resolution: both adults and children reliably fixate the referent of a word soon after it is used (3, 710). The relative timecourse of fixation then can provide an index of an individual comprehender’s ability or be used to measure the difference between two stimulus conditions.

The version of this method that is used with children goes by many names, including the “intermodal preferential looking” paradigm and the “looking while listening” paradigm (LWL, the name we adopt here) (9, 11, 12). In LWL experiments, children are typically shown two images displayed side by side and asked to find one of them. For example, a ball and a book might be shown, and the child might hear “Look at the ball! Can you find it?”. Accuracy is then computed as the proportion of time their eyes fixate the correct image within a fixed window after the onset of the noun (“ball” in this case). Reaction time is computed only on trials in which the child is fixating the distractor image (the book) at word onset; in these cases, the average time it takes for the child to shift fixation from the distractor to the target image is used as an index of processing speed. Early work using this method showed that both children’s speed and accuracy increase rapidly across the second year (3, 12). Related methods have provided a window into how children process phonological (13), morphological (14), lexical (15), syntactic (16), and semantic (17, 18) information.

Word recognition ability, as measured by LWL, is hypothesized to play a key role in language learning. Measurements of children’s language input at home are consistently associated with their vocabulary size (19, 20). The mechanism posited to drive this association is that each word that a child experiences is an opportunity to learn. But each word must first be recognized during the short window of time when it is present in the child’s memory. Consider a child hearing the utterance “Can you put the ball in the crate?” The faster and more accurately the child can recognize the word “ball”, the better they can use this evidence to help infer the speaker’s intended meaning, allowing possible inferences about the meaning of the less familiar word, “crate” (21). Consistent with this idea, one important study found that children’s word recognition speed mediated the longitudinal relationship between home language input and vocabulary growth (22).

Word recognition speed has also been used as an index of individual differences in early childhood (4, 2326) and beyond (2729). Over and above measures of vocabulary size, word recognition speed at 18 months predicts children’s standardized test scores years later (24). Further, faster processing at 18 months is predictive of whether “late talkers” catch up to their peers or could benefit from further intervention (25). Critically, these assessments use words that children at the target age are reported to understand and produce - they are not indices of vocabulary size but rather of how quickly and accurately the child can recognize a familiar spoken word and use it to guide their visual attention to a referent.

Yet given the logistical hurdles involved in sampling from this population, individual experiments measuring processing speed with young children typically recruit relatively small samples in a restricted range of ages. These samples provide neither the breadth of ages nor the number of participants needed to estimate how word recognition changes developmentally and how it connects with other aspects of early language development (see (27, 29) for examples of these analyses in school-aged children). To overcome these limitations, we created Peekbank, an open database of LWL data from young children, stored in a harmonized format (30). This dataset unifies and carefully curates a large amount of eye-tracking data from studies with infants and toddlers, representing cumulatively over 12 million individual samples of children’s eye movements during real-time language processing. The Peekbank dataset allows us to gain an unprecedented view of the development of word recognition across a large sample of children.

We investigate two specific hypotheses here. First, one influential theory posits that language learning is a process of skill learning, in which the child is learning the skill of fluent conversation with other language users (31, 32). In this theory, the major information processing challenge of language learning is that incoming language is ephemeral and must be processed quickly before it is lost (the “now-or-never bottleneck”). On this kind of account, we should expect to see the signatures of expertise and skill learning in word recognition, which is one of the primary skills involved in processing incoming language in real time. Accuracy should change linearly with the logarithm of age, reflecting gradual asymptotic convergence to mature levels of accuracy. In addition, we might observe what is known as the “power law of practice,” the regularity found in many cases of skill learning that the logarithm of reaction time decreases with the logarithm of experience across participants (3335, cf. 36, 37). Indeed, this pattern is predicted by an influential associative process model of early word learning (38). In our case, we expect that chronological age is a proxy for experience and so the logarithm of reaction time should decrease linearly with the logarithm of age. Finally, trial-to-trial variability in both speed and accuracy should decrease with increasing expertise, as is found in studies of motor expertise (39).

Second, previous findings have provided limited and sometimes conflicting evidence on the concurrent and predictive relations between word recognition and language learning. Initial reports showed strong predictive relationships between both speed and accuracy and later vocabulary growth (23), with replications in infants born preterm (40) and late talkers (25). Subsequent studies have primarily focused on speed of processing and found more mixed results, with reaction time measures found to be only inconsistently predictive of later vocabulary outcomes (4, 26, 41). A larger dataset should allow us to make a more definitive test of the presence of these relationships. Further, by examining the relationship between speed, accuracy, and vocabulary, it should be possible to assess the extent to which processing speed specifically plays a role in vocabulary growth.

Results

We retrieved data from Peekbank, focusing on data from English-speaking children ages 1–6 years and on simple word recognition trials in which children were shown two pictures of concrete objects and heard a label for an object (typically embedded in a simple carrier phrase such as “Look at the … “). While other experimental manipulations and languages are included in the database, we narrowed our sample to English-speaking children because they are well-represented across our age range and excluded manipulations which aimed to capture phenomena other than simple concrete noun reference (e.g., adjective comprehension or novel word learning). These criteria yielded 24 datasets, including 1963 children and 3553 administrations of the LWL procedure (some datasets were longitudinal or involved multiple closely-spaced testing sessions).

Table 1 shows the characteristics of individual datasets (see also S1 Dataset Description in the Supplementary Information). The size of the combined dataset, the unified data processing pipeline, and the fact that individual studies used very similar implementations of the LWL experimental paradigm all allowed us to make a more detailed study of the development of word recognition than has previously been possible. While our analyses are exploratory in nature, they are guided by the two hypotheses outlined above: the presence of 1) signatures of skill learning in word recognition, and 2) linkages between word recognition and vocabulary.

Characteristics of included datasets from Peekbank.

“Admins” denotes separate experimental sessions. “CDIs” refers to whether the dataset contains parent report vocabulary data from the MacArthur-Bates Communicative Development Inventory.

Speed and accuracy of word recognition increase

We began by examining developmental changes in children’s word recognition. Figure 1 depicts the average timecourse at different ages across all datasets (not controlling for any variation in items and procedures across age groups). Intuitively, these timecourses show gradual increases in accuracy (higher overall proportion target looking) and speed (faster looking to the target after hearing a label) as age increases. To characterize age gradients in speed and accuracy across children, we computed both RTs (reaction times) and accuracies (proportion looking at the target image) following standard practices in the literature (9). Reaction times were computed only on trials for which the child was fixating the distractor at the point of disambiguation (label onset), and were defined as the time from label onset to the first fixation on the target image (see S2 Reaction Times, including further details on how reaction times were computed in S2.1 and discussion of issues surrounding distinguishing “correct” vs. “incorrect” trials when computing looking-based reaction times in S2.2).

Timecourse of word recognition at different ages.

The x-axis shows time (in ms) from the onset of the target label (vertical solid line). Colored lines show the average increase in proportion target looking post label onset at each age bin (in months). Age bins are larger for older children due to decreased data density. The dashed horizontal line represents chance looking. Error bands represent standard errors of the mean. Grey backgrounds highlight the short and long time windows used in subsequent analyses.

Because there is no consensus about the length of time windows for the computation of accuracy, we considered both a shorter window (from 200 – 2000 ms after noun onset) and a longer window (from 200 – 4000 ms). For each window, we averaged all fixations within the window to compute a continuous proportion of target looking between 0 (no fixation on the target during the window) and 1 (total fixation on the target during the window) on every trial. In this initial analysis, we treat observations of RT and target looking as direct measures of the constructs speed and accuracy (see S4 Test-Retest Reliability); in subsequent analyses we estimate latent variables representing these constructs.

Our first question was about the functional form of the relationships between age, speed, and accuracy (see S5 Pairwise Correlations of Main Measures for raw pairwise correlations between variables). We began by fitting linear mixed-effects models predicting speed and accuracy on each trial across the full dataset with random slopes of child age nested within study (modeling item and procedural variation across studies) and random intercepts by participant (see S8 Mixed-effects model specifications for further details on these specifications). We compared models that included both long and short accuracy windows, as well as logarithmic and linear effects of age, and logarithmic and linear transformations of RT (see S3 Checks on Data Distributional Assumptions for further analyses and discussion of these modeling choices). The best fitting model of accuracy predicted long window accuracy as a function of the logarithm of age; the best fitting model of speed predicted log RT as a function of log age as well (see S6 Functional Form Model Comparison and S7 Power Law Fits). Because long window accuracies were more correlated with other variables and showed clearer age gradients, we focus on these in our analyses.

Figure 2 shows these age gradients. Log RT decreased significantly with age, reflecting increasing speed ( = −0.11, 95% CI [−0. 14, −0.08], t(12.86) = −8.38, p < .001) and accuracy also increased significantly with age ( = 0.07, 95% CI [0.05, 0.08], t(17.59) = 12.34, p < .001). In sum, we see continuing improvements in word recognition across the full age range in our dataset that appear roughly linear in the logarithm of age. These logarithmic relationships follow theoretical expectations that both speed and accuracy should gradually asymptote to mature levels of performance, as seen in skill learning more generally (33, 35).

Participant-level target looking and reaction time (log), plotted by age (log).

Longitudinal datapoints are connected by lines. The solid blue line shows a linear fit and associated confidence interval. Thin colored lines show linear fits for those datasets spanning six or more months of age. The dashed line for accuracy shows chance-level looking (.5)

Variability of word recognition decreases

One further hallmark of increasing skill is a decrease in task-relevant variability (39). Both within and across datasets, within-individual variation in speed and accuracy decreased across the developmental range we examined (Figure 3). We fit mixed-effects models predicting the standard deviation of both speed and accuracy for each testing session for each participant, including random slopes of log age nested within dataset and random intercepts for each participant. For both speed and accuracy, within-individual variability decreased with age (speed: = −0.03, 95% CI [−0.05, −0.02], t(12.48) = −5.30, p < .001; accuracy: = −0.03, 95% CI [−0.04, −0.02], t(8.92) = −8.28, p < .001). Thus, as well as being faster and more accurate, older children were more consistent in their real-time word recognition than younger children.

Participant-level variability in target looking and reaction time (log RT), plotted by age (log).

Plotting conventions are as in Figure 1.

Speed and accuracy relate to vocabulary size

We were next interested in whether the various aspects of word recognition − including speed, accuracy, and the variability of each of these − were related to other aspects of early language ability. Of the studies in our database, 15 gathered parent reports about children’s early vocabulary using the MacArthur-Bates Communicative Development Inventory (CDI), a popular survey instrument that provides a reliable and valid estimate of children’s early vocabulary (2, 42). Different forms of the CDI can be used to measure either receptive and expressive vocabulary (for children up to 18 months) or expressive vocabulary only (for children 16 − 30 months).

We fit a series of factor analytic models to explore the dimensionality of the parent report and child LWL data. Our goal in these analyses was to understand the underlying relatedness of the various measures of word recognition and vocabulary, and in particular to assess the evidence for 1) whether the speed, accuracy, and variability measures described above all index the same underlying language processing construct and 2) the nature of the relation between this construct (or set of constructs) and early vocabulary. We begin developing models using all data, treating each observation as independent even if it comes from a longitudinal study; this assumption is equivalent to asserting an invariant factor structure across development (for a test of this assumption, see S10 Factor Analysis on First Administrations). In subsequent models, we relax this assumption and explore longitudinal growth.

Initial exploratory factor analysis using parallel analysis to select the number of factors suggested that three factors explained substantial variance in the data (see S9 Factor Analysis). To better accommodate missing data under the assumption of data missing at random (e.g., missingness due to the age sampling schemes of the various datasets), we used confirmatory factor analysis with full information maximum likelihood to find the best set of loadings. The best fitting model was a three-factor model with factors for speed (RT and RT variability), accuracy (proportion looking to target on each trial and associated variability of this measure), and vocabulary (comprehension and production from the CDI). Fit statistics for this model were generally good (Confirmatory fit index: 0.97, RMSE: 0.06); see S11 Alternative Factor Structures).

Figure 4 shows a regression model fit to this confirmatory factor analysis, with log age predicting each latent variable. This regression model allows interpretation of the covariances between latent factors as partial correlations (controlling for age). The non-age related variance of all three latent factors was significantly related to that of the other factors, with speed and accuracy showing strong negative covariance (β = −0.82, SE = 0.03, p < .0001) and weaker but significant covariation between RT and vocabulary (β = −0.29, SE = 0.04, p < .0001) and accuracy and vocabulary (β = 0.39, SE = 0.03, p < .0001). This model supports the idea that variation in speed and accuracy of word recognition is related to individual differences in parent-reported vocabulary beyond the effects of age. Further, the broader set of analyses support a factor structure in which speed and accuracy (and their associated variabilities) are related but distinct aspects of word recognition, rather than being measures of one single construct. These analyses treat all data as between person, however, rather than modeling change in these factors within individuals.

Structural equation model showing the three-factor factor analysis with a regression of each latent variable on the logarithm of age.

Observed variables are notated as squares and latent variables are notated as circles Factor loadings and regression coefficients are shown with straight, solid lines; covariances are shown with dashed lines; residual variances are shown as solid circular connections. Stars show conventional levels of statistical significance, e.g. * indicates p < .05, ** indicates p < .01, and *** indicates p < .001. Covariances reflect age-residualized correlations between variables.

Speed of processing relates to vocabulary growth

To investigate within-person relationships between LWL and vocabulary, we began by fitting longitudinal growth models to the portion of the data containing multiple LWL sessions for individual children. We first reproduced the analysis reported in (25), in which between-person differences in longitudinal growth in productive vocabulary were predicted based on between-person differences in speed during the initial session of the study. We fit a mixed-effects model predicting growth in vocabulary as a quadratic function of age, RT at study initiation (t0), and their interaction (as well as random effects of age nested within participant and also age nested within dataset). This model revealed a significant effect of t0 RT ( = −0.13, 95% CI [−0.20, —0.05], t(317.06) = −3.41, p < .001) and an interaction between t0 RT and the quadratic age predictor ( = 1.38, 95% CI [0.70, 2.07], t(1073.46) = 3.98, p < .001). This analysis suggests that children with faster initial RTs show both larger vocabularies and faster vocabulary growth over time.

We confirmed this analysis using a non-linear growth model with a logistic shape, which provides a better fit to vocabulary size within a fixed-length form than the quadratic model (see S12 Non-Linear Growth Model) (2). Figure 5 shows predictions from this model, confirming the differentiation of growth curves for children with higher and lower initial reaction time.

Growth curves from a logistic growth model showing predicted vocabulary growth for children with initial reaction times one SD faster than the mean (blue), at the mean (red), and one SD slower than the mean (green).

Individual longitudinal trajectories are shown in light gray. Solid lines show global model estimates and colored regions indicate 95% credible intervals.

On the other hand, it is possible that differences in predicted growth trajectories are due to coupling between vocabulary size and language processing across the entire developmental period, rather than a predictive relationship specifically between t0 RT and vocabulary growth. To test this relationship, we used longitudinal structural equation models. We separated the longitudinal speed, accuracy, and vocabulary data into two-month bins spanning up to 10 months from the initial measurement (i.e., t0, ..., t4) and fit individual growth across each of these variables. We used full-information maximum likelihood to handle the substantial missing data caused by the different longitudinal sampling schemes of studies in our dataset (see S13 SEM Longitudinal Missingness). The fitted longitudinal model is shown in Figure 6. Overall fit statistics were generally acceptable (Confirmatory fit index: 0.89, RMSE: 0.03, RMSE p-value: 1).

Structural equation model showing longitudinal couplings between growth parameters.

Our key question of interest concerned coupling among the (latent) intercepts and slopes of these growth models. Consistent with our earlier analysis showing that faster processing is related to vocabulary growth, we saw significant coupling between processing speed intercepts and vocabulary growth slopes (β = −0.14, SE = 0.05, p = 0.01) as well as a variety of other couplings. On the other hand, there was not significant coupling between growth in speed and growth in vocabulary (β = −0.01, SE = 0.01, p = 0.65). This null effect could be interpreted as being consistent with these abilities growing independently, but there are other possibilities. First, the longitudinal data we had might not have allowed sufficiently precise estimates of growth slopes, or second, since vocabulary growth is non-linear, the linear model we used here might not have captured coupling among nonlinear aspects of developmental change.

In sum, these findings provide evidence consistent with the claim that differences in processing speed are related to differences in the rate of age-related change in vocabulary (22, 23). Children with greater skill in word recognition learn words faster.

Discussion

How does word recognition change across early childhood and how does it relate to language learning? We investigated these questions using a new, large-scale dataset of developmental eye-tracking measurements compiled across many prior studies. The age gradients for speed and accuracy indicated that both improve asymptotically. Gradients for recognition speed were consistent with the log-log relationship associated with the “power law of practice,” that is, with a gradual convergence to mature levels of processing efficiency. Further, the age gradient suggested that trial-to-trial variability decreases with age, consistent with both the literature on skill learning (39) and other work on developmental changes in variability (4345). Speed and accuracy were both related to vocabulary size concurrently and processing speed was also related longitudinally to later vocabulary growth.

Together, our findings are consistent with theories that posit that language learning is a process of skill acquisition, in which children become adept at quickly converting ephemeral signals into meaning (31). This skill develops gradually over the course of early childhood and supports word learning. Further, our results point to consistency between skill development in early childhood and the continued refinement of language processing and language knowledge during middle childhood (27, 29).

By aggregating data from many pre-existing studies, we were able to overcome the limitations of prior investigations, which typically had sample sizes at least an order of magnitude smaller than ours. Our approach was to build on the time-consuming and meticulous data collection from previous infant and toddler eye-tracking studies - representing cumulatively many thousands of hours of in-lab data collection and hand-annotation of the resulting videos of child looking behavior - by harmonizing these data into a single, large-scale database. This approach illustrates how building harmonized databases can be especially powerful when composed of high-effort and high-quality datasets that are smaller in scope, maximizing the impact of previous data collection efforts and allowing us to ask broader questions about developmental change (2). In contrast to individual studies, which typically have at best the statistical power to test one or two specific contrasts, our “big data” approach provided the sample sizes necessary to explore the relationships between different variables. Because early language is so variable, these kinds of samples – with thousands, rather than dozens of children - are likely to be required to gain further insight into the psychometrics of early language learning (2, 46, 47).

Our approach is both observational and exploratory. Thus, we cannot untangle the range of different causal models that explain the variation we observed. First, early word recognition skill could lead to faster word learning, but faster children could also be faster due to their larger vocabulary and stronger lexical representations. These two causal directions could also interact reciprocally, leading to a “rich get richer” process in which children with larger vocabularies process faster, and their faster processing helps them increase their vocabulary size more rapidly. Finally, a third shared factor - perhaps general cognitive ability - could underpin both processes. Our cross-sectional data cannot distinguish these hypotheses even in principle (48), and our longitudinal data are likely too sparse to distinguish such complex causal models. Future work must also explore how the functional forms we observed here between individuals reflect processes of within-person change. Although the Peekbank dataset includes a variety of longitudinal data, most reflect a small number of measurements; denser longitudinal data collection is required to better estimate within-person growth models.

The consistency of the trends we observed across datasets suggests that our qualitative conclusions are robust to some significant cross-laboratory and cross-sociodemographic variation. Nevertheless, these findings are still limited in their generalizability by the convenience samples that were used in most of the studies aggregated in Peekbank. These studies typically (but not always) represent children from well-educated parents living in university-adjacent communities. We would not expect that specific numerical parameters estimated in our aggregate convenience sample would generalize to other samples.

More broadly, our results here suggest the continued importance of the looking-while-listening paradigm as an index of children’s language processing abilities. If language learning is, at least in part, a process of skill learning, then measurement of this skill in larger samples provides a critical window into understanding the remarkable process of language learning.

Materials and methods

Data

We included information from 1963 unique participants across 24 datasets. Dataset information is given in Table 1. Although experiments in Peekbank include a variety of different experimental manipulations, we analyzed only data from standard, simple word recognition trials; these trials were sometimes the main focus of the original studies and sometimes constituted control conditions for experiments with more complex manipulations. Requirements for being considered a standard word recognition trial included that (a) the target word was familiar (also no part-words); (b) the target word was the first point of disambiguation; (c) the target word was embedded in a well-formed, grammatical carrier phrase; (d) there was no informative language presented prenominally (e.g. semantically informative verbs, adjectives); (e) there were no nonsense words presented anywhere during the trial (including the carrier phrase); (f) there was no language-, speaker-, or accent-switching within trial; (g) the distractor image was unrelated to both the target label and the target image; and (h) there was no phonological overlap between the distractor label and target label. We focus here on English purely for practical reasons − the Peekbank dataset at present contains limited data from other languages.

We excluded trials entirely if they were missing data on more than 50% of timepoints, and excluded RTs if they were based on fewer than 50% of timepoints in the short analytic window (200 – 2000 ms). We also removed RTs shorter than 367 ms, as these were unlikely to be generated based on the specific linguistic stimulus. We then excluded participants from the analysis if they contributed fewer than four accuracy measurements or fewer than two reaction time measurements. At the participant level, these steps together led to 17% missingness for RTs and 6.30% missingness for long window accuracies.

Analytic methods

We used lme4 to fit linear mixed-effects models, brms to fit non-linear growth models, and lavaan to fit structural equation models. Random effects structures for each model are given in text; full model specifications are available in the Supplemental Information (S8, S9, and S12) and in the reproducible code for this paper, available in the linked repository. To aid interpretability, all variables were standardized (z-scored) prior to inclusion in structural equation models.

Data availability

We retrieved all data from Peekbank release 2025.1 using the peekbankr R package. All code and data necessary to reproduce this manuscript are available at https://github.com/peekbank/peekbank-development

Supplemental Information

S1. Dataset Description

Figure S1 gives the age distribution of unique participants for each separate dataset at different ages. Note that for some datasets, there are multiple administrations (i.e., experimental test sessions) for each participant.

Age distribution of unique participants for each dataset, using three-month bins.

Figure S2 shows the distribution of measurement intervals for longitudinal studies within the dataset.

Distribution of retest administrations across datasets with repeated measurements, colored by dataset.

Each count indicates a retest administration (initial administrations are excluded). Administrations listed with a retest interval of 0 indicate retests within a month of the initial administration.

S2. Reaction Times

S2.1. Reaction Time Computation

Eye-tracking data are stored in Peekbank as a time series of fixations to specific areas of interest (in particular, the target and distractor on each trial). Other fixations can be to areas not in the target or distractor as well to off-screen areas. This time series has a uniform sample rate of 25ms/sample, based on resampling of the data in Peekbank to 40 Hz during preprocessing (Zettersten et al. 2023). Reaction times are computed by filtering trials to only those on which the child is fixating the distractor at the point of disambiguation (t = 0) and then finding those trials on which the first non-missing fixation is to the target (hence excluding trials without a shift and trials on which a shift is to an off-screen location). The reaction time is then the total time from t = 0 to the first timestep during which the child fixates the target. Consistent with standard practice in the literature following Fernald et al. (2008), RTs that are shorter than 367 ms are excluded as they are too short to be considered a response to the stimulus.

S2.2. Comparison of Reaction Times for Correct and Incorrect Trials: Re-analysis of Creel, 2024

The Peekbank dataset only includes measurements of infants’ looking behavior, with no measure of a final target selection. This contrasts with work in the visual-world paradigm with older children and adults, in which participants make a final explicit choice about which image matches the target label (e.g. Colby & McMurray, 2023). Having this additional response allows a clearer separation of accuracy and reaction times, because researchers can compute reaction times specifically on those trials in which participants responded correctly. This strategy helps avoid a possible mixing of reaction times for incorrect and correct responses, which might be generated by different underlying cognitive processes. A possible concern with the Peekbank datasets — and reaction times in infant looking-while-listening studies more generally — is that it is difficult to separate reaction times for correct vs. incorrect responses in the absence of an independent final choice response.

To address this concern, we investigated data from a recent large-scale word recognition study with toddlers in which eyetracking measures were collected together with a final pointing response (Creel, 2024). This dataset included 914 responses from children (2.5–6.5 years) completing a looking-while listening procedure in which they also were instructed to point to the target image. Using this dataset, we investigated the correlation between reaction times (following the same procedure as in our main analyses, i.e. focusing specifically on distractor-to-target shifts) computed over all trials and reaction times computed only for those trials in which children selected the correct referent. The results are shown in Figure S3. Reaction times (i.e., distractor to target shifts) for correct trials only were highly correlated with reaction times across all trials (r = .85, 95% CI [.82, .87], t(479) = 34.84, p < .001). This result suggests that having the ability to filter out incorrect trials has a minimal impact on reaction time computation, even in young children. While there is some uncertainty about how these results may generalize to infants in our younger age ranges (i.e., below 2.5 years of age), who struggle to provide reliable pointing responses, it seems reasonable to assume that our reaction time results would stay largely the same if it were possible to filter out trials on which infants make an incorrect mapping between the target label and the target image using an eyetracking-independent final choice response.

Correlation between reaction times on all trials and reaction times on trials where the child pointed to the correct target.

Data from Creel (2024).

S3. Checks on Data Distributional Assumptions

Here, we check whether the distributional forms that are assumed for the distributions of RT and accuracy are a reasonable empirical fit to the data, and compare against other commonly used distributional forms.

We confirm that, across the age range, the choice to use a log-normal distribution for RT and a normal distribution for accuracy is justified.

S3.1. Reaction Time

The literature focuses on the use of the Exponential-Gaussian (ex-Gaussian) distribution as well as Wald, Weibull, gamma, and log normal distributions (see for example Luce, 1986; Ratcliff, 1993; Van Zandt, 2002). All of these are 2- or 3-parameter distributions meaning that there is no necessary relationship between mean and variance.

The problem of fitting RT distributions is complex and a substantial literature exists (e.g., Ratcliff, 1979; Luce, 1986; Van Zandt, 2000; Baayen & Milin, 2010). One of the big challenges in our dataset as well as elsewhere is that distributions are conditional on factors such as participant and task, so it is challenging to draw inferences about the underlying distribution when looking at average data.

That said, we find that overall the data are best fit by either an ex-Gaussian or a log normal distribution, again consistent with prior literature, giving us confidence in this conclusion. Across the full dataset, the BIC values for ex-Gaussian (42553) and log normal (42624) are quite close to one another, and better than the Wald (49038) and normal (43459) fits. When binned by age (Fig S4), younger children seem better fit for a log normal distribution and older children seem better fit by ex-Gaussian (models with lowest BICs are shown in red since significant differences can be obscured by the large scale). Figure S5 shows the RT data distribution overlaid with the corresponding log-normal distributions.

Overall, we think this result generally vitiates our decision to use log-transformed RTs as our primary dependent measure.

S3.2. Accuracy

Individual trial-level accuracies are not binomial because they are an average probability of fixation over a viewing window. They are bounded at 0 and 1, but in general they tend towards the range .5 – .8 in most studies of this population. Figure S6 shows the data binned by age group with fitted gaussian distributions.

These distributions seem well fit by standard gaussians, but they are in principle bounded and so we asked whether this made a difference, using a Beta distribution (a two-parameter continuous distribution bounded at 0 and 1) for fitting. Surprisingly, across all data, the BIC values for the two distributions were very similar (−4321 for normal versus −4314 for Beta), though the normal distribution was slightly favored. Across age groups, there was heterogeneity with some groups better fit by a gaussian and others better fit by a Beta (Figure S7).

Again, we feel that this result generally vitiates our approach of modeling accuracies via standard linear mixed-effects models: their distributional form is quite close to normal.

Goodness of fit for different distributional models for RT, split by age.

Distribution of RT overlaid with a log normal distribution, split by age.

Goodness of fit for different distributional models of accuracy, split by age.

Distribution of accuracies overlaid with normal distribution, split by age.

S4. Test-Retest Reliability

We examined test-retest reliability for our primary variables of interest by calculating Pearson correlations between pairs of administrations given no more than three months apart. Test-retest correlations were significant but relatively modest: ρlongwindowacc = 0.462, ρshortwindowacc = 0.496, ρrt = 0.407. These reliabilities were biased downwards by three factors, however. First, longitudinal assessments sometimes use variable items between testing sessions, leading to item-related variance in measurement. Second, even three months can lead to substantial change in some children’s language abilities, thus correlations are attenuated by true change as well as measurement error. Third, longitudinal data in the dataset come primarily from the youngest children and hence are likely to show overall higher measurement error due to variability in children’s behavior and an overall lower number of trials.

S5. Pairwise Correlations of Main Measures

Table S1 shows pairwise correlations between the primary variables of interest in the dataset.

Pairwise correlations between primary variables of interest.

S6. Functional Form Model Comparison

Table S2 shows model comparison measures for different models of the functional form of the relationship between accuracy and age and Table S3 shows the same for reaction time. Age gradients are estimated substantially better with long window accuracies. Note that there are a greater number of observations for short window accuracies due to less missing data. We speculate that, on average, more participants looked away from the screen towards the end of trials, leading to a greater number of exclusions of long window trials based on the 50% criterion. Note that the total percentage of trials excluded is still small for both measures: 4.4% for long window accuracy and 1.5% for short window accuracy.

Model comparison metrics for different functional forms of the relationship between accuracy and age.

Model comparison metrics for different functional forms of the relationship between RT and age.

S7. Power Law Fits

In the literature on the “law of practice”, although the log-log relationship we observed is commonly present in the aggregate across individuals, the situation is substantially more complex when relationships are measured within individuals. The best fitting curves for individuals are often exponentials or delayed exponentials (Evans et al., 2018; Heathcote, Brown, & Mewhort, 2000).

With our current dataset, we unfortunately cannot specifically determine whether within-individual patterns of change conform to linear, power law, or exponential developmental patterns, because we have insufficient data about individuals’ improvement across time. Thus, our current results apply to the form of the age gradient as opposed to the form of any individual’s pattern of developmental change.

We believe that, unlike the skills being studied in the prior adult literature (e.g., Anderson, 1982; Heathcote, Brown, & Mewhort, 2000; Logan, 1988), language processing is being learned over the course of a child’s lifetime. Thus, we do not expect to see within-paradigm changes in learning in what is a narrow period of time compared to the duration over which language processing skills are refined.

Nevertheless, here we test for other forms of the aggregate relationship between age and reaction time. In particular, we consider 1) a log~log relationship between RT and age (presented in the main text), 2) using both a log age and a linear age to predict log RT, 3) a quadratic relationship between age and RT, and 4) a cubic relationship between age and RT. As shown in Table S4, the model with a linear age term in addition to a log age term has the best fit, although the linear age term coefficient is only marginally significant (coefficients in Table S5).

These models reveal a small but significant additional linear age term over and above log age, but - because individual participant-level fits are not possible - this term can’t really be used to weigh in on the debate about the precise nature of the learning pattern.

Goodness of fit comparison between different models of the relationship between age and RT.

Fixed effects coefficients for a model predicting log RT from both log age and linear age.

S8. Mixed-effects model specifications

Here we provide specifications for the lmer mixed-effects models used in the main text. These models are used to estimate the relationship between age and the primary variables of interest, controlling for dataset and subject-level variability.

For accuracy, 4 models were run, crossing long and short windows as the dependent variable with age or log age as the predictor.

long_window_accuracy ~ age_s + (age_s | dataset_name) + (1 | subject_id)

long_window_accuracy ~ log_age_s + (log_age_s | dataset_name) + (1 | subject_id)

short_window_accuracy ~ age_s + (age_s | dataset_name) + (1 | subject_id)

short_window_accuracy ~ log_age_s + (log_age_s | dataset_name) + (1 | subject_id)

For reaction time, 4 models were run, crossing rt and log rt as the dependent variable with age or log age as the predictor.

log_rt ~ age_s + (age_s | dataset_name) + (1 | subject_id)

log_rt ~ log_age_s + (log_age_s | dataset_name) + (1 | subject_id)

rt ~ age_s + (age_s | dataset_name) + (1 | subject_id)

rt ~ log_age_s + (log_age_s | dataset_name) + (1 | subject_id)

To look at the relationship between variance in the accuracy and reaction time measures and children’s age, we ran two models.

long_window_acc_var ~ log_age_s + (log_age_s | dataset_name) + (1 | subject_id)

log_rt_var ~ log_age_s + (log_age_s | dataset_name) + (1 | subject_id)

In the growth curve analysis, fit a mixed-effects model predicting growth in vocabulary as a quadratic function of age, RT at study initiation (t0), and their interaction, using the formula below

prod ~ poly(age_15,2) *rt_t0 + (age | subject_id) + (1 | dataset_name)

S9. Factor Analysis

Figure 3 shows the result of a parallel analysis supporting the presence of three factors in the exploratory factor analysis. Table S6 shows the factor loadings for the exploratory three-factor solution using varimax rotation. The first factor is primarily driven by vocabulary measures, the second by reaction time, and the third by accuracy measures.

Factor loadings for the exploratory three factor solution using varimax rotation.

The confirmatory factor analysis of this three-factor solution was fit using the following specification

vocab =~ prod + comp

accuracy =~ long_window_accuracy + long_window_acc_var

speed =~ log_rt + log_rt_var

Parallel analysis scree plot showing the eigenvalues for each factor, for actual, simulated, and resampled data.

The confirmatory factor analysis of the three-factor solution with a relation to age was fit using the following specification

vocab =~ prod + comp

accuracy =~ acc + acc_sd

speed =~ log_rt + log_rt_sd

vocab ~ log_age

accuracy ~ log_age

speed ~ log_age

The SEM with a linear growth curve used the following specification

accuracy_intercept =~ 1*acc_t0 + 1*acc_t1 + 1*acc_t2 + 1*acc_t3 + 1*acc_t4

accuracy_slope =~ 1*acc_t0 + 2*acc_t1 + 3*acc_t2 + 4*acc_t3 + 5*acc_t4

speed_intercept =~ 1*log_rt_t0 + 1*log_rt_t1 + 1*log_rt_t2 + 1*log_rt_t3 + 1*log_rt_t4

speed_slope =~ 1*log_rt_t0 + 2*log_rt_t1 + 3*log_rt_t2 + 4*log_rt_t3 + 5*log_rt_t4

vocab_intercept =~ 1*prod_t0 + 1*prod_t1 + 1*prod_t2 + 1*prod_t3 + 1*prod_t4

vocab_slope =~ 1*prod_t0 + 2*prod_t1 + 3*prod_t2 + 4*prod_t3 + 5*prod_t4

accuracy_intercept ~~ NA*accuracy_intercept

accuracy_slope ~~ NA*accuracy_slope

speed_intercept ~~ NA*speed_intercept

speed_slope ~~ NA*speed_slope

vocab_intercept ~~ NA*vocab_intercept

vocab_slope ~~ NA*vocab_slope

S10. Factor Analysis on First Administrations

As a robustness check, we tested our best factor analytic models using only cross-sectional data (filtering to the first test session in longitudinal datasets; N=1963 instead of N=3553). A comparison of all 4 models is shown in Table S7. For the three-factor CFA, the first administration model shows increased CFI (.992 instead of .972) and decreased RMSEA (.030 instead of .065). The same is true for the age-regressed three-factor CFA, which shows very good statistics on both first administrations and longitudinal data (CFI = .999 and .991, respectively and RMSEA = .009 and .037 respectively).

Comparison of confirmatory factor analysis models on longitudinal data or first administrations only.

S11. Alternative Factor Structures

In this section, we provide comparisons between the three-factor model we report in the main text and several alternative models, including:

  • a one-factor model;

  • a two-factor model with vocabulary separated from speed and accuracy;

  • a two-factor model with speed separated from accuracy and vocabulary; and

  • a two-factor model with variability terms separated from speed, accuracy, and vocabulary.

Table S8 shows the result of these comparisons. The three-factor model shows the lowest AIC and BIC, as well as being significantly better fitting than the next-best model.

Model comparison for alternative factor structures.

p-values show differences between adjacent models; no p-values are shown for comparisons between non-nested models.

S12. Non-linear Growth Models

To test for the differentiation of vocabulary growth based on initial reaction time, we used the package brms to fit a (Bayesian) logistic growth model to the production data. This model has two parameters for the logistic curve, a scale and an intercept. Both were allowed to interact with initial reaction time. We also included random effects of logistic intercept and scale by participant and a grouping term across datasets.

Fixed effects estimates from logistic growth model.

Fixed effects estimates from logistic growth model using RT residualized on age as the predictor.

This model showed a significant effect of initial reaction time on the intercept of the logistic growth curve, but not on its scale (see Table S9).

Age and initial reaction time were both mean-centered.

Growth curves from a logistic growth model showing predicted vocabulary growth for children based on their age-residualized initial reaction times.

Predictions are shown for children with initial reaction times one SD faster than the mean for their age (blue), at the mean for their age (red), and one SD slower than the mean for their age (green). Individual longitudinal trajectories are shown in light gray. Solid lines show global model estimates and colored regions indicate 95% credible intervals.

The formula specification was

nlform <- brms::bf(

prod ~ 1 / (1 + exp((xmid - age_c) / exp(logscale))),

xmid ~ 1 + log_rt_0_c + (1 | dataset_name/subject_id),

logscale ~ 1 + log_rt_0_c + (1 | dataset_name/subject_id),

# scale ~ 1 + log_rt_0_c,

nl = TRUE

)

And the priors were

priors <- c(

prior(normal(0, 5), nlpar = “xmid”, coef = “Intercept”),

prior(normal(1, 1), nlpar = “logscale”, coef = “Intercept”),

prior(normal(0, 1), nlpar = “logscale”, coef = “log_rt_0_c”),

prior(exponential(1), class = “sigma”),

prior(normal(0, 2), nlpar = “xmid”, coef = “log_rt_0_c”),

# Random effects for xmid

prior(exponential(1), class prior(exponential(1), class “sd”, nlpar = “xmid”, group = “dataset_name”),

“sd”, nlpar = “xmid”, group = “dataset_name:subject_id”),

# Random effects for scale

prior(exponential(1), class = “sd”, nlpar = “logscale”, group = “dataset_name”),

prior(exponential(1), class = “sd”, nlpar = “logscale”, group = “dataset_name:subject_id”)

)

Age and reaction time are correlated, so to check that the effects of initial reaction time were not due to age effects, we reran the model using residualized reaction time to remove effects of age. As seen in Table S10 and Figure S9, the pattern of effects is similar for residualized reaction time as for reaction time.

Interpretation of growth in both this model and the linear growth model in the main text is complicated by the fact that the CDI form puts a ceiling on the total number of words that can be recorded; both the quadratic growth functions and the logistic functions come together at the form ceiling. Thus, a shift in quadratic growth in the linear model and a shift in intercept in the logistic model both point to the same overall effect, which is faster growth at the point of maximal sensitivity of the CDI. Neither model can estimate whether the overall growth trajectory is different beyond the range of the CDI. Thus, although these models might initially seem to be in conflict, we believe that they actually point to the same phenomenon, which is perhaps better described by the longitudinal SEM model reported in the main text. Children with greater skill in word recognition show an overall positive shift in the growth trajectory of vocabulary development.

S13. SEM Longitudinal Missingness

The SEM model was fit to the entire dataset, including the large mass of cross-sectional data (to anchor the estimates of t0 coefficients) and the sparse longitudinal data for each time point. We have 3%-12% of the total t0 datapoints for any given time point (see Table S11), given the sparsity of longitudinal sampling (only 6/24 of the datasets are longitudinal).

Our data are MAR (missing at random) rather than MCAR (missing completely at random). This is because their missingness is due to which dataset they are part of - if they are from a cross-sectional dataset, they are by definition missing all longitudinal observations. For our analyses to be appropriate given this structure, we have to assume that the general developmental patterns we are studying are replicated across datasets. We believe that they are, and we show this statistically using our mixed-effects and non-linear mixed-effects models, which control for dataset-related variation. We also show dataset-level effects in a number of our visualizations for this same reason. The same degree of random effect specification that we can do in the mixed-effects models is not possible in the SEM model, however, purely for technical reasons. Again, this point highlights the importance of convergence across analyses.

Fraction of data present for each measure at each time point for the longitudinal SEM.

Additional information

Funding

Jacobs Foundation

  • Virginia A Marchman