Abstract
Being a fluent language user involves recognizing words as they unfold in time. How does this skill develop over the course of early childhood? And how does facility in word recognition relate to the growth of vocabulary knowledge? We address these questions using data from Peekbank, an open database of experiments measuring children’s eye movements during early word recognition. Combining 24 datasets from almost 2,000 children ages 1–6 years, we show that word recognition becomes faster, more accurate, and less variable across development, consistent with a process of skill learning. Factor analysis reveals covariation of word recognition speed and accuracy with children’s vocabulary size in cross-sectional analysis. Further, across a range of longitudinal models, speed, accuracy, and vocabulary show coupled growth such that children with faster word recognition tend to show faster vocabulary growth. Together, these findings support the view that word recognition is a skill that develops gradually across early childhood and that this skill plays a role in supporting early language learning.
Introduction
Children acquiring a language are learning a body of knowledge - a set of words and the ways they are combined - but they are also learning to deploy this knowledge in the myriad complex, noisy, and fast-moving environments in which language is used. As children enter their second year, language explodes onto the scene; both vocabulary and grammatical abilities grow rapidly and in tandem (1, 2). This growth in knowledge is also accompanied by changes in language processing efficiency: children become quicker and more accurate in recognizing words and matching them with their referents (3–5).
Yet unlike language production, which is manifest via overt behavior, evidence for word recognition is often more subtle. Very young children with incomplete knowledge may not be able to point to the correct referent of a word, but they may still have some representation of word meaning (6). Eye tracking has thus emerged as a key method that allows the measurement of language comprehension with high temporal resolution: both adults and children reliably fixate the referent of a word soon after it is used (3, 7–10). The relative timecourse of fixation then can provide an index of an individual comprehender’s ability or be used to measure the difference between two stimulus conditions.
The version of this method that is used with children goes by many names, including the “intermodal preferential looking” paradigm and the “looking while listening” paradigm (LWL, the name we adopt here) (9, 11, 12). In LWL experiments, children are typically shown two images displayed side by side and asked to find one of them. For example, a ball and a book might be shown, and the child might hear “Look at the ball! Can you find it?”. Accuracy is then computed as the proportion of time their eyes fixate the correct image within a fixed window after the onset of the noun (“ball” in this case). Reaction time is computed only on trials in which the child is fixating the distractor image (the book) at word onset; in these cases, the average time it takes for the child to shift fixation from the distractor to the target image is used as an index of processing speed. Early work using this method showed that both children’s speed and accuracy increase rapidly across the second year (3, 12). Related methods have provided a window into how children process phonological (13), morphological (14), lexical (15), syntactic (16), and semantic (17, 18) information.
Word recognition ability, as measured by LWL, is hypothesized to play a key role in language learning. Measurements of children’s language input at home are consistently associated with their vocabulary size (19, 20). The mechanism posited to drive this association is that each word that a child experiences is an opportunity to learn. But each word must first be recognized during the short window of time when it is present in the child’s memory. Consider a child hearing the utterance “Can you put the ball in the crate?” The faster and more accurately the child can recognize the word “ball”, the better they can use this evidence to help infer the speaker’s intended meaning, allowing possible inferences about the meaning of the less familiar word, “crate” (21). Consistent with this idea, one important study found that children’s word recognition speed mediated the longitudinal relationship between home language input and vocabulary growth (22).
Word recognition speed has also been used as an index of individual differences in early childhood (4, 23–26) and beyond (27–29). Over and above measures of vocabulary size, word recognition speed at 18 months predicts children’s standardized test scores years later (24). Further, faster processing at 18 months is predictive of whether “late talkers” catch up to their peers or could benefit from further intervention (25). Critically, these assessments use words that children at the target age are reported to understand and produce - they are not indices of vocabulary size but rather of how quickly and accurately the child can recognize a familiar spoken word and use it to guide their visual attention to a referent.
Yet given the logistical hurdles involved in sampling from this population, individual experiments measuring processing speed with young children typically recruit relatively small samples in a restricted range of ages. These samples provide neither the breadth of ages nor the number of participants needed to estimate how word recognition changes developmentally and how it connects with other aspects of early language development (see (27, 29) for examples of these analyses in school-aged children). To overcome these limitations, we created Peekbank, an open database of LWL data from young children, stored in a harmonized format (30). This dataset unifies and carefully curates a large amount of eye-tracking data from studies with infants and toddlers, representing cumulatively over 12 million individual samples of children’s eye movements during real-time language processing. The Peekbank dataset allows us to gain an unprecedented view of the development of word recognition across a large sample of children.
We investigate two specific hypotheses here. First, one influential theory posits that language learning is a process of skill learning, in which the child is learning the skill of fluent conversation with other language users (31, 32). In this theory, the major information processing challenge of language learning is that incoming language is ephemeral and must be processed quickly before it is lost (the “now-or-never bottleneck”). On this kind of account, we should expect to see the signatures of expertise and skill learning in word recognition, which is one of the primary skills involved in processing incoming language in real time. Accuracy should change linearly with the logarithm of age, reflecting gradual asymptotic convergence to mature levels of accuracy. In addition, we might observe what is known as the “power law of practice,” the regularity found in many cases of skill learning that the logarithm of reaction time decreases with the logarithm of experience across participants (33–35, cf. 36, 37). Indeed, this pattern is predicted by an influential associative process model of early word learning (38). In our case, we expect that chronological age is a proxy for experience and so the logarithm of reaction time should decrease linearly with the logarithm of age. Finally, trial-to-trial variability in both speed and accuracy should decrease with increasing expertise, as is found in studies of motor expertise (39).
Second, previous findings have provided limited and sometimes conflicting evidence on the concurrent and predictive relations between word recognition and language learning. Initial reports showed strong predictive relationships between both speed and accuracy and later vocabulary growth (23), with replications in infants born preterm (40) and late talkers (25). Subsequent studies have primarily focused on speed of processing and found more mixed results, with reaction time measures found to be only inconsistently predictive of later vocabulary outcomes (4, 26, 41). A larger dataset should allow us to make a more definitive test of the presence of these relationships. Further, by examining the relationship between speed, accuracy, and vocabulary, it should be possible to assess the extent to which processing speed specifically plays a role in vocabulary growth.
Results
We retrieved data from Peekbank, focusing on data from English-speaking children ages 1–6 years and on simple word recognition trials in which children were shown two pictures of concrete objects and heard a label for an object (typically embedded in a simple carrier phrase such as “Look at the … “). While other experimental manipulations and languages are included in the database, we narrowed our sample to English-speaking children because they are well-represented across our age range and excluded manipulations which aimed to capture phenomena other than simple concrete noun reference (e.g., adjective comprehension or novel word learning). These criteria yielded 24 datasets, including 1963 children and 3553 administrations of the LWL procedure (some datasets were longitudinal or involved multiple closely-spaced testing sessions).
Table 1 shows the characteristics of individual datasets (see also S1 Dataset Description in the Supplementary Information). The size of the combined dataset, the unified data processing pipeline, and the fact that individual studies used very similar implementations of the LWL experimental paradigm all allowed us to make a more detailed study of the development of word recognition than has previously been possible. While our analyses are exploratory in nature, they are guided by the two hypotheses outlined above: the presence of 1) signatures of skill learning in word recognition, and 2) linkages between word recognition and vocabulary.

Characteristics of included datasets from Peekbank.
“Admins” denotes separate experimental sessions. “CDIs” refers to whether the dataset contains parent report vocabulary data from the MacArthur-Bates Communicative Development Inventory.
Speed and accuracy of word recognition increase
We began by examining developmental changes in children’s word recognition. Figure 1 depicts the average timecourse at different ages across all datasets (not controlling for any variation in items and procedures across age groups). Intuitively, these timecourses show gradual increases in accuracy (higher overall proportion target looking) and speed (faster looking to the target after hearing a label) as age increases. To characterize age gradients in speed and accuracy across children, we computed both RTs (reaction times) and accuracies (proportion looking at the target image) following standard practices in the literature (9). Reaction times were computed only on trials for which the child was fixating the distractor at the point of disambiguation (label onset), and were defined as the time from label onset to the first fixation on the target image (see S2 Reaction Times, including further details on how reaction times were computed in S2.1 and discussion of issues surrounding distinguishing “correct” vs. “incorrect” trials when computing looking-based reaction times in S2.2).

Timecourse of word recognition at different ages.
The x-axis shows time (in ms) from the onset of the target label (vertical solid line). Colored lines show the average increase in proportion target looking post label onset at each age bin (in months). Age bins are larger for older children due to decreased data density. The dashed horizontal line represents chance looking. Error bands represent standard errors of the mean. Grey backgrounds highlight the short and long time windows used in subsequent analyses.
Because there is no consensus about the length of time windows for the computation of accuracy, we considered both a shorter window (from 200 – 2000 ms after noun onset) and a longer window (from 200 – 4000 ms). For each window, we averaged all fixations within the window to compute a continuous proportion of target looking between 0 (no fixation on the target during the window) and 1 (total fixation on the target during the window) on every trial. In this initial analysis, we treat observations of RT and target looking as direct measures of the constructs speed and accuracy (see S4 Test-Retest Reliability); in subsequent analyses we estimate latent variables representing these constructs.
Our first question was about the functional form of the relationships between age, speed, and accuracy (see S5 Pairwise Correlations of Main Measures for raw pairwise correlations between variables). We began by fitting linear mixed-effects models predicting speed and accuracy on each trial across the full dataset with random slopes of child age nested within study (modeling item and procedural variation across studies) and random intercepts by participant (see S8 Mixed-effects model specifications for further details on these specifications). We compared models that included both long and short accuracy windows, as well as logarithmic and linear effects of age, and logarithmic and linear transformations of RT (see S3 Checks on Data Distributional Assumptions for further analyses and discussion of these modeling choices). The best fitting model of accuracy predicted long window accuracy as a function of the logarithm of age; the best fitting model of speed predicted log RT as a function of log age as well (see S6 Functional Form Model Comparison and S7 Power Law Fits). Because long window accuracies were more correlated with other variables and showed clearer age gradients, we focus on these in our analyses.
Figure 2 shows these age gradients. Log RT decreased significantly with age, reflecting increasing speed (


Participant-level target looking and reaction time (log), plotted by age (log).
Longitudinal datapoints are connected by lines. The solid blue line shows a linear fit and associated confidence interval. Thin colored lines show linear fits for those datasets spanning six or more months of age. The dashed line for accuracy shows chance-level looking (.5)
Variability of word recognition decreases
One further hallmark of increasing skill is a decrease in task-relevant variability (39). Both within and across datasets, within-individual variation in speed and accuracy decreased across the developmental range we examined (Figure 3). We fit mixed-effects models predicting the standard deviation of both speed and accuracy for each testing session for each participant, including random slopes of log age nested within dataset and random intercepts for each participant. For both speed and accuracy, within-individual variability decreased with age (speed: 


Participant-level variability in target looking and reaction time (log RT), plotted by age (log).
Plotting conventions are as in Figure 1.
Speed and accuracy relate to vocabulary size
We were next interested in whether the various aspects of word recognition − including speed, accuracy, and the variability of each of these − were related to other aspects of early language ability. Of the studies in our database, 15 gathered parent reports about children’s early vocabulary using the MacArthur-Bates Communicative Development Inventory (CDI), a popular survey instrument that provides a reliable and valid estimate of children’s early vocabulary (2, 42). Different forms of the CDI can be used to measure either receptive and expressive vocabulary (for children up to 18 months) or expressive vocabulary only (for children 16 − 30 months).
We fit a series of factor analytic models to explore the dimensionality of the parent report and child LWL data. Our goal in these analyses was to understand the underlying relatedness of the various measures of word recognition and vocabulary, and in particular to assess the evidence for 1) whether the speed, accuracy, and variability measures described above all index the same underlying language processing construct and 2) the nature of the relation between this construct (or set of constructs) and early vocabulary. We begin developing models using all data, treating each observation as independent even if it comes from a longitudinal study; this assumption is equivalent to asserting an invariant factor structure across development (for a test of this assumption, see S10 Factor Analysis on First Administrations). In subsequent models, we relax this assumption and explore longitudinal growth.
Initial exploratory factor analysis using parallel analysis to select the number of factors suggested that three factors explained substantial variance in the data (see S9 Factor Analysis). To better accommodate missing data under the assumption of data missing at random (e.g., missingness due to the age sampling schemes of the various datasets), we used confirmatory factor analysis with full information maximum likelihood to find the best set of loadings. The best fitting model was a three-factor model with factors for speed (RT and RT variability), accuracy (proportion looking to target on each trial and associated variability of this measure), and vocabulary (comprehension and production from the CDI). Fit statistics for this model were generally good (Confirmatory fit index: 0.97, RMSE: 0.06); see S11 Alternative Factor Structures).
Figure 4 shows a regression model fit to this confirmatory factor analysis, with log age predicting each latent variable. This regression model allows interpretation of the covariances between latent factors as partial correlations (controlling for age). The non-age related variance of all three latent factors was significantly related to that of the other factors, with speed and accuracy showing strong negative covariance (β = −0.82, SE = 0.03, p < .0001) and weaker but significant covariation between RT and vocabulary (β = −0.29, SE = 0.04, p < .0001) and accuracy and vocabulary (β = 0.39, SE = 0.03, p < .0001). This model supports the idea that variation in speed and accuracy of word recognition is related to individual differences in parent-reported vocabulary beyond the effects of age. Further, the broader set of analyses support a factor structure in which speed and accuracy (and their associated variabilities) are related but distinct aspects of word recognition, rather than being measures of one single construct. These analyses treat all data as between person, however, rather than modeling change in these factors within individuals.

Structural equation model showing the three-factor factor analysis with a regression of each latent variable on the logarithm of age.
Observed variables are notated as squares and latent variables are notated as circles Factor loadings and regression coefficients are shown with straight, solid lines; covariances are shown with dashed lines; residual variances are shown as solid circular connections. Stars show conventional levels of statistical significance, e.g. * indicates p < .05, ** indicates p < .01, and *** indicates p < .001. Covariances reflect age-residualized correlations between variables.
Speed of processing relates to vocabulary growth
To investigate within-person relationships between LWL and vocabulary, we began by fitting longitudinal growth models to the portion of the data containing multiple LWL sessions for individual children. We first reproduced the analysis reported in (25), in which between-person differences in longitudinal growth in productive vocabulary were predicted based on between-person differences in speed during the initial session of the study. We fit a mixed-effects model predicting growth in vocabulary as a quadratic function of age, RT at study initiation (t0), and their interaction (as well as random effects of age nested within participant and also age nested within dataset). This model revealed a significant effect of t0 RT (

We confirmed this analysis using a non-linear growth model with a logistic shape, which provides a better fit to vocabulary size within a fixed-length form than the quadratic model (see S12 Non-Linear Growth Model) (2). Figure 5 shows predictions from this model, confirming the differentiation of growth curves for children with higher and lower initial reaction time.

Growth curves from a logistic growth model showing predicted vocabulary growth for children with initial reaction times one SD faster than the mean (blue), at the mean (red), and one SD slower than the mean (green).
Individual longitudinal trajectories are shown in light gray. Solid lines show global model estimates and colored regions indicate 95% credible intervals.
On the other hand, it is possible that differences in predicted growth trajectories are due to coupling between vocabulary size and language processing across the entire developmental period, rather than a predictive relationship specifically between t0 RT and vocabulary growth. To test this relationship, we used longitudinal structural equation models. We separated the longitudinal speed, accuracy, and vocabulary data into two-month bins spanning up to 10 months from the initial measurement (i.e., t0, ..., t4) and fit individual growth across each of these variables. We used full-information maximum likelihood to handle the substantial missing data caused by the different longitudinal sampling schemes of studies in our dataset (see S13 SEM Longitudinal Missingness). The fitted longitudinal model is shown in Figure 6. Overall fit statistics were generally acceptable (Confirmatory fit index: 0.89, RMSE: 0.03, RMSE p-value: 1).

Structural equation model showing longitudinal couplings between growth parameters.
Our key question of interest concerned coupling among the (latent) intercepts and slopes of these growth models. Consistent with our earlier analysis showing that faster processing is related to vocabulary growth, we saw significant coupling between processing speed intercepts and vocabulary growth slopes (β = −0.14, SE = 0.05, p = 0.01) as well as a variety of other couplings. On the other hand, there was not significant coupling between growth in speed and growth in vocabulary (β = −0.01, SE = 0.01, p = 0.65). This null effect could be interpreted as being consistent with these abilities growing independently, but there are other possibilities. First, the longitudinal data we had might not have allowed sufficiently precise estimates of growth slopes, or second, since vocabulary growth is non-linear, the linear model we used here might not have captured coupling among nonlinear aspects of developmental change.
In sum, these findings provide evidence consistent with the claim that differences in processing speed are related to differences in the rate of age-related change in vocabulary (22, 23). Children with greater skill in word recognition learn words faster.
Discussion
How does word recognition change across early childhood and how does it relate to language learning? We investigated these questions using a new, large-scale dataset of developmental eye-tracking measurements compiled across many prior studies. The age gradients for speed and accuracy indicated that both improve asymptotically. Gradients for recognition speed were consistent with the log-log relationship associated with the “power law of practice,” that is, with a gradual convergence to mature levels of processing efficiency. Further, the age gradient suggested that trial-to-trial variability decreases with age, consistent with both the literature on skill learning (39) and other work on developmental changes in variability (43–45). Speed and accuracy were both related to vocabulary size concurrently and processing speed was also related longitudinally to later vocabulary growth.
Together, our findings are consistent with theories that posit that language learning is a process of skill acquisition, in which children become adept at quickly converting ephemeral signals into meaning (31). This skill develops gradually over the course of early childhood and supports word learning. Further, our results point to consistency between skill development in early childhood and the continued refinement of language processing and language knowledge during middle childhood (27, 29).
By aggregating data from many pre-existing studies, we were able to overcome the limitations of prior investigations, which typically had sample sizes at least an order of magnitude smaller than ours. Our approach was to build on the time-consuming and meticulous data collection from previous infant and toddler eye-tracking studies - representing cumulatively many thousands of hours of in-lab data collection and hand-annotation of the resulting videos of child looking behavior - by harmonizing these data into a single, large-scale database. This approach illustrates how building harmonized databases can be especially powerful when composed of high-effort and high-quality datasets that are smaller in scope, maximizing the impact of previous data collection efforts and allowing us to ask broader questions about developmental change (2). In contrast to individual studies, which typically have at best the statistical power to test one or two specific contrasts, our “big data” approach provided the sample sizes necessary to explore the relationships between different variables. Because early language is so variable, these kinds of samples – with thousands, rather than dozens of children - are likely to be required to gain further insight into the psychometrics of early language learning (2, 46, 47).
Our approach is both observational and exploratory. Thus, we cannot untangle the range of different causal models that explain the variation we observed. First, early word recognition skill could lead to faster word learning, but faster children could also be faster due to their larger vocabulary and stronger lexical representations. These two causal directions could also interact reciprocally, leading to a “rich get richer” process in which children with larger vocabularies process faster, and their faster processing helps them increase their vocabulary size more rapidly. Finally, a third shared factor - perhaps general cognitive ability - could underpin both processes. Our cross-sectional data cannot distinguish these hypotheses even in principle (48), and our longitudinal data are likely too sparse to distinguish such complex causal models. Future work must also explore how the functional forms we observed here between individuals reflect processes of within-person change. Although the Peekbank dataset includes a variety of longitudinal data, most reflect a small number of measurements; denser longitudinal data collection is required to better estimate within-person growth models.
The consistency of the trends we observed across datasets suggests that our qualitative conclusions are robust to some significant cross-laboratory and cross-sociodemographic variation. Nevertheless, these findings are still limited in their generalizability by the convenience samples that were used in most of the studies aggregated in Peekbank. These studies typically (but not always) represent children from well-educated parents living in university-adjacent communities. We would not expect that specific numerical parameters estimated in our aggregate convenience sample would generalize to other samples.
More broadly, our results here suggest the continued importance of the looking-while-listening paradigm as an index of children’s language processing abilities. If language learning is, at least in part, a process of skill learning, then measurement of this skill in larger samples provides a critical window into understanding the remarkable process of language learning.
Materials and methods
Data
We included information from 1963 unique participants across 24 datasets. Dataset information is given in Table 1. Although experiments in Peekbank include a variety of different experimental manipulations, we analyzed only data from standard, simple word recognition trials; these trials were sometimes the main focus of the original studies and sometimes constituted control conditions for experiments with more complex manipulations. Requirements for being considered a standard word recognition trial included that (a) the target word was familiar (also no part-words); (b) the target word was the first point of disambiguation; (c) the target word was embedded in a well-formed, grammatical carrier phrase; (d) there was no informative language presented prenominally (e.g. semantically informative verbs, adjectives); (e) there were no nonsense words presented anywhere during the trial (including the carrier phrase); (f) there was no language-, speaker-, or accent-switching within trial; (g) the distractor image was unrelated to both the target label and the target image; and (h) there was no phonological overlap between the distractor label and target label. We focus here on English purely for practical reasons − the Peekbank dataset at present contains limited data from other languages.
We excluded trials entirely if they were missing data on more than 50% of timepoints, and excluded RTs if they were based on fewer than 50% of timepoints in the short analytic window (200 – 2000 ms). We also removed RTs shorter than 367 ms, as these were unlikely to be generated based on the specific linguistic stimulus. We then excluded participants from the analysis if they contributed fewer than four accuracy measurements or fewer than two reaction time measurements. At the participant level, these steps together led to 17% missingness for RTs and 6.30% missingness for long window accuracies.
Analytic methods
We used lme4 to fit linear mixed-effects models, brms to fit non-linear growth models, and lavaan to fit structural equation models. Random effects structures for each model are given in text; full model specifications are available in the Supplemental Information (S8, S9, and S12) and in the reproducible code for this paper, available in the linked repository. To aid interpretability, all variables were standardized (z-scored) prior to inclusion in structural equation models.
Data availability
We retrieved all data from Peekbank release 2025.1 using the peekbankr R package. All code and data necessary to reproduce this manuscript are available at https://github.com/peekbank/peekbank-development
Supplemental Information
S1. Dataset Description
Figure S1 gives the age distribution of unique participants for each separate dataset at different ages. Note that for some datasets, there are multiple administrations (i.e., experimental test sessions) for each participant.

Age distribution of unique participants for each dataset, using three-month bins.
Figure S2 shows the distribution of measurement intervals for longitudinal studies within the dataset.

Distribution of retest administrations across datasets with repeated measurements, colored by dataset.
Each count indicates a retest administration (initial administrations are excluded). Administrations listed with a retest interval of 0 indicate retests within a month of the initial administration.
S2. Reaction Times
S2.1. Reaction Time Computation
Eye-tracking data are stored in Peekbank as a time series of fixations to specific areas of interest (in particular, the target and distractor on each trial). Other fixations can be to areas not in the target or distractor as well to off-screen areas. This time series has a uniform sample rate of 25ms/sample, based on resampling of the data in Peekbank to 40 Hz during preprocessing (Zettersten et al. 2023). Reaction times are computed by filtering trials to only those on which the child is fixating the distractor at the point of disambiguation (t = 0) and then finding those trials on which the first non-missing fixation is to the target (hence excluding trials without a shift and trials on which a shift is to an off-screen location). The reaction time is then the total time from t = 0 to the first timestep during which the child fixates the target. Consistent with standard practice in the literature following Fernald et al. (2008), RTs that are shorter than 367 ms are excluded as they are too short to be considered a response to the stimulus.
S2.2. Comparison of Reaction Times for Correct and Incorrect Trials: Re-analysis of Creel, 2024
The Peekbank dataset only includes measurements of infants’ looking behavior, with no measure of a final target selection. This contrasts with work in the visual-world paradigm with older children and adults, in which participants make a final explicit choice about which image matches the target label (e.g. Colby & McMurray, 2023). Having this additional response allows a clearer separation of accuracy and reaction times, because researchers can compute reaction times specifically on those trials in which participants responded correctly. This strategy helps avoid a possible mixing of reaction times for incorrect and correct responses, which might be generated by different underlying cognitive processes. A possible concern with the Peekbank datasets — and reaction times in infant looking-while-listening studies more generally — is that it is difficult to separate reaction times for correct vs. incorrect responses in the absence of an independent final choice response.
To address this concern, we investigated data from a recent large-scale word recognition study with toddlers in which eyetracking measures were collected together with a final pointing response (Creel, 2024). This dataset included 914 responses from children (2.5–6.5 years) completing a looking-while listening procedure in which they also were instructed to point to the target image. Using this dataset, we investigated the correlation between reaction times (following the same procedure as in our main analyses, i.e. focusing specifically on distractor-to-target shifts) computed over all trials and reaction times computed only for those trials in which children selected the correct referent. The results are shown in Figure S3. Reaction times (i.e., distractor to target shifts) for correct trials only were highly correlated with reaction times across all trials (r = .85, 95% CI [.82, .87], t(479) = 34.84, p < .001). This result suggests that having the ability to filter out incorrect trials has a minimal impact on reaction time computation, even in young children. While there is some uncertainty about how these results may generalize to infants in our younger age ranges (i.e., below 2.5 years of age), who struggle to provide reliable pointing responses, it seems reasonable to assume that our reaction time results would stay largely the same if it were possible to filter out trials on which infants make an incorrect mapping between the target label and the target image using an eyetracking-independent final choice response.

Correlation between reaction times on all trials and reaction times on trials where the child pointed to the correct target.
Data from Creel (2024).
S3. Checks on Data Distributional Assumptions
Here, we check whether the distributional forms that are assumed for the distributions of RT and accuracy are a reasonable empirical fit to the data, and compare against other commonly used distributional forms.
We confirm that, across the age range, the choice to use a log-normal distribution for RT and a normal distribution for accuracy is justified.
S3.1. Reaction Time
The literature focuses on the use of the Exponential-Gaussian (ex-Gaussian) distribution as well as Wald, Weibull, gamma, and log normal distributions (see for example Luce, 1986; Ratcliff, 1993; Van Zandt, 2002). All of these are 2- or 3-parameter distributions meaning that there is no necessary relationship between mean and variance.
The problem of fitting RT distributions is complex and a substantial literature exists (e.g., Ratcliff, 1979; Luce, 1986; Van Zandt, 2000; Baayen & Milin, 2010). One of the big challenges in our dataset as well as elsewhere is that distributions are conditional on factors such as participant and task, so it is challenging to draw inferences about the underlying distribution when looking at average data.
That said, we find that overall the data are best fit by either an ex-Gaussian or a log normal distribution, again consistent with prior literature, giving us confidence in this conclusion. Across the full dataset, the BIC values for ex-Gaussian (42553) and log normal (42624) are quite close to one another, and better than the Wald (49038) and normal (43459) fits. When binned by age (Fig S4), younger children seem better fit for a log normal distribution and older children seem better fit by ex-Gaussian (models with lowest BICs are shown in red since significant differences can be obscured by the large scale). Figure S5 shows the RT data distribution overlaid with the corresponding log-normal distributions.
Overall, we think this result generally vitiates our decision to use log-transformed RTs as our primary dependent measure.
S3.2. Accuracy
Individual trial-level accuracies are not binomial because they are an average probability of fixation over a viewing window. They are bounded at 0 and 1, but in general they tend towards the range .5 – .8 in most studies of this population. Figure S6 shows the data binned by age group with fitted gaussian distributions.
These distributions seem well fit by standard gaussians, but they are in principle bounded and so we asked whether this made a difference, using a Beta distribution (a two-parameter continuous distribution bounded at 0 and 1) for fitting. Surprisingly, across all data, the BIC values for the two distributions were very similar (−4321 for normal versus −4314 for Beta), though the normal distribution was slightly favored. Across age groups, there was heterogeneity with some groups better fit by a gaussian and others better fit by a Beta (Figure S7).
Again, we feel that this result generally vitiates our approach of modeling accuracies via standard linear mixed-effects models: their distributional form is quite close to normal.

Goodness of fit for different distributional models for RT, split by age.

Distribution of RT overlaid with a log normal distribution, split by age.

Goodness of fit for different distributional models of accuracy, split by age.

Distribution of accuracies overlaid with normal distribution, split by age.
S4. Test-Retest Reliability
We examined test-retest reliability for our primary variables of interest by calculating Pearson correlations between pairs of administrations given no more than three months apart. Test-retest correlations were significant but relatively modest: ρlongwindowacc = 0.462, ρshortwindowacc = 0.496, ρrt = 0.407. These reliabilities were biased downwards by three factors, however. First, longitudinal assessments sometimes use variable items between testing sessions, leading to item-related variance in measurement. Second, even three months can lead to substantial change in some children’s language abilities, thus correlations are attenuated by true change as well as measurement error. Third, longitudinal data in the dataset come primarily from the youngest children and hence are likely to show overall higher measurement error due to variability in children’s behavior and an overall lower number of trials.
S5. Pairwise Correlations of Main Measures
Table S1 shows pairwise correlations between the primary variables of interest in the dataset.

Pairwise correlations between primary variables of interest.
S6. Functional Form Model Comparison
Table S2 shows model comparison measures for different models of the functional form of the relationship between accuracy and age and Table S3 shows the same for reaction time. Age gradients are estimated substantially better with long window accuracies. Note that there are a greater number of observations for short window accuracies due to less missing data. We speculate that, on average, more participants looked away from the screen towards the end of trials, leading to a greater number of exclusions of long window trials based on the 50% criterion. Note that the total percentage of trials excluded is still small for both measures: 4.4% for long window accuracy and 1.5% for short window accuracy.

Model comparison metrics for different functional forms of the relationship between accuracy and age.

Model comparison metrics for different functional forms of the relationship between RT and age.
S7. Power Law Fits
In the literature on the “law of practice”, although the log-log relationship we observed is commonly present in the aggregate across individuals, the situation is substantially more complex when relationships are measured within individuals. The best fitting curves for individuals are often exponentials or delayed exponentials (Evans et al., 2018; Heathcote, Brown, & Mewhort, 2000).
With our current dataset, we unfortunately cannot specifically determine whether within-individual patterns of change conform to linear, power law, or exponential developmental patterns, because we have insufficient data about individuals’ improvement across time. Thus, our current results apply to the form of the age gradient as opposed to the form of any individual’s pattern of developmental change.
We believe that, unlike the skills being studied in the prior adult literature (e.g., Anderson, 1982; Heathcote, Brown, & Mewhort, 2000; Logan, 1988), language processing is being learned over the course of a child’s lifetime. Thus, we do not expect to see within-paradigm changes in learning in what is a narrow period of time compared to the duration over which language processing skills are refined.
Nevertheless, here we test for other forms of the aggregate relationship between age and reaction time. In particular, we consider 1) a log~log relationship between RT and age (presented in the main text), 2) using both a log age and a linear age to predict log RT, 3) a quadratic relationship between age and RT, and 4) a cubic relationship between age and RT. As shown in Table S4, the model with a linear age term in addition to a log age term has the best fit, although the linear age term coefficient is only marginally significant (coefficients in Table S5).
These models reveal a small but significant additional linear age term over and above log age, but - because individual participant-level fits are not possible - this term can’t really be used to weigh in on the debate about the precise nature of the learning pattern.

Goodness of fit comparison between different models of the relationship between age and RT.

Fixed effects coefficients for a model predicting log RT from both log age and linear age.
S8. Mixed-effects model specifications
Here we provide specifications for the lmer mixed-effects models used in the main text. These models are used to estimate the relationship between age and the primary variables of interest, controlling for dataset and subject-level variability.
For accuracy, 4 models were run, crossing long and short windows as the dependent variable with age or log age as the predictor.
long_window_accuracy ~ age_s + (age_s | dataset_name) + (1 | subject_id)
long_window_accuracy ~ log_age_s + (log_age_s | dataset_name) + (1 | subject_id)
short_window_accuracy ~ age_s + (age_s | dataset_name) + (1 | subject_id)
short_window_accuracy ~ log_age_s + (log_age_s | dataset_name) + (1 | subject_id)
For reaction time, 4 models were run, crossing rt and log rt as the dependent variable with age or log age as the predictor.
log_rt ~ age_s + (age_s | dataset_name) + (1 | subject_id)
log_rt ~ log_age_s + (log_age_s | dataset_name) + (1 | subject_id)
rt ~ age_s + (age_s | dataset_name) + (1 | subject_id)
rt ~ log_age_s + (log_age_s | dataset_name) + (1 | subject_id)
To look at the relationship between variance in the accuracy and reaction time measures and children’s age, we ran two models.
long_window_acc_var ~ log_age_s + (log_age_s | dataset_name) + (1 | subject_id)
log_rt_var ~ log_age_s + (log_age_s | dataset_name) + (1 | subject_id)
In the growth curve analysis, fit a mixed-effects model predicting growth in vocabulary as a quadratic function of age, RT at study initiation (t0), and their interaction, using the formula below
prod ~ poly(age_15,2) *rt_t0 + (age | subject_id) + (1 | dataset_name)
S9. Factor Analysis
Figure 3 shows the result of a parallel analysis supporting the presence of three factors in the exploratory factor analysis. Table S6 shows the factor loadings for the exploratory three-factor solution using varimax rotation. The first factor is primarily driven by vocabulary measures, the second by reaction time, and the third by accuracy measures.

Factor loadings for the exploratory three factor solution using varimax rotation.
The confirmatory factor analysis of this three-factor solution was fit using the following specification
vocab =~ prod + comp
accuracy =~ long_window_accuracy + long_window_acc_var
speed =~ log_rt + log_rt_var

Parallel analysis scree plot showing the eigenvalues for each factor, for actual, simulated, and resampled data.
The confirmatory factor analysis of the three-factor solution with a relation to age was fit using the following specification
vocab =~ prod + comp
accuracy =~ acc + acc_sd
speed =~ log_rt + log_rt_sd
vocab ~ log_age
accuracy ~ log_age
speed ~ log_age
The SEM with a linear growth curve used the following specification
accuracy_intercept =~ 1*acc_t0 + 1*acc_t1 + 1*acc_t2 + 1*acc_t3 + 1*acc_t4
accuracy_slope =~ 1*acc_t0 + 2*acc_t1 + 3*acc_t2 + 4*acc_t3 + 5*acc_t4
speed_intercept =~ 1*log_rt_t0 + 1*log_rt_t1 + 1*log_rt_t2 + 1*log_rt_t3 + 1*log_rt_t4
speed_slope =~ 1*log_rt_t0 + 2*log_rt_t1 + 3*log_rt_t2 + 4*log_rt_t3 + 5*log_rt_t4
vocab_intercept =~ 1*prod_t0 + 1*prod_t1 + 1*prod_t2 + 1*prod_t3 + 1*prod_t4
vocab_slope =~ 1*prod_t0 + 2*prod_t1 + 3*prod_t2 + 4*prod_t3 + 5*prod_t4
accuracy_intercept ~~ NA*accuracy_intercept
accuracy_slope ~~ NA*accuracy_slope
speed_intercept ~~ NA*speed_intercept
speed_slope ~~ NA*speed_slope
vocab_intercept ~~ NA*vocab_intercept
vocab_slope ~~ NA*vocab_slope
S10. Factor Analysis on First Administrations
As a robustness check, we tested our best factor analytic models using only cross-sectional data (filtering to the first test session in longitudinal datasets; N=1963 instead of N=3553). A comparison of all 4 models is shown in Table S7. For the three-factor CFA, the first administration model shows increased CFI (.992 instead of .972) and decreased RMSEA (.030 instead of .065). The same is true for the age-regressed three-factor CFA, which shows very good statistics on both first administrations and longitudinal data (CFI = .999 and .991, respectively and RMSEA = .009 and .037 respectively).

Comparison of confirmatory factor analysis models on longitudinal data or first administrations only.
S11. Alternative Factor Structures
In this section, we provide comparisons between the three-factor model we report in the main text and several alternative models, including:
a one-factor model;
a two-factor model with vocabulary separated from speed and accuracy;
a two-factor model with speed separated from accuracy and vocabulary; and
a two-factor model with variability terms separated from speed, accuracy, and vocabulary.
Table S8 shows the result of these comparisons. The three-factor model shows the lowest AIC and BIC, as well as being significantly better fitting than the next-best model.

Model comparison for alternative factor structures.
p-values show differences between adjacent models; no p-values are shown for comparisons between non-nested models.
S12. Non-linear Growth Models
To test for the differentiation of vocabulary growth based on initial reaction time, we used the package brms to fit a (Bayesian) logistic growth model to the production data. This model has two parameters for the logistic curve, a scale and an intercept. Both were allowed to interact with initial reaction time. We also included random effects of logistic intercept and scale by participant and a grouping term across datasets.

Fixed effects estimates from logistic growth model.

Fixed effects estimates from logistic growth model using RT residualized on age as the predictor.
This model showed a significant effect of initial reaction time on the intercept of the logistic growth curve, but not on its scale (see Table S9).
Age and initial reaction time were both mean-centered.

Growth curves from a logistic growth model showing predicted vocabulary growth for children based on their age-residualized initial reaction times.
Predictions are shown for children with initial reaction times one SD faster than the mean for their age (blue), at the mean for their age (red), and one SD slower than the mean for their age (green). Individual longitudinal trajectories are shown in light gray. Solid lines show global model estimates and colored regions indicate 95% credible intervals.
The formula specification was
nlform <- brms::bf(
prod ~ 1 / (1 + exp((xmid - age_c) / exp(logscale))),
xmid ~ 1 + log_rt_0_c + (1 | dataset_name/subject_id),
logscale ~ 1 + log_rt_0_c + (1 | dataset_name/subject_id),
# scale ~ 1 + log_rt_0_c,
nl = TRUE
)
And the priors were
priors <- c(
prior(normal(0, 5), nlpar = “xmid”, coef = “Intercept”),
prior(normal(1, 1), nlpar = “logscale”, coef = “Intercept”),
prior(normal(0, 1), nlpar = “logscale”, coef = “log_rt_0_c”),
prior(exponential(1), class = “sigma”),
prior(normal(0, 2), nlpar = “xmid”, coef = “log_rt_0_c”),
# Random effects for xmid
prior(exponential(1), class prior(exponential(1), class “sd”, nlpar = “xmid”, group = “dataset_name”),
“sd”, nlpar = “xmid”, group = “dataset_name:subject_id”),
# Random effects for scale
prior(exponential(1), class = “sd”, nlpar = “logscale”, group = “dataset_name”),
prior(exponential(1), class = “sd”, nlpar = “logscale”, group = “dataset_name:subject_id”)
)
Age and reaction time are correlated, so to check that the effects of initial reaction time were not due to age effects, we reran the model using residualized reaction time to remove effects of age. As seen in Table S10 and Figure S9, the pattern of effects is similar for residualized reaction time as for reaction time.
Interpretation of growth in both this model and the linear growth model in the main text is complicated by the fact that the CDI form puts a ceiling on the total number of words that can be recorded; both the quadratic growth functions and the logistic functions come together at the form ceiling. Thus, a shift in quadratic growth in the linear model and a shift in intercept in the logistic model both point to the same overall effect, which is faster growth at the point of maximal sensitivity of the CDI. Neither model can estimate whether the overall growth trajectory is different beyond the range of the CDI. Thus, although these models might initially seem to be in conflict, we believe that they actually point to the same phenomenon, which is perhaps better described by the longitudinal SEM model reported in the main text. Children with greater skill in word recognition show an overall positive shift in the growth trajectory of vocabulary development.
S13. SEM Longitudinal Missingness
The SEM model was fit to the entire dataset, including the large mass of cross-sectional data (to anchor the estimates of t0 coefficients) and the sparse longitudinal data for each time point. We have 3%-12% of the total t0 datapoints for any given time point (see Table S11), given the sparsity of longitudinal sampling (only 6/24 of the datasets are longitudinal).
Our data are MAR (missing at random) rather than MCAR (missing completely at random). This is because their missingness is due to which dataset they are part of - if they are from a cross-sectional dataset, they are by definition missing all longitudinal observations. For our analyses to be appropriate given this structure, we have to assume that the general developmental patterns we are studying are replicated across datasets. We believe that they are, and we show this statistically using our mixed-effects and non-linear mixed-effects models, which control for dataset-related variation. We also show dataset-level effects in a number of our visualizations for this same reason. The same degree of random effect specification that we can do in the mixed-effects models is not possible in the SEM model, however, purely for technical reasons. Again, this point highlights the importance of convergence across analyses.

Fraction of data present for each measure at each time point for the longitudinal SEM.
Additional information
Funding
Jacobs Foundation
Virginia A Marchman
References
- Developmental and stylistic variation in the composition of early vocabularyJournal of child language 21:85–123Google Scholar
- Variability and Consis tency in Early Language Learning: The Wordbank ProjectCambridge, MA: MIT Press Google Scholar
- Rapid gains in speed of verbal processing by infants in the 2nd yearPsychological Science 9:228–231Google Scholar
- Does speed of processing or vocabulary size predict later language growth in toddlers?Cognitive Psychology 115:101238Google Scholar
- The comprehension boost in early word learning: Older infants are better learnersChild development perspectives 14:142–149Google Scholar
- At 6–9 months, human infants know the meanings of many common nounsProceedings of the National Academy of Sciences 109:3253–3258Google Scholar
- Integration of visual and linguistic information in spoken language comprehensionScience 268:1632–1634Google Scholar
- Incremental interpretation at verbs: Restricting the domain of subsequent referenceCognition 73:247–264Google Scholar
- Looking while listening: Using eye movements to monitor spoken language comprehension by infants and young childrenIn:
- Sekerina IA
- Fernandez EM
- Clahsen H
- Real-time lexical comprehension in young children learning american sign languageDevelopmental science 21:e12672Google Scholar
- The intermodal preferential looking paradigm: A window onto emerging language comprehension
- Visual preference as a test of infant word comprehensionApplied Psycholinguistics 11:145–166Google Scholar
- Phonological priming and cohort effects in toddlersCognition 121:196–206Google Scholar
- Children’s expressive and receptive knowledge of the english regular pluralDev Psychol 26:10.1037/dev0001986
- Lexical neighborhoods and the word-form representations of 14-month-oldsPsychological Science 13:480–484Google Scholar
- The kindergarten-path effect: Studying on-line sentence processing in young childrenCognition 73:89–134Google Scholar
- In the infant’s mind’s ear: Evidence for implicit naming in 18-month-oldsPsychological science 21:908–913Google Scholar
- Nature and origins of the lexicon in 6-mo-oldsProceedings of the National Academy of Sciences 114:12916–12921Google Scholar
- Meaningful differences in the everyday experience of young american children
- Linking quality and quantity of parental linguistic input to child language skills: A meta-analysisChild Development 92:484–501Google Scholar
- Using speakers’ referential intentions to model early cross-situational word learningPsychological science 20:578–585Google Scholar
- Talking to children matters: Early language experience strengthens processing and builds vocabularyPsychological Science 24:2143–2152Google Scholar
- Picking up speed in understanding: Speech processing efficiency and vocabulary growth across the 2nd yearDevelopmental psychology 42:98Google Scholar
- Speed of word recognition and vocabulary knowledge in infancy predict cognitive and language outcomes in later childhoodDevelopmental science 11:F9–16Google Scholar
- Individual differences in lexical processing at 18 months predict vocabulary growth in typically developing and late-talking toddlersChild development 83:203–22Google Scholar
- Interrelationships between working memory, processing speed, and language development in the age range 2–4 yearsJournal of Speech, Language, and Hearing Research 59:1146–1158Google Scholar
- Efficiency of spoken word recognition slows across the adult lifespanCognition 240:105588Google Scholar
- The development of lexical processing: Real-time phonological competition and semantic activation in school age childrenQuarterly Journal of Experimental Psychology :17470218241244799
- The slow development of real-time processing: Spoken-word recognition as a crucible for new thinking about language acquisition and language disordersCurrent Directions in Psychological Science 31:305–315Google Scholar
- Peekbank: An open, large-scale repository for developmental eye-tracking data of children’s word recognitionBehavior Research Methods 55:2485–2500Google Scholar
- The now-or-never bottleneck: A fundamental constraint on languageBehavioral and brain sciences 39:e62Google Scholar
- Language acquisition as skill learningCurrent opinion in behavioral sciences 21:205–208Google Scholar
- Learning and stability: A psychophysiological analysis of a case of motor learning with clinical applicationsJournal of Applied Psychology 10:1Google Scholar
- Processing time declines exponentially during childhood and adolescenceDevelopmental psychology 27:259Google Scholar
- Acquisition of cognitive skillPsychological review 89:369Google Scholar
- The power law repealed: The case for an exponential law of practicePsychonomic bulletin & review 7:185–207Google Scholar
- Refining the law of practicePsychological review 125:592Google Scholar
- Word learning emerges from the interaction of online referent selection and slow associative learningPsychological review 119:831Google Scholar
- Optimal feedback control as a theory of motor coordinationNature neuroscience 5:1226–1235Google Scholar
- Early language processing efficiency predicts later receptive vocabulary outcomes in children born pretermChild Neuropsychology 22:649–665Google Scholar
- Lexical-processing efficiency leverages novel word learning in infants and toddlersDevelopmental science 21:e12569Google Scholar
- MacArthur-bates communicative development inventories users guide and technical manual, third edition (Brookes)
- Working memory capacity, variability, and response to intervention at age 6 and its association to inattention and mathematics age 9Cognitive Development 58:101013Google Scholar
- Variability in the precision of children’s spatial working memoryJournal of Intel ligence 6:8Google Scholar
- Introducing the intra-individual variability hypothesis in explaining individual differences in language development
- Everyday language input and production in 1,001 children from six continentsProceedings of the National Academy of Sciences of the United States of America 120:e2300671120Google Scholar
- Quantifying sources of variability in infancy research using the infant-directed speech preferenceAdvances in Methods and Practices in Psychological Science 3:24–52Google Scholar
- On the dimensional indeterminacy of one-wave factor analysis under causal effectsJournal of Causal Inference 11:20220074Google Scholar
- Effect of the relationship between target and masker sex on infants’ recognition of speechThe Journal of the Acoustical Society of America 141:EL164–EL169Google Scholar
- Bilingual toddlers’ comprehension of mixed sentences is asymmetrical across their two languagesDevelopmental Science 22:e12794Google Scholar
- Infants use known verbs to learn novel nouns: Evidence from 15-and 19-month-oldsCognition 131:139–146Google Scholar
- Role of speaker gender in toddler lexical processingInfancy 27:291–300Google Scholar
- Exploring the linguistic, cognitive, and social skills underlying lexical processing efficiency as measured by the looking-while-listening paradigmJournal of Child Language 49:302–325Google Scholar
- Is a pink cow still a cow? Individual differences in toddlers’ vocabulary knowledge and lexical representationsCognitive science 41:1090–1105Google Scholar
- Roses are red, socks are blue: Switching dimensions disrupts young children’s language comprehensionPloS one 11:e0158459Google Scholar
- Familiar object salience affects novel word learningChild development 90:e246-e262Google Scholar
- Understanding the role of non-contrastive variability in word learning and visual attention in infancyDavis: University of California Google Scholar
- Using tablets to collect data from young childrenJournal of Cognition and Development 17:1–17Google Scholar
- The more they hear the more they learn? Using data from bilinguals to test models of early lexical developmentCognition 238:105525Google Scholar
- Becoming word meaning experts: Infants’ processing of familiar words in the context of typical and atypical exemplarsChild Development 95:e352–e372Google Scholar
- Developmental changes in the speed of social attention in early word learning
- Anticipatory coarticulation facilitates word recognition in toddlersCognition 142:345–350Google Scholar
- Caregiver talk and medical risk as predictors of language outcomes in full term and preterm toddlersChild Development 89:1674–1690Google Scholar
- Familiarity plays a small role in noun comprehension at 12–18 monthsInfancy 25:458–477Google Scholar
- Behold the canine!: How does toddlers’ knowledge of typical frames and familiar words interact to influence their sentence processing?unpublishedGoogle Scholar
- Analyzing reaction timesInternational Journal of Psychological Research 3:12–28Google Scholar
- Connecting the tots: Strong looking-pointing correlations in preschoolers’ word learning and implications for continuity in language developmentChild Development 96:87–103Google Scholar
- Toward an instance theory of automatizationPsychological Review 95:492–527Google Scholar
- A computational analysis of uniqueness points in auditory word recognitionPerception & Psychophysics 39:155–158Google Scholar
- Group reaction time distributions and an analysis of distribution statisticsPsychological bul letin 86:446–461Google Scholar
- Methods for dealing with reaction time outliersPsychological Bul letin 114:510–532Google Scholar
- How to fit a response time distributionPsychonomic Bul letin & Review 7:424–465Google Scholar
- Analysis of response time distributionsStevens’ Handbook of Experimental Psychology 4:461–516Google Scholar
Article and author information
Author information
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
Cite all versions
You can cite all versions using the DOI https://doi.org/10.7554/eLife.109636. This DOI represents all versions, and will always resolve to the latest one.
Copyright
© 2025, Frank et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 0
- downloads
- 0
- citations
- 0
Views, downloads and citations are aggregated across all versions of this paper published by eLife.