Continuous developmental changes in word recognition support language learning across early childhood

Michael C Frank; Virginia A Marchman; Claire Augusta Bergey; Veronica Boyce; Mika Braginsky; George Kachergis; Jess Mankewitz; Stephan Meylan; Ben Prystawski; Nilam Ram; Robert Z Sparks; Adrian Steffan; Alvin Wei Ming Tan; Martin Zettersten

doi:10.7554/eLife.109636.2

Introduction

Children acquiring a language are learning a body of knowledge – a set of words and the ways they are combined – but they are also learning to deploy this knowledge in the myriad complex, noisy, and fast-moving environments in which language is used. As children enter their second year, language explodes onto the scene; both vocabulary and grammatical abilities grow rapidly and in tandem (1, 2). This growth in knowledge is also accompanied by changes in language processing efficiency: children become quicker and more accurate in recognizing words and matching them with their referents (3–5).

Yet unlike language production, which is manifest via overt behavior, evidence for word recognition – the linking of a word form to its meaning during language comprehension – is often more subtle. Very young children may not be able to point to the correct referent of a word, but they may still have some representation of word meaning (6). Eye tracking has thus emerged as a key method that allows the measurement of language comprehension with high temporal resolution: both adults and children reliably fixate the referent of a word soon after it is used (3, 7–10). This procedure measures the general construct of word recognition by operationalizing knowledge of a meaning as visual attention to a specific named referent. The relative timecourse of fixation then can provide an index of an individual comprehender’s ability or be used to measure the difference between two stimulus conditions.

The version of this method that is used with children goes by many names, including the “intermodal preferential looking” paradigm and the “looking while listening” paradigm (LWL, the name we adopt here) (9, 11, 12). In LWL experiments, children are typically shown two images displayed side by side and asked to find one of them. For example, a ball and a shoe might be shown, and the child might hear “Look at the ball! Can you find it?”. Accuracy is then computed as the proportion of time their eyes fixate the correct image within a fixed window after the onset of the noun (“ball” in this case). Reaction time is (typically) computed only on trials in which the child is fixating the distractor image (the shoe) at word onset; in these cases, the average time it takes for the child to shift fixation from the distractor to the target image is used as an index of processing speed. Early work using this method showed that both children’s speed and accuracy increase rapidly across the second year (3, 12). Related methods have provided a window into how children process phonological (13), morphological (14), lexical (15), syntactic (16), and semantic (17, 18) information.

Familiar word recognition – as measured by LWL – is hypothesized to play a key role in language learning (19). The idea, in a nutshell, is that the faster and more accurately a child can process incoming words, the more opportunities they have for learning. Consider a child hearing the utterance “Can you put the ball in the crate?” The better the child can recognize the word “ball”, the better they can use this evidence to help infer the speaker’s intended meaning, allowing possible inferences about the meaning of the less familiar word, “crate” (20).

Real time language processing, including word recognition, relies heavily on predictive processing, in which comprehenders integrate expectations from prior linguistic context with noisy and ephemeral incoming signals (21, 22). The more input a child receives, the better their predictions are likely to be, and hence the more they can learn (19, 23). Indeed, measurements of children’s language input at home are consistently associated with their vocabulary size (24, 25). And, in line with this predictive processing framework, one important study found that children’s word recognition speed mediated the longitudinal relationship between home language input and vocabulary growth (26). Thus, word recognition is thought to play a key supporting role in ongoing word learning.

Familiar word recognition speed has also been used as an index of individual differences in early childhood (4, 19, 27–29) and beyond (30–32). Over and above measures of vocabulary size, word recognition speed at 18 months predicts children’s language and cognitive abilities as measured by standardized tests administered at age 8 (27). Further, faster processing at 18 months is prospectively related to whether “late talkers” catch up to their peers or could benefit from further intervention (28). Critically, most word recognition paradigms use words that children at the target age are reported to understand and produce. They are thus not indices of vocabulary size but rather measures of how quickly and accurately the child can recognize a familiar spoken word and use it to guide their visual attention to a referent. However, it is unknown the extent to which specific responses reflect an individual child’s general speed of language processing versus their familiarity of specific words.

Given the logistical hurdles involved in conducting eye-tracking experiments with young children, individual experiments typically recruit relatively small samples in a restricted range of ages. These samples provide neither the breadth of ages nor the number of participants needed to estimate how word recognition changes developmentally and how it connects with other aspects of early language development (see (30, 32) for examples of these analyses in school-aged children). To overcome these limitations, we created Peekbank, an open database of LWL data from young children, stored in a harmonized format (33). This dataset unifies and carefully curates a large amount of eye-tracking data from studies with infants and toddlers, representing cumulatively over 30 million individual measurements of children’s eye movements across trials and time-points (dataset version: 2026.1). The Peekbank dataset allows us to gain an unprecedented view of the development of word recognition across a large sample of children.

We investigate two specific issues here. First, one influential theory posits that language learning is a process of skill learning, in which the child is learning the skill of fluent conversation with other language users (34, 35). In this theory, the major information processing challenge of language learning is that incoming language is ephemeral and must be processed quickly before it is lost (the “now-or-never bottleneck”). On this kind of account, we should expect to see the signatures of expertise and skill learning in word recognition, which is one of the primary skills involved in processing incoming language in real time. Accuracy should change linearly with the logarithm of age, reflecting gradual asymptotic convergence to mature levels of accuracy. In addition, we might observe what is known as the “power law of practice,” the regularity found in many cases of skill learning that the logarithm of reaction time decreases with the logarithm of experience across participants (36–38, cf. 39, 40). Indeed, this pattern is predicted by an influential associative process model of early word learning (41). In our case, we expect that chronological age is a proxy for experience and so the logarithm of reaction time should decrease linearly with the logarithm of age. Finally, trial-to-trial variability in both speed and accuracy should decrease with increasing expertise, as is found in studies of motor expertise (42).

Second, previous findings have provided limited and sometimes conflicting evidence on the concurrent and predictive relations between word recognition and language learning. Initial reports showed strong prospective relationships between both speed and accuracy and later vocabulary growth (19), with replications in infants born preterm (43) and late talkers (28). Subsequent studies have primarily focused on speed of processing and found more mixed results, with reaction time measures found to be only inconsistently related to later vocabulary outcomes (4, 29, 44). A larger dataset should allow us to make a more definitive test of the presence of these relationships. Further, by examining the relationship between speed, accuracy, and vocabulary, it should be possible to assess the extent to which processing speed specifically plays a role in vocabulary growth.

Across both of these issues, the contribution of our work here lies in the detailed quantitative description of development. Nearly every theory of language learning assumes some role for continuous developmental change in word recognition, but these assumptions have not previously been anchored to specific measurements. Hence neither the functional form of the assumed changes nor their concurrent and predictive relationships to vocabulary have been quantified. We leverage the Peekbank dataset to accomplish these goals.

Results

We retrieved data from Peekbank, focusing on data from monolingual English-speaking children ages 6 months – 6 years and on simple word recognition trials in which children were shown two pictures of concrete objects and heard a label for an object (typically embedded in a simple carrier phrase such as “Look at the …”). While other experimental manipulations and languages are included in the database, we narrowed our sample to English-speaking children because they are well-represented across our age range and excluded manipulations which aimed to capture phenomena other than simple concrete noun reference (e.g., adjective comprehension or novel word learning). These criteria yielded 26 datasets, including 2555 children and 4124 administrations of the LWL procedure (some datasets were longitudinal or involved multiple closely-spaced testing sessions).

Table 1 shows the characteristics of individual datasets (see also S1 Dataset Description in the Supplementary Information). The size of the combined dataset, the unified data processing pipeline, and the fact that individual studies used very similar implementations of the LWL experimental paradigm all allowed us to make a more detailed study of the development of word recognition than has previously been possible. While our analyses are exploratory in nature, they are guided by the two hypotheses outlined above: the presence of 1) signatures of skill learning in word recognition, and 2) linkages between word recognition and vocabulary.

Characteristics of included datasets from Peekbank.
“Admins” denotes separate experimental sessions. “CDIs” refers to whether the dataset contains parent report vocabulary data from the MacArthur-Bates Communicative Development Inventory.

Speed and accuracy of word recognition increase

We began by examining developmental changes in children’s word recognition. Figure 1 depicts the average timecourse of target looking at different ages across all datasets (not controlling for any variation in items and procedures across age groups). Intuitively, these timecourses show gradual increases in accuracy (more target looking; computed as the ratio of target to target plus distractor looking) and speed (faster looking to the target after hearing a label) as age increases. To characterize age gradients in speed and accuracy across children, we computed both RTs (reaction times) and accuracies (proportion looking at the target image) following standard practices in the literature (9). Reaction times were computed only on trials for which the child was fixating the distractor at the point of disambiguation (label onset), and were defined as the time from label onset to the first fixation on the target image (see S2 Reaction Times, including further details on how reaction times were computed in S2.1 and discussion of issues surrounding distinguishing “correct” vs. “incorrect” trials when computing looking-based reaction times in S2.2).

Timecourse of word recognition at different ages.
The x-axis shows time (in ms) from the onset of the target label (vertical solid line). Colored lines show the average increase in proportion target looking post label onset at each age bin (in months). Age bins are larger for older children due to decreased data density. The dashed horizontal line represents chance looking. Error bands represent standard errors of the mean. Grey backgrounds highlight the short and long time windows used in subsequent analyses. The data within the figure is filtered such that at (a) participants are required to contribute at least 5 observations and (b) there must be at least 50 participants contributing to each time bin within an age group.

Because there is no consensus about the length of time windows for the computation of accuracy, we considered both a shorter window (from 200 – 2000 ms after noun onset) and a longer window (from 200 – 4000 ms). For each window, we averaged all fixations within the window to compute a continuous proportion of target looking between 0 (no fixation on the target during the window) and 1 (total fixation on the target during the window) on every trial. In this initial analysis, we treat observations of RT and target looking as direct measures of the constructs speed and accuracy (see S4 Test-Retest Reliability); in subsequent analyses we estimate latent variables representing these constructs.

Our first question was about the functional form of the relationships between age, speed, and accuracy (see S5 Pairwise Correlations of Main Measures for raw pairwise correlations between variables). We began by fitting linear mixed-effects models predicting speed and accuracy on each trial across the full dataset with random slopes of child age nested within study (modeling item and procedural variation across studies) and random intercepts by participant (see S8 Mixed-effects model specifications for further details on these specifications). We compared models that included both long and short accuracy windows, as well as logarithmic and linear effects of age, and logarithmic and linear transformations of RT (see S3 Checks on Data Distributional Assumptions for further analyses and discussion of these modeling choices). The best fitting model of accuracy predicted long window accuracy as a function of the logarithm of age; the best fitting model of speed predicted log RT as a function of log age as well (see S6 Functional Form Model Comparison and S7 Power Law Fits). Because long window accuracies were more correlated with other variables and showed clearer age gradients, we focus on these in our analyses.

Figure 2 shows these age gradients. Log RT decreased significantly with age, reflecting increasing speed ( = −0.13, 95% CI [−0. 16, −0.11], t(18.93) = −12.23, p < .001) and accuracy also increased significantly with age ( = 0.07, 95% CI [0.06, 0.08], t(20.17) = 13.05, p <.001). In sum, we see continuing improvements in word recognition across the full age range in our dataset that appear roughly linear in the logarithm of age. These logarithmic relationships follow theoretical expectations that both speed and accuracy should gradually asymptote to mature levels of performance, as seen in skill learning more generally (36, 38).

Participant-level target looking and reaction time (log), plotted by age (log).
Longitudinal datapoints are connected by lines. The solid blue line shows a linear fit and associated confidence interval. Thin colored lines show linear fits for those datasets spanning six or more months of age. The dashed line for accuracy shows chance-level looking (.5)

Variability of word recognition decreases

One further hallmark of increasing skill is a decrease in task-relevant variability (42). Both within and across datasets, within-individual variation in speed and accuracy decreased across the developmental range we examined (Figure 3). We fit mixed-effects models predicting the standard deviation of both speed and accuracy for each testing session for each participant, including random slopes of log age nested within dataset and random intercepts for each participant. For both speed and accuracy, within-individual variability decreased with age (speed: = −0.05, 95% CI [−0.06, −0.03], t(16.33) = −7. 19, p < .001; accuracy: = −0.04, 95% CI [−0.04, −0.03], t(12.29) = −10.45, p <.001). Thus, as well as being faster and more accurate, older children were more consistent in their real-time word recognition than younger children.

Participant-level variability in target looking and reaction time (log RT), plotted by age (log).
Plotting conventions are as in Figure 1.

Speed and accuracy relate to vocabulary size

We were next interested in whether the various aspects of word recognition – including speed, accuracy, and the variability of each of these – were related to other aspects of early language ability. In our prior analyses, chronological age acts as a proxy for greater language experience and larger vocabulary as well as a host of other correlated developmental changes in cognition. Now we explicitly explore relations to vocabulary growth and the triadic relationship between age, word recognition, and vocabulary.

Of the studies in our database, 14 gathered parent reports about children’s early vocabulary using the MacArthur-Bates Communicative Development Inventory (CDI), a popular survey instrument that provides a reliable and valid estimate of children’s early vocabulary (2, 45). Different forms of the CDI can be used to measure either receptive and expressive vocabulary (for children up to 18 months) or expressive vocabulary only (for children 16 – 30 months).

We fit a series of factor analytic models to explore the dimensionality of the parent report and child LWL data. Our goal in these analyses was to understand the underlying relatedness of the various measures of word recognition and vocabulary, and in particular to assess the evidence for 1) whether the speed, accuracy, and variability measures described above all index the same underlying language processing construct and 2) the nature of the relation between this construct (or set of constructs) and early vocabulary. We initially add age as an additional variable to our models to explore whether this factor structure relates to age; later we treat age as a predictor of latent factors. We begin developing models using all data, treating each observation as independent even if it comes from a longitudinal study; this assumption is equivalent to asserting an invariant factor structure across development (for a test of this assumption, see S10 Factor Analysis on First Administrations). In subsequent models, we relax this assumption and explore longitudinal growth.

Initial exploratory factor analysis using parallel analysis to select the number of factors suggested that three factors explained substantial variance in the data (see S9 Factor Analysis). To better accommodate missing data under the assumption of data missing at random (e.g., missingness due to the age sampling schemes of the various datasets), we used confirmatory factor analysis with full information maximum likelihood to find the best set of loadings. The best fitting model was a three-factor model with factors for speed (RT and RT variability), accuracy (proportion looking to target on each trial and associated variability of this measure), and vocabulary (comprehension and production from the CDI). Fit statistics for this model were generally good (Confirmatory fit index: 0.98, RMSE: 0.06); see S11 Alternative Factor Structures).

Figure 4 shows a regression model fit to this confirmatory factor analysis, with log age predicting each latent variable. This regression model allows interpretation of the covariances between latent factors as partial correlations (controlling for age). The non-age related variance of all three latent factors was significantly related to that of the other factors. Speed and accuracy showed strong negative covariance (β = -0.89, SE = 0.03, p < .001), as expected since they are derived from the same data. Importantly, there was also weaker but significant covariation between RT and vocabulary (β = -0.35, SE = 0.04, p < .001) and accuracy and vocabulary (β = 0.45, SE = 0.03, p < .001). This model supports the idea that variation in speed and accuracy of word recognition is related to individual differences in parent-reported vocabulary beyond the effects of age. Further, the broader set of analyses support a factor structure in which speed and accuracy (and their associated variabilities) are related but distinct aspects of word recognition, rather than being measures of one single construct. These analyses treat all data as between person, however, rather than modeling change in these factors within individuals.

Structural equation model showing the three-factor factor analysis with a regression of each latent variable on the logarithm of age.
Observed variables are notated as squares and latent variables are notated as circles. Factor loadings and regression coefficients are shown with straight, solid lines; covariances are shown with dashed lines; residual variances are shown as solid circular connections. Stars show conventional levels of statistical significance, e.g., * indicates p < .05, ** indicates p < .01, and *** indicates p < .001. Covariances reflect age-residualized correlations between variables.

Speed of processing relates to vocabulary growth

We next investigated within-person relationships between LWL and vocabulary. In particular, we investigate two different (but not mutually exclusive) hypotheses about how word recognition skill could support word learning. First, early word recognition skill could lay a foundation for later vocabulary growth — we test this question first using a series of longitudinal growth models testing whether individual variability in processing speed predicts later increases in productive vocabulary. A second, stronger version of this hypothesis is what we call a “virtuous cycle” model of the relationship between processing speed and vocabulary growth, in which not only baseline word recognition skill, but also children’s improvements in this skill are related to faster growth in vocabulary; we test this hypothesis using longitudinal structural equation models.

To investigate the first hypothesis, we began by fitting longitudinal growth models to the full dataset (though note that the same conclusions hold when restricting the data to only those children with multiple LWL sessions). We first reproduced the analysis reported in (28), in which between-person differences in longitudinal growth in productive vocabulary were predicted based on between-person differences in speed during the initial session of the study. We fit a mixed-effects model predicting growth in vocabulary as a quadratic function of age, RT at study initiation (t₀), and their interaction (as well as random effects of age nested within participant and also age nested within dataset). This model revealed a significant effect of t₀ RT ( = −0.14, 95% CI [−0. 19, −0.08], t(530.16) = −4.85, p < .001) and an interaction between t₀ RT and the quadratic age predictor ( = 2.00, 95% CI [1.04, 2.96], t(545.67) = 4.07, p <.001). This analysis suggests that children with faster initial RTs show both larger vocabularies and faster vocabulary growth over time.

We confirmed this analysis using a non-linear growth model with a logistic shape, which provides a better fit to vocabulary size within a fixed-length form than the quadratic model (see S12 Non-Linear Growth Model) (2). Figure 5 shows predictions from this model, confirming the differentiation of growth curves for children with higher and lower initial reaction time.

Growth curves from a logistic growth model showing predicted productive vocabulary growth for children with initial reaction times one SD faster than the mean (blue), at the mean (red), and one SD slower than the mean (green).
Individual longitudinal trajectories are shown in light gray. Solid lines show global model estimates and colored regions indicate 95% credible intervals.

On the other hand, it is possible that differences in predicted growth trajectories are due to coupling between vocabulary size and language processing across the entire developmental period, rather than a predictive relationship specifically between t₀ RT and vocabulary growth (i.e., the “virtuous cycle” model). To test this relationship, we used longitudinal structural equation models. We separated the longitudinal speed, accuracy, and vocabulary data into two-month bins spanning up to 10 months from the initial measurement (i.e., t₀, ..., t₄) and fit individual growth across each of these variables. We used full-information maximum likelihood to handle the substantial missing data caused by the different longitudinal sampling schemes of studies in our dataset (see S13 SEM Longitudinal Missingness). The fitted longitudinal model is shown in Figure 6. Overall fit statistics were generally acceptable (Confirmatory fit index: 0.89, RMSE: 0.03, RMSE p-value: > .999).

Structural equation model showing longitudinal couplings between growth parameters.

Our key question of interest concerned coupling among the (latent) intercepts and slopes of these growth models. Consistent with our earlier analysis showing that faster processing is related to vocabulary growth, we saw significant between-person coupling between processing speed intercepts and vocabulary growth slopes (β = -0.18, SE = 0.06, p = .001) as well as a variety of other between-person couplings. On the other hand, there was not significant coupling between growth in speed and growth in vocabulary (β = 0.00, SE = 0.02, p = .872). This null effect could be interpreted as being consistent with these abilities growing independently, but there are other possibilities. First, the longitudinal data we had might not have allowed sufficiently precise estimates of growth slopes, or second, since vocabulary growth is non-linear, the linear model we used here might not have captured coupling among nonlinear aspects of developmental change.

In sum, these findings provide evidence consistent with the claim that differences in processing speed are related to differences in the rate of age-related change in vocabulary (19, 26). Children with greater skill in word recognition learn words faster. However, we did not find evidence for the stronger version of this claim: in neither the non-linear growth model nor the linear SEM did we find evidence that increases in speed were related to increases in vocabulary size. Thus, our findings do not support a “virtuous cycle” model in which increases in recognition specifically lead to increases in vocabulary size.

Discussion

How does word recognition change across early childhood and how does it relate to language learning? We investigated these questions using a new, large-scale dataset of developmental eye-tracking measurements compiled across many prior studies. The age gradients for speed and accuracy indicated that both improve asymptotically. Gradients for recognition speed were consistent with the log-log relationship associated with the “power law of practice,” that is, with a gradual convergence to mature levels of processing efficiency. Further, the age gradient suggested that trial-to-trial variability decreases with age, consistent with both the literature on skill learning (42) and other work on developmental changes in variability (46–48). Speed and accuracy were both related to vocabulary size concurrently and processing speed was also related longitudinally to later vocabulary growth.

Together, our findings are consistent with theories that posit that language learning is a process of skill acquisition, in which children become adept at quickly converting ephemeral signals into meaning (34). This skill develops gradually over the course of early childhood and supports word learning. Further, our results point to consistency between skill development in early childhood and the continued refinement of language processing and language knowledge during middle childhood (30, 32).

By aggregating data from many pre-existing studies, we were able to overcome the limitations of prior investigations, which typically had sample sizes at least an order of magnitude smaller than ours. Our approach was to build on the time-consuming and meticulous data collection from previous infant and toddler eye-tracking studies – representing cumulatively many thousands of hours of in-lab data collection and hand-annotation of the resulting videos of child looking behavior – by harmonizing these data into a single, large-scale database. This approach illustrates how building harmonized databases can be especially powerful when composed of high-effort and high-quality datasets that are smaller in scope, maximizing the impact of previous data collection efforts and allowing us to ask broader questions about developmental change (2). In contrast to individual studies, which typically have at best the statistical power to test one or two specific contrasts, our “big data” approach provided the sample sizes necessary to explore the relationships between different variables. Because early language is so variable, these kinds of samples – with thousands, rather than dozens of children – are likely to be required to gain further insight into the psychometrics of early language learning (2, 49, 50).

Our approach is both observational and exploratory. Thus, we cannot untangle the range of different causal models that explain the variation we observed. First, early word recognition skill could lead to faster word learning, but faster children could also be faster due to their larger vocabulary and stronger lexical representations. These two causal directions could also interact reciprocally, leading to a “rich get richer” process in which children with larger vocabularies process faster, and their faster processing helps them increase their vocabulary size more rapidly. Finally, a third shared factor – perhaps general cognitive ability – could underpin both processes. Our cross-sectional data cannot distinguish these hypotheses even in principle (51), and our longitudinal data are likely too sparse to distinguish such complex causal models. Future work must also explore how the functional forms we observed here between individuals reflect processes of within-person change. Although the Peekbank dataset includes a variety of longitudinal data, most reflect a small number of measurements; denser longitudinal data collection is required to better estimate within-person growth models.

The relationships we report are derived from models that account for variation across datasets, suggesting that our qualitative conclusions are robust to cross-laboratory variation. Nevertheless, these findings are still limited in their generalizability by the convenience samples that were used in most of the studies aggregated in Peekbank. These studies typically (but not always) represent children from well-educated parents living in university-adjacent communities. We would not expect that specific numerical parameters estimated in our aggregate convenience sample would generalize to other samples.

More broadly, our results here suggest the continued importance of the looking-while-listening paradigm as an index of children’s language processing abilities. If language learning is, at least in part, a process of skill learning, then measurement of this skill in larger samples provides a critical window into understanding the remarkable process of language learning.

Materials and Methods

Data

We included information from 2555 unique participants across 26 datasets. Dataset information is given in Table 1. Although experiments in Peekbank include a variety of different experimental manipulations, we analyzed only data from standard, simple word recognition trials (“vanilla” trials); these trials were sometimes the main focus of the original studies and sometimes constituted control conditions for experiments with more complex manipulations. Requirements for being considered a standard word recognition trial included that (a) the target word was familiar (also no part-words); (b) the target word was the first point of disambiguation and appeared only once; (c) the target word was embedded in a well-formed, grammatical carrier phrase; (d) there was no informative language presented prenominally (e.g. semantically informative verbs, adjectives); (e) there were no nonsense words presented anywhere during the trial (including the carrier phrase); (f) there was no language-, speaker-, or accent-switching within trial; (g) the auditory stimulus included no intentional background noise or audio filtering; (h) both target and distractor items were familiar objects; (i) no novel visual stimuli (i.e., experimenter-created artificial items or items selected to be entirely unfamiliar) were visible; and (j) the target referent was the focal object in the target image and there were no additional focal objects competing for attention within the target image (e.g., if the target word was “orange” and the image depicted an orange on a plate, this was considered a standard trial; if however the image depicted both an apple and an orange on a plate, this was not considered standard). We focus here on English purely for practical reasons – the Peekbank dataset at present contains limited data from other languages.

We excluded trials entirely if they were missing data on more than 50% of timepoints, and excluded RTs if they were based on fewer than 50% of timepoints in the short analytic window (200 – 2000 ms). We also removed RTs shorter than 367 ms, as these were unlikely to be generated based on the specific linguistic stimulus. We then excluded participants from the analysis if they contributed fewer than four accuracy measurements or fewer than two reaction time measurements. At the participant level, these steps together led to 21.40% missingness for RTs and 8.80% missingness for long window accuracies.

Analytic methods

We used lme4 to fit linear mixed-effects models, brms to fit non-linear growth models, and lavaan to fit structural equation models. Random effects structures for each model are given in text; full model specifications are available in the Supplemental Information (S8, S9, and S12) and in the reproducible code for this paper, available in the linked repository. To aid interpretability, all variables were standardized (z-scored) prior to inclusion in structural equation models.

Supplemental Information

S1. Dataset Description

Figure S1 gives the age distribution of unique participants for each separate dataset at different ages. Note that for some datasets, there are multiple administrations (i.e., experimental test sessions) for each participant. Figure S2 shows the distribution of measurement intervals for longitudinal studies within the dataset. Table S1 has additional information on how many trials each dataset contributed and what percent of the dataset’s trials were included.

Age distribution of unique participants for each dataset, using three-month bins.

Distribution of retest administrations across datasets with repeated measurements, colored by dataset.
Each count indicates a retest administration (initial administrations are excluded). Administrations listed with a retest interval of 0 indicate retests within a month of the initial administration.

Characteristics of included datasets from Peekbank, sorted by what percent of the data the data they represent.
Percent trials refers to what percent of the trials used came from that dataset; total is the number of trials used from that dataset; and included is what percent of all trials had data that was included (based on criteria about missingness, distractor to target transition, minimum RT). LW = long-window accuracy, SW = short-window accuracy, and RT = reaction time.

S2. Reaction Times

S2.1. Reaction Time Computation

Eye-tracking data are stored in Peekbank as a time series of fixations to specific areas of interest (in particular, the target and distractor on each trial). Other fixations can be to areas not in the target or distractor as well to off-screen areas. This time series has a uniform sample rate of 25ms/sample, based on resampling of the data in Peekbank to 40 Hz during preprocessing (Zettersten et al. 2023). Reaction times are computed by filtering trials to only those on which the child is fixating the distractor at the point of disambiguation (t =0) and then finding those trials on which the first non-missing fixation is to the target (hence excluding trials without a shift and trials on which a shift is to an off-screen location). The reaction time is then the total time from t = 0 to the first timestep during which the child fixates the target. Consistent with standard practice in the literature following Fernald et al. (2008), RTs that are shorter than 367 ms are excluded as they are too short to be considered a response to the stimulus.

S2.2. Comparison of Reaction Times for Correct and Incorrect Trials: Re-analysis of Creel (2024)

The Peekbank dataset only includes measurements of infants’ looking behavior, with no measure of a final target selection. This contrasts with work in the visual-world paradigm with older children and adults, in which participants make a final explicit choice about which image matches the target label (e.g. Colby & McMurray, 2023). Having this additional response allows a clearer separation of accuracy and reaction times, because researchers can compute reaction times specifically on those trials in which participants responded correctly. This strategy helps avoid a possible mixing of reaction times for incorrect and correct responses, which might be generated by different underlying cognitive processes. A possible concern with the Peekbank datasets — and reaction times in infant looking-while-listening studies more generally — is that it is difficult to separate reaction times for correct vs. incorrect responses in the absence of an independent final choice response.

To address this concern, we investigated data from a recent large-scale word recognition study with toddlers in which eyetracking measures were collected together with a final pointing response (Creel, 2024). This dataset included 914 responses from children (2.5-6.5 years) completing a looking-while listening procedure in which they also were instructed to point to the target image. Using this dataset, we investigated the correlation between reaction times (following the same procedure as in our main analyses, i.e. focusing specifically on distractor-to-target shifts) computed over all trials and reaction times computed only for those trials in which children selected the correct referent. The results are shown in Figure S3. Reaction times (i.e., distractor to target shifts) for correct trials only were highly correlated with reaction times across all trials (r = .85, 95% CI [.82,.87], t(479) = 34.84, p <.001). This result suggests that having the ability to filter out incorrect trials has a minimal impact on reaction time computation, even in young children. While there is some uncertainty about how these results may generalize to infants in our younger age ranges (i.e., below 2.5 years of age), who struggle to provide reliable pointing responses, it seems reasonable to assume that our reaction time results would stay largely the same if it were possible to filter out trials on which infants make an incorrect mapping between the target label and the target image using an eyetracking-independent final choice response.

Correlation between reaction times on all trials and reaction times on trials where the child pointed to the correct target.
Data from Creel (2024).

S3. Checks on Data Distributional Assumptions

Here, we check whether the distributional forms that are assumed for the distributions of RT and accuracy are a reasonable empirical fit to the data, and compare against other commonly used distributional forms. We confirm that, across the age range, the choice to use a log-normal distribution for RT and a normal distribution for accuracy is justified.

S3.1. Reaction Time

The literature focuses on the use of the Exponential-Gaussian (ex-Gaussian) distribution as well as Wald, Weibull, gamma, and log normal distributions (see for example Luce, 1986; Ratcliff, 1993; Van Zandt, 2002). All of these are 2- or 3-parameter distributions meaning that there is no necessary relationship between mean and variance.

The problem of fitting RT distributions is complex and a substantial literature exists (e.g., Ratcliff, 1979; Luce, 1986; Van Zandt, 2000; Baayen & Milin, 2010). One of the big challenges in our dataset as well as elsewhere is that distributions are conditional on factors such as participant and task, so it is challenging to draw inferences about the underlying distribution when looking at average data.

That said, we find that overall the data are best fit by either an ex-Gaussian or a log normal distribution, again consistent with prior literature, giving us confidence in this conclusion. Across the full dataset, the BIC values for ex-Gaussian (46906) and log normal (46953) are quite close to one another, and better than the Wald (53949) and normal (47811) fits. When binned by age (Fig S4), younger children seem better fit for a log normal distribution and older children seem better fit by ex-Gaussian (models with lowest BICs are shown in red since significant differences can be obscured by the large scale). Figure S5 shows the RT data distribution overlaid with the corresponding log-normal distributions. Overall, we think this result generally vitiates our decision to use log-transformed RTs as our primary dependent measure.

Goodness of fit for different distributional models for RT, split by age.

Distribution of RT overlaid with a log normal distribution, split by age.

S3.2. Accuracy

Individual trial-level accuracies are not binomial because they are an average probability of fixation over a viewing window. They are bounded at 0 and 1, but in general they tend towards the range .5 - .8 in most studies of this population. Figure S6 shows the data binned by age group with fitted gaussian distributions.

Goodness of fit for different distributional models of accuracy, split by age.

These distributions seem well fit by standard gaussians, but they are in principle bounded and so we asked whether this made a difference, using a Beta distribution (a two-parameter continuous distribution bounded at 0 and 1) for fitting. Surprisingly, across all data, the BIC values for the two distributions were very similar (-4710 for normal versus -4678 for Beta), though the normal distribution was slightly favored. Across age groups, there was heterogeneity with some groups better fit by a gaussian and others better fit by a Beta (Figure S7). Again, we feel that this result generally vitiates our approach of modeling accuracies via standard linear mixed-effects models: their distributional form is quite close to normal.

Distribution of accuracies overlaid with normal distribution, split by age.

S4. Test-Retest Reliability

We examined test-retest reliability for our primary variables of interest by calculating Pearson correlations between pairs of administrations given no more than three months apart. Test-retest correlations were significant but relatively modest: ρ_{longwindowacc} = 0.462, ρ_{shotwindowacc} = 0.496, ρ_rt = 0.407. These reliabilities were biased downwards by three factors, however. First, longitudinal assessments sometimes use variable items between testing sessions, leading to item-related variance in measurement. Second, even three months can lead to substantial change in some children’s language abilities, thus correlations are attenuated by true change as well as measurement error. Third, longitudinal data in the dataset come primarily from the youngest children and hence are likely to show overall higher measurement error due to variability in children’s behavior and an overall lower number of trials.

S5. Pairwise Correlations of Main Measures

Table S2 shows pairwise correlations between the primary variables of interest in the dataset.

Pairwise correlations between primary variables of interest.

S6. Functional Form Model Comparison

Table S3 shows model comparison measures for different models of the functional form of the relationship between accuracy and age and Table S4 shows the same for reaction time. Age gradients are estimated substantially better with long window accuracies. Note that there are a greater number of observations for short window accuracies due to less missing data. We speculate that, on average, more participants looked away from the screen towards the end of trials, leading to a greater number of exclusions of long window trials based on the 50% criterion. Note that the total percentage of trials excluded is still small for both measures: 4.8% for long window accuracy and 1.8% for short window accuracy.

Model comparison metrics for different functional forms of the relationship between accuracy and age.

Model comparison metrics for different functional forms of the relationship between RT and age.

S7. Power Law Fits

In the literature on the “law of practice”, although the log-log relationship we observed is commonly present in the aggregate across individuals, the situation is substantially more complex when relationships are measured within individuals. The best fitting curves for individuals are often exponentials or delayed exponentials (Evans et al., 2018; Heathcote, Brown, & Mewhort, 2000).

With our current dataset, we unfortunately cannot specifically determine whether within-individual patterns of change conform to linear, power law, or exponential developmental patterns, because we have insufficient data about individuals’ improvement across time. Thus, our current results apply to the form of the age gradient as opposed to the form of any individual’s pattern of developmental change.

We believe that, unlike the skills being studied in the prior adult literature (e.g., Anderson, 1982; Heathcote, Brown, & Mewhort, 2000; Logan, 1988), language processing is being learned over the course of a child’s lifetime. Thus, we do not expect to see within-paradigm changes in learning in what is a narrow period of time compared to the duration over which language processing skills are refined.

Nevertheless, here we test for other forms of the aggregate relationship between age and reaction time. In particular, we consider 1) a log~log relationship between RT and age (presented in the main text), 2) using both a log age and a linear age to predict log RT, 3) a quadratic relationship between age and RT, and 4) a cubic relationship between age and RT. As shown in Table S5, the model with a linear age term in addition to a log age term has the best fit, although the linear age term coefficient is only marginally significant (coefficients in Table S6).

Goodness of fit comparison between different models of the relationship between age and RT.

Fixed effects coefficients for a model predicting log RT from both log age and linear age.

These models reveal a small but significant additional linear age term over and above log age, but – because individual participant-level fits are not possible – this term can’t really be used to weigh in on the debate about the precise nature of the learning pattern.

S8. Mixed-effects model specifications

Here we provide specifications for the lmer mixed-effects models used in the main text. These models are used to estimate the relationship between age and the primary variables of interest, controlling for dataset and subject-level variability.

For accuracy, 4 models were run, crossing long and short windows as the dependent variable with age or log age as the predictor.

For reaction time, 4 models were run, crossing rt and log rt as the dependent variable with age or log age as the predictor.

To look at the relationship between variance in the accuracy and reaction time measures and children’s age, we ran two models.

long_window_acc_var ~ log_age_s + (log_age_s | dataset_name) + (1 | subject_id) log_rt_var ~ log_age_s + (log_age_s | dataset_name) + (1 | subject_id)

In the growth curve analysis, we fit a mixed-effects model predicting growth in vocabulary as a quadratic function of age, RT at study initiation (t0), and their interaction, using the formula below

prod ~ poly(age_15,2) *rt_t0 + (age | subject_id) + (1 | dataset_name)

S9. Factor Analysis

Figure 3 shows the result of a parallel analysis supporting the presence of three factors in the exploratory factor analysis. Table S7 shows the factor loadings for the exploratory three-factor solution using varimax rotation. The first factor is primarily driven by vocabulary measures, the second by reaction time, and the third by accuracy measures.

Factor loadings for the exploratory three factor solution using varimax rotation.

The confirmatory factor analysis of this three-factor solution was fit using the following specification

vocab =~ prod + comp accuracy =~ long_window_accuracy + long_window_acc_var speed =~ log_rt + log_rt_var

Parallel analysis scree plot showing the eigenvalues for each factor, for actual, simulated, and resampled data.

The confirmatory factor analysis of the three-factor solution with a relation to age was fit using the following specification

vocab =~ prod + comp accuracy =~ acc + acc_sd speed =~ log_rt + log_rt_sd

vocab ~ log_age accuracy ~ log_age speed ~ log_age

The SEM with a linear growth curve used the following specification

accuracy_intercept =~ 1*acc_t0 + 1*acc_t1 + 1*acc_t2 + 1*acc_t3 + 1*acc_t4 accuracy_slope =~ 1*acc_t0 + 2*acc_t1 + 3*acc_t2 + 4*acc_t3 + 5*acc_t4 speed_intercept =~ 1*log_rt_t0 + 1*log_rt_t1 + 1*log_rt_t2 + 1*log_rt_t3 + 1*log_rt_t4

speed_slope =~ 1*log_rt_t0 + 2*log_rt_t1 + 3*log_rt_t2 + 4*log_rt_t3 + 5*log_rt_t4 vocab_intercept =~ 1*prod_t0 + 1*prod_t1 + 1*prod_t2 + 1*prod_t3 + 1*prod_t4 vocab_slope =~ 1*prod_t0 + 2*prod_t1 + 3*prod_t2 + 4*prod_t3 + 5*prod_t4

accuracy_intercept ~~ NA*accuracy_intercept accuracy_slope ~~ NA*accuracy_slope speed_intercept ~~ NA*speed_intercept speed_slope ~~ NA*speed_slope vocab_intercept ~~ NA*vocab_intercept vocab_slope ~~ NA*vocab_slope

S10. Factor Analysis on First Administrations

As a robustness check, we tested our best factor analytic models using only cross-sectional data (filtering to the first test session in longitudinal datasets; N=1963 instead of N=3553). A comparison of all 4 models is shown in Table S8. For the three-factor CFA, the first administration model shows increased CFI (.992 instead of .972) and decreased RMSEA (.030 instead of .065). The same is true for the age-regressed three-factor CFA, which shows very good statistics on both first administrations and longitudinal data (CFI = .999 and .991, respectively and RMSEA = .009 and .037 respectively).

Comparison of confirmatory factor analysis models on longitudinal data or first administrations only.

S11. Alternative Factor Structures

In this section, we provide comparisons between the three-factor model we report in the main text and several alternative models, including:

a one-factor model;
a two-factor model with vocabulary separated from speed and accuracy;
a two-factor model with speed separated from accuracy and vocabulary; and
a two-factor model with variability terms separated from speed, accuracy, and vocabulary.

Table S9 shows the result of these comparisons. The three-factor model shows the lowest AIC and BIC, as well as being significantly better fitting than the next-best model.

Model comparison for alternative factor structures.
p-values show differences between adjacent models; no p-values are shown for comparisons between non-nested models.

S12. Non-linear Growth Models

To test for the differentiation of vocabulary growth based on initial reaction time, we used the package brms to fit a (Bayesian) logistic growth model to the production data. This model has two parameters for the logistic curve, a scale and an intercept. Both were allowed to interact with initial reaction time. We also included random effects of logistic intercept and scale by participant and a grouping term across datasets.

Age and initial reaction time were both mean-centered. This model showed a significant effect of initial reaction time on the intercept of the logistic growth curve, but not on its scale (see Table S10).

Fixed effects estimates from logistic growth model.

The formula specification was

nlform <- brms::bf( prod ~ 1 / (1 + exp((xmid - age_c) / exp(logscale))), xmid ~ 1 + log_rt_0_c + (1 | dataset_name/subject_id), logscale ~ 1 + log_rt_0_c + (1 | dataset_name/subject_id), # scale ~ 1 + log_rt_0_c, nl = TRUE )

And the priors were

priors <- c( prior(normal(0, 5), nlpar = “xmid”, coef = “Intercept”), prior(normal(1, 1), nlpar = “logscale”, coef = “Intercept”), prior(normal(0, 1), nlpar = “logscale”, coef = “log_rt_0_c”), prior(exponential(1), class = “sigma”), prior(normal(0, 2), nlpar = “xmid”, coef = “log_rt_0_c”),

# Random effects for xmid prior(exponential(1), class = “sd”, nlpar = “xmid”, group = “dataset_name”), prior(exponential(1), class = “sd”, nlpar = “xmid”, group = “dataset_name:subject_id”),

# Random effects for scale prior(exponential(1), class = “sd”, nlpar = “logscale”, group = “dataset_name”), prior(exponential(1), class = “sd”, nlpar = “logscale”, group = “dataset_name:subject_id”) )

Age and reaction time are correlated, so to check that the effects of initial reaction time were not due to age effects, we reran the model using residualized reaction time to remove effects of age. As seen in Table S11 and Figure S9, the pattern of effects is similar for residualized reaction time as for reaction time.

Fixed effects estimates from logistic growth model using RT residualized on age as the predictor.

Growth curves from a logistic growth model showing predicted productive vocabulary growth for children based on their age-residualized initial reaction times.
Predictions are shown for children with initial reaction times one SD faster than the mean for their age (blue), at the mean for their age (red), and one SD slower than the mean for their age (green). Individual longitudinal trajectories are shown in light gray. Solid lines show global model estimates and colored regions indicate 95% credible intervals.

Interpretation of growth in both this model and the linear growth model in the main text is complicated by the fact that the CDI form puts a ceiling on the total number of words that can be recorded; both the quadratic growth functions and the logistic functions come together at the form ceiling. Thus, a shift in quadratic growth in the linear model and a shift in intercept in the logistic model both point to the same overall effect, which is faster growth at the point of maximal sensitivity of the CDI. Neither model can estimate whether the overall growth trajectory is different beyond the range of the CDI. Thus, although these models might initially seem to be in conflict, we believe that they actually point to the same phenomenon, which is perhaps better described by the longitudinal SEM model reported in the main text. Children with greater skill in word recognition show an overall positive shift in the growth trajectory of vocabulary development.

S13. SEM Longitudinal Missingness

The SEM model was fit to the entire dataset, including the large mass of cross-sectional data (to anchor the estimates of t0 coefficients) and the sparse longitudinal data for each time point. We have 3%-12% of the total t0 datapoints for any given time point (see Table S12), given the sparsity of longitudinal sampling (only 6/24 of the datasets are longitudinal).

Fraction of data present for each measure at each time point for the longitudinal SEM.

Our data are MAR (missing at random) rather than MCAR (missing completely at random). This is because their missingness is due to which dataset they are part of – if they are from a cross-sectional dataset, they are by definition missing all longitudinal observations. For our analyses to be appropriate given this structure, we have to assume that the general developmental patterns we are studying are replicated across datasets. We believe that they are, and we show this statistically using our mixed-effects and non-linear mixed-effects models, which control for dataset-related variation. We also show dataset-level effects in a number of our visualizations for this same reason. The same degree of random effect specification that we can do in the mixed-effects models is not possible in the SEM model, however, purely for technical reasons. Again, this point highlights the importance of convergence across analyses.

Data availability

We retrieved all data from Peekbank release 2026.1 using the peekbankr R package. All code and data necessary to reproduce this manuscript are available at https://github.com/peekbank/peekbank-development.

Additional information

Funding

Jacobs Foundation (Foundation_JF)

Michael C Frank

Significance of findings

Strength of evidence

Abstract

Introduction

Results

Characteristics of included datasets from Peekbank.

Speed and accuracy of word recognition increase

Timecourse of word recognition at different ages.

Participant-level target looking and reaction time (log), plotted by age (log).

Variability of word recognition decreases

Participant-level variability in target looking and reaction time (log RT), plotted by age (log).

Speed and accuracy relate to vocabulary size

Structural equation model showing the three-factor factor analysis with a regression of each latent variable on the logarithm of age.

Speed of processing relates to vocabulary growth

Growth curves from a logistic growth model showing predicted productive vocabulary growth for children with initial reaction times one SD faster than the mean (blue), at the mean (red), and one SD slower than the mean (green).

Structural equation model showing longitudinal couplings between growth parameters.

Discussion

Materials and Methods

Data

Analytic methods

Supplemental Information

S1. Dataset Description

Age distribution of unique participants for each dataset, using three-month bins.

Distribution of retest administrations across datasets with repeated measurements, colored by dataset.

Characteristics of included datasets from Peekbank, sorted by what percent of the data the data they represent.

S2. Reaction Times

S2.1. Reaction Time Computation

S2.2. Comparison of Reaction Times for Correct and Incorrect Trials: Re-analysis of Creel (2024)

Correlation between reaction times on all trials and reaction times on trials where the child pointed to the correct target.

S3. Checks on Data Distributional Assumptions

S3.1. Reaction Time

Goodness of fit for different distributional models for RT, split by age.

Distribution of RT overlaid with a log normal distribution, split by age.

S3.2. Accuracy

Goodness of fit for different distributional models of accuracy, split by age.

Distribution of accuracies overlaid with normal distribution, split by age.

S4. Test-Retest Reliability

S5. Pairwise Correlations of Main Measures

Pairwise correlations between primary variables of interest.

S6. Functional Form Model Comparison

Model comparison metrics for different functional forms of the relationship between accuracy and age.

Model comparison metrics for different functional forms of the relationship between RT and age.

S7. Power Law Fits

Goodness of fit comparison between different models of the relationship between age and RT.

Fixed effects coefficients for a model predicting log RT from both log age and linear age.

S8. Mixed-effects model specifications

S9. Factor Analysis

Factor loadings for the exploratory three factor solution using varimax rotation.

Parallel analysis scree plot showing the eigenvalues for each factor, for actual, simulated, and resampled data.

S10. Factor Analysis on First Administrations

Comparison of confirmatory factor analysis models on longitudinal data or first administrations only.

S11. Alternative Factor Structures

Model comparison for alternative factor structures.

S12. Non-linear Growth Models

Fixed effects estimates from logistic growth model.

Fixed effects estimates from logistic growth model using RT residualized on age as the predictor.

Growth curves from a logistic growth model showing predicted productive vocabulary growth for children based on their age-residualized initial reaction times.

S13. SEM Longitudinal Missingness

Fraction of data present for each measure at each time point for the longitudinal SEM.

Data availability

Additional information

Funding

References

Article and author information

Author information

Michael C Frank

Virginia A Marchman

Claire Augusta Bergey

Veronica Boyce

Mika Braginsky

George Kachergis

Jess Mankewitz

Stephan Meylan

Ben Prystawski

Nilam Ram

Robert Z Sparks

Adrian Steffan

Alvin Wei Ming Tan

Martin Zettersten

Version history