1 Introduction

What computational operations is the brain performing when it recognizes some splotches of ink on a piece of paper as a meaningful word? This question has been the focus of a large number of neuroimaging studies that examine brain activity during reading. Noninvasive measurement techniques such as electroencephalography (eeg) (Grainger and Holcomb, 2009), magnetoencephalography (meg) (Salmelin, 2007) and functional magnetic resonance imaging (fmri) (Price, 2012) have provided a wealth of information about when and where changes in activity might be expected during various tasks involving orthographic processing (Carreiras et al., 2014).

In the case of visual word recognition, a lot of these results come in the form of changes in the amplitude of specific components of the neural response evoked by stimuli that are designed to create interesting experimental contrasts (Luck, 2014). Such evoked components reflect macro-level computations — that is, the net result of thousands of individual biological neurons working together. Taken together, the results indicate the presence of a processing pipeline, starting with the extraction of low-level visual features (e.g., edges, line segments), which are subsequently refined into more complex features (e.g., letter shapes) and further into lexical features (e.g., bigrams, words) (Dehaene et al., 2005; Grainger and Holcomb, 2009). While neuroimaging studies provide us with information about what processing steps are performed where and when, the observed data alone yield little information as to what kind of computations are performed during these steps (Poeppel, 2012). To develop such an understanding, we need to make these computations explicit, model them, and test and refine the model against the data provided by imaging studies (Barber and Kutas, 2007; Doerig et al., 2023; Price, 2018).

In this study, we aimed to computationally reproduce the results of Vartiainen et al. (2011), which is a representative meg study that employed experimental contrasts designed to study key processing steps throughout the entire visual word recognition pipeline. The authors catalogued the effects of the experimental contrasts on the amplitudes of all major evoked meg responses found in the data, and concluded that the significant effects could be attributed to three components that dominate the early meg time course during visual word recognition (Jobard et al., 2003; Salmelin, 2007), namely:

type-i This component peaks occipitally around 100 ms after stimulus onset, is modulated by the visual complexity of the stimulus and hence thought to reflect the processing of low-level visual features (Tarkiainen et al., 1999). Vartiainen et al. (2011) used a contrast between stimuli with and without added visual noise to highlight this processing stage.

type-ii This component peaks occipito-temporally around 150 ms after stimulus onset, and is modulated by whether the stimulus contains letters of the participant’s native alphabet (Parviainen et al., 2006; Tarkiainen et al., 1999). Hence, this component is colloquially referred to as the “letter string” response and thought to reflect letter shape detection. Vartiainen et al. (2011) used a contrast between letter strings and symbol strings (visually similar to letters but not part of the Finnish alphabet) to highlight this processing stage.

n400m This component peaks at around 400 ms after stimulus onset, and is often studied in the context of priming (Kutas and Federmeier, 2011), but experimental effects can also be observed when using isolated stimuli. For example, this component is modulated by the lexicality of the stimulus, hence associated with more lexical processing (Halgren et al., 2002; Helenius et al., 1998; Salmelin, 2007; Service et al., 2007). Vartiainen et al. (2011) used a contrast between consonant strings, valid Finnish words and pseudowords to highlight this processing stage.

These three components are also observed in eeg studies (referred to as the p1, n1/n170 and n400 components respectively (Brem et al., 2009)). Together, they are indicative of three processing stages during visual word recognition: basic visual analysis, orthographic analysis and lexical analysis. To explore possible cognitive computations during these stages, we sought to create a computational model that performs the macro-level computations needed to achieve visual word recognition in such a manner as to reproduce the behavior of the three evoked components when presented with the same stimuli as a human participant.

In past decades, several computational models have been created that successfully capture some aspects of visual word recognition. For the purposes of this study, we draw a distinction between the older “traditional-style” models (Norris, 2013) and the large artificial neural networks trained through deep learning that have been gaining popularity as models of brain function (LeCun et al., 2015; Richards et al., 2019). We will show that a model will need to combine elements from both approaches if it is to be able to reproduce the stimulus-dependent modulation of all evoked components mentioned above.

Amongst the first models of visual word recognition, the interactive activation and competition (iac) model of letter perception by McClelland and Rumelhart (McClelland and Rumelhart, 1981; Rumelhart and McClelland, 1982), showed how the interplay of bottom-up and top-down connections results in a system capable of “filling in the blanks” when faced with a partially obscured word. This model was later extended to model semantics as well, showing how the activation of some semantic features (“is alive”, “is a bird”) leads to the subsequent activation of more semantic features (“can fly”, “lays eggs”), in what became known as the parallel distributed processing (pdp) framework (McClelland and Rogers, 2003). Note that while models such as these consist of many interconnected units, those units and their connections do not aim to mimic a biological connectome, but are an abstract representation of macro-level computations performed by one. Coltheart et al. (2001) pointed out the benefits of explicitly defined connections in their dual-route cascaded (drc) model of visual word recognition and reading out loud, as they grant the researcher exact control over the macro-level computations. However, as the scope of the models increases, “hand-wiring” the connections in the model becomes increasingly difficult. Therefore, most current models employ back-propagation to learn the connection weights between units based on a large training dataset. Together, iac, pdp and drc models have been shown to account for many behavioral findings, such as reaction times and recognition errors, in both healthy volunteers and patients (McClelland and Rogers, 2003; McLeod et al., 2000; Perry et al., 2007). Furthermore, Laszlo and Plaut (2012) demonstrated how a pdp-style model can produce an N400-like signal by summing the activity of the computational units in specific layers of the model.

However, none of the aforementioned models can produce detailed quantitative data that can be directly compared with neuroimaging data, leaving researchers to focus on finding indirect evidence that the same macro-level computations occur in both the model and the brain, with varying levels of success (Barber and Kutas, 2007; Jobard et al., 2003; Protopapas et al., 2016). Due to the computational constrains at the time, the simulated environments in these models are extremely simplified (e.g. the Laszlo and Plaut (2012) model operated exclusively on 3-letter words), whereas the brain data will, by nature, reflect the reading process in more varied and realistic visual settings. An even bigger limitation is the manner in which the initial visual input is presented to the model. These models make use of “letter banks”, where each letter of a written word is encoded in as a separate group of inputs. The letters themselves are encoded as either a collection of 16 predefined line segments (Figure 1) (Coltheart et al., 2001; McClelland and Rumelhart, 1981) or a binary code indicating the letter (Laszlo and Armstrong, 2014; Laszlo and Plaut, 2012). This rather high level of visual representation sidesteps having to deal with issues such as visual noise, letters with different scales, rotations and fonts, segmentation of the individual letters, and so on. More importantly, it makes it impossible to create the visual noise and symbol string conditions used in the meg study we were trying to reproduce (Vartiainen et al., 2011) to modulate the type-i and type-ii components. In order to model the process of visual word recognition to the extent where one may reproduce neuroimaging studies such as Vartiainen et al. (2011), we need to start with a model of vision that is able to directly operate on the pixels of a stimulus.

The 16 predefined line segments of McClelland and Rumelhart (1981) and five example letters

Models of vision seem to have converged on a sequence of convolution-and-pooling operations to simulate the macro-level behavior of the visual cortex (Lindsay, 2020; Serre et al., 2007). Such models have become much more powerful as advances in deep learning and its software ecosystem are rapidly changing our notion of what is computationally tractable to model (LeCun et al., 2015; Richards et al., 2019). Convolutional neural networks (cnns) have emerged that perform scale- and rotation-invariant visual object recognition at very high levels of accuracy (Dai et al., 2021; Krizhevsky et al., 2017; Simonyan and Zisserman, 2015). Furthermore, they model some functions of the visual cortex well enough that a direct comparison between network state and neuroimaging data has become possible (Devereux et al., 2018; Schrimpf et al., 2020; Yamins and DiCarlo, 2016).

Visual word recognition can be seen as a specialized form of object recognition (Grainger, 2018). Developmental studies suggest that orthographic-specific neuroimaging findings such as the visual word form area (vwfa) and the type-ii evoked components emerge after part of our established vision system is reconfigured as we learn to read (Cohen and Dehaene, 2004; Dehaene-Lambertz et al., 2018a; Parviainen et al., 2006). Studies on the visual cortex of non-human primates have shown that visual systems even without extensive exposure to written language have neural representations that carry enough information to distinguish letter and word shapes (Rajalingham et al., 2020), even accounting for distortions (Katti and Arun, 2022), and this finding was reproduced in cnns. Therefore, cnns may very well be suitable tools for increasing the scale of traditional connectionist models of reading, thus allowing for a much more realistic modeling of the early orthographic processing steps than what has been possible so far. Indeed, earlier work has established that training a cnn to classify written words can coerce it to form a vwfa-like region on top of an existing visual system (Hannagan et al., 2021), and cause it to form high-level orthographic representations (Katti and Arun, 2022; Testolin et al., 2017) which can be used by a linear regression algorithm to predict human meg activity in the visual cortex. (Caucheteux and King, 2022)

Whereas traditional-style models are typically designed to replicate a specific experimental effect in a biologically feasible manner, deep learning models are designed to perform in a more naturalistic setting, regardless of biological plausibility. Consequently, traditional-style models have been evaluated by comparing the timing and amount of the simulated activity to the amplitude of an evoked component (Laszlo and Armstrong, 2014; Laszlo and Plaut, 2012) or human reaction times (Agrawal et al., 2020; Norris and Kinoshita, 2012) in a tightly controlled experimental setting, and deep learning models have been evaluated using data from participants who are operating in a more challenging setting, such as watching movies (Huth et al., 2012), reading sentences (Caucheteux and King, 2022) or listening to stories (Huth et al., 2016). For the latter type of comparisons, information-based metrics are used. For example, representational similarity analysis (rsa) (Kriegeskorte et al., 2008) quantifies the extent to which the inner state of a model discriminates between the same types of stimuli as a neural response, and BrainScore (Schrimpf et al., 2020) quantifies the extent to which a (usually linear) regressor or classifier can use the inner state of a model to predict brain activity. However, information-based metrics by themselves provide at best only a rough indication of how closely a model mimics the computations performed by the brain. It would be desirable to establish a more direct link between a model and the experimental results obtained in neuroscience studies (Bowers et al., 2022). Therefore, in the current study, we evaluated the performance of a cnn in the spirit of the traditional models, namely by its ability to accurately reproduce the behavior of the type-i, type-ii and n400m responses in the tightly controlled experimental setting of Vartiainen et al. (2011). The resulting model combines the ability of deep learning models to operate directly on the pixels of a stimulus and to have a large vocabulary, with the ability of traditional models to replicate specific experimental results.

We started with a basic cnn architecture, namely vgg-11 (Simonyan and Zisserman, 2015), that had been pre-trained to classify images and evolved it into a model of visual word recognition by re-training it to classify written words. Different versions of the model were created by introducing slight modifications to the base architecture and using different training regimens. The performance of the trained models was evaluated by presenting them with the same set of stimuli that had been used by Vartiainen et al. (2011) and comparing the amount of activity within its layers with the amplitude of the type-i, type-ii, and n400m components measured from the human participants.

2 Results

2.1 The three MEG responses have unique response profiles

The computational models in this study were evaluated by their success in replicating the functional effects observed in the evoked meg components as recorded by Vartiainen et al. (2011) during the presentation of written words, pseudowords, consonant strings, symbol strings and noise-embedded words (Figure 2A) to 15 participants. Performing minimum norm estimate, dynamic statistical parametric mapping (mne-dspm) (Dale et al., 2000) source analysis yielded a comprehensive map of where activity can be found (Figure 2B), which is used in this study mostly to provide context for a deeper investigation into three peaks of activity that can be observed along the ventral stream. Through guided equivalent current dipole (ecd) modeling (Salmelin, 2010 ), the high-dimensional data was summarized as a sparse set of ecds, each one capturing an isolated spatiotemporal component, allowing us to study selected components in more detail. The original study (Vartiainen et al., 2011) found three ecd groups along the ventral stream that correspond to the type-i, type-ii and n400m responses, based on their location (Figure 2B), timing (Figure 2C) and responses to the different stimulus types (Figure 2D). Similar ecd groups were identified in the right hemisphere (Supplementary Figure 1), but the effects were clearer in the left. Therefore, in this study, we re-used the three ecd groups on the left hemisphere to serve as ground-truth for our modeling efforts.

Summary of the meg results obtained by Vartiainen et al. (2011).

A: Examples of stimuli used in the meg experiment. Each stimulus contained 7–8 letters or symbols.

B: Source estimate of the evoked meg activity, using mne-dspm. The grand-average activity to word stimuli, averaged for three time intervals, is shown in orange hues. For each time interval, white circles indicate the location of the most representative left-hemisphere ecd for each participant, as determined by Vartiainen et al. (2011).

C: Grand-average time course of signal strength for each group of ecds in response to the different stimulus types. The traces are color-coded to indicate the stimulus type as shown in A. Shaded regions indicate time periods over which statistical analysis was performed.

D: For each group of ecds shown in B, and separately for each stimulus type (different colors, see A), the distribution (and mean) of the grand-average response amplitudes to the different stimulus types, obtained by integrating the ecd signal strength over the time intervals highlighted in C. Whenever there is a significant difference (linear mixed effects (lme) model, p < 0.05, false discovery rate (fdr) corrected) between two adjacent distributions, the corresponding difference in means is shown.

To compare the evoked meg responses and model layer activity, we collected summary statistics of both, referred to as “response profiles”. For the ecd time courses, the response profiles were formed by computing for each stimulus the average signal amplitude in the time intervals indicated in Figure 2C, z-transformed across the stimuli, and averaged across participants to obtain grand-average response profiles (Figure 2D). The response profile for the type-i response demonstrates how this occipital component is driven by the visual complexity of the stimulus and is characterized by a large response to noise-embedded words relative to all other stimulus types. The type-ii response, located further along the fusiform gyrus, exhibits sensitivity to whether the stimulus contains letters that are part of the participant’s native alphabet and, in contrast to the type-i, has a reduced response to the noise-embedded words. The n400m response is located in the temporal cortex and is modulated by the lexical content of the stimulus, manifested in this study as a larger response to the word-like (i.e., words and pseudowords) versus the non-word-like stimuli.

2.2 CNNs produce responses that are qualitatively similar to the MEG evoked components

As a model of the computations underlying the brain activity observed during the meg experiment, we used a vgg-11 (Szegedy et al., 2015) network architecture, pretrained on ImageNet (Russakovsky et al., 2015). This architecture consists of five convolution layers (three of which perform convolution twice), followed by two densely connected layers, terminating in an output layer (Figure 3A). Variations of the model were trained to perform visual word recognition using training sets consisting of images of written Finnish words, rendered in varying fonts, sizes and rotations (Figure 3B). In each case, training lasted for 20 epochs, which was enough for each model to obtain a near perfect accuracy (> 99%) on an independent validation set, as well as being able to correctly identify the original 118 word stimuli used in the meg experiment. During training, the models were never exposed to any of the other stimulus types used in the meg experiment (noisy words, symbols, consonant strings, and pseudowords) nor to the exact word stimuli used in that experiment. After training was complete, response profiles for all layers were created by computing the 𝓁2 norm of the rectified linear unit (relu) activations of the units in each layer in response to each stimulus, followed by a z-transformation (Figure 4). Finally, the response profiles of the cnn models were compared to those of the three meg evoked components, both qualitatively by comparing the effects of the experimental contrasts and quantitatively through correlation. This allowed for an iterative design approach, where the network architecture and training diet of the model were manipulated to address shortcomings in the response profiles, until a model was reached with the response profiles of the contained layers matching those of the type-i, type-ii and n400m evoked meg components.

Overview of the proposed computational model of visual word recognition in the brain.

A: The vgg-11 model architecture, consisting of five convolution layers, two fully connected layers and one output layer.

B: Examples of the images used to train the model.

Response profiles obtained from the models.

For each layer, the response profile, i.e. the z-scored magnitude of relu activations in response to the same stimuli as used in the meg experiment, is shown. Whenever there is a significant difference (t-test, p < 0.05, fdr corrected) between two adjacent distributions, the corresponding difference in means is shown.

The design process started with an evaluation of the standard vgg-11 model, trained only on ImageNet. The layer response profiles of this model showed a distinctive pattern in response to the different types of stimuli used in Vartiainen et al. (2011) (Figure 4, top row), where the stimuli containing a large amount of noise evoke a very high level of activation in the first convolution layer, which carries over into all other layers. Thus, this model simulated the type-i component well, but it failed to simulate the type-ii and n400m components. This is not surprising given that this model is illiterate.

Next, we trained the model on a training set containing a million examples of 250 possible written words (Figure 4, second row). After the model had learned to recognize written words (regardless of size, font family and some rotation), the convolution layers still showed a response profile similar to the type-i. However, the later linear layers now had a less extreme response to the noisy stimuli and also showed a dissociation between symbols and letters, making their response profiles more similar to that of the type-ii and n400m components of the evoked response. Nonetheless, in this version of the model, the response to noisy stimuli is still too large to be a good fit with the type-ii and n400m response profiles.

In vision research, having unit activations be noisy has been shown to make a system more robust against image perturbations, both in the brain and in cnns (Carandini et al., 1997; Dapello et al., 2020). In our case, the addition of a small amount of Gaussian noise to all unit activations in the model reduced markedly the response of the later layers to the noisy stimuli, resulting in less activity to the noisy stimuli than the other stimulus types (Figure 4, third row), while the recognition accuracy of the model was not degraded. We found that in order for the model to display this behavior, both noisy unit activations and batch normalization were necessary (Supplementary Figure 2). For this model, the response profiles of the first four convolution layers match that of the type-i component, and that of the first connected layer is similar to the type-ii component in that both the noisy word stimuli and symbol stimuli evoked smaller responses than stimuli containing letters. However, this model does not contain any layers with response profiles matching that of the n400m.

The n400m component of the evoked meg response is modulated by the lexicality of the stimulus, as demonstrated through the contrast between consonant strings, proper words and pseudowords. Whereas consonant strings evoked a smaller n400m response than proper words, pseudowords produced a response that was equal to (or even larger than) that to proper words (Figure 2D). In this version of the model however, the response to pseudowords is less than that of proper words. This can be attributed to the relatively small vocabulary (250 words) of this model. The pseudowords used in the meg experiment were selected to be orthographically similar to words in the extended vocabulary of a native Finnish speaker, but not necessarily similar to any of the 250 words in the vocabulary of the model. Increasing the vocabulary size to 1000 words was enough to solve the problem regarding pseudowords and yielded a model where the response profiles of the early convolution layers matched that of the type-i, those of the two fully connected layers matched that of the type-ii, and that of the output layer matched that of the n400m (Figure 4, fourth row).

Further increasing the vocabulary size to 10 000 words initially had a detrimental effect on the model’s ability to simulate the n400m component, as the response to consonant strings was nearly as strong as that to words and pseudowords (Figure 4, fifth row). In order to reproduce the response profile of the n400m during all experimental conditions of the meg experiment, a model needs to hit a sweet spot regarding specificity of activation to in-vocabulary versus out-of-vocabulary letter strings. We found that this specificity can be controlled by manipulating the frequency in which words are presented to the model during training. By scaling the number of examples of a word in the training set in accordance with the frequency of occurrence of the word in a large text corpus (Kanerva et al., 2014), a model was obtained with a large vocabulary of 10 000 and the response profiles of the convolution layers matching that of the type-i, response profiles of the fully connected layers matching that of the type-ii, and the response profiles of the output layer matching that of the n400m (Figure 4, bottom row). These response profiles were stable across different random initializations when training the model (Supplementary Figure 3).

Various variations in model architecture and training procedure were evaluated (Supplementary Figures 4–6). We found that reducing the number of layers by either removing convolution layers or fully connected layers, resulted in a model that did not exhibit the desired change in response to noisy stimuli from an enlarged response in the early layers to a diminished response in the later layers (Supplementary Figure 4). Increasing the number of layers resulted in a model that is very similar to the final model. As the vocabulary of the final model (10 000) exceeds the number of units in its fully-connected layers, a bottleneck is created in which a sub-lexical representation is formed. The number of units in the fully-connected layers, i.e. the width of the bottleneck, did not greatly affect the response profiles of the model (Supplementary Figure 5).

A model with either randomly initialized weights, or trained only on ImageNet, have response profiles in which each layer resembled that of the type-i response (Supplementary Figure 6, top rows). Training on word stimuli was required to simulate the type-ii and n400m components. While starting from a pre-trained model on ImageNet reduced the number of epochs necessary for the model to reach >99 % accuracy, pre-training was not found to be strictly necessary in order to arrive at a good model that can simulate all three components (Supplementary Figure 6, bottom rows).

2.3 CNNs produce responses that correlate well with the MEG evoked components

Since both model layer activation and meg evoked response amplitudes are described by a single number per stimulus in the response profiles, comparison between response profiles can be performed through straightforward correlation analysis (Figure 5, Figure 6). The models were designed to simulate the response amplitudes of an idealized average human participant, consistently producing the same response to a given stimulus, without attempting to model inter-participant variability and variability across repetitions. These sources of variability are present in the meg data and place a boundary on the maximal obtainable correlation score, which we will refer to as the noise ceiling, following Kriegeskorte et al. (2008). Evoked meg response amplitudes can vary substantially between repetitions of the same stimulus (either when presenting it multiple times to the same participant or presenting it to multiple participants) and are hence typically studied by averaging across repetitions (Luck, 2014). The need for averaging becomes apparent when looking at the estimated noise ceilings for single-participant data (Figure 5). The pessimistic estimate for the single-participant noise ceiling is very low for all three meg components (computed as the average correlation between the response amplitudes of a single participant versus those averaged across all remaining participants (Kriegeskorte et al., 2008)), indicating a large amount of inter-participant variability. When averaging across participants, the estimated noise ceilings are much higher (estimated through a procedure proposed by Lage-Castellanos et al. (2019)), which is why we will focus on the comparison between the models and the average meg response amplitudes.

Correlations between all models and MEG responses.

For each model, the maximum correlation between the response profiles obtained from all layers of the model and the response profiles of the type-i, type-ii and n400m evoked meg components. This was evaluated on both the grandaverage level and for each individual subject. Estimated noise ceilings are indicated with vertical lines. The grand average analysis only has a single estimate, for the single subject analysis the shaded area indicates the range between pessimistic estimate and the optimistic estimate, in between which the true noise ceiling should lie. Significant differences between neighboring distributions are indicated with stars (paired t-test, *p < 0.05, **p < 0.01, ***p < 0.001, fdr corrected).

A closer look at the relationship between the final model and MEG responses.

A: Relationship between the response profiles obtained from evoked meg activity, quantified using three ecd groups, and the response profiles obtained from three layers of the model. Kernel density distributions are shown at the borders.

B: Correlation between the mne-dspm source estimate and the model. Grandaverage source estimates were obtained in response to each stimulus. The correlation map was obtained by correlating the activity at each source point with that for three layers of the model. The correlation map is shown at the time of peak correlation (within the time windows indicated in Figure 2C). Only positive correlations are shown.

The amount of activity in the first convolution layer of each model we evaluated correlates with the amplitude of the type-i component up to the estimated noise ceiling, which is in line with our qualitative finding that all models produce similar response profiles in the first convolution layers. For the type-ii and n400m components, the illiterate model (trained only on ImageNet, Figure 5 top row) shows a strong negative correlation, which can be attributed to its invalid response to the noisy word condition. Training the model on orthographic stimuli greatly improves its correlation with these components (Figure 5, second row), but also here we see that the model without noisy unit activations, which failed to produce the proper response to the noisy word condition, has a weak correlation to the type-ii and n400m components. Among the models with noisy unit activations (Figure 5, bottom four rows), which produced the proper response to the noisy word condition, the differences in correlation with the type-ii and n400m components are much smaller. Notably, the models that produced qualitatively incorrect n400m response profiles (Figure 5, rows three and five) have slightly higher correlation with the n400m component than a model that produced a qualitatively correct responses (Figure 5, fourth row). This shows how a poor simulation of a brain response during one experimental condition can be compensated for by a better simulation of the brain response during another. However, the final model that produces correct responses to all experimental conditions and has the largest vocabulary (Figure 5, bottom row) also has the strongest correlation with the components of the evoked response. Still, there remains a gap between the model-brain correlations and the estimated noise ceiling, indicating there might remain some reproducible variance in the type-ii and n400m components that the model does not capture.

Upon closer examination of the relationship between the response profiles of the final model and the meg evoked components (Figure 6A), we see that the correlation is mostly driven by differences in response strength between the experimental conditions. Within each stimulus type, meg evoked component amplitude and model layer activation are not significantly uncorrelated (p > 0.05 for all components), which is not surprising as the stimuli were designed to minimize within-condition variation. To verify that the three ecd groups capture the most important correlations between the brain activity and the model, we compared the model to cortex-wide mne-dspm source estimates (Figure 6B). The areas with maximum correlation appear around the locations of the ecds and at a time that is close to the peak activation of the ecds.

To test whether a multivariate analysis would yield additional insights, we trained a linear ridge regression model to predict brain activity given the activity within the layers of a model, what Schrimpf et al. (2020) refer to as BrainScore. BrainScores where computed for both the illiterate model, trained only on ImageNet (cf. Figure 4, top row), and the final model (cf. Figure 4, bottom row). We found that for both models, every layer could be used to predict the brain activity along key locations along the ventral stream (Supplementary Figure 7). Furthermore, there was very little difference in BrainScore between the illiterate and the final model. Hence, this information-based metric was in this case not helpful for evaluating model-brain fit.

2.4 CNNs can reproduce experimental results beyond the original MEG study

The experimental contrasts employed in the meg experiment were designed to distinguish the type-i, type-ii and n400m components from one another. However, many more experimental contrasts have been used to study these components in the past. We explored the effect of some of these on our final model (Figure 7) in order to ascertain the extent to which the model can simulate the behavior of the three components under conditions extending beyond the original meg data that guided its design.

Post-hoc exploration of various experimental contrasts.

For each contrast, four sample stimuli are shown to demonstrate the effect of the manipulated stimulus property and below are the correlation between the manipulation and the amount of activity in each layer of the final model. For the experimental contrasts indicated with a number, one or more confounding factors were corrected for (partial correlation). Different colors indicate convolution layers (blue), fully connected layers (orange) and the output layer (green).

In Tarkiainen et al. (1999), the difference between the type-i and type-ii component was highlighted by contrasting stimuli that contained only noise with stimuli that contained both noise and letters. Our model successfully replicates the finding that the amplitude of both the type-i and type-ii components increase with the amount of visual noise, except when there are letters present, in which case the amplitude of the type-ii component decreases with the amount of noise while that of the type-i component is unaffected. This striking change in behavior occurs in the model when the layer type switches from convolution to fully connected linear. Furthermore, our model reproduces the finding by Tarkiainen et al. (1999) that the amplitudes of both the type-i and type-ii component increases when a stimulus contains more letters or symbols, however given two strings of equal length, the amplitude of the type-ii component is larger for the string containing letters versus a string containing symbols, while the amplitude of the type-i component is similar for both strings.

Basic visual features such as font family and font size have a large effect on the activity in the initial convolution layers of the model, and less so on the activity in the later layers. This roughly corresponds to findings in the masked repetition priming eeg literature, where they find that early components such as the n/p150 are affected by such properties, whereas later components are not (Chauncey et al., 2008; Grainger and Holcomb, 2009; Hauk and Pulvermüller, 2004). However, our results may not be directly comparable to that literature, because our model cannot do repetition priming. Even so, it is surprising that the fully connected linear layers in the model show a decrease in activity with increasing font size, rather than being unaffected.

Using invasive recordings, Woolnough et al. (2021) found effects of letter and bigram frequency in the occipitotemporal cortex, where we find the type-ii component in meg, but not in occipital cortex, where we find the type-i component. This result has been shown in eeg studies as well (Grainger, 2018; Laszlo and Federmeier, 2014). Our model shows similar sensitivity to letter and bigram frequency in the later convolution and fully connected layers. However, we found that our model did not show any sensitivity to word frequency or orthographic neighbourhood size, even though these are known to affect later components in eeg studies (Dambacher et al., 2006; Holcomb et al., 2002).

3 Discussion

In this study, we demonstrated how training a cnn for visual word classification results in a model that can explain key functional effects of the evoked meg response during visual word recognition, provided some conditions to enhance biological realism are met. Necessary attributes of the cnn and its training diet were found to be that unit activations should be noisy, the vocabulary should be large enough, and for very large vocabularies, the frequency of occurrence of words in the training set should follow their frequency in the language environment.

In contrast to traditional models of reading in the brain that operate on collections of predefined lines or letter banks (for example: Coltheart et al., 2001; Laszlo and Armstrong, 2014; McClelland and Rumelhart, 1981), a cnn stimulates the entire processing pipeline starting from raw pixel values of a bitmap image. This allowed us to evaluate model-brain correspondence using the same experimental contrasts as have been used in neuroimaging studies of visual word recognition. Following a neuroconnectionist approach (Doerig et al., 2023), we started with an illiterate cnn, trained to perform image classification, and iteratively made changes to its architecture and training diet, until it could replicate the effects of manipulating low-level visual properties such as noise levels, fonts and sizes, to mid-level orthographic properties such as symbols versus letters, and to high-level lexical properties such as consonants strings versus pseudowords versus real words. Some of these contrasts explore what we consider to be “edge cases” of visual word processing that explore what happens when the brain tries to recognize a word and fails. Symbol strings, consonant strings and pseudowords do not occur often in our natural environment and we don’t explicitly train ourselves to distinguish them from real words when we learn to read. Following this logic, we included only real words in the training set for the model, so that responses to the nonword stimuli were simulations of a computational process failing to properly recognize a word.

The resulting models bridge an important gap in modeling visual word recognition, by establishing a direct link between computational processes and key findings in experimental neuroimaging studies.

3.1 A computational process exhibiting type-i, type-ii and n400m-like effects

In this study, the computational process of visual word recognition was taken to be a transformation from a bitmap image containing a written word (with some variation in letter shapes, size and rotation) to the identity of the word. Many network architectures can be found in the deep-learning literature that could by used to implement such a process. For this study, we chose the vgg architecture (Figure 3), where high accuracy is achieved by performing the same few operations many times, rather than many different operations or clever tricks. The operations performed by the network are: convolution, pooling, batch normalization, and linear transforms. Our rationale was to start with these operations, which already have links to the neuroscience of vision (Carandini et al., 1997; Lindsay, 2020; Serre et al., 2007; Yamins and DiCarlo, 2016) and only when they prove insufficient to explain a desired experimental effect, introduce others. This turned out to be unnecessary for explaining the results of Vartiainen et al. (2011), but one can imagine the necessity of introducing recurrence (Kubilius et al., 2019) or attention mechanisms (Vaswani et al., 2017) in future extensions of the model into for example semantic processing.

The visual word recognition task itself did not prove particularly difficult for a cnn, as all variations of the network architecture and training diet that were tried resulted in a model that achieved near perfect accuracy. However, not all variations produced response profiles matching those of the evoked components (Figure 4, Supplementary Figure 4). This validates a criticism sometimes heard concerning whether deep learning models are brain-like, namely that the ability of a model to perform a task that once could only be done by a brain is not enough evidence that a model computes like a brain (Bowers et al., 2022). Indeed, we found that task performance alone, i.e. the identification of the correct stimulus word, was a poor indicator for judging a model’s suitability as computational hypothesis for the brain. More compelling evidence was obtained by comparing the model’s internal state to measurements of brain activity. Two variations of the model, with respective vocabulary sizes of 1k and 10k words, implement the computational process in such a way that they produce amounts of activity within their layers that resemble the amplitudes of the type-i, type-ii and n400m components of the meg evoked response. The model with the larger vocabulary was chosen to be the final iteration of the model and subjected to a more extensive range of experimental contrasts (Figure 7).

The computational process implemented by the final model starts with a series of convolutionand-pooling operations. Individual convolution units have a restricted receptive field, which is gradually widened through pooling operations. The units in the first layer see a patch of 3 × 3 pixels and those in the final fifth convolution layer combine information from patches of 18 × 18 pixels. This setup enforces a hierarchical representation forming in those layers, from edges to line segments and corners to (parts of) letter shapes (Dehaene et al., 2005; Dehaene-Lambertz et al., 2018b). Hence, increasing the visual complexity, i.e. adding more edges to an image by for example introducing visual noise or adding more symbols, will increase the activity of the convolution units. This behavior corresponds closely to that of the type-i component, which is in line with the growing body of evidence of convolution-and-pooling being performed in visual cortex (Lindsay, 2020; Serre et al., 2007).

The kind of information selected for in the later layers of the model changes through the introduction of non-relevant input signals during training, here done through noisy unit activations. Without noisy activations, the task of the computational units is limited to finding features to distinguish one letter from another. With noisy activations, their task also entails isolating letter-shape information from non-relevant information. The property of not merely having a preference for, but actively seeking to isolate certain types of information is key for producing the response profiles of the type-ii and n400m components. When receptive field sizes were 8 × 8 pixels and larger (fourth and fifth layers), the convolution units showed a preference for letters over symbols and a preference for more frequently used letters. This orthographic tuning is also the hallmark of the type-ii component. However, while the amplitude of this component increases with increasing visual complexity through the addition of more symbols or letters to a stimulus (Tarkiainen et al., 1999), an important aspect of the type-ii is that manipulations that reduce readability, such as the addition of visual noise, decrease its amplitude, even though the overall visual complexity of the stimulus is increased. It is in this aspect that the convolution units in our models are not necessarily a good fit for the type-ii response, since their tight receptive field strongly ties their activation to that of their counterparts in the previous layer. Increased activation in the early convolution layers tends to carry over, with the later convolution layers having only limited capabilities for filtering out noise. With small vocabularies, we did observe noise-filtering behavior in the fifth convolution layer, but less so when vocabulary size was increased from 250 to 1k and the effect was gone with a vocabulary of 10k.

In the model, convolution units are followed by pooling units, which serve the purpose of stratifying the response across changes in position, size and rotation within the receptive field of the pooling unit. Hence, the effect of small differences in letter shape, such as the usage of different fonts, was only present in the early convolution layers, in line with findings in the eeg literature (Chauncey et al., 2008; Grainger and Holcomb, 2009; Hauk and Pulvermüller, 2004). However, the ability of pooling units to stratify such differences depends on the size of their receptive field, which might explain why, in later layers, we observed a decreased response to stimuli where text was rendered with a font size exceeding the receptive field of the pooling units (Figure 7).

In the final model, the convolution-and-pooling phase is followed by three fully connected layers, which pool information across the entire input image. Their large receptive fields make them more effective in isolating orthographic information from noise than the convolution layers (Figure 4) and their response profile matches that of the type-ii component well, also when the model has a large vocabulary. Moreover, units in the fully connected layers can see multiple letters, making them sensitive not only to letter frequency, but also bigram frequency, allowing them to replicate these effects on the type-ii component (Woolnough et al., 2021).

The final fully connected layer, the output layer, represents the model’s notion of a mental lexicon: a collection of known word-forms, not yet linked to their semantic meaning (which we didn’t model at all). Since orthographic word-form and semantic meaning are generally only very loosely coupled (Monaghan et al., 2014), the lexicon often appears as a separate unit in computational models (Woollams, 2015). At face value, since we employed cross-entropy as loss function during training, the output layer simulates the lexicon following a localist structural approach, which harks back to the earliest models of visual word recognition (McClelland and Rumelhart, 1981; Morton, 1969). In the localist approach, each word is represented by a dedicated unit that should only be active when the sensory version of that word (in this study, a bitmap image containing the word as rendered text) is provided as input to the model, to the exclusion of all other words in the vocabulary. This is known in the machine learning literature as “one-hot” encoding. The localist approach is often contrasted with the distributed approach, wherein each word is represented by a unique combination of multiple output units that respond to subword parts (e.g., n-grams, pairings of letters in the first and final position, etc.), which is a more efficient encoding scheme (Seidenberg and McClelland, 1989). Both the localist and distributed approaches have their strengths and weaknesses when it comes to explaining experimental results, and the debate on how exactly the human lexicon is encoded in the brain is far from settled (Woollams, 2015). Interestingly, because our model covers the entire process from pixel to word-form, its simulation of the lexicon is not strictly localist, but combines both localist and distributed approaches.

When the input image contains a word that is present in the model’s vocabulary, it is true that the corresponding word-specific unit in the output layer will have the highest amount of activation of all units in the output layer. However, given the connectionist nature of our model, all units tuned to orthographically similar words will co-activate to some degree. This inaccuracy is important when it comes to simulating the amplitude of the n400m component in response to pseudowords, which is as large as (and sometimes even larger than) the response to real words. In the model, a pseudoword does not have a dedicated output unit, but given a large enough vocabulary, there will be enough units that respond to orthographically similar words to produce a large amount of activity nonetheless. Crucially, letter strings that are not word-like, such as consonant strings, will not activate many units in the output layer, which is an important difference between the response profiles of the n400m and the type-ii. Furthermore, in our final model, the first two fully connected layers have fewer units (4096) than the number of words in the vocabulary (10 000), which produces a distributed representation, where one unit is involved in the encoding of many words. The response profiles of both the second fully connected layer (distributed representation) and the output layer (localist representation) resemble that of the n400m response. Hence, our model does not provide direct cause for preferring a localist or distributed approach when modeling the lexicon, but shows how both approaches interact when creating a model of the entire process of sensory input to lexicon: the localist approach can function as training objective and a distributed representation emerges in the hidden layers.

3.2 Limitations of the current model

The vgg-11 architecture was originally designed to achieve high image classification accuracy on the ImageNet challenge (Simonyan and Zisserman, 2015). Although we have introduced some modifications that make the model more biologically plausible, the final model is still incomplete in many ways as a complete model of brain function during reading.

One important limitation of the current model is the lack of an explicit mapping from the units inside its layers to specific locations in the brain at specific times. Multiple layers have a maximum correlation with the source estimate at the exact same location and time (Figure 6B). Furthermore, there is no clear relationship between the number of layers the signal needs to traverse in the model to the processing time in the brain. The signal needs to propagate through five convolution layers to reach the point where it matches the type-ii response at 140 ms to 200 ms, but only through one more additional layer to reach the point where it starts to match the n400m response at 300 ms to 500 ms. This also raises the question what the brain is doing during this time that seems to take so long. Still, cutting down on the number of times convolution is performed in the model seems to make it unable to achieve the desired suppression of noise (Supplementary Figure 4). A promising way forward may be to use a network architecture like cornet (Kubilius et al., 2019), that performs convolution multiple times in a recurrent fashion, yet simultaneously propagates activity forward after each pass. The introduction of recursion into the model will furthermore align it better with traditional-style models, since it can cause a model to exhibit attractor behavior (McLeod et al., 2000), which will be especially important when extending the model into the semantic domain.

Despite its limitations, our model is an important milestone for computational models of reading that leverages deep learning techniques to extend the scope of traditional-style models to encompass the entire computational process starting from raw pixels values to representations of word-forms in the mental lexicon. This allowed us to directly compare activity within the model to meg neuroimaging results and show how the stimulus-dependent amplitudes of the type-i, type-ii and n400m components of the meg evoked response can be computationally derived from a series of convolution-and-pooling steps, followed by several fully connected linear operations. Overall, we found that a qualitative evaluation of the response profiles (Figure 4) gave a most direct path to finding ways to improve the model. It was more informative than the correlation scores (Figure 5, Supplementary Figure 7), which were mostly driven by the large differences between the noise-embedded words and symbol strings versus the other stimulus types. This underlines the importance of finding ways to directly link the activity simulated within a model to measurements obtained from the brain and reason about them in both a quantitative and qualitative fashion.

4 Materials and Methods

4.1 MEG study design and data analysis

The meg data that had been collected by Vartiainen et al. (2011) was re-analyzed for the current study using MNE-python (Gramfort et al., 2013) and FreeSurfer (Dale et al., 1999). We refer to the original publication for additional details on the study protocol and data collection process.

Simultaneous meg and eeg was recorded from 15 participants (who gave their informed consent, in agreement with the prior approval of the Ethics Committee of the Helsinki and Uusimaa Hospital District). The stimuli consisted of visually presented Finnish words, pseudowords (pronounceable but meaningless), consonant strings (random consonants), symbol strings (randomly formed from 10 possible shapes), and words embedded in high-frequency visual noise (Figure 2A). Each stimulus category contained 112 different stimuli and each stimulus contained 7–8 letters/symbols. The stimuli were presented sequentially in a block design. Each block started with a random period of rest (0 ms to 600 ms), followed by 7 stimuli of the same category. Each stimulus was shown for 300 ms with an inter-stimulus interval of 1500 ms of gray background. In addition to the five stimulus conditions (16 blocks each), there were 16 blocks of rest (12 s). An additional 5 target blocks were added, in which one stimulus appeared twice in a row, that the participants were asked to detect and respond to with a button press. The target blocks were not included in the analysis. The same paradigm was applied in an fmri-eeg recording of the same participants as well (data not used here).

The data were recorded with a Vector View device (Elekta Neuromag, now MEGIN Oy, Finland) in a magnetically shielded room. In addition, vertical and horizontal electrooculography (eog) were recorded. The signals were bandpass filtered at 0.03 Hz to 200 Hz and digitized at 600 Hz. For analysis, the data were further processed using the maxfilter software (MEGIN Oy, Finland) and bandpass filtered at 0.1 Hz to 40 Hz. Eye-movement and heart-beat artifacts were reduced using independent component analysis (ica): the signal was high-passed at 1 Hz, and components with correlations larger than 5 standard deviations with any of the eog channels or an estimated electro-cardiogram (based on the magnetometers without the maxfilter applied) were removed (3–7 out of 86–108 components). Epochs were cut −200 ms to 1000 ms relative to the onset of the stimulus and baseline-corrected using the pre-stimulus interval. Participant-specific noise covariance matrices of the meg sensors were estimated using the pre-stimulus interval.

Source estimation of the sensor-level signals was performed using two methods. The first, mne-dspm (Dale et al., 2000), yields a comprehensive map of where activity can be found, which is used in this study mostly to provide context for a deeper investigation into three peaks of activity that can be observed along the ventral stream. The second, guided ecd modeling (Salmelin, 2010), summarizes the high-dimensional data as a sparse set of ecds, each one capturing an isolated spatiotemporal component, allowing us to study selected components in more detail. The advantage of this sparse method over distributed source estimation techniques such as the minimum norm estimate (mne) or beamformers, is that in cases like Vartiainen et al. (2011), a single ecd may conveniently capture an isolated component of the evoked response. Furthermore, this sparsity makes it straightforward to take individual differences regarding the cortical origin of the evoked component into account, since the location and orientation of the ecds were fitted to optimally explain the data of each individual. For each individual, the matching between ecds and the type-i, type-ii and n400m responses was performed based on their approximate location, timing of peak activity, and behavior across the different stimulus types. The groups do not necessarily contain an ecd for each of the 15 participants, as some participants did not have an ecd to contribute that matched well enough in position and/or timing. The number of participants contributing an ecd to each group were: type-i: 14, type-ii: 14, and n400m: 15.

For both ecd and mne-dspm source estimation, 3-layer boundary element method (bem) forward models were used. These models were based on T1-weighted anatomical images that were acquired using a 3 T Signa excite magnetic resonance imaging (mri) scanner. FreeSurfer (Dale et al., 1999) was used to obtain the surface reconstructions required to compute the bem models. Vartiainen et al. (2011) defines three groups of ecds in the left-hemisphere, associated with the type-i, type-ii and n400m responses. The locations and orientations of the ecds in these groups were re-used in the current study. For each individual epoch, the signal at each of the three ecds and a cortex-wide mne-dspm source estimate was computed using the noise covariance matrices and bem forward models. Finally, the locations of the ecds and mne-dspm source points were morphed to FreeSurfer’s template brain.

4.2 Computational models

The models of the computational process underlying the brain activity observed during the meg experiment used a vgg-11 (Szegedy et al., 2015) network architecture, pretrained on ImageNet (Russakovsky et al., 2015), as defined in the TorchVision (Marcel and Rodriguez, 2010) module of the PyTorch (Paszke et al., 2019) package. This architecture consists of five convolution layers (with the final three performing convolution twice), followed by two fully connected linear layers, terminating in a fully connected linear output layer (Figure 3A). Each individual convolution step is followed by batch-normalization (Ioffe and Szegedy, 2015) and max-pooling. Every unit, including the output units, used the following non-linear activation function:

where N denotes Gaussian distributed random noise with zero mean and standard deviation σnoise. For version of the model with noisy activations σnoise = 0.1, and for versions without σnoise = 0.

The model was trained to perform visual word recognition using a training set of approximately 1 000 000 images, where each image depicted a word, rendered in varying fonts, sizes and rotations (Figure 3B). An independent test set containing 100 000 images was used to track the performance of the model during training. None of the images in the training or test set were an exact duplicate of the images used as stimuli in the meg experiment. Therefore, the stimulus set can be treated as an independent validation set. The task for the model was to identify the correct word (regardless of the font, size and rotation used to render the text) by setting the corresponding unit in the output layer to a high value.

Different versions of the model were created using vocabulary sizes of 250, 1 000 and 10 000 Finnish words. The vocabulary was compiled using the 112 words used in the meg, extended with the most frequently used Finnish words, as determined by Kanerva et al. (2014), having a length of 3–9, excluding proper names and using their lemma form. No symbol strings, consonant strings, nor pseudowords were present in the training set.

For the training sets that do not take word frequency into account, each word was rendered 100 times in various fonts, sizes and rotations. When word frequency was taken into account, the number of times each word occurred in the training set c was scaled by its frequency f in a large Finnish text corpus (Kanerva et al., 2014) in the following manner:

where f0 is the frequency of the least commonly occurring word in the training vocabulary, r is the frequency ratio, ∑r is the total frequency ratio of all words in the training vocabulary, N is the desired number of items in the training set, in our case 1 000 000. The actual number of items in the training set will deviate a little from N, as the floor operation causes ∑cN. The number of times a word occurred in the test set was always 10.

Images were assembled by overlaying the rendered text onto an image of visual noise. Visual noise was rendered by randomly selecting a gray-scale value (0 % to 100 %) for each pixel. Next, if the training image contained text, the word was rendered in uppercase letters, using a randomly chosen font (15 possible fonts), font size (14 pt to 32 pt), and rotation (−20° to 20°). The list of fonts consisted of: Ubuntu Mono, Courier, Luxi Mono Regular, Lucida Console, Lekton, Dejavu Sans Mono, Times New Roman, Arial, Arial Black, Verdana, Comic Sans MS, Georgia, Liberation Serif, Impact, and Roboto Condensed.

For every variation of the model, the training procedure was the same. The training objective was to minimize the cross-entropy loss between the model output and a one-hot encoded vector of a length equal to the vocabulary size, indicating the target word. Optimization of the model’s parameters was performed using stochastic gradient descend with an initial learning rate of 0.01, which was reduced to 0.001 after 10 epochs (an “epoch” in this context refers to one sequence of all 1 000 000 training images), and a momentum of 0.9. During training, the noisy-relu activation function was temporarily disabled for the output units, and re-activated during evaluation and normal operation of the model. In total, training was performed for 20 epochs, by which point the performance on the test set, for all variations of the model, had plateaued at around 99.6 %.

4.3 Comparison between model and MEG data

To compare the model with the meg responses, statistical summaries were made of the high-dimensional data, that we refer to as “response profiles”, that could be compared either qualitatively by examining the pattern of activation across stimulus types, and quantitatively by computing Pearson correlation between them.

To construct the response profiles for a model, it was applied to the bitmap images that were used during the meg experiment (padded to be 256 × 256 pixels with 50 % grayscale pixels). The activity at each layer of the model was quantified by recording the relu activation right before the maxpool operation, and computing the 𝓁2 norm across the units. This yielded, for each stimulus image, a single number for each layer in the model representing the amount of activity within that layer. For each layer, the response profile was z-scored across the stimulus images.

To construct the response profiles for the type-i, type-ii and n400m components, the neural activity represented by the corresponding ecd was quantified by taking the mean of its timecourse during an appropriate time window (indicated in Figure 2C, 64 ms to 115 ms for the type-i, 140 ms to 200 ms for the type-ii, and 300 ms to 400 ms for the n400m, after stimulus onset). This yielded for each stimulus image a single number that indicates the amount of signal at the ecd. Collected across all stimuli, these numbers form the response pattern of each ecd. The response patterns were first z-scored across the stimulus images independently for each participant and later averaged across participants for the grand-average analysis.

The mne-dspm source estimates used 4000 source points, distributed evenly across the cortex. The response profile at each source point was computed by taking the maximum signal value of the source point’s timecourse within the same time window as the ecd analysis. This yielded for each stimulus image, a single number for each source point for each participant, indicating the amount of signal at the source point. The response patterns were first z-scored across the stimulus images independently for each participant and later averaged across participants during the grand-average analysis.

4.4 Statistics

The comparisons between response profiles obtained from the models and the meg data were performed through Pearson correlation (Figure 5, Figure 6, Figure 7). In cases where distributions were tested for a significant difference in mean (Figure 4, Figure 5), this was done through two-tailed paired t -tests, as implemented in SciPy (Virtanen et al., 2020). In Figure 2D, multiple participants contribute to each data point in the distributions, hence a linear mixed effects (lme) model was used to compare distributions, with participants modeled as random effect (both slope and intercept), as implemented in the Julia MixedModels package (Bates et al., 2023).

The noise ceilings, indicating the maximum obtainable correlation between the response profile of a model and those of the evoked components (Figure 5), were estimated using two different methods. A noise ceiling depends on the amount of variation in repeated measurements of the same response profile. When estimating the noise ceiling for a response profile obtained from a single participant, repeated measurements were available since the same response profile could be obtained from multiple participants. Hence, the estimation procedure outlined in Kriegeskorte et al. (2008) could be used, which yields an optimistic and pessimistic estimate, which act as upper and lower bounds for the true noise ceiling. For each evoked component, the noise estimate was computed as the mean correlation between the response profile obtained from each individual participant and the grand-average response profile. For the optimistic estimate, the individual under consideration was included in the grand-average (hence contributing data twice) and for the pessimistic estimate, the individual was not. When estimating the noise ceiling for the grand-average response profile, no repeated measurements were directly available. In this case, Lage-Castellanos et al. (2019) propose estimating the noise ceiling n based on the between-participant variation:

Where σμ is the standard deviation (across the stimuli) of the grand-average response profile, is the mean (across the stimuli) variation across the participant-specific response profiles, and N is the number of participants.

4.5 Post-hoc experimental contrasts

The final model was not only tested with the stimuli of Vartiainen et al. (2011), but also sets of stimuli designed with additional experimental contrasts (Figure 7). For each contrast, a different set of stimuli was generated. Unless mentioned otherwise, stimuli were text strings rendered on a gray background in the Arial font at a fontsize of 23 pixels.

Pure noise 1 000 images were generated containing a gray background, which was blended with Gaussian noise with varying levels between 0 % to 100 %, with no rendered text.

Noisy word All words of length 8 were selected from the original 10k vocabulary and rendered. They were blended with Gaussian noise at random levels between 0 % to 50 %.

Num. letters All words in the 10k vocabulary were rendered.

Num. symbols For each word in the 10k vocabulary, a symbol string of matching length was generated using the unicode characters “square”, “circle”, “triangle up”, “triangle down”, “diamond” and rendered.

Word vs. symbols The stimulus lists used for the “Num. letters” and “Num. symbols” contrasts were combined. For this contrast, the model response profile was correlated with a vector containing zeros indicating symbols and ones indicating words.

Font family All words in the 10k vocabulary were rendered in both the Impact and Comic Sans fonts. For this contrast, the model response profile was correlated with a vector containing zeros indicating stimuli rendered in the Impact font, and ones indicating stimuli rendered in the Comic Sans font.

Font size All words of length 8 were selected from the original 10k vocabulary. Each selected word was rendered at 8 fontsizes ranging from 14–32 pixels.

Letter freq. 10 000 random letter strings of length 8 were generated and rendered. In this contrast, the model response profile was correlated with the total letter frequency of each random letter string in the 10k vocabulary.

Bigram freq. 10 000 strings consisting of 4 bigrams were generated and rendered. Since there is generally a strong correlation between letter and bigram frequency, the random bigrams were selected in a fashion designed to minimize this confounding factor. Bigrams were randomly selected from a set of 50 common bigrams (freq >= 30 000) with the lowest letter frequency, and a set of 50 uncommon bigrams (freq < 10 000) with the highest letter frequency. Bigram and letter frequencies were computed on the 10k vocabulary.

Word freq. All words in the 10k vocabulary were rendered. The word frequency was computed as the square root of the number of times the word occured in a large corpus of Finnish text (Kanerva et al., 2014).

Orth. neighb. All words in the 10k vocabulary were rendered. The orthographic neighbourhood size was computed as the number of words in the 10k vocabulary with a Levenshtein distance of 1 to the target word.

4.6 Data and materials availability

The trained model and meg derivatives necessary for performing the comparison is available at https://osf.io/nu2ep. As the consent obtained from the participants prohibits sharing of personal data, only meg data that has been aggregated across participants is available. The code for training the model and reproducing the results of this paper is available at https://github.com/wmvanvliet/viswordrec-baseline.

Acknowledgements

We acknowledge the computational resources provided by the Aalto Science-IT project. This research was funded by the Academy of Finland (grants #310988 and #343385 to M.v.V, #315553 and #355407 to R.S.) and the Sigrid Jusélius Foundation (to R.S.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.