Introduction

Decision-making under time pressure is deeply embedded within the activities of daily life. To study the cognitive and neural processes underlying decision-making, psychologists fit computational models to response time (RT) data gathered from relatively simple cognitive tasks. Evidence accumulation models (EAMs), such as the diffusion-decision model [58], [59] and linear ballistic accumulator model (LBA) [8], are the most successful and widely-used computational models of decision-making. In the EAM framework, sensory evidence is accumulated (possibly with noise) until reaching a threshold, at which point an overt response is generated. As such, decision-making is described in terms of a small number of interpretable parameters that capture basic cognitive processes. Empirically, EAMs have been shown to capture RT distributions from a variety of cognitive tasks at the individual subject level [8], [58], [60], [80].

Despite this empirical success and theoretical merit, the simple characterization of decision-making encapsulated by EAMs results in somewhat restrictive applications [17]. For example, EAMs do not typically provide a detailed specification of how stimuli are transformed into evidence or of the motor processes that physically generate a response. On the sensory side, the modeler instead converts the raw high-dimensional stimulus into a small number of abstract perceptual features that drive the rate of evidence accumulation. In the context of visual tasks, as we study here, this implies that EAMs are not amenable to studying the visual processing steps underlying decision-making, despite decades of experimental and theoretical vision research. It is also difficult to model tasks with complex (e.g. naturalistic) visual stimuli with EAMs, since the relevant visual features must be specified by hand.

Here, we integrate convolutional neural network (CNN) models for visual feature extraction with traditional accumulator models of speeded decision-making in a framework we call the visual accumulator model (VAM). We use CNNs since they are image-computable: they accept arbitrary images as input, such as visual stimuli from cognitive tasks. CNNs also capture key features of biological vision [37], [41], [85]. Each layer of the CNN applies a convolution, nonlinearity, and pooling operation to the output of the previous layer with a bank of trainable filters. In early network layers, this results in spatially-organized units with local receptive fields, analogous to the retinotopic organization of the early mammalian visual pathway. The concatenation of multiple layers forms a hierarchy in which progressively more abstract features are extracted in deeper layers, in a way that is globally similar to the primate visual cortical hierarchy.

Two features of the VAM distinguish our framework from other models in which neural networks were used to generate RTs or RT proxies [22], [39], [55], [71], [77]. First, the VAM generates RTs via a traditional EAM, a component that prior models lack. The VAM therefore inherits the advantages in interpretability afforded by EAMs and allows us to connect to the rich decision-making literature from psychology. Second, the CNN and EAM components of the VAM are jointly fitted to the RT, choice, and raw (pixel-space) visual stimuli from a single participant in a unified Bayesian framework. Thus, in contrast to previous models, the representations learned by the CNN and EAM parameters are directly constrained by behavioral data. We leverage these advantages to explore how abstract, task-relevant information is extracted from raw sensory inputs, and investigate how the behavioral phenomenon of congruency effects arises as a consequence of the representation geometry learned by the CNN. In doing so, our framework also addresses one of the main criticisms of deep neural network models of vision that are optimized to perform particular tasks (e.g. object identification): these models account for few results from psychology [2], [5], [18], [42], [44]. Indeed, as we will show, fitting the CNN with human data—rather than optimizing the model to perform a task—has significant consequences for the representations learned by the model.

Modeling framework

We first describe the task—Lost in Migration (LIM), available as part of the Lumosity cognitive training platform—which will serve to ground the discussion of the model. LIM is a stylized version of the well-known arrows flanker task used in the study of cognitive control and visual attention (Fig. 1A; [73]). In LIM, participants are shown stimuli composed of several arrow-shaped birds. The task is to indicate the direction of the central bird (the target) using the arrow keys on the keyboard, ignoring the direction of the surrounding birds (the flankers). Participants engage with the task at home and are rewarded via a composite score that takes into account both speed and accuracy.

Task and model.

A) Top, Lost in Migration task. Bottom, the seven stimulus layouts (random target/flanker directions). B) VAM schematic. The numbers after the CNN layer names correspond to the number of channels used in that layer. See Methods for additional details.

The stimuli used in LIM vary along several dimensions that are not present in the standard flanker task, implying potentially complicated stimulus-behavior dependencies that are well-suited to the VAM. First, both targets and flankers can be independently oriented left, right, up, or down, such that there are four possible response directions (the standard arrow flanker task allows for only left and right responses). Second, the layout of the targets and the flankers can appear in one of seven configurations: left/right/up/down ‘V’, horizontal/vertical line, and cross (Fig. 1A). Last, the target can be centered anywhere within the 640x480 pixel game window, subject to edge constraints.

As with other flanker tasks, a given trial is said to be congruent if targets and flankers are oriented in the same direction, and incongruent otherwise. A consistent observation from studies employing the flanker task and related conflict tasks are congruency effects: participants are slower and less accurate on incongruent trials [16], [69], [74]. Congruency effects are considered to index a specific aspect of cognitive control: they indicate the extent to which an individual can selectively attend to task-relevant information and ignore irrelevant information.

The visual accumulator is an image-computable EAM, composed of a CNN and EAM chained together (Fig. 1B). Minimally processed (pixel-space) visual stimuli are provided as inputs to the CNN. The outputs of the CNN correspond to the mean rates at which evidence is accumulated (the drift rates), one for each possible response (four in the case of LIM). The EAM then generates choices and RTs through a noisy evidence accumulation process. For each participant in the dataset, we fit one such model (CNN + EAM) jointly using that participant’s visual stimuli, RTs, and choices. Building on prior work [11], [38], we developed an automatic differentiation variational inference (ADVI) algorithm that simultaneously optimizes the parameters of the CNN and learns the posterior distribution over the LBA parameters (Methods).

The particular EAM we adopt is the linear ballistic accumulator model (LBA [8]; Fig. 1B) since it can be applied to tasks with more than two possible responses, though the general VAM framework is compatible with other EAMs that have a closed-form or easily approximated likelihood. The LBA parameters fitted in the VAM are the decision threshold b, the non-decision time t0 , and a parameter A that controls the dispersion of the initial accumulator values (the drift rate means are fitted implicitly via the CNN). We used a seven layer CNN architecture (6 convolutional layers, 1 fully-connected layer) in the VAM (Fig. 1B). To speed up the training process, the first two convolutional layers were initialized with the parameters from a 16-layer VGG CNN trained to classify the ImageNet dataset [12], [70].

All of the code (https://github.com/pauljaffe/vam) and data (https://doi.org/10.5281/zenodo.10775514) used to train the VAMs and reproduce our results are publicly-available.

Results

Models capture human behavioral data

We fitted a separate VAM to LIM data (visual stimuli, RTs, choices) from each of 75 Lumosity users (participants). We selected participants who had practiced Lost in Migration extensively (≥ 25,000 trials) and used data from a later stage of practice to minimize learning effects (≥ each participant’s 50th gameplay). These participants varied in age (23–87 years) and in their behavior (mean RT, accuracy, congruency effects), allowing us to examine how well our framework captured individual differences.

To assess the model fits, we compared the mean RT, accuracy, RT congruency effect (incongruent trial mean RT minus congruent trial mean RT), and accuracy congruency effect (congruent trial accuracy minus incongruent trial accuracy) of each model/participant pair on a set of holdout stimuli, separate from those used to train the models (Figs. 2 and S1). For each behavioral summary statistic, the responses of the fitted models were highly correlated with those of the participants (Pearson’s r > 0.75), with slopes close to unity (Fig. 2B-E, statistics in figure legend). We found that the RT congruency effect could be attributed to a reduction in the mean target drift rate parameter on incongruent vs. congruent trials, while the accuracy congruency effect could be attributed to a higher mean flanker drift rate on incongruent trials relative to the non-target (other) mean drift rates on congruent trials (Fig. 2F). The latter observation follows from the fact that the drift rates on trial i are sampled from N(v(i), 1), where the v(i) are the mean drift rates shown in Fig. 2F. Since the flanker drift rates on incongruent trials are higher (less negative) than the non-target drift rates on congruent trials, more errors result on incongruent trials, giving rise to the accuracy congruency effect.

Comparison of model/participant behavior.

For panels B-E, each point is one model/participant (n = 75), black line: unity, red line: linear best-fit. A) Example model/participant RT distributions. B) Mean RT (Pearson’s r = 0.99, bootstrap 95% CI = (0.99, 0.99), best-fit slope = 1.07). C) Accuracy (r = 0.91, 95% CI = (0.87, 0.94), slope = 1.15). D) RT congruency effect (r = 0.77, 95% CI = (0.67, 0.86), slope = 1.01). E) Accuracy congruency effect (r = 0.92, 95% CI = (0.88, 0.94), slope = 1.20). F) Drift rates averaged across all trials and models. G) Mean RT vs. age averaged across models. H) Example model/participant mean RT vs. stimulus layout (Pearson’s r = 0.67). I) Example model/participant mean RT vs. horizontal stimulus position (negative values: left of center; Pearson’s r = 0.79). J) Empirical CDF of Pearson’s r between model/participant mean RTs across stimulus feature bins (only participants with significant RT modulation are shown; layout: n = 60 models/participants, x-position: n = 72, y-position: n = 69). Error bars in panels F-I are bootstrap 95% confidence intervals.

We also examined whether the fitted models captured demographic effects, focusing on the well-established slowing of RTs that occurs with age [23], [50], [57]. Mean RTs began increasing around age 50, effects captured by the fitted models (Fig. 2G). Consistent with prior work in a variety of decision-making tasks [20], [57], [67], [72], we found that an age-dependent increase in the non-decision time (t0) component of the response contributed to longer RTs in the models from older adults (Fig. S2A). In contrast to these prior studies, we did not observe increased response caution (measured as b - A) in the models from older adults (Fig. S2B). This discrepancy could be explained by differences in the task design or the fact that our participants had practiced the task substantially more than in typical studies [72]. We also observed an age-dependent reduction in the target drift rates and no age-dependence of the flanker drift rates (Fig. S2C-D), findings that have received mixed support in the literature [3], [20], [57], [67], [72].

One virtue of the VAM is that the model implicitly learns which stimulus properties influence behavior. As a simple demonstration of this, we investigated how two high-level visual features influenced participant behavior—stimulus layout and both horizontal/vertical stimulus position—and whether the fitted models captured these effects. We focused on RT effects, since no participants exhibited a significant effect of layout or horizontal/vertical position on accuracy (p > 0.05 for all stimulus features, chi-squared test), though we note that ceiling effects resulting from the high accuracy of the participants may have hindered our ability to detect these accuracy effects.

The majority of participants exhibited layout-dependent RT biases (p < 0.05 for 60/75 participants, one-way ANOVA; see examples in Fig. 2H and Fig. S1B). We quantified how well the models captured these effects by calculating the Pearson’s r between model/participant mean RTs across each layout for the participants that exhibited significant layout-dependent RT modulation (Fig. 2J). The median Pearson’s r was 0.67, demonstrating good correspondence between model/participant behavior. There was considerable heterogeneity in the particular layout RT biases exhibited by both the participants and models, though responses to trials with the vertical line layout were on average somewhat faster than the other layouts for both participants and models (Figs. S1B and S3A).

The majority of participants also exhibited both horizontal and vertical position-dependent RT biases (horizontal position: p < 0.05 for 72/75 participants, vertical position: p < 0.05 for 69/75 participants, one-way ANOVA; see examples in Fig. 2I and Fig. S1C-D). The fitted models captured these effects adequately: the median Pearson’s r between model/participant mean RTs across stimulus position bins was 0.78 for horizontal position and 0.55 for vertical position (Fig. 2J). While there was substantial variability across both participants and models in the particular position-dependent RT biases they exhibited, most participants/models responded more slowly when the stimuli were positioned close to the horizontal edges and, to a lesser extent, the vertical edges of the task window (Figs. S1C-D and S3B-C).

While the analyses above demonstrate that the mean congruency effects from the participants were captured well by the VAMs, we observed a mismatch between model and participant behavior for other commonly used behavioral indices that take into account additional information from the RT distribution: RT delta plots and conditional accuracy functions [30], [63], [78], [82], [83] (Fig. S4A-B). However, when we fitted a separate mo del to trials from each of n=5 RT quantiles from a given participant’s data, quantitative fits improved and qualitative trends were captured for both delta plots and conditional accuracy functions (Fig. S4C-D, Methods). Thus, a small modification of the VAM training paradigm can be used to investigate how congruency effects change as a function of RT. However, a thorough investigation of these effects is outside the scope of this manuscript, and we focus on the models fit to all trials in the remainder of our study.

In summary, the VAM captured individual differences in RTs, accuracy, and congruency effects, and the dependence of RTs on stimulus layout and position. To lay the groundwork for understanding how the learned visual representations of the models relate to these behavioral effects, we sought to characterize the general properties of these representations that enable proficient execution of the task.

Representations of task-relevant stimulus information

To perform LIM well, the models must learn representations that select the task-relevant information (target direction) and diminish the influence of irrelevant information (flanker direction, stimulus layout, and stimulus position). To investigate these representations, we analyzed a set of stimulus features in each CNN layer, derived from the models’ responses to a holdout set of LIM stimuli (separate from the training stimuli; Fig. 3A, Methods). The features in each layer form an N × Kl matrix, where N is the number of stimuli in the holdout set (5000) and Kl is the number of active units in layer l (a variable fraction of units in each layer did not respond to any stimuli and were excluded from the feature matrix).

Neural representations of target direction.

A) Schematic of the CNN features extracted from each network layer. B) Decoding accuracy of stimulus target direction. C) Normalized mutual information for target direction conveyed by single units, averaged across units. Mutual information was normalized by the entropy of the target direction distribution. D) Dimensionality of target representations as measured by the participation ratio of the target-centered feature covariance matrix. E) Proportion of units exhibiting selectivity for target direction. Panels B-E show the average across n = 75 models; error bars correspond to bootstrap 95% confidence intervals.

We characterized the emergence of target selectivity in the model by quantifying how well the target direction could be decoded from the feature matrix of a given layer with a linear support vector machine [66] (SVM; Fig. 3B). Note that only incongruent trials were used in these analyses, since the classifier could use the flanker direction to classify the target on congruent trials, artificially inflating performance. Decoding performance on holdout data for target direction was near chance (27%) in the first network layer and increased to nearly perfect decoding (≥ 97%) at layer 4, with a sharp increase between the second and third convolutional layers (Fig. 3B). The increase in target decoding accuracy from shallower to deeper network layers is generally consistent with neural recording studies in mammals and other neural networks studies that document more accurate decoding of abstract variables (e.g. object or category identity) in higher visual/cortical regions and deeper neural network layers [7], [47], [76].

We also investigated target selectivity at the single-unit level by quantifying the mutual information between each unit’s activity and target direction [47], [76] (Fig. 3C). In contrast to the population-level decoding results, information for target direction exhibited a non-monotonic “hunchback” profile across the convolutional layers, then increased sharply in the final fully-connected layer. The observation that single-unit information for target direction decreased between the fourth and final convolutional layers while population-level decoding remained high is especially noteworthy in that it implies a transition from representing target direction with specialized “target neurons” to a more distributed, ensemble-level code. Notably, a similar transition in coding properties takes place along the cortical hierarchy, with higher-order cortical regions exhibiting a reduction in units with “pure” selectivity and a corresponding increase in units with mixed selectivity for multiple task or stimulus features [45], [46], [64]. The high proportion of units with mixed selectivity results in a high-dimensional representation in which all task-relevant information can easily be extracted with simple linear decoders [10], [51].

Consistent with these ideas, we found that the dimensionality of target representations as measured by the participation ratio (Methods) increased sharply in the last two convolutional layers (Conv5-Conv6), paralleling the reduction in single-unit information for target direction in these layers (Fig. 3D). The highdimensional representation observed in these layers may enable the VAM to capture the rich stimulusbehavior dependencies present in the participant data. The reduction in dimensionality in the final fully-connected layer, and concomitant increase in single-unit information for target direction, may reflect a strong constraint to select the correct target direction imposed by the task. Notably, the hunchback-shaped profile we observed in the dimensionality of representations has also been observed along visual cortical regions of the rat ventral stream and in other neural network studies [1], [47].

To more explicitly characterize the degree of directional selectivity in each layer, we quantified the proportion of units that responded preferentially to one of the four target directions. We separated units according to the sign of modulation and degree of directional selectivity: “selective (+)” and “selective (-)” units were more (or less) active for one target direction relative to the other three; “complex” units exhibited significant modulation by target direction without a clear directional preference (Methods).

The proportion of units that were selective for a particular direction increased steadily from the second to the fourth convolutional layer, with most units exhibiting positive modulation (Figs. 3E & S5). Between the fourth and final convolutional layers, the proportion of units with positive target direction modulation decreased, while the proportion of units with negative and complex modulation increased.

In summary, a strong representation for target direction emerged in the middle convolutional layers, and was initially supported by a simple low-dimensional code, with information for each target direction concentrated in separate populations of directionally-tuned units. In later convolutional layers, target direction was supported by a more distributed, complex, and high-dimensional code, with weaker directional selectivity at the single-unit level.

Extraction and suppression of distracting stimulus information

How do the models’ visual representations enable target selectivity for stimuli that vary along several irrelevant dimensions? The mammalian visual system solves the analogous problem of visual object recognition through representations that become increasingly invariant to different object positions, sizes, and lighting conditions [14], [66]. To determine whether the models learned representations that share these characteristics, we initially assessed whether the model representations for target direction are indeed invariant (or more generally, tolerant), to variation in flanker direction, stimulus layout, and horizontal/vertical position. Alternatively, the models could have learned a coding scheme in which different representations are responsible for selecting the target in each distracter context, e.g. with units that respond to the conjunction of particular target direction and layout combinations.

To assess the degree of representation tolerance, we quantified how well a linear SVM trained to classify target direction in one distracter context generalized to a different distracter context, where the distracter context is defined by a particular type of task-irrelevant information [4], [66]. For example, to assess the degree of tolerance to flanker direction, we split trials into a training set where the flanker direction was fixed to a particular value (e.g. down), and a generalization set with all other flanker directions (e.g. flanker direction = left/right/up). The performance of the classifier on the generalization set measures the extent to which the target representations are tolerant (or invariant) to variability in a given type of distracting information.

For each type of distracting information, we found that generalization performance was initially near chance (25%) and increased steadily through the network layers (Fig. 4A). The lower generalization performance for flanker direction observed in the deepest network layers (~60%) can be attributed to the robust congruency effects exhibited by the models (Fig. 2D-E): for the vast majority of incongruent error trials, the model chose the flanker direction, implying that flanker direction has a strong impact on model representations.

Suppression of task-irrelevant information and tolerance in task-relevant representations.

A) Decoding accuracy of stimulus target direction in a new distracter context (generalization performance). Context was defined by the values of a given stimulus feature (flanker direction, layout, horizontal/vertical position). B) Decoding accuracy of irrelevant stimulus features. C) Normalized mutual information for irrelevant stimulus features conveyed by single units, averaged across units. For each stimulus feature, the mutual information was normalized by the entropy of the stimulus feature distribution. All panels show the average across n = 75 models; error bars correspond to bootstrap 95% confidence intervals.

The tolerance of target direction representations to variability in irrelevant stimulus features suggests that the irrelevant information was progressively suppressed from shallower to deeper network layers. To examine this explicitly, we quantified how well each irrelevant stimulus feature (flanker direction, stimulus layout, horizontal/vertical position) could be decoded from the activity in each layer using the same SVM classifier methodology as we did for target direction decoding. We found that decoding accuracy for flanker direction and stimulus layout exhibited a hunchback profile in which decoding performance started low in early layers, increased in intermediate layers, and decreased in later layers (Fig. 4B). The decoding accuracy for horizontal and vertical position followed a similar but shifted pattern: there was a steady increase in accuracy until the second-to-last layer, followed by a slight drop in the last layer. Partial (rather than complete) suppression of irrelevant stimulus features is expected given that these features all impact behavior (Fig. 2). Note that the increase in receptive field size that occurs from shallower to deeper layers is necessary but not sufficient for accurate decoding of the irrelevant stimulus features, and does not explain the reduction in decoding accuracy in later layers.

We also examined suppression at the single-unit level by quantifying the mutual information between each unit’s activations and the values of each task-irrelevant stimulus feature [76] (Fig. 4C). The mutual information for a given stimulus feature was normalized by the entropy of the feature distribution to facilitate comparisons between the features [47]. In agreement with the population-level decoding analyses, information for the irrelevant stimulus features exhibited a pronounced hunchback profile in the progression from shallower to deeper network layers. It is noteworthy that the suppression of irrelevant information is more pronounced at the single-unit level in that decoding accuracy remains relatively high in later layers, particularly for flanker direction. This parallels the findings discussed above for target direction, and again suggests a transition from a simple code with populations of units that are selective for particular stimulus features to a more distributed, ensemble-level code.

Orthogonality of task-relevant and irrelevant information predicts behavior

A noteworthy feature of visual attention is that the selectivity and tolerance identified above is not absolute: irrelevant information cannot be filtered out completely, as illustrated by the congruency effects observed in the flanker task and related conflict tasks. A common framework for understanding these behavioral phenomena posits that task-relevant and irrelevant information compete for control over response execution, and that congruency effects arise from incomplete suppression of irrelevant information [78], [83]. Motivated by these theories, we investigated whether the degree of suppression of irrelevant information in the trained models was correlated with congruency effects across the models we analyzed.

To this end, we operationalized suppression with two metrics of the model representations that we investigated above: the accuracy of decoders trained to classify flanker direction from the model representations and the mutual information for flanker direction conveyed by single units. We expected to observe a positive correlation between both of these metrics and both RT and accuracy congruency effects: higher decoder accuracy or mutual information for flanker direction corresponds to less suppression and therefore higher congruency effects. However, we did not observe a significant positive correlation between either decoding accuracy or mutual information and RT or accuracy congruency effects in any of the model layers, with the exception of a single significant positive correlation between flanker mutual information and accuracy congruency effects in layer Conv3 (Fig. S6).

Given this somewhat surprising negative result, we were motivated to consider an alternative account of congruency effects, one that takes into account the relative geometry of the task-irrelevant (flanker) and taskrelevant (target) information. In particular, we considered the possibility that task-relevant and irrelevant information could be orthogonalized in the high-dimensional space of neural activity, such that task-relevant information is shielded from distracter interference. The general idea that neural representations for different types of information can be orthogonalized to prevent interference has received support from a number of neural recording studies [19], [32], [40], [52], [65].

To examine whether target and flanker representations are orthogonalized, we first defined target and flanker subspaces from the target direction and flanker direction classifiers used in the decoding analyses described above (Methods). The target direction classifier for a given network layer implicitly defines four decoding vectors, one for each target direction [4], [40]. We define the subspace of the Kl-dimensional feature space spanned by these four vectors as the target subspace; the flanker subspace is defined analogously.

To measure the orthogonality between target and flanker subspaces, we calculated the average of the cosine of the principal angles between the target and flanker subspaces, a metric we refer to as subspace alignment (Methods). Principal angles generalizes the idea of angles between lines or planes to arbitrary dimensions [31], [88]. The subspace alignment metric has a simple interpretation: it is equal to one if the subspaces are completely parallel and zero if they are completely orthogonal.

We first characterized how target/flanker subspace alignment develops across the layers of the trained models (Fig. 5A). Given that target direction decoding is poor in the first two layers (< 50%; Fig. 3B), the decoding vectors used to define the target subspace are not particularly meaningful, and we do not attempt to interpret the subspace alignment metric in these layers. Beginning at the third convolutional layer, when decoding accuracy for both targets and flankers is high (> 90%), we found that target and flanker subspaces are well-aligned. We interpret the high alignment as evidence that the model has learned a common representation for direction that is shared for both targets and flankers. In later layers, we found that target and flanker representations become increasingly orthogonal, consistent with the view that the processing in later layers acts to reduce interference between task-relevant and irrelevant information.

Orthogonality of target/flanker subspaces predicts accuracy congruency effects.

A) Target/flanker subspace alignment averaged across models. B) Pearson’s correlation coefficient between target/flanker subspace alignment and accuracy congruency effect calculated across models. C) Target/flanker subspace alignment vs. accuracy congruency effect for layers Conv4-FC1. Each point corresponds to one model; the red line is the linear best-fit. For all panels, n = 75 models. Error bars in panels A-B correspond to bootstrap 95% confidence intervals. Asterisks in panel B indicate a significant Pearson’s r (adjusted p-value < 0.05, permutation test with n = 1000 shuffles, Bonferroni correction for 7 comparisons).

If orthogonalizing target and flanker representations reduces interference from the irrelevant (flanker) information, we should observe a positive correlation between subspace alignment and congruency effects across models, since greater alignment results in more interference. Consistent with this idea, we observed a significant positive correlation between subspace alignment and accuracy congruency effects across models in each layer beginning with the fourth convolutional layer (Fig. 5B-C; adjusted p-value < 0.05 for layers 3–7, permutation test, Bonferroni correction for 7 comparisons). In contrast, we did not observe a significant correlation between target/flanker subspace alignment and RT congruency effects in any network layer, suggesting a mechanistic dissociation between RT and accuracy congruency effects (Fig. S7).

Representation geometry of task-optimized models

Researchers who use neural network models to study neural representations typically optimize the model to perform a task, rather than fit the model to behavioral data from the task as we do here [45], [81], [86], [87], cf. [13], [29], [75]. This raises the possibility that these two training paradigms induce different representations in the models. To investigate this, we trained CNNs to perform LIM (i.e. output the direction of the target) by minimizing the standard cross-entropy loss function used in image classification tasks. Other than the training objective, all aspects of the optimization algorithm, CNN architecture, initialization, and training data were the same as those used to train the VAM. We trained one task-optimized model for each VAM, using data from the same participants (n = 75 task-optimized models).

We first compared the behavioral outputs of the task-optimized models and the VAMs. We found that the task-optimized models did not exhibit an accuracy congruency effect (Fig. 6A). Note we could not examine the RT congruency effect since the task-optimized models do not generate RTs. Thus, simply training the model to perform the task is not sufficient to reproduce a behavioral phenomenon widely-observed in conflict tasks. This challenges a core (but often implicit) assumption of the task-optimized training paradigm, namely that training a model to do a task well will result in model representations that are similar to those employed by humans.

Comparison of VAMs and task-optimized models.

A) Accuracy congruency effect. B) Target/flanker subspace alignment. C) Dimensionality of target representations, as measured by the participation ratio of the target-centered feature covariance matrix. D) Normalized mutual information for target/flanker direction conveyed by single units, averaged across units. Mutual information was normalized by the entropy of the target/flanker direction distribution. E) Decoding accuracy of target/flanker direction. F) Proportion of units exhibiting selectivity for target direction in layers Conv5-Conv6. All panels show the average across n = 75 task-optimized models and n = 75 VAMs; error bars correspond to bootstrap 95% confidence intervals. The VAM data shown in panels A-F is the same as that shown in Figs. 2E, 5A, 3D, 3C/4C, 3B/4B, and 3E, respectively.

Above, we showed that VAMs with greater orthogonalization of target and flanker information exhibit smaller accuracy congruency effects (Fig. 5B), providing evidence that the relative geometry of task-relevant and irrelevant representations is a critical determinant of the degree of flanker interference at the behavioral level. Given that the task-optimized models do not exhibit an accuracy congruency effect, we therefore expect that these models would exhibit a higher degree of orthogonalization of target and flanker information. Consistent with this idea, we found that the task-optimized models had lower target/flanker subspace alignment (i.e. higher orthogonalization) for all network layers beginning with the third convolutional layer (Fig. 6B).

A direct consequence of the training paradigms used to train the VAMs is that these models are encouraged to capture dependencies between the stimulus features and behavior (RTs and accuracy), while the task-optimized models are not. As a result, the VAMs may learn richer representations of the stimuli, since a variety of stimulus features—layout, stimulus position, flanker direction—influence behavior (Fig. 2). To investigate this possibility, we compared the dimensionality of target representations between the VAMs and task-optimized models using the participation ratio metric discussed above.

We found that the dimensionality of the VAM and task-optimized model representations was nearly identical for the first four convolutional layers (Fig. 6C). In contrast, for the final two convolutional layers, the VAMs exhibited a substantially more pronounced expansion of dimensionality than the task-optimized models. In the final fully-connected layer, dimensionality decreased sharply for both types of models, and was somewhat higher for the VAMs.

The increased dimensionality of VAMs’ target represenations in later network layers is consistent with the view that these models must learn more complex representations of the stimuli in order to successfully capture stimulus-behavior dependencies. It is also noteworthy that the most striking difference between the dimensionality of the VAMs and task-optimized models occurs during the latter part (layers Conv5-Conv6) of the expansion phase of the hunchback-shaped dimensionality profile discussed above and observed in prior work [1], [47]. In these layers, single-unit information for target and flanker direction—the primary task features—decreases, while population-level decoding of these features remains high (Figs. 3B-C & 4B-C). As discussed above, this dissociation implies a transition from a simple representation of target/flanker direction with separate populations of directionally-tuned units to a more complex and distributed code.

To determine whether the task-optimized models exhibited this change in coding properties, we quantified the single-unit information and population-level decoding accuracy for target/flanker direction in these models. Relative to the VAMs, the task-optimized models had substantially higher single-unit information for both target and flanker direction in layers Conv5-Conv6 (Fig. 6D). The task-optimized models also showed marginally more or roughly equivalent single-unit information for stimulus position in these layers relative to the VAMs, but had less single-unit information for stimulus layout (Fig. S8A).

In contrast, population-level decoding accuracy for target/flanker direction and stimulus position was similar between the task-optimized models and VAMs in layers Conv5-Conv6, though decoding accuracy for stimulus layout was notably lower for the task-optimized models (Fig. 6E & S8B). These results suggest that the task-optimized models maintained a simpler code for target/flanker direction in the later convolutional layers relative to the VAMs, primarily relying on separate populations of directionally-tuned units. Consistent with this idea, the task-optimized models had a higher proportion of simple selective (+) directionally-tuned units and a lower proportion of complex units in layers Conv5-6 relative to the VAMs (Fig. 6F).

Discussion

The dominant models of decision-making, while providing a good quantitative description of psychophysical data, do not incorporate biologically plausible models of the perceptual processes that are essential for many behaviors [17]. On the other hand, neural network models of vision, while capturing core properties of the primate visual system, do not account for many results from behavioral experiments [2], [5], [18], [42], [44]. The VAM addresses both of these limitations by integrating neural network models for visual processing with traditional decision-making models in a unified probabilistic framework that can be fitted to visual stimuli and RT data. Leveraging large-scale data from a task with rich visual stimuli, we demonstrate that our framework captures complex dependencies between stimulus features and behavior at the level of individual participants. We also illustrate how congruency effects—a core behavioral phenomenon observed in conflict tasks—can be explained in terms of the visual representations of the model. Finally, we document several key differences between the representations learned by models fitted to human behavioral data (the VAMs) and those learned by models trained only to do the task.

To perform the task and capture the behavioral data, the VAMs learned representations that—like the mammalian cortex—extract task-relevant information from raw visual inputs. Our analyses of these representations revealed several discrete processing phases that are fruitfully discussed in relation to the changes in target representation dimensionality we observed. Across the layers of the VAM’s CNN, target dimensionality exhibited a prominent hunchback shape profile, corresponding to an initial protracted phase of dimensionality expansion followed by abrupt dimensionality compression in the final network layers. An analogous expansion and subsequent compression of object representation dimensionality has been documented in CNNs trained to classify images and along the rat visual-cortical hierarchy [1], [47], with some notable differences from our work that we highlight below.

The initial phase of dimensionality expansion can be explained in part by the pruning of low-level stimulus features (e.g. contrast and luminosity) that are correlated across stimuli in the dataset [1]. These correlations are particularly strong in our dataset since we used the same background image for each stimulus, resulting in low-dimensional activity in the initial network layers. We speculate that the initial increase in target representation dimensionality (i.e. layers Conv1-Conv4) results from the removal of these correlations, analogous to a whitening transformation [1].

In the early and middle convolutional layers (Conv1-Conv4), we found that the population-level decoding and single-unit information for both task-relevant (target direction) and distracting information (flanker direction, stimulus layout, and position) increased, occurring in parallel with the subtle expansion of dimensionality we observed in these layers. These trends are broadly consistent with observations that the mammalian visual cortex encodes increasingly abstract stimulus properties (e.g. object identity) in downstream brain regions [24], [28], [66]. The fact that information for distracting stimulus features increased in these layers is especially noteworthy, given that this information—the flanker direction, most prominently—impairs performance on the task [16], [73]. While it is tempting to speculate that the VAMs learned to extract distracter information because they were fitted to behavioral data, where the impairments in task performance manifest, the task-optimized models also exhibited increased distracter information in these layers. This suggests an alternative explanation in which the model first extracts more granular, superordinate stimulus representations that group together stimulus features with similar statistics. For example, the initial increase in encoding of target and flanker direction may reflect the representation of a unitary “directional signal” that facilitates the eventual separation of target and flanker direction representations that occurs in later layers. This idea is consistent with our observation that target/flanker subspaces are highly aligned in the middle convolutional layers and become progressively more orthogonal in later network layers.

In the last convolutional layers (Conv5-Conv6), we observed a large expansion in target representation dimensionality, which was notably less pronounced in the task-optimized models compared to the VAMs. The increase in dimensionality may be partly due to an increase in mixed selectivity for multiple stimulus features at the single-unit level, similar to the single-unit coding properties observed in primate PFC [64]. In general agreement with this idea, we observed a reduction in the proportion of units with selectivity for particular target directions in these layers, and a concomitant increase in units that were modulated by multiple target directions. Relative to the VAMs, the task-optimized models showed a higher proportion of directionally-tuned units in these layers. A notable advantage of high-dimensional neural codes composed of units with mixed selectivity is that many different input-output relationships can be implemented with simple decoders [10], [51], [64]. Thus, the higher dimensionality and more complex target direction coding observed in the VAMs relative to the task-optimized models may reflect the fact that the VAMs are trained to capture a potentially large number of dependencies between stimulus attributes and behavior, while the task-optimized models need only extract the target direction. Given the impressive capacity of primates to learn complex tasks and arbitrary stimulus behavior relationships, and the abundance of high-dimensional representations across the cortex, models trained to capture the richness of stimulus-behavior dependencies— such as the VAM—may result in better models of cortical processing than optimizing models to perform tasks.

The increase in dimensionality in layers Conv5-Conv6 may also result from suppression of correlations induced by covarying high-level stimulus features—in particular, target and flanker direction—a transformation that we conjectured to operate on low-level stimulus features in the earlier network layers. Consistent with this, the largest reduction in target/flanker subspace alignment occurs simultaneously with the largest expansion in target representation dimensionality, between layers Conv4 and Conv5. Given that the VAMs and task-optimized models exhibit a comparable reduction in target/flanker subspace alignment between these layers, the larger dimensionality that we observed in the VAMs may be driven primarily by other factors, such as the increase in units with mixed selectivity discussed above.

One of the key strengths of our modeling framework is that neural representations of stimulus features can be directly related to behavioral outputs. We focused in particular on congruency effects since they are ubiquitously observed in conflict tasks and highlight an inherent limitation of selective attention, namely that humans cannot completely filter out distracting information. Congruency effects are often interpreted in the context of a “dual-process” model in which automatic or impulsive processing arising from distracting stimulus information and more controlled or deliberate processing of task-relevant information compete for control over behavior [62], [78], [83]. Congruency effects result from the incomplete suppression of incorrect response activation in the automatic pathway. Building on this framework, a variety of models have been proposed that successfully capture congruency effects and other behavioral phenomena observed in conflict tasks [9], [78], [82].

Our work differs from these prior modeling efforts in two key ways. First, we did not attempt to “build in” a predetermined implementation of congruency effects. Rather, we fitted a relatively unstructured neural network model to human flanker task data and examined how congruency effects emerged [29]. Second, we explicitly modeled how task-relevant (and task-irrelevant) representations are extracted from raw visual stimuli with a biologically-plausible model of the mammalian visual system (a CNN). These features of the VAM allowed us to explore a space of possible explanations for congruency effects that are not readily investigated with prior conflict task models. Each CNN layer is composed of many simple neuron-like processing units, enabling a population-level description of task-relevant and irrelevant representations. Prior connectionist models of conflict tasks are also implemented as networks of interacting processing units [9], but are much reduced in scale relative to the VAM, precluding a characterization of stimulus feature representations in terms of their high-dimensional geometry as we pursue here. Another relevant attribute of CNNs is that the successive convolution and pooling operations in each layer extract increasingly abstract, task-related representations from raw visual inputs, a hallmark of the mammalian ventral visual system. This enabled a rich characterization of the intermediate visual representations that sequentially transform raw visual stimuli to behavioral outputs.

Leveraging these features of the VAM framework, we investigated potential explanations for congruency effects in terms of the population-level activity and representation geometry in each network layer. Motivated by the dual-process model of conflict tasks mentioned above, we initially sought evidence for the suppression of task-irrelevant information, as operationalized by population-level decoding accuracy and single-unit information for flanker direction. While we did find evidence of greater suppression in later vs. middle convolutional layers for both of these metrics, variability in the magnitude of this suppression across models was not related to variability in congruency effects.

Instead, we found that the relative geometry of target and flanker representations was a critical determinant of congruency effects. In particular, models with smaller accuracy congruency effects had more orthogonal (or less aligned) representations of targets and flankers in later network layers, proximal to behavioral outputs. This finding is consistent with the idea that orthogonalization of task-relevant/irrelevant representations shields the task-relevant information from distracters, and parallels findings from neural recording studies in both humans and animals that representations for different sources of information can be orthogonalized in order to prevent interference [19], [32], [40], [52], [65], [84].

To elaborate on this idea, in models with more aligned (less orthogonal) target/flanker representations in intermediate layers, the task-relevant (target) subspace is effectively “contaminated” by task-irrelevant (flanker) information. In the absence of any corrective process, the task-relevant subspace propagates a mixture of both correct (target) and incorrect (flanker) direction signals forward through the network. At the final readout layer of the network, assuming that the target subspace is well-aligned with the readout weights [53], flanker signals within the target subspace then contribute to the drift rate for the (incorrect) flanker direction. This increases the probability of an error, such that models with more aligned target/flanker representations exhibit larger accuracy congruency effects. A corollary of this proposition is that the absolute amount of flanker information in the network—equivalently, the degree to which flanker information has been suppressed—is not necessarily predictive of congruency effects. This may account for our observations that the measures of suppression we considered are not correlated with accuracy congruency effects across models, and that representations for flanker direction do not appear to be strongly suppressed even in later network layers, as evidenced by high decoding accuracy for flanker direction in both the VAMs and task-optimized models.

One apparent limitation of the VAM as presented here is that it does not have dynamics, which seem to be required to explain some observations in the flanker task and related conflict tasks. For example, the RTs on incongruent error trials are typically faster than error RTs for congruent trials and RTs for correct trials [83], an effect that we confirm is also present in the flanker task variant studied here. This observation can be explained by a “shrinking attention spotlight” in which response activation from flankers starts high and diminishes over time, resulting in a higher proportion of errors for faster RTs [82]. While the primary VAMs analyzed in this paper did not capture these error patterns, we show that the simple modification of training separate VAMs on each RT quantile produced error patterns that closely resemble those observed in the participant data. The representations learned by these models could in principle be compared, allowing one to investigate the mechanisms underlying these error phenomena. Future work may aim to incorporate true dynamics into the visual component and decision component of the VAM with recurrent CNNs [22], [49] and the task-DyVA model [29], respectively.

In summary, the VAM is a probabilistic model of psychophysical data that captures how raw sensory inputs are transformed into the abstract representations that guide decisions. Raw (pixel-space) visual stimuli are processed by a biologically-plausible neural network model of vision that outputs the parameters of a traditional decision-making model. Each VAM is fitted to data from a single participant, a feature that allowed us to study how individual differences in behavior emerge from differences in the “brains” of the models. To this end, we found that models with smaller congruency effects had more orthogonal representations for task-relevant and irrelevant information. While we chose to use a CNN to model visual processing, we note that the VAM is not limited to this choice: other sensory encoding models, such as those based on transformer architectures [15], can be readily swapped in to replace the CNN with minimal changes to the underlying VAM implementation. Similarly, the LBA decision-making model we employed is easily replaced with other decision-making models that have a closed-form or easily approximated likelihood, such as the diffusion-decision model and leaky competing accumulator model [43], [48], [60], [80]. In this way, the VAM provides a general probabilistic framework for jointly fitting interpretable decision-making models and expressive neural network architectures of sensory processing with psychophysical data.

Methods

Datasets

We used deidentified Lost in Migration gameplay data from 75 Lumosity users (participants) to train the models included in this paper. The mean (SD) age of this sample at the time of signup was 56.4 (18.6) years; 46.7% identified as female, 48.0% identified as male, and 5.3% did not report their gender. All analysis and modeling was done retrospectively on preexisting Lumosity data that were collected during the normal use of the Lumosity program (at home). All participants consented to the use and disclosure of their deidentified Lumosity data for any purpose as laid out in the Lumosity Privacy Policy (www.lumosity.com/legal/privacy_policy). No statistical methods were used to predetermine sample sizes, though our sample sizes are comparable to or larger than those used in related modeling work [8], [82].

The included participants were selected from a larger pool that met certain inclusion criteria. The selection from this larger pool was done at random, with the following exception: we included all participants under the age of 40 (n = 19) to ensure adequate representation of younger participants. The criteria used to define the larger participant pool were as follows: we required that participants had signed up as Lumosity users between June 28, 2015 and June 30, 2020 (inclusive); that they were between the ages of 18 and 89 at the time of signup (inclusive); that their country of origin was Australia, Canada, New Zealand, or the United States; that their preferred language was English; and that they were not employees of Lumos Labs, Inc. Accounts created for research purposes were also excluded. We also required that all of a given participant’s Lost in Migration gameplays were done on the web (as opposed to mobile) platform. Finally, we required that participants had at least 200 Lost in Migration gameplays and at least 25,000 trials starting with their 50th gameplay. One additional participant with near chance accuracy (33%) on Lost in Migration was excluded from the larger participant pool.

Trials with very short and very long RTs were excluded and did not count toward the 25,000 trial minimum. Specifically, we excluded trials less than or equal to 250ms and trials classified as outliers from a criterion based on the median absolute deviation from the median (the MAD): trials with an absolute deviation from the median RT of more than 10 times a given participant’s MAD were excluded.

The final datasets from each participant used for model training consist of the first 25,000 non-outlier trials starting with their 50th gameplay. Data from the first 49 gameplays were excluded to reduce learning-related variability. Approximately 50% of trials were congruent (vs. incongruent).

Modeling framework

Linear ballistic accumulator (LBA) model

The decision-making component of our modeling framework is the LBA model [8], tailored to Lost in Migration. Evidence for each of the four possible response directions is accumulated linearly and independently at a response-specific drift rate dk. Commitment to a decision occurs when one of the accumulators reaches a fixed threshold parameter b. The RT on a given trial is the duration of this evidence accumulation process plus a constant non-decision time parameter t0 that includes both sensory processing and motor execution. Response variability comes from two sources. First, on each trial, the drift rate dk for each response k ∈ {1, 2, 3, 4} is sampled independently from a Gaussian distribution with mean vk and a common SD s. We fix s to 1 for all models to ensure that the LBA parameters are identifiable [11], [25]. Second, the initial evidence for each accumulator is sampled independently from a uniform distribution on the interval [0, A], where A is a model parameter.

Visual accumulator model (VAM): overview

The VAM is a generalization of the LBA model in which the drift rate means on trial i depend on the stimuli s(i) through a CNN. For a dataset with N trials, the task data are x = {response times t, choices c, stimuli s}, where , and . We adopt a Bayesian framework and model the joint density of the task data x and LBA parameters θ = {b, A, t0} as:

where the drift rate means v(i) depend on a CNN with parameters ζ (described below). The factor on the left of Equation (1) is the likelihood of the LBA model, which has a closed-form solution [8]. The factor on the right is the prior distribution over the LBA parameters, specified as a standard multivariate Gaussian p(θ) ~ N(0, I). We express the joint density more compactly as:

To fit the VAM, i.e. learn the posterior distribution over the LBA parameters p(θ|x) and simultaneously optimize the CNN parameters ζ, we apply automatic differentiation variational inference (ADVI) [38]. Rather than attempting to sample from the true posterior directly as in Markov chain Monte Carlo (MCMC) methods, variational inference introduces an approximate posterior density q(θ; ϕ) and minimizes the Kullback-Leibler (KL) divergence from p(θ|x) to q(θ; ϕ) by optimizing ϕ. Here, we specify the approximate posterior q(θ; ϕ) as a multivariate Gaussian with mean μ and unconstrained covariance matrix Ʃ (thus ϕ = {μ, Ʃ}).

We use variational inference since it scales well to large datasets and can handle complicated models (such as the VAM), in contrast to MCMC methods. Leveraging automatic differentiation software (JAX [6]), we automate the calculation of the derivatives of the variational objective (described below) with respect to both the CNN parameters ζ and variational parameters ϕ.

Variable transformations

Note that the LBA parameters θ = {b, A, t0} are restricted to be nonnegative, while μ ∈ ℝ3 and Ʃ is restricted to be positive semidefinite. However, to optimize the variational objective (defined below), we require that ϕ and θ have the same support [38]. To achieve this, we transform ϕ and θ to both have support on the real line. Specifically, we reparameterize Ʃ using the Cholesky decomposition as Ʃ = LLT, where L is a lower-triangular matrix with entries in ℝ (so that now ϕ = {μ, L}). For the LBA parameters, in addition to mapping them to the real line, we must also enforce the constraint that the threshold b is always greater than the parameter A controlling the range of the initial evidence distribution. We enforce this by defining and taking the log of , A, and t0 to map them to the real line [11]. We define the transformed parameters as and the transformation that maps θ to θ* as T.

VAM objective function

To fit the VAM, we maximize the evidence lower bound (ELBO):

This is the standard ELBO optimized in variational inference, with an additional dependence on the CNN parameters ζ. The ELBO is a lower bound on the marginal likelihood of the data p(x), and maximizing the ELBO is equivalent to minimizing the KL divergence from p(θ|x) to q(θ; ϕ). Writing the ELBO in terms of the transformed variables θ*, we have:

The term log | det | is a Jacobian adjustment for the transformation T-1 that maps θ* to θ, required to ensure that the transformed density integrates to one. For the transformation T-1, with θ* vectorized as , the Jacobian is given by:

Taking the log of the absolute value of the determinant of gives the required adjustment:

We will apply stochastic gradient ascent to maximize the ELBO objective function given by Equation (4), using Monte Carlo (MC) estimates of the expectation and AD to calculate gradients with respect to ϕ and ζ. However, the gradients of the ELBO with respect to ϕ cannot be calculated directly by AD, since the expectation in Equation (4) is taken with respect to q(θ*; ϕ), which depends on ϕ. We work around this by applying the reparameterization trick [33], [61], which expresses θ* as a differentiable transformation of an auxiliary noise variable ϵ ~ p(ϵ), where p(ϵ) does not depend on ϕ. In particular, we set p(ϵ) to be a standard multivariate Gaussian, p(ϵ) = N(ϵ; 0, I), and define the transformation , where μ and L are the mean and covariance parameters of the Gaussian approximating density q(θ *; ϕ).

Now we estimate the expectation in Equation (4) with MC samples from p(ϵ). Since the expectation no longer depends on ϕ, we can use AD directly on these MC estimates to obtain unbiased gradients of the ELBO. We estimate the ELBO for each data point using independent MC samples, since this reduces the variance of the gradients relative to using the same set of MC samples for a given batch of data [35]. Thus the ELBO objective for the ith data point is estimated by:

VAM inference algorithm and other training details

The VAM training/inference algorithm is summarized in Algorithm 1. The size of the training set was 16,250 samples (65% of the total dataset) for each model. An additional validation set of 3,750 samples (15%) was used to monitor model training. The remaining 5,000 samples (20%) were used to evaluate model performance (the holdout set). We set the batch size M to 256 and the number of MC samples used to estimate the ELBO L to 10. We used the Adam optimizer [34] with the following hyperparameters for model training: learning rate = 1 e — 3, β1 = 0.9, and β2 = 0.999. The same random seed was used to initialize the parameters of all models. Models were trained using a single NVIDIA GeForce RTX 3060 Ti GPU. Each model took approximately one hour to train.

For a small fraction of attempted model training runs (10/85 = 11.8%), the model either failed to exceed chance accuracy (3/10 models) or converged to a state in which almost all trials had exclusively negative drift rates (7/10 models). These models were not included in summary analyses.

VAM inference algorithm

Convolutional neural network (CNN)

We used the same seven-layer CNN architecture for all models (six convolutional layers followed by one fully-connected hidden layer). The number of channels/units used in the seven layers was as follows: 64, 64, 128, 128, 128, 256, 1024. The output layer (after the fully-connected layer) has four channels, one for each drift rate. Each convolutional layer used a 3×3 pixel kernel (stride 1, same padding). Each convolutional layer was followed by a ReLU nonlinearity, instance normalization [79], and a max-pooling layer (2×2 pixel window, stride 2), in that order. The fully-connected hidden layer was followed by a ReLU nonlinearity and a dropout layer (dropout rate set to 0.5).

To speed up model training, the first two convolutional layers were initialized with the parameters from a larger CNN trained on an image classification task. Specifically, the pretrained model used a 16-layer VGG architecture and was trained on the ImageNet dataset [12], [70]. These first two layers were trainable (i.e. not fixed to their initial values). The weights of the other convolutional layers, fully-connected layer and output layer were initialized randomly using the LeCun normal initializer [36]; the biases were initialized with zeros.

Image preprocessing and data augmentation

The image stimuli used to train the VAM differed from the original Lost in Migration stimuli in a few ways. Information about the current score and time remaining visible at the top of the game window was removed. We also used the same blue background image for all stimuli (the background in the original Lost in Migration changes over the course of the gameplay). Finally, the stimuli were resized from 640×480 pixels (widthxheight) to 128×128 pixels.

We used data augmentation techniques commonly used in other image classification training paradigms to improve the generalization ability of the models. Specifically, for each batch of training data, each image was independently and randomly translated by a small amount. The size of the translation (in pixels) was drawn from a uniform distribution on the interval [0,1] (vertical) and [0,2] (horizontal), rounded to the nearest pixel (horizontal and vertical translations were sampled independently). We also applied a variation of a random elastic image deformation [68] as implemented in the Augmax Python package [26]. Specifically, we used the Warp transformation in Augmax and set the strength parameter to 3 and the coarseness parameter to 32. The transformation was applied to each image independently with a probability of 0.75.

Task-optimized models

The task-optimized CNN models were trained by minimizing the cross-entropy loss function, where the correct label was defined as the direction of the target bird. Other than the loss function, all aspects of the training process were the same as those used for the VAM (CNN architecture, optimizer settings, initialization, etc.). For each VAM we trained, we trained one task-optimized model using the same training data (n = 75 task-optimized models).

Models trained on separate RT quantiles

For the conditional accuracy function and delta plot results presented in Fig. S4C-D, we trained models on data from n = 10 randomly selected participants using a modified training paradigm. For each participant, we calculated the quintiles (i.e. the 0.2, 0.4, 0.6, and 0.8 quantiles) of their RT distribution and trained a separate VAM on the trials within each RT quantile bin (five models per participant). As for the VAMs used in the original training paradigm, 65% of the data within each RT quantile bin was used for training, 15% was used to monitor model training, and the remaining 20% was designated as holdout data used to evaluate model performance. We generated RTs and choices from the holdout stimuli for each of the five models from a given participant, then combined these outputs to compare with the participant’s data.

For a subset of attempted training runs (7/57 = 12.3%, 7 different participants), the model failed to converge (near chance accuracy for 6/7 models, 69% accuracy for 1/7 models). In these cases, we trained a new model from the same participant and RT bin, but used a different random seed to initialize the model parameters and partition the training set vs. holdout set (all such retrained models converged successfully). We did not observe a clear association between training success rate and RT quantile bin (number of failed models in RT quantile bins 1-5: 0, 2, 1, 2, 2).

Analysis methods: overview

We analyzed all of the VAMs after 12,800 parameter update steps (corresponding to the 200th training epoch), by which point training had converged. We analyzed all of the task-optimized models after 1,600 parameter update steps (the 25th training epoch) since these models converged much more quickly. The VAMs trained on data from a single RT quantile bin were all analyzed after 6,500 parameter update steps (the 500th training epoch, given the smaller training set size) since these models also converged more quickly than those trained with the entire dataset. All analyses were conducted using a holdout set of n = 5000 LIM stimuli/RTs/choices (i.e., data that was not used to train the models). To generate RTs and choices from the trained models, the LIM stimuli from the holdout set were provided as inputs to the CNN, which output mean drift rates for each stimulus. We denote the drift rate mean for the kth accumulator on trial i as . We used the LBA model to sample RTs and choices using these mean drift rates, where the drift rate for the kth accumulator on trial i, , is sampled from a normal distribution with mean and SD = 1. For a very small fraction of trials (mean ± s.e.m. percent excluded trials: 0.080 ± 0.011%, n = 75 models), all of the sampled drift rates were negative, and thus the RT and choice were undefined. These trials were excluded from all analyses.

All error bars shown in the figures correspond to bootstrap 95% confidence intervals calculated using 1000 bootstrap samples.

Behavior analyses

The RT congruency effect was defined as the mean RT on incongruent trials minus the mean RT on congruent trials, where only correct trials were included in the calculation. The accuracy congruency effect was defined as the accuracy on congruent trials minus the accuracy on incongruent trials.

To calculate the RT delta plots for a given model/participant, we first calculated the RT deciles (0.1, 0.2, …, 0.9 quantiles) separately for congruent and incongruent trials, using only correct trials. Within each RT decile, we then calculated the RT congruency effect and mean RT (average of congruent and incongruent mean RTs), forming a delta plot for that participant/model. These mean RTs and congruency effects were then averaged across participants.

To calculate the conditional accuracy functions for a given model/participant, we calculated the accuracy and mean RT of trials within each RT quintile (0.2, 0.4, 0.6, and 0.8 quantiles) separately for congruent and incongruent trials. As above, these measures were then averaged across participants.

The models/participants included in the analysis of how RT varied with stimulus layout and horizontal/vertical position were those that exhibited significant modulation of RT by one or more of these stimulus features. Significant modulation was determined by running an ANOVA on the RTs in each stimulus feature bin (e.g. for each layout or each horizontal position bin), using a threshold p-value of 0.05. For horizontal stimulus position, we used bins of width = 50 pixels in the original 640×480 pixel window space, except for the leftmost and rightmost bins which had width = 60 pixels. For vertical stimulus position, we used bins of width = 25 pixels.

The number of participants exhibiting significant modulation of RT was 60 for stimulus layout, 72 for horizontal position, and 69 for vertical position (out of 75 total). To determine whether a given participant exhibited significant modulation of accuracy by a given stimulus feature, we used a chi-squared test for equality of proportions across stimulus feature bins (threshold p-value = 0.05). No participants exhibited significant modulation of accuracy for any of the stimulus features.

LBA parameters

For analyses of the fitted LBA parameters t0, b, and A, we used the maximum a posteriori (MAP) estimates of these parameters, corresponding to the mean parameter vector of the learned Gaussian posterior density q(θ; ϕ). For analyses of the mean target and flanker drift rates, we provided the LIM stimuli from the holdout set as inputs to the fitted CNN, which generates the mean drift rates as its outputs. The non-target/non-flanker (other) drift rates were calculated by averaging the drift rates from the two non-target/non-flanker accumulators on incongruent trials or the three non-target accumulators on congruent trials.

CNN features

The model features used in the analyses of CNN representations were derived from the unit activations in each layer elicited by the holdout image set. The activations were processed just after the ReLU nonlinearity. The features in the convolutional layers are defined as the maximum value of each channel (i.e., the maximum was calculated across the spatial dimensions) [27], yielding a N × Kl feature matrix, where N is the number of images in the holdout set (5000) and Kl is the number of active channels in layer l. For the fully-connected layer, the features were defined as the N × Kl matrix of activations.

Only incongruent trials were used for all of the analyses involving the CNN features described below, for the following reasons. For the selectivity and tolerance analyses in which we decoded target or flanker direction from the model features in each layer, described in more detail below, note that the target and flanker directions are by definition the same on congruent trials. As such, a classifier trained to decode target direction from congruent trials could achieve perfect accuracy using information in the model features attributable to the flankers, rather than targets. Using only incongruent trials ensured that such cross-contamination could not occur. We used only incongruent trials for all of the other analyses of the model features simply for convenience.

Stimulus feature decoding

To assess how well particular stimulus features could be decoded from the activity in each layer, we trained a linear SVM to classify the values of that stimulus feature using the N × Kl activation (feature) matrix in each layer. The activation matrix was standardized before training the classifiers, such that each column had zero mean and unit variance. Horizontal and vertical stimulus positions were discretized to enable classification using the same bins as were used in the behavioral analyses.

The classification task was done in a standard “one-vs-rest” setting: for each value of a given stimulus feature, one sub-classifier was trained on a binary classification task with the chosen value as one class and all other values as the other class, yielding one classifier for each value of the stimulus feature (e.g. four for target direction, seven for stimulus layout). To determine the decision of the combined classifier for a given image, we generated predictions from each sub-classifier and assigned the decision to the sub-classifier with the largest (most confident) prediction. We assessed the overall decoding accuracy of each SVM on a separate test image set. Note that both the training set and test set for the SVMs were subsets of the holdout image set (i.e., the images that were not used to train the VAMs). The SVMs were trained using the LinearSVC model in the scikit-learn Python package [54] with the squared hinge loss function and L2 regularization with the penalty parameter C set to 1.0 (the default settings).

Tolerance

To assess the tolerance of model representations to variation in a given stimulus feature (flanker direction, stimulus layout, horizontal position, and vertical position), we trained a linear SVM to classify target direction using stimuli with the chosen stimulus feature fixed to one value (the training context), and assessed the generalization performance of the SVM on stimuli that contained all other values of that stimulus feature (the generalization context). For example, to assess tolerance to stimulus layout, we trained one SVM to classify target direction using stimuli with the vertical line layout, and assessed the generalization performance of that classifier on stimuli with the six other layouts. We trained one such SVM for each of the seven stimulus layouts, and averaged the generalization performance across these seven SVMs to derive the overall generalization performance measure for a given model and network layer. Other details of the classifiers were the same as those used for the decoding analyses described above.

Target/flanker subspace alignment. To calculate the target/flanker subspace alignment metric, we first defined target and flanker subspaces using the SVM classifiers that we trained for the target/flanker decoding analyses. Specifically, for each of the four classifiers, which were trained to classify stimuli as a given target or flanker direction vs. all other target or flanker directions using the model features from a given layer, we extracted the vector orthogonal to the decision hyperplane. In agreement with prior work [4], [40], we refer to these vectors as decoding vectors for a given target/flanker direction. Let and denote the target decoding row vector and flanker decoding row vector for the kth direction, respectively. We define the matrices formed with these four decoding vectors filling the rows as the target subspace matrix Xtarg and the flanker subspace matrix Xflnk. These matrices have dimensions 4 × Kl, where Kl is the number of feature dimensions for layer l. Each matrix therefore spans a subspace of the full Kl-dimensional feature space. Our goal is to determine whether the target and flanker subspaces are orthogonal.

To do so, we employ principal angles between subspaces [31], [88], which generalizes the more intuitive notion of angles between lines or planes to arbitrary dimensions. To calculate the principal angles, we require orthonormal bases for the target and flanker subspaces, which we determine using a reduced singular value decomposition (SVD):

The rows of the 4 × Kl matrices and form an orthonormal basis for the target and flanker subspaces, respectively. The cosines of the principal angles are given by the singular values of . The average of these singular values is our subspace alignment metric, which ranges from zero (completely orthogonal subspaces) to one (completely aligned/parallel subspaces).

Participation ratio

We measured the dimensionality of target representations for a given layer with the participation ratio (PRl) [21], defined as:

where Kl is the number of features in layer l and λ1 ≥ … ≥ λi ≥ … ≥ λKl are the eigenvalues of the targetcentered feature covariance matrix for layer l. The target-centered features were obtained by subtracting the centroid of the feature matrix for each target direction from the corresponding trials in the feature matrix [56].

The participation ratio is a continuous measure of dimensionality ranging from 1 to Kl. The minimum (PRl = 1) is obtained when all of the feature variance is concentrated in a single dimension, such that λi = 0 for i ≥ 2. The maximum (PRl = Kl) is obtained when the variance is evenly spread across the Kl feature dimensions, such that all Kl eigenvalues are equal.

Mutual information

The activity of each unit in response to the holdout image set was discretized into 10 equally-sized bins, yielding a unit-specific activation distribution pX (x). Stimulus features (horizontal/vertical position, layout, target/flanker direction) were discretized if the values were continuous, or used as is if not, yielding a stimulus feature distribution pY (y). Discretization of the continuous variables was done as described in the decoding methods section. The joint probability mass function of the unit activity and stimulus feature is denoted by p(X,Y) (x, y). For a given unit and stimulus feature, the mutual information is given by:

The mutual information for a given stimulus feature was normalized by the entropy of the feature distribution to facilitate comparisons between the features [47]. The entropy is given by:

Single-unit modulation by target direction

To identify units that were modulated by target direction, we ran a one-way ANOVA on the z-scored activity of each unit, where each group in the ANOVA was determined by the activity of that unit in response to stimuli for a given target direction (the activity of each unit was z-scored across the stimuli). Units with an ANOVA p-value < 0.001 were defined as significantly modulated by target direction. These units were further split into three subtypes based on their degree of selectivity and sign of modulation: selective (+), selective (-), and complex units.

We first selected units that had significantly higher or lower activation for one direction relative to the other three directions, as assessed by Tukey’s HSD test (p < 0.05). Within this population, some units had both significantly higher activation for one direction relative to the other three and significantly lower activation for one direction relative to the other three. Units that did vs. did not have this property were handled separately. In the former (simpler) case, the units that only had significantly higher activation for one direction relative to the other three were defined as selective (+) units; the units that only had significantly lower activation for one direction relative to the other three were defined as selective (-) units.

In the latter (more complicated) case, we compared the magnitude of the activation for the positive and negative modulation directions with a rank-sum test. Units for which the magnitude of activity for the positive modulation direction was significantly greater (p < 0.05) than the magnitude of activity for the negative modulation direction were defined as selective (+) units. Analogous criteria were used to define selective (-) units (with the signs reversed). Units within this pool with rank-sum p-value > 0.05 were defined as complex units. Finally, units that were significantly modulated by target direction but that did not meet any of the criteria described above were also defined as complex units.

Data and code availability

All of the code (https://github.com/pauljaffe/vam) and data (https://doi.org/10.5281/zenodo.10775514) used to train the VAMs and reproduce our results are publicly-available without restrictions.

Acknowledgements

We thank M. Steyvers, M. Robinson, D. Yamins and members of the NeuroAILab, the Lumos Labs research team, and members of the Poldrack Lab for helpful conversations and comments on this work. We also thank A. Kaluszka for providing graphics files of the Lost and Migration stimuli. The authors received no specific funding for this work.

Author contributions

P.I.J. designed research, performed research, and wrote the paper. P.I.J. and G.X.S.R. analyzed data. P.I.J., G.X.S.R., R.J.S., P.G.B., and R.A.P. edited the paper. R.J.S. and R.A.P. provided resources. R.J.S., P.G.B., and R.A.P. supervised research and provided input at all stages.

Competing interests

R.J.S. is employed by Lumos Labs, Inc. and owns stock in the company. P.I.J., G.X.S.R., P.G.B., and R.A.P. have no competing interests.

Supporting Information

Example model/participant RT distributions and dependence of RTs on stimulus features.

A) Example model/participant RT distributions (all trials). B) Examples of model/participant mean RT vs. stimulus layout. C) Examples of model/participant mean RT vs. horizontal stimulus position (negative values: left of center). D) Examples of model/participant mean RT vs. vertical stimulus position (negative values: above center). For all panels, error bars correspond to bootstrap 95% confidence intervals.

Age dependence of LBA parameters.

For all panels, we tested age-dependence with a oneway ANOVA and report Bonferroni-adjusted p-values, corrected for 4 comparisons (n = 75 models). We also report adjusted p-values from a post-hoc comparison of the 20–29 vs. 70–89 age groups conducted with Tukey’s HSD. Error bars correspond to bootstrap 95% confidence intervals. A) Non-decision time parameter t0 (F (5, 69) = 13.3, p < 1e-7). Tukey’s HSD for 20–29 vs. 70–89 age groups: p < 1e-8. B) Response caution (b - A; F (5, 69) = 0.49, p = 1.0). C) Mean target drift rate (F (5, 69) = 3.4, p = 0.026). Tukey’s HSD for 20–29 vs. 70–89 age groups: p = 0.002. D) Mean flanker drift rate (F (5, 69) = 0.72, p = 1.0).

Dependence of RTs on stimulus layout and position.

For each participant/model, we calculated the mean RT in each stimulus feature bin, then subtracted the average of these mean RTs from each bin. The panels show the average of these centered RTs across all participants with significant modulation of RT for that particular stimulus feature. For all panels, we conducted a one-way ANOVA for both models/participants and report Bonferroni-adjusted p-values, corrected for 3 comparisons (n = 75 models). We also report results from post-hoc comparisons between select feature bins conducted with Tukey’s HSD. Error bars correspond to bootstrap 95% confidence intervals. A) RT vs. stimulus layout (models: F (6, 53) = 7.43, p < 1e-6, RTs for the vertical line layout were significantly faster (Tukey’s HSD adjusted p-value < 0.05) than RTs from all other layouts except ‘>‘; participants: F (6, 53) = 23.1, p < 1e-22; RTs for the vertical line layout were significantly faster than RTs from all other layouts). B) RT vs. horizontal stimulus position (negative values: left of center; models: F (7, 64) = 16.8, p < 1e-18, RTs for the leftmost and rightmost position bins were significantly slower than RTs from all intermediate position bins; participants: F (7, 64) = 72.6, p < 1e-73; RTs for the leftmost and rightmost position bins were significantly slower than RTs from all intermediate position bins). C) RT vs. vertical stimulus position (negative values: above center; models: F (5, 66) = 17.2, p < 1e-14, RTs for the topmost and bottommost position bins were significantly slower than RTs from the two centermost position bins; participants: F(5, 66) = 113.3, p < 1e-74; RTs for the topmost and bottommost position bins were significantly slower than RTs from the two centermost position bins).

RT delta plots and conditional accuracy functions.

A) RT delta plots for participants and models trained with the standard training paradigm (all RT data; n = 75 models/participants). B) Conditional accuracy functions for participants and models trained with the standard training paradigm. C) RT delta plots for participants and models trained separately on data in each RT quantile, then combined (n=10 combined models/participants). D) Conditional accuracy functions for participants and models trained separately on data in each RT quantile, then combined. For all panels, error bars correspond to bootstrap 95% confidence intervals.

Activity of all selective (+) units for one example model.

Each row shows the activity of one unit for 100 randomly selected stimuli, sorted by target direction. The activity of each unit was centered and normalized by the activity of the stimulus with the largest magnitude activation. The small number of selective (+) units in layer Conv1 are not shown.

Absence of correlation between flanker suppression metrics and congruency effects.

All panels show the Pearson’s correlation coefficient between the specified suppression and behavior metrics, calculated across models. A) Flanker direction decoding accuracy vs. accuracy congruency effect. B) Mutual information for flanker direction conveyed by single units vs. accuracy congruency effect. C) Flanker direction decoding accuracy vs. RT congruency effect. D) Mutual information for flanker direction conveyed by single units vs. RT congruency effect. For all panels, n = 75 models, error bars correspond to bootstrap 95% confidence intervals. Asterisks indicate a significant Pearson’s r (adjusted p-value < 0.05, permutation test with n = 1000 shuffles, Bonferroni correction for 7 comparisons).

Absence of correlation between target/flanker subspace alignment and RT congruency effect.

Pearson’s correlation coefficient between target/flanker subspace alignment and RT congruency effect across models (n = 75 models, error bars correspond to bootstrap 95% confidence intervals). The correlation was not significant for any layer (adjusted p-value > 0.05, permutation test with n = 1000 shuffles, Bonferroni correction for 7 comparisons).

Additional analysis of VAMs and task-optimized models.

A) Normalized mutual information for stimulus layout and horizontal/vertical stimulus position conveyed by single units, averaged across units. Mutual information was normalized by the entropy of the corresponding stimulus feature distribution. B) Decoding accuracy of stimulus layout and horizontal/vertical stimulus position. All panels show the average across n = 75 task-optimized models and n = 75 VAMs; error bars correspond to bootstrap 95% confidence intervals. The VAM data shown in panels A and B is the same as that shown in Figs. 4C and 4B, respectively.