Using recurrent neural network to estimate irreducible stochasticity in human choice behavior

eLife assessment

In this study, Ger and colleagues present a valuable new technique that uses recurrent neural networks to distinguish between model misspecification and behavioral stochasticity when interpreting cognitive-behavioral model fits. Simulations provide solid evidence for the validity of this technique and broadly support the claims of the article, although more work is needed to understand its applicability to real behavioral experiments. This technique addresses a long-standing problem that is likely to be of interest to researchers pushing the limits of cognitive computational modeling.

https://doi.org/10.7554/eLife.90082.3.sa0

Significance of the findings:

Valuable: Findings that have theoretical or practical implications for a subfield

Landmark
Fundamental
Important
Valuable
Useful

Strength of evidence:

Solid: Methods, data and analyses broadly support the claims with only minor weaknesses

Exceptional
Compelling
Convincing
Solid
Incomplete
Inadequate

During the peer-review process the editor and reviewers write an eLife Assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife Assessments

Abstract
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Theoretical computational models are widely used to describe latent cognitive processes. However, these models do not equally explain data across participants, with some individuals showing a bigger predictive gap than others. In the current study, we examined the use of theory-independent models, specifically recurrent neural networks (RNNs), to classify the source of a predictive gap in the observed data of a single individual. This approach aims to identify whether the low predictability of behavioral data is mainly due to noisy decision-making or misspecification of the theoretical model. First, we used computer simulation in the context of reinforcement learning to demonstrate that RNNs can be used to identify model misspecification in simulated agents with varying degrees of behavioral noise. Specifically, both prediction performance and the number of RNN training epochs (i.e., the point of early stopping) can be used to estimate the amount of stochasticity in the data. Second, we applied our approach to an empirical dataset where the actions of low IQ participants, compared with high IQ participants, showed lower predictability by a well-known theoretical model (i.e., Daw’s hybrid model for the two-step task). Both the predictive gap and the point of early stopping of the RNN suggested that model misspecification is similar across individuals. This led us to a provisional conclusion that low IQ subjects are mostly noisier compared to their high IQ peers, rather than being more misspecified by the theoretical model. We discuss the implications and limitations of this approach, considering the growing literature in both theoretical and data-driven computational modeling in decision-making science.

Introduction

Humans’ behavior is thought to arise from a set of multidimensional, complex, and latent cognitive processes. Theoretical computational models allow researchers to handle this complexity by putting forward a set of mathematical descriptions that map assumed latent cognitive processes to observed behavior (Daw, 2011; Smith and Ratcliff, 2004). By fitting a theoretical model (also sometimes termed ‘symbolic’ models) to the observed behavior, the researcher is able to draw conclusions regarding the estimation and architecture of the latent cognitive processes (Wilson and Collins, 2019). While this approach has already led to substantial scientific findings (Montague et al., 2012; Rescorla, 1972), it still faces a major challenge since the true underlying model is always left unknown (Box, 1979). Specifically, theoretical computational models usually stem from strict theoretical assumptions that can differ from the true cognitive mechanism that led to the observed behavior, a problem termed ‘model misspecifications’ (Beck et al., 2012; Nassar and Frank, 2016). Moreover, the issue of model misspecification is even more critical when considering that most studies in the field focus on group differences and therefore might neglect individual variation in the expressed models (Stephan et al., 2009; Rigoux et al., 2014). In the current study, we address the fundamental methodological problem of theoretical model misspecification by using well-established theory-independent models, a family of highly flexible models that require no prior theoretical specification.

Imagine that you are working on a theoretical question that you have addressed well by assembling a computational model that, after much effort, can replicate and mimic empirical observations. Still, some data will not be accurately predicted by the model, leaving an open question regarding the possibility that better specification of the model (i.e., changing some mechanisms, adding additional processes) would better capture the pattern in your observations. Currently, to address the issue of model misspecification, methodology in theoretical computational science calls for researchers to perform a model comparison analysis (Wilson and Collins, 2019). Here, the researcher is encouraged to put forward a set of different candidate models and quantify which model is most plausible given the observed data (Palminteri et al., 2017). Yet, even after a rigorous process of model comparison, the best-fitting model will still have what seems like room for improvement. This leaves an unanswered question regarding the possibility that yet another different model that was not described by the researcher might better explain and predict the observations (Eckstein et al., 2021; McElreath, 2020).

The distance in terms of variance between observed and predicted model behavior is typically termed the predictive gap. The predictive gap between individuals’ behavior and model prediction can be attributed to two factors: The first is model misspecification, as mentioned above. Here, the individual’s true behavior-generating model is different from the one suggested by the researcher (Beck et al., 2012; Nassar and Frank, 2016). The second factor is stochasticity, which refers to true noise or the natural randomness in human behavior (Faisal et al., 2008; Findling et al., 2019). The underlying assumption is that some variance in behavior is unpredictable and therefore represents irreducible variance. The assumption of irreducible variance is well accepted across scientific disciplines (Griffiths and Schroeter, 2018), and it suggests that even Laplace’s demon (Gleick, 2011), a metaphorical demon that is assumed to have access to all mechanisms and processes in the universe, would not be able to produce perfect predictions. Both model misspecification and irreducible noise influence the predictive gap, yet identifying the contribution of each factor may lead to different implications. A predictive gap attributed mostly to model misspecification suggests that the researcher needs to increase the space of candidate models to further reduce the gap. However, a predictive gap attributed mostly to irreducible noise suggests that such improvement is mostly out of reach.

More recently, data-driven approaches based on neural networks have emerged as an alternative modeling paradigm in cognitive research (Dezfouli et al., 2019b; Song et al., 2021). These networks, which are models trained to learn directly from data rather than relying on theoretical assumptions about human behavior, have been shown to surpass typical theoretical models in pure action prediction (Dezfouli et al., 2019b; Song et al., 2021). The key property of these models is that a priori, the models’ parameters do not directly map to some underlying cognitive process. Instead, the free parameters (i.e., weights) are iteratively adjusted during training to improve the network’s predictive capability for a desired objective function (LeCun et al., 2015). Furthermore, unlike classical theoretical models, neural networks are overparameterized models that include a large number of learnable free parameters. This property allows the network high flexibility and the ability to approximate a wide range of functions (Siegelmann and Sontag, 1992; Hornik et al., 1989), including functions believed to arise from human cognition (Barak, 2017). For example, Dezfouli et al., 2019b trained a recurrent neural network (RNN) to predict future human actions in a two-armed bandit task. Their study indicated that the RNN is capable of surpassing baseline RL models in action prediction. In another work, Fintz et al., 2022 trained an RNN model to predict future human actions in a four-armed bandit task. They showed that the RNN was able to capture atypical behaviors such as choice alternations and win-shift-lose-stay, which common cognitive models failed to capture. Finally, Song et al., 2021 trained an RNN model to predict the choices of humans in a reinforcement learning (RL) task that included a rich space of possible states and actions. They showed that the RNN was able to outperform the best-known cognitive model.

However, the high flexibility of neural networks also comes with the disadvantage that these networks are often considered black box models. Despite attempts to interpret the networks’ latent space (Dezfouli et al., 2019a), it is still not clear how to efficiently interpret behavior using them, which is a major goal in cognitive research (Hasson et al., 2020; Yarkoni and Westfall, 2017). In this study, we aim to leverage the network’s high flexibility and predictive capability to address the problem of identifying misspecification in theoretical computational models, using two different estimates.

Predictive performance

We assume that, for a fixed dataset, the flexibility of neural networks would enable them to capture the mapping between independent variables and dependent variables, up to a point where the remaining predictive gap is primarily due to noise rather than model misspecification (see Figure 1a). Therefore, we hypothesize that across different true generative models, neural networks will reach a predictive gap that closely resembles the predictive gap that remains when the true generative model is known. As a result, theory-independent models can serve as an upper benchmark against which the new data can be predicted based on previous observations. If, for some individuals, the theoretical model is severely misspecified, we expect to observe a significant improvement in our ability to predict unseen data using theory-independent models. However, if an individual is simply exhibiting a noisier pattern of behavior, we anticipate little to no improvement when using theory-independent models.

Figure 1

Download asset Open asset

Hypothetical illustration for using theory-independent models to explore the fit of theoretical models to human behavior.

(a) Predictive performance – we illustrate the predictive gap for three hypothetical models: first, the true data-generating model (i.e., forever unknown to the researcher; green) where the remaining gap is due to an irreducible noise component in the individual’s behavior. Second, a hypothetical alternative theoretical model (e.g., specified by a researcher; purple), where the remaining predictive gap reflects both irreducible noise and model misspecification. Finally, a hypothetical theory-independent model (turquoise) is assumed to reflect a predictive gap that is mainly due to model misspecification compared with the alternative model, yet does not provide a clear theoretical interpretation. We, therefore, assume that theory-independent models can be used to inform researchers about the amount of improvement that can be further gained by assembling additional theoretical models. (b) Point of early stopping – when training a network, we can examine its performance against a validation set as a function of the number of epochs used for training the parameters (x-axis). The point of early stopping reflects the maximum number of training epochs with the best predictive performance, just before the network starts to overfit the data (i.e., learn patterns that are due to noise; indicated by a yellow star). Here, we illustrate two hypothetical learning curves reflecting the point of early stopping for low-noise (purple line) and high-noise (blue line) datasets. Specifically, we illustrate the notion that the point of early stopping can reflect the amount of noise in the data, so a lower point of early stopping reflects noisier data (considering a fixed number of observations for the two datasets and network size).

Point of early stopping

Neural networks are prone to overfitting, where they fit too closely to the training data and fail to generalize to new datasets (LeCun et al., 2015). To mitigate overfitting, a common technique called early stopping is used (Bishop and Nasrabadi, 2006). In early stopping, the network is trained for multiple epochs, and the predictive performance on a validation set is monitored. Initially, the performance improves as the number of epochs increases, but there comes a point where the network starts learning patterns that reflect noise rather than true underlying patterns in the data. The point of early stopping is determined as the epoch at which the best prediction is achieved. We propose that the point of early stopping largely reflects the amount of noise in the data. With a fixed dataset and network size, an earlier stopping point indicates noisier data (see Figure 1b). The rationale behind this hypothesis is that, under certain realistic assumptions, the probability of stopping at an earlier epoch is higher for a sequence with more noise. This is because the noisy sequence has a lower ratio of true signal to noise, resulting in less information learned by the algorithm over the course of training. Using the point of early stopping to estimate irreducible noise has advantages, including the potential to require fewer data compared to using network predictive accuracy to estimate model misspecification. Estimating the upper bound of the prediction accuracy of the true data generation model using predictive accuracy typically necessitates three sets of data per individual (training, validation, and test). This approach may also require initial auxiliary data for pre-training the recurrent neural network (RNN) to achieve optimal results. However, estimating the point of early stopping can be accomplished with only two sets of data per individual (training and validation). Additionally, this approach may benefit from initializing the network with random weights instead of a pre-trained RNN, as it allows for estimating the maximum possible variance in the optimal epoch estimate among individuals.

In the present study, our aim is to estimate model misspecification in individual participants. We propose a method that utilizes theory-independent models, specifically RNNs, which possess high flexibility in learning complex features from data without manual engineering. In Study 1, we conducted simulations involving three groups, each consisting of $N = 100$ simulated agents performing a two-step RL decision task. Each group of agents followed a specific data-generating model, and the agents differed in the level of true noise present in their actions. We assumed three hypothetical labs, with each lab having knowledge of only one data-generating model, thus misspecifying two-thirds of the agents. We demonstrated that the predictive performance of a pre-trained RNN using three datasets per agent (training, validation, test) could be used to classify whether an agent was misspecified by the lab’s theoretical model. Additionally, we showed that the number of optimal training epochs for an RNN with random weights and fewer data (training and validation, without a test set) could serve as an estimate for comparing the amount of noise in agent data, given a fixed dataset and RNN architecture. Next, in Study 2, we analyzed an empirical dataset with $N = 54$ participants who completed three sessions of a two-step decision task in a lab setting. We examined the fit of a well-known theoretical model (Daw’s hybrid model) to participants with different IQ levels. We found that the theoretical model showed a systematically poorer fit to participants with low IQ compared to those with high IQ. By utilizing RNN predictive performance and optimal epochs estimates, we classified the source of the low theoretical model fit in low IQ participants. Our findings converged to suggest that the percentage of model miss-specification did not differ between low and high IQ individuals. Instead, low IQ individuals exhibited noisier decision-making compared to their high IQ counterparts. We propose that theory-independent models, such as RNNs, can be valuable in classifying model misspecification. However, we acknowledge the limitations of our approach and discuss potential directions for future research.

Results

Study 1: Simulation study

We begin by applying our method to simulated data, where different types of artificial agents governed by distinct generating models performed a two-step task (Daw et al., 2011). This simulation allowed us to evaluate our approach using artificial data, where we knew the true underlying data-generating model and the level of true noise for each agent. Our main objective was to assess the ability of theory-independent models to inform us about the factors contributing to a poorer fit of a theoretical model, specifically model misspecification or irreducible noise. To accomplish this, we introduced three theoretical models (Hybrid model-based/model-free, Habit, and K-dominant hand) and two theory-independent models (logistic regression [LR] and RNN). We generated data from the three theoretical models and fitted all five models to the data from each respective theoretical model. Subsequently, we used the predictive capability of the RNN and LR models, as well as the number of optimal training epochs for the RNN, to estimate and distinguish between model misspecification and stochastic behavior (see ‘Materials and methods’ for more details regarding the task, theoretical and theory-independent models, and model fitting procedure).

Classification of model misspecification using predictive performance

Our first aim was to assess the ability of theory-independent models to predict agent choice data across each theoretical model. For each agent in the artificial dataset and for each cross-validation (CV) round, we used one block (training set) to estimate optimal parameters (separately for each model). Then, the optimal parameters were used to predict the agent’s first-stage choice data on a withheld block (test set; see ‘Materials and methods’ for full details). We repeated this procedure for three rounds and averaged the predictive score over all withheld blocks (i.e., $n L P_{i}^{m}$ ; see Equation 22). Namely, for each agent we obtained five predictive scores, corresponding to the five models mentioned above (Hybrid, Habit, K-DH, RNN, LR). We found that across all theoretical models, RNN achieved the second-best predictive score, second only to the true generative model (see Figure 2a). This suggests that RNN is flexible enough to approximate a wide range of different behavioral models. Moreover, we found that the LR model achieved a poorer predictive score of agent choice data, implying that the model is less expressive compared with RNN.

Figure 2 with 2 supplements see all

Download asset Open asset

Classification of model misspecification using the predictive performance of theory-independent models.

Here, we tested our ability to identify model misspecification of theoretical models using recurrent neural network (RNN) and logistic regression. (a) Across three theoretical models (Hybrid, Habit, K-DH),we simulated 100 agents and predicted agents’ actions using the same three theoretical models, RNN and logistic regression. We present on the y-axis differences in negative log-probability estimates for each fitted model against the true generative model. For example, the left panel depicts the difference in log-probability estimates across all five models for 100 agents simulated using the hybrid model. As expected, across all three generative theoretical models, RNN achieved the best performance score, second only to the true generative theoretical model (black lines represent 95%CI). (b) We calculated a confusion matrix for a hypothetical lab that is familiar only with one theoretical model and uses RNN or logistic regression to try and conclude whether a certain agent shows a high predictive gap due to model misspecification. Each cell represents the average value across three classification rounds. We assumed one lab theoretical model as the true data-generating model in each round. Agents were then classified into two classes (assumed lab model or unknown model) based on the difference in predictive scores between the assumed theoretical model and a theory-independent model. For example, the top-left cell in the left confusion matrix indicates the percentage of agents (averaged over three classification rounds) better predicted by their true data-generating model than by the RNN. Results show good classification by RNN (accuracy ≈ 0.86), and logistic regression (accuracy ≈ 0.70), with better performance for RNN compared with logistic regression. Overall, these results suggest that RNN/logistic regression can be used to some extent to inform researchers regarding model misspecification when using theoretical computational modeling.

Next, we performed a hypothetical experiment in which three hypothetical labs estimated individual performance using a theoretical model. Each hypothetical lab was assumed to be aware of only one data-generating model, namely, ‘Hybrid-lab’, ‘Habit-lab’, and ‘K-DH-lab’. The notion here is that we illustrate a hypothetical situation where unbeknownst to the research lab, subjects are performing the task under different models. For example, the hypothetical Hybrid-lab is assumed to be aware only of the hybrid model and wrongly assumes that all subjects acted according to the Hybrid model. In such a case, all agents with a true generative model of Habit and K-DH models will be misspecified. However, we assumed that the lab has no knowledge regarding these alternative models, and thus will fit the hybrid model to all agents. We were interested in testing the extent to which theory-independent models (RNN, LR) can help such a lab to classify if a certain agent might be better explained by another unknown model compared to the hybrid model. For this aim, we performed three classification rounds, where at each round we assumed one hypothetical lab classified each agent into one of two classes: lab theoretical model or unknown alternative model. The classification was done based on the difference between the $n L P_{i}^{m}$ scores of the assumed theoretical model and the two theory-independent models of each agent. We averaged across all classification rounds and present two confusion matrices, classification by RNN and by LR (see Figure 2b). Like before, we found that RNN achieved a higher classification accuracy of 86% compared to the LR model, which reached only 70% accuracy. This finding supports our claim that RNN can signify if a certain subject might be better explained by another unknown model. Importantly, to illustrate the robustness of our results, we provide a supplementary analysis where the exact same analysis was repeated across 100 and 500 observations per agent and found very similar results (see Figure 2—figure supplement 1). Furthermore, to assert that this effect is not due to averaging across all classification rounds in the supplementary information, we provide the confusion matrices for each hypothetical lab (Hybrid, Habit, K-DH). We found that across all hypothetical labs the RNN achieved a higher true negative rate and a lower true positive rate compared to the LR model (see Figure 2—figure supplement 2).

Using the number of optimal epochs to estimate noise

When fitting an RNN, we estimate the number of optimal training epochs that minimize both underfitting and overfitting. We reasoned that for a fixed number of observation and network parameters, the point of early stopping (henceforward, ‘optimal epochs’) should also reflect the amount of noise/information in the behavioral data. Specifically, we hypothesized that the probability of stopping in an earlier epoch is higher for noisier agents since this probability is determined by the ratio between the true signal learned and the noise. To test our hypothesis, we examined the correlations between the number of optimal epochs and the amount of true noise each agent holds (see Figure 3a). As expected, we found that agents with high levels of true noise in their data also showed lower number of optimal epochs (i.e., required less RNN parameter training) compared with less noisy agents (see Figure 3a; linear correlation coefficient $r = - . 67, - . 74, - . 62$ for the Hybrid, Habit, and K-DH agents, respectively; p<0.001 across all correlation). Importantly, we found that this result holds regardless of the agent’s true underlying data-generating model, suggesting that the point of early stopping may be used as an index for the amount of true noise each participant holds. We performed the same analysis with different network sizes and different number of observations (i.e., trials) per agent and found very similar results (see Figure 3—figure supplement 1). Therefore, we conclude that the number of optimal epochs reflects the amount of information in the observed behavior. Note that the number of optimal epochs is not an absolute estimate since on its own it can be influenced by other factors, including the number of observations and the size of the net.

Figure 3 with 3 supplements see all

Download asset Open asset

Optimal epoch relation to true noise.

For each agent, we iteratively trained the recurrent neural network (RNN) while examining its predictive ability. We then recorded for each agent the number of optimal epochs, which is the number of RNN training iterations that minimized underfitting and overfitting. (a) We found a strong association between the number of optimal RNN training epochs (y-axis) and the agent’s true noise level. This result demonstrates that the number of optimal RNN epochs can be used as a proxy for the amount of information in a certain dataset (given a fixed number of observations and network size). (b) For illustration purposes, we plotted the RNN loss curves for agents with low vs. high levels of noise. Specifically, we present loss curves where for each RNN training epoch (x-axis) we estimate the predictive accuracy using a validation dataset. We estimated the loss curves of 300 artificial agents (from three theoretical models) and divided them into two groups according to their true noise (high vs. low). We show that the point of optimal epoch (early stopping; denoted yellow star) was higher to agents with low internal noise.

Study 2: Empirical study

We now present an application of our method to an existing dataset, where humans performed an identical two-step task as reported in Study 1, at three different time points (Kiddle et al., 2018; see ‘Materials and methods’). We predict participants’ behavior using a well-established hybrid model (see ‘Hybrid model’) and demonstrate that low IQ is associated with a higher predictive gap. We put our method to use and examine whether action prediction of RNN can improve the predictive gap of the theoretical model.

Specifically, we assume that if low IQ participants are more frequently misspecified by the hybrid model, then RNN will show a greater reduction of the predictive gap for low compared with high IQ individuals (see Figure 4a). However, another possibility is that the higher predictive gap of low compared with high IQ participants in the theoretical hybrid model is mostly due to the noisier behavior of the low IQ participants’ part. In that case, RNN should have a similar contribution to the predictive gap, regardless of IQ (see Figure 4b).

Figure 4

Download asset Open asset

Prediction scenarios for the use of recurrent neural network (RNN) to examine model misspecification as a function of IQ.

To demonstrate the ability to use RNN to identify model misspecification, we illustrate two hypothetical associations between IQ and the difference in predictive performance for a theoretical hybrid model vs. RNN. (a) Prediction for a scenario where there is a higher frequency of model misspecification among low vs. high IQ individuals: here, we assume that low IQ participants are more frequently misspecified by the theoretical hybrid model compared with high IQ individuals. Therefore, the prediction here is that for low IQ individuals the RNN will provide higher predictive gap improvement compared with high IQ individuals. (b) Prediction for a scenario where there is equal frequency of model misspecification among low vs. high IQ individuals: here, we assume that the frequency of model misspecification is similar across levels of IQ. Therefore, we predict no association between IQ and the change in predictive accuracy for RNN vs. the theoretical hybrid model.

Furthermore, we estimate the number of optimal epochs for each individual and examine the correlation with IQ. If low IQ participants’ behavior is less predicted by the theoretical model mostly due to them being misspecified, then we should not observe any correlation between the optimal number of epochs and IQ (since across IQ participants are assumed to have a similar amount of information in their behavioral data). However, if the behavior of low IQ participants is noisier compared with high IQ, then we should expect a negative correlation between the number of epochs and IQ.

Predictive performance

We began our investigation by estimating the correlation between IQ and the predictive score of the hybrid model. We found a positive correlation between individuals’ IQ and the hybrid predictive score [see Figure 5a; linear correlation coefficient $r = 0.28$ , p>0.05, 95% CI $(0.014, 0.548)$ ] so that lower IQ was associated with a higher predictive gap. This finding can reflect a systematic misspecification of low IQ individuals by the hybrid model. Alternatively, it might be that low IQ is associated with noisy behavior, which then led to a higher predictive gap. To address this question, we predicted individuals’ actions using RNN and examined whether the association between individuals’ predictive gap and IQ is attenuated when using RNN. If the association between IQ and predictive gap of the theoretical hybrid model is due to higher rates of model misspecification for low compared with high IQ individuals, then RNN would attenuate this association. Specifically, RNN should be able to overcome the misspecification issue the hybrid model might have, thus leading to a similar predictive gap across IQ levels. However, if the association between IQ and predictive gap of the theoretical hybrid model is mostly due to noisy behavior in low compared with high IQ, then we should observe the same correlation between predictive gap and IQ for both the theoretical hybrid model and RNN (see Figure 5b, linear correlation coefficient $r = 0.07$ , p=0.58).

Figure 5 with 1 supplement see all

Download asset Open asset

Empirical association of the predictive score for the theoretical hybrid model, recurrent neural network (RNN), and IQ.

(a) Association between IQ and the predictive score obtained from a theoretical hybrid model, showing that low IQ individuals are associated with lower predictive accuracy (shaded area corresponds to the 95% confidence interval for the regression line). We assumed that if this association is mostly due to higher rates of model misspecification in low compared with high IQ individuals, then RNN should overcome this difficulty and show similar predictive accuracy across both models. (b) Difference in the predictive score for the theoretical hybrid model vs. RNN. Across different IQ scores, we found no significant difference in the predictive score of one of the models over the other. This finding suggests that the lower predictive score of low IQ individuals is not due to systematic model misspecification, but rather due to noisier behavior (shaded turquoise/purple areas signify better predictive score of the Hybrid/RNN models, respectively; shaded dark area corresponds to the 95% confidence interval for the regression line). (c) Posterior distribution for the interaction term from a Bayesian regression, suggesting no effect for the paired interaction of model type (hybrid vs. RNN) and IQ on the model’s predictive score. This null effect suggests that the association between IQ and predictive accuracy was similar for both hybrid and RNN models (soiled/dashed purple lines indicate median/zero, respectively; lower solid black line indicates 95%_HDI). These results suggest that the higher predictive gap of the theoretical model for low compared with high IQ individuals is mostly due to individual differences in the levels of behavioral noise rather than systematic model misspecification.

To examine whether RNN’s predictive accuracy shows an attenuated association with IQ compared to the theoretical hybrid predictive accuracy, we estimated the paired interaction of model type (RNN vs. Hybrid) and IQ on predictive accuracy estimates. Specifically, we fitted a hierarchical Bayesian regression model, where the dependent variable was the individual’s predictive score (measured in negative log probability) that was predicted using the individual’s IQ score, model type (coded as –1 for hybrid and 1 for RNN), and their paired interaction. We found a main effect for IQ, such that higher IQ individuals had higher log probability scores (see Figure 5—figure supplement 1a; posterior median = 0.614; 95%_HDI between 0.022 and 1.208; probability of direction [pd] 97.9%). Moreover, we also found a main effect for model type, such that the RNN model obtained on average a higher predictive score than the hybrid (see Figure 5—figure supplement 1b; posterior median = 9.577; 95%_HDI between 6.94 and 12.2; pd 100%).

Importantly, we did not find support for an interaction effect of IQ × model type on individuals’ predictive scores, suggesting that the association between IQ and predictive accuracy was similar for the RNN and hybrid models (see Figure 5c; posterior median = –0.11; 95% HDI between –0.35 and 0.13; pd 81.1%). This result suggests that RNN did not improve the predictive accuracy of the hybrid model more for low vs. high IQ individuals. Specifically, the lack of interaction between model type and IQ can be taken as evidence that the frequency of model misspecification is not different in low vs. high IQ individuals. To further assert our finding, we now examine the association between IQ and the number of optimal RNN epochs as a proxy for the amount of noise in the individual’s data.

Optimal epoch relation with IQ

The comparison of the predictive performance of the hybrid vs. RNN models suggested that low IQ individuals showed noisier behavior compared to their high IQ peers. To further validate our finding that lower IQ individuals’ choice data is noisier (rather than more frequently misspecified), we examined the number of optimal RNN training epochs for each participant. Here, we trained an individual RNN model for each subject initialized with random weights. We then examined the relationship between an individual’s IQ score and their corresponding number of optimal RNN training epochs (i.e., the point at which we stopped training the RNN to avoid overfitting). We found a significant positive correlation, such that higher IQ individuals had a higher number of optimal training epochs [see Figure 6a; linear correlation coefficient $r = 0.33$ , p<0.05, 95% C (0.072, 0.597)]. To further illustrate this finding, we divided participants into two groups according to their IQ score – low/high (using the median as the threshold). We then recorded the prediction of the validation data of each participant throughout the training procedure and averaged within each of the two groups (see Figure 6b). We found that the optimal epoch of the high IQ group was significantly greater ( $M = 332.80$ , $S D = 178.06$ ) than that of the low IQ group ( $M = 227.34$ , $S D = 206.90$ ; $t (52) = 2.011$ , p<0.05). These findings suggest noisier behavior for low compared to high IQ individuals.

Figure 6

Download asset Open asset

Association between IQ and individual’s optimal epoch estimates.

(a) Relation between the number of optimal recurrent neural network (RNN) training epochs (y-axis) and IQ score (x-axis; shaded area corresponds to the 95% confidence interval for the regression line). We found that IQ significantly correlates with individuals’ optimal RNN epoch estimates, such that lower IQ participants required fewer RNN training epochs to reach the optimal training point. (b) Loss curve of validation data averaged over the low IQ group (IQ < 110; purple) and high IQ group (IQ > 110; turquoise). The point of optimal epoch (early stopping) is denoted by a yellow star. We found that the optimal epoch for the high IQ group was significantly higher than that for the low IQ group. Overall, when combined with our simulation study demonstrating the association between the number of optimal RNN epochs and true noise (for a fixed number of observations and network size; see Figure 3), these results suggest noisier decision-making for low IQ individuals compared to high IQ individuals. This finding, along with the results obtained from the RNN predictive accuracy (see Figure 5), suggests that low IQ individuals are not more frequently misspecified compared to their high IQ peers.

Discussion

Developing a theoretical computational model requires the researcher to state theoretical assumptions regarding the investigated process, assumptions that might differ from the true data-generating process, and thus exposed to the problem of model misspecification. Here, we sought to construct a method that tackles this problem in the context of theoretical computational models of human decision-making behavior. Taking advantage of the high flexibility of theory-independent models (i.e., RNN) that are not theoretically constrained by assumptions of behavior, we proposed a method to indicate if a predictive gap observed in a single-agent choice data is mostly due to model misspecification or rather an inherent stochasticity in the behavior being modeled. We further suggest that the number of training epochs used to reach an optimal balance between under and overfitting for RNN can be used as a relative estimate of noise in the individual behavior.

In Study 1, we validated our approach using simulated data from artificial agents generated by different theoretical models in a two-step multi-armed bandit task. By comparing the predictive performance of the RNN model with each theoretical model, we demonstrated the RNN’s ability to determine whether a predictive gap observed in an individual agent is primarily due to model misspecification or inherent stochasticity in their behavior. Additionally, we showed that the point of early stopping in the RNN training process is strongly associated with the level of true irreducible stochasticity in an agent’s behavior. A higher number of optimal training epochs is indicative of less noise in the agent’s behavior, given a fixed amount of data and network size. One practical advantage of using the point of early stopping is that it can be estimated using only two sets of data (training and validation), unlike predictive accuracy, which requires an additional test dataset.

Next, in Study 2, we applied these methods to an empirical dataset of human participants performing the same two-step task. We found that Daw’s hybrid model (Daw et al., 2011), a well-known theoretical model, exhibited a consistently poorer fit to participants with low IQ compared to those with high IQ. To investigate the source of this poorer fit, we individually fitted an RNN model to each participant and compared the predictive performance of both the hybrid model and the RNN model, as well as the point of early stopping for each participant. We observed that the RNN model showed a similar reduction in behavior prediction for both low and high IQ individuals, consistent with the hybrid model. Additionally, the RNN’s optimal epoch estimates were systematically higher for high IQ individuals. These findings provide preliminary evidence suggesting that the behavior of low IQ participants is characterized by greater inherent noise, rather than being more misspecified by the hybrid model, when compared to their high IQ counterparts.

The use of theory-independent neural network models fitted directly to behavioral data in decision-making tasks has garnered significant attention in recent years (Dezfouli et al., 2019a; Dezfouli et al., 2019b; Fintz et al., 2022; Peterson et al., 2021; Song et al., 2021). Previous research has highlighted the advantages of this approach by demonstrating that RNNs are capable of capturing behavioral features that traditional theoretical models struggle to replicate. In a groundbreaking study by Dezfouli et al., 2019b, it was shown that an RNN model fitted to participants performing a two-armed bandit task outperformed a typical RL model in pure action prediction. This finding has been extended to more complex decision-making tasks such as the four-armed bandit (Fintz et al., 2022) and build-your-own-stimulus tasks (Song et al., 2021). While most studies have primarily focused on the high predictive performance of RNN models, there are also examples of leveraging RNNs to gain theoretical insights. For instance, Dezfouli et al., 2019a developed an autoencoder model utilizing an RNN encoder to map each participant’s behavior into a low-dimensional latent space, which facilitated the study of individual differences and variation in decision-making strategies. In another study, Song et al., 2021 proposed an RNN model trained with an embedding specific to each participant, showing that these embeddings were effective in capturing differences in various cognitive variables. Thus, previous research has demonstrated that RNNs, unconstrained by specific theoretical assumptions, have the ability to capture and predict human behavior, resulting in remarkably low predictive gaps.

We believe that our work contributes to the growing body of research in the field, making four key contributions. First, through simulation, we have validated the assertion that RNNs can achieve nearly optimal fit and approximate various behavioral models without the need for prior theoretical specification. Second, we introduce a novel application of the point of early stopping in RNN training. Traditionally used to prevent overfitting, we demonstrate its utility as a unique individual measure of the amount of inherent noise present in both artificial and human behavioral choice data. Unlike predictive accuracy, which requires group-level pre-training and three datasets per individual (training, validation, and test), the point of early stopping only necessitates two datasets (training and validation) and can be employed without empirical RNN pre-training. This allows for a more flexible estimate of stopping points between individuals. Third, we present an application of the RNN model to human choice data in the context of a multistage two-step decision task, expanding upon previous research that has predominantly focused on single-stage tasks. Lastly, we utilize the RNN model to investigate the relationship between model misspecification and individuals’ intelligence scores.

Our work also addresses the question of inherent randomness or irreducible stochasticity in human behavior. The question of irreducible stochasticity has broader implications not only for cognitive science but also as a fundamental question in general science. We provide provisional evidence suggesting that individuals with lower intelligence scores exhibit noisier decision patterns that cannot be reduced by any model. However, several questions remain unanswered, such as identifying the specific stages of the decision-making process that underlie this behavioral variability. In the current study, we primarily focus on noise that arises from the deliberation process. This aligns with previous studies that have proposed the concept of ‘decision acuity’ (Moutoussis et al., 2021), which represents a dimensional structure in core decision-making strongly related to the inverse temperature parameter (β). It suggests that decision variability originates from differences in reward sensitivities. However, another line of research has proposed the idea of ‘computation noise’, which refers to random noise inherent to the learning process that corrupts the value updating of each action (Findling et al., 2019; Findling and Wyart, 2021). Further studies are needed to elucidate the extent to which these two factors contribute to the noisier decision patterns observed in individuals with lower intelligence. Another open question pertains to identifying the neural correlates and mechanisms underlying this variability. Nevertheless, the significant advantage of the current method is that researchers can use the number of optimal RNN epochs to estimate the amount of noise in observations without specifying a theoretical mechanism. This capability enhances the interpretation and utilization of theoretical models.

Several limitations should be considered in our proposed approach. First, fitting a data-driven neural network is evidently not enough to produce a comprehensive theoretical description of the data generation mechanisms. Currently, best practices for cognitive modeling (Wilson and Collins, 2019) require identifying under what conditions the model struggles to predict the data (e.g., using posterior predictive checks), and describing a different theoretical model that could account for these disadvantages in prediction. However, identifying conditions where the model shortcomings in predictive accuracy are due to model misspecifications rather than noisier behavior is a challenging task. We propose leveraging data-driven RNNs as a supplementary tool, particularly when they significantly outperform existing theoretical models, followed by refined theoretical modeling to provide insights into what processes were misspecified in the initial modeling effort.

Second, although we observed a robust association between the optimal number of epochs and true noise across varying network sizes and dataset sizes (see Figure 3—figure supplement 1), additional factors such as network architecture and other model parameters (e.g., learning rate, see Figure 3—figure supplement 3) might influence this estimation. Further research is required to allow us to better understand how and why different factors change the number of optimal epochs for a given dataset before it can be applied with confidence to empirical investigations.

Third, the empirical dataset used in our study consisted of data collected from human participants at a single time point, serving as the training set for our RNN. The test set data, collected with a time interval of ∼6 and 18 months, introduced the possibility of changes in participants’ decision-making strategies over time. In our analysis, we neglected any possible changes in participants’ decision-making strategies during that time, changes that may lead to poorer generalization performance of our approach. Thus, further studies are needed to eliminate such possible explanations.

Fourth, our simulations, albeit illustrative, were confined to known models, necessitating in silico validation before extrapolating the efficacy of our approach to other model classes and tasks. Our aim was to showcase the potential benefits of using a data-driven approach, particularly when faced with unknown models. However, whether RNNs will provide optimal fits for tasks with more complex rules and long-term sequential dependencies remains uncertain.

Finally, while positive outcomes where RNNs surpass theoretical models can prompt insightful model refinement, caution is warranted in directly equating RNN performance with that of the generative model, as seen in our simulations (e.g., Figure 2). We highlight that our empirical findings depict a more complex scenario, wherein the RNN enhanced the predictive accuracy for all participants uniformly. Notably, we also provide evidence supporting a null effect among individuals, with no consistent difference in RNN improvement over the theoretical model based on IQ. Although it remains conceivable that a different data-driven model could systematically heighten the predictive accuracy for lower IQ individuals in this task, such a possibility seems less probable in light of the current findings.

To sum up, we show that both the predictive gap and the point of early stopping of neural networks can be used to estimate whether a certain predictive gap found for a theoretical model is mostly due to model misspecification or irreducible noise. We hope this work will lead to further studies exploring the utilization of neural networks to enhance theoretical computational models.

Model	Hybrid	Habit	K-DH
Parameter range	$w \sim U (0, 1)$
	$α_{1, 2} \sim U (0, 1)$	$α_{1, 2} \sim U (0, 1)$	$p_{0, 1, 2, 3} \sim U (0, 1)$
	$β_{1, 2} \sim U (0, 10)$	$β_{1, 2} \sim U (0, 10)$
	$λ \sim U (0, 1)$
# agents	100	100	100
# trials	200	200	200
# blocks	3	3	3

Share this article

Cite this article

Hypothetical illustration for using theory-independent models to explore the fit of theoretical models to human behavior.

Classification of model misspecification using the predictive performance of theory-independent models.

Optimal epoch relation to true noise.

Prediction scenarios for the use of recurrent neural network (RNN) to examine model misspecification as a function of IQ.

Empirical association of the predictive score for the theoretical hybrid model, recurrent neural network (RNN), and IQ.

Association between IQ and individual’s optimal epoch estimates.

Two-step task.

Simulation specification for Study 1.

Author details

Yoav Ger

Contribution

For correspondence

Competing interests

Moni Shahar

Contribution

Competing interests

Nitzan Shahar

Contribution

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism