Gated recurrence enables simple and accurate sequence prediction in stochastic, changing, and structured environments

  1. Cédric Foucault
  2. Florent Meyniel  Is a corresponding author
  1. Cognitive Neuroimaging Unit, INSERM, CEA, Université Paris-Saclay, NeuroSpin center, France
  2. Sorbonne Université, Collège Doctoral, France

Abstract

From decision making to perception to language, predicting what is coming next is crucial. It is also challenging in stochastic, changing, and structured environments; yet the brain makes accurate predictions in many situations. What computational architecture could enable this feat? Bayesian inference makes optimal predictions but is prohibitively difficult to compute. Here, we show that a specific recurrent neural network architecture enables simple and accurate solutions in several environments. This architecture relies on three mechanisms: gating, lateral connections, and recurrent weight training. Like the optimal solution and the human brain, such networks develop internal representations of their changing environment (including estimates of the environment’s latent variables and the precision of these estimates), leverage multiple levels of latent structure, and adapt their effective learning rate to changes without changing their connection weights. Being ubiquitous in the brain, gated recurrence could therefore serve as a generic building block to predict in real-life environments.

Editor's evaluation

There has been a longstanding interest in developing normative models of how humans handle latent information in stochastic and volatile environments. This study examines recurrent neural network models trained on sequence-prediction tasks analogous to those used in human cognitive studies. The results demonstrate that such models lead to highly accurate predictions for challenging sequences in which the statistics are non-stationary and change at random times. These novel and remarkable results open up new avenues for cognitive modelling.

https://doi.org/10.7554/eLife.71801.sa0

Introduction

Being able to correctly predict what is coming next is advantageous: it enables better decisions (Dolan and Dayan, 2013; Sutton and Barto, 1998), a more accurate perception of our world, and faster reactions (de Lange et al., 2018; Dehaene et al., 2015; Saffran et al., 1996; Sherman et al., 2020; Summerfield and de Lange, 2014). In many situations, predictions are informed by a sequence of past observations. In that case, the prediction process formally corresponds to a statistical inference that uses past observations to estimate latent variables of the environment (e.g. the probability of a stimulus) that then serve to predict what is likely to be observed next. Specific features of real-life environments make this inference a challenge: they are often partly random, changing, and structured in different ways. Yet, in many situations, the brain is able to overcome these challenges and shows several aspects of the optimal solution (Dehaene et al., 2015; Dolan and Dayan, 2013; Gallistel et al., 2014; Summerfield and de Lange, 2014). Here, we aim to identify the computational mechanisms that could enable the brain to exhibit these aspects of optimality in these environments.

We start by unpacking two specific challenges which arise in real-life environments. First, the joint presence of randomness and changes (i.e. the non-stationarity of the stochastic process generating the observations) poses a well-known tension between stability and flexibility (Behrens et al., 2007; Soltani and Izquierdo, 2019; Sutton, 1992). Randomness in observations requires integrating information over time to derive a stable estimate. However, when a change in the estimated variable is suspected, it is better to limit the integration of past observations to update the estimate more quickly. The prediction should thus be adaptive, that is, dynamically adjusted to promote flexibility in the face of changes and stability otherwise. Past studies have shown that the brain does so in many contexts: perception (Fairhall et al., 2001; Wark et al., 2009), homeostatic regulation (Pezzulo et al., 2015; Sterling, 2004), sensorimotor control (Berniker and Kording, 2008; Wolpert et al., 1995), and reinforcement learning (Behrens et al., 2007; Iglesias et al., 2013; Soltani and Izquierdo, 2019; Sutton and Barto, 1998).

Second, the structure of our environment can involve complex relationships. For instance, the sentence beginnings "what science can do for you is..." and "what you can do for science is..." call for different endings even though they contain the same words, illustrating that prediction takes into account the ordering of observations. Such structures appear not only in human language but also in animal communication (Dehaene et al., 2015; Hauser et al., 2001; Robinson, 1979; Rose et al., 2004), and all kinds of stimulus-stimulus and stimulus-action associations in the world (Saffran et al., 1996; Schapiro et al., 2013; Soltani and Izquierdo, 2019; Sutton and Barto, 1998). Such a structure is often latent (i.e. not directly observable) and it governs the relationship between observations (e.g. words forming a sentence, stimulus-action associations). These relationships must be leveraged by the prediction, making it more difficult to compute.

In sum, the randomness, changes, and latent structure of real-life environments pose two major challenges: that of adapting to changes and that of leveraging the latent structure. Two commonly used approaches offer different solutions to these challenges. The Bayesian approach allows to derive statistically optimal predictions for a given environment knowing its underlying generative model. This optimal solution is a useful benchmark and has some descriptive validity since, in some contexts, organisms behave close to optimally (Ma and Jazayeri, 2014; Tauber et al., 2017) or exhibit several qualitative aspects of the optimal solution (Behrens et al., 2007; Heilbron and Meyniel, 2019; Meyniel et al., 2015). However, a specific Bayes-optimal solution only applies to a specific generative model (or class of models [Tenenbaum et al., 2011]). This mathematical solution also does not in general lead to an algorithm of reasonable complexity (Cooper, 1990; Dagum and Luby, 1993). Bayesian inference therefore says little about the algorithms that the brain could use, and the biological basis of those computations remains mostly unknown with only a few proposals highly debated (Fiser et al., 2010; Ma et al., 2006; Sahani and Dayan, 2003).

Opposite to the Bayes-optimal approach is the heuristics approach: solutions that are easy to compute and accurate in specific environments (Todd and Gigerenzer, 2000). However, heuristics lack generality: their performance can be quite poor outside the environment that suits them. In addition, although simple, their biological implementation often remains unknown (besides the delta-rule [Eshel et al., 2013; Rescorla and Wagner, 1972; Schultz et al., 1997]).

Those two approaches leave open the following questions: Is there a general, biologically feasible architecture that enables, in different environments, solutions that are simple, effective, and that reproduce the qualitative aspects of optimal prediction observed in organisms? If so, what are its essential mechanistic elements?

Our approach stands in contrast with the elegant closed-form but intractable mathematical solutions offered by Bayesian inference, and the simple but specialized algorithms offered by heuristics. Instead, we look for general mechanisms under the constraints of feasibility and simplicity. We used recurrent neural networks because they can offer a generic, biologically feasible architecture able to realize different prediction algorithms (see LeCun et al., 2015; Saxe et al., 2021 and Discussion). We used small network sizes in order to produce simple (i.e. low-complexity, memory-bounded) solutions. We tested their generality using different environments. To determine the simplest architecture sufficient for effective solutions and derive mechanistic insights, we considered different architectures that varied in size and mechanisms. For each one, we instantiated several networks and trained them to approach their best possible prediction algorithm in a given environment. We treated the training procedure as a methodological step without claiming it to be biologically plausible. To provide interpretability, we inspected the networks’ internal model and representations, and tested specific optimal aspects of their behavior—previously reported in humans (Heilbron and Meyniel, 2019; Meyniel et al., 2015; Nassar et al., 2010; Nassar et al., 2012)—which demonstrate the ability to adapt to changes and leverage the latent structure of the environment.

Results

The framework: sequence prediction and network architectures

All our analyses confront simulated agents with the same general problem: sequence prediction. It consists in predicting, at each time step in a sequence where one time step represents one observation, the probability distribution over the value of the next observation given the previous observations (here we used binary observations coded as ‘0’ and ‘1’) (Figure 1a). The environment generates the sequence, and the agent’s goal is to make the most accurate predictions possible in this environment. Below, we introduce three environments. All of them are stochastic (observations are governed by latent probabilities) and changing (these latent probabilities change across time), and thus require dynamically adapting the stability-flexibility tradeoff. They also feature increasing levels of latent structure that must be leveraged, making the computation of predictions more complex.

Figure 1 with 1 supplement see all
Problem to solve and network architectures.

(a) Sequence prediction problem. At each time step t, the environment generates one binary observation xt. The agent receives it and returns a prediction pt: its estimate of the probability that the next observation will be one given the observations collected so far. The agent’s goal is to make the most accurate predictions possible. The agent can measure its accuracy by comparing its prediction pt with the actual value observed at the next time step xt+1, allowing it to learn from the observations without any external supervision. (b) Common three-layer template of the recurrent neural network architectures. Input connections transmit the observation to the recurrent units and output connections allow the prediction to be read from the recurrent units. (c) Three key mechanisms of recurrent neural network architectures. Gating allows for multiplicative interaction between activities. Lateral connections allow the activities of different recurrent units i and j to interact. Recurrent weight training allows the connection weights of recurrent units to be adjusted to the training environment. i’ may be equal to i. (d) The gated recurrent architecture includes all three mechanisms: gating, lateral connections, and recurrent weight training. Each alternative architecture includes all but one of the three mechanisms.

How do agents learn to make predictions that fit a particular environment? In real life, agents often do not benefit from any external supervision and must rely only on the observations. To do so, they can take advantage of an intrinsic error signal that measures the discrepancy between their prediction and the actual value observed at the next time step. We adopted this learning paradigm (often called unsupervised, self-supervised, or predictive learning in machine learning [Elman, 1991; LeCun, 2016]) to train our agents in silico. We trained the agents by exposing them to sequences generated by a given environment and letting them adjust their parameters to improve their prediction (see Materials and methods).

During testing, we kept the parameters of the trained agents frozen, exposed them to new sequences, and performed targeted analyses to probe whether they exhibit specific capabilities and better understand how they solve the problem.

Our investigation focuses on a particular class of agent architectures known as recurrent neural networks. These are well suited for sequence prediction because recurrence allows to process inputs sequentially while carrying information over time in recurrent activity. The network architectures we used all followed the same three-layer template, consisting of one input unit whose activity codes for the current observation, one output unit whose activity codes for the prediction about the next observation, and a number of recurrent units that are fed by the input unit and project to the output unit (Figure 1b). All architectures had self-recurrent connections.

We identified three mechanisms of recurrent neural network architectures that endow a network with specific computational properties which have proven advantageous in our environments (Figure 1c). One mechanism is gating, which allows for multiplicative interactions between the activities of units. A second mechanism is lateral connectivity, which allows the activities of different recurrent units to interact with each other. A third mechanism is the training of recurrent connection weights, which allows the dynamics of recurrent activities to be adjusted to the training environment.

To get mechanistic insight, we compared an architecture that included all three mechanisms, to alternative architectures that were deprived of one of the three mechanisms but maintained the other two (Figure 1d; see Materials and methods for equations). Here, we call an architecture with all three mechanisms ‘gated recurrent’, and the particular architecture we used is known as GRU (Cho et al., 2014; Chung et al., 2014). When deprived of gating, multiplicative interactions between activities are removed, and the architecture reduces to that of a vanilla recurrent neural network also known as the Elman network (Elman, 1990). When deprived of lateral connections, the recurrent units become independent of each other, thus each recurrent unit acts as a temporal filter on the input observations (with possibly time-varying filter weights thanks to gating). When deprived of recurrent weight training, the recurrent activity dynamics become independent of the environment and the only parameters that can be trained are those of the output unit; this architecture is thus one form of reservoir computing (Tanaka et al., 2019). In the results below, unless otherwise stated, the networks all had 11 recurrent units (the smallest network size beyond which the gated recurrent network showed no substantial increase in performance in any of the environments), but the results across architectures are robust to this choice of network size (see the last section of the Results).

Performance in the face of changes in latent probabilities

We designed a first environment to investigate the ability to handle changes in a latent probability (Figure 2a; see Figure 1—figure supplement 1 for a graphical model). In this environment we used the simplest kind of latent probability: p(1), the probability of occurrence (or base rate) of the observation being 1 (note that p(0) = 1−p(1)), here called ‘unigram probability’. The unigram probability suddenly changed from one value to another at so-called ‘change points’, which could occur at any time, randomly with a given fixed probability.

Gated recurrent networks perform quasi-optimally in the face of changes in latent probabilities.

(a) Sample sequence of observations (dots) and latent unigram probability (line) generated in the changing unigram environment. At each time step, a binary observation is randomly generated based on the latent unigram probability, and a change point can occur with a fixed probability, suddenly changing the unigram probability to a new value uniformly drawn in [0,1]. (b) Prediction performance in the changing unigram environment. For each type of agent, 20 trained agents (trained with different random seeds) were tested (dots: agents; bars: average). Their prediction performance was measured as the % of optimal log likelihood (0% being chance performance and 100 % optimal performance, see Equation 1 for the log likelihood) and averaged over observations and sequences. The gated recurrent network significantly outperformed every other type of agent (p < 0.001, two-tailed two independent samples t-test with Welch’s correction for unequal variances).

This environment, here called ‘changing unigram environment’, corresponds for instance to a simple oddball task (Aston-Jones et al., 1997; Kaliukhovich and Vogels, 2014; Ulanovsky et al., 2004), or the probabilistic delivery of a reward with abrupt changes in reward probabilities (Behrens et al., 2007; Vinckier et al., 2016). In such an environment, predicting accurately is difficult due to the stability-flexibility tradeoff induced by the stochastic nature of the observations (governed by the unigram probability) and the possibility of a change point at any moment.

To assess the networks’ prediction accuracy, we compared the networks with the optimal agent for this specific environment, that is, the optimal solution to the prediction problem determined using Bayesian inference. This optimal solution knows the environment’s underlying generative process and uses it to compute, via Bayes’ rule, the probability distribution over the possible values of the latent probability given the past observation sequence,p(pt+1env|x0,...,xt) known as the posterior distribution. It then outputs as prediction the mean of this distribution. (For details see Materials and methods and Heilbron and Meyniel, 2019).

We also compared the networks to two types of heuristics which perform very well in this environment: the classic 'delta-rule' heuristic (Rescorla and Wagner, 1972; Sutton and Barto, 1998) and the more accurate 'leaky' heuristic (Gijsen et al., 2021; Heilbron and Meyniel, 2019; Meyniel et al., 2016; Yu and Cohen, 2008) (see Materials and methods for details). To test the statistical reliability of our conclusions, we trained separately 20 agents of each type (each type of network and each type of heuristic).

We found that even with as few as 11 units, the gated recurrent networks performed quasi-optimally. Their prediction performance was 99 % of optimal (CI ±0.1%), 0 % corresponding to chance level (Figure 2b). Being only 1 % short of optimal, the gated recurrent networks outperformed the delta rule and leaky agents, which performed 10 times and 5 times further from optimal, respectively (Figure 2b).

For mechanistic insight, we tested the alternative architectures deprived of one mechanism. Without either gating, lateral connections, or recurrent weight training, the average performance was respectively 6 times, 4 times, and 12 times further from optimal (Figure 2b), that is, the level of a leaky agent or worse. The drops in performance remain similar when considering only the best network of each architecture instead of the average performance (Figure 2b, compare rightmost dots across rows).

These results show that small gated recurrent networks can achieve quasi-optimal predictions and that the removal of one of the mechanisms of the gated recurrent architecture results in a systematic drop in performance.

Adaptation to changes through the adjustment of the effective learning rate

In a changing environment, the ability to adapt to changes is key. Networks exposed to more changing environments during training updated their predictions more overall during testing, similarly to the optimal agent (see Figure 3—figure supplement 1) and, to some extent, humans (Behrens et al., 2007, Figure 2e; Findling et al., 2021, Figure 4c). At a finer timescale, the moment-by-moment updating of the predictions also showed sensible dynamics around change points.

Figure 3a illustrates a key difference in behavior between, on the one hand, the optimal agent and the gated recurrent network, and on the other hand, the heuristic agents: the dynamics of their update differ. This difference is particularly noticeable when recent observations suggest that a change point has just occurred: the optimal agent quickly updates the prediction by giving more weight to the new observations; the gated recurrent network behaves the same but not the heuristic agents. We formally tested this dynamic updating around change points by measuring the moment-by-moment effective learning rate, which normalizes the amount of update in the prediction by the prediction error (i.e. the difference between the previous prediction and the actual observation; see Materials and methods, Equation 2).

Figure 3 with 1 supplement see all
Gated recurrent but not alternative networks adjust their moment-by-moment effective learning rate around changes like the optimal agent.

(a) Example prediction sequence illustrating the prediction updates of different types of agents. Within each type of agent, the agent (out of 20) yielding median performance in Figure 2b was selected for illustration purposes. Dots are observations, lines are predictions. (b) Moment-by-moment effective learning rate of each type of agent. 20 trained agents of each type were tested on 10,000 sequences whose change points were locked at the same time steps, for illustration purposes. The moment-by-moment effective learning rate was measured as the ratio of prediction update to prediction error (see Materials and methods, Equation 2), and averaged over sequences. Lines and bands show the mean and the 95 % confidence interval of the mean.

Gated recurrent networks turned out to adjust their moment-by-moment effective learning rate as the optimal agent did, showing the same characteristic peaks, at the same time and with almost the same amplitude (Figure 3b, top plot). By contrast, the effective learning rate of the delta-rule agents was (by construction) constant, and that of the leaky agents changed only marginally.

When one of the mechanisms of the gated recurrence was taken out, the networks’ ability to adjust their effective learning rate was greatly degraded (but not entirely removed) (Figure 3b, bottom plots). Without gating, without lateral connections, or without recurrent weight training, the amplitude was lower (showing both a lower peak value and a higher baseline value), and the peak occurred earlier.

This shows that gated recurrent networks can reproduce a key aspect of optimal behavior: the ability to adapt the update of their prediction to change points, which is lacking in heuristic agents and alternative networks.

Internal representation of precision and dynamic interaction with the prediction

Beyond behavior, we sought to determine whether a network’s ability to adapt to changes relied on idiosyncratic computations or followed the more general principle of precision-weighting derived from probability theory. According to this principle, the precision of the current prediction (calculated in the optimal agent as the negative logarithm of the standard deviation of the posterior distribution over the latent probability, see Equation 3 in Materials and methods) should influence the weight of the current prediction relative to the next observation in the updating process: for a given prediction error, the lower the precision, the higher the subsequent effective learning rate. This precision-weighting principle results in an automatic adjustment of the effective learning rate in response to a change, because the precision of the prediction decreases when a change is suspected.

In line with this principle, human participants can estimate not only the prediction but also its precision as estimated by the optimal agent (Boldt et al., 2019, Figure 2; Meyniel et al., 2015, Figure 4B), and this precision indeed relates to the participants’ effective learning rate (McGuire et al., 2014, Figure 2C and S1A; Nassar et al., 2010, Figure 4C and 3B; Nassar et al., 2012, Figure 5 and 7c, ).

We tested whether a network could represent this optimal precision too, by trying to linearly read it from the network’s recurrent activity (Figure 4a). Note that the networks were trained only to maximize prediction accuracy (not to estimate precision). Yet, in gated recurrent networks, we found that the read precision on left-out data was highly accurate (Figure 4a, left plot: the median Pearson correlation with the optimal precision is 0.82), and correlated with their subsequent effective learning rate as in the optimal agent (Figure 4a, right plot: the median correlation for gated recurrent networks is –0.79; for comparison, it is –0.88 for the optimal agent).

Gated recurrent networks have an internal representation of the precision of their estimate that dynamically interacts with the prediction following the precision-weighting principle.

(a) Left to right: Schematic of the readout of precision from the recurrent activity of a network (obtained by fitting a multiple linear regression from the recurrent activity to the log precision of the optimal posterior distribution); Accuracy of the read precision (calculated as its Pearson correlation with the optimal precision); Pearson correlation between the read precision and the network’s subsequent effective learning rate (the optimal value was calculated from the optimal agent’s own precision and learning rate); Example sequence illustrating their anti-correlation in the gated recurrent network. In both dot plots, large and small dots show the median and individual values, respectively. (b) Dynamics of the optimal posterior (left) and the network activity (right) in three sequences (green, yellow, and pink). The displayed dynamics are responses to a streak of 1 s after different sequences of observations (with different generative probabilities as shown at the bottom). The optimal posterior distribution is plotted as a color map over time (dark blue and light green correspond to low and high probability densities, respectively) and as a line plot at two times: on the left, the time tstart just before the streak of 1s, and on the right, a time tA/tB/tC when the prediction (i.e. mean) is approximately equal in all three cases; note that the precision differs. The network activity was projected onto the two-dimensional subspace spanned by the prediction and precision vectors (for the visualization, the precision axis was orthogonalized with respect to the prediction axis). In the gated recurrent network, the arrow Δp shows the update to the prediction performed in the next three time steps starting at the time tA/tB/tC defined from the optimal posterior. Like the optimal posterior and unlike the network without gating, the gated recurrent network represents different levels of precision at an equal prediction, and the lower the precision, the higher the subsequent update to the prediction—a principle called precision-weighting. In all example plots (a–b), the displayed network is the one of the 20 that yielded the median read precision accuracy.

To better understand how precision information is represented and how it interacts with the prediction dynamically in the network activity, we plotted the dynamics of the network activity in the subspace spanned by the prediction and precision vectors (Figure 4b). Such visualization captures both the temporal dynamics and the relationships between the variables represented in the network, and has helped understand network computations in other works (Mante et al., 2013; Sohn et al., 2019). Here, two observations can be made.

First, in the gated recurrent network (Figure 4b, second plot from the right), the trajectories are well separated along the precision axis (for the same prediction, the network can represent multiple precisions), meaning that the representation of precision is not reducible to the prediction. By contrast, in the network without gating (Figure 4b, rightmost plot), these trajectories highly overlap, which indicates that the representation of precision and prediction are mutually dependent. To measure this dependence, we computed the mutual information between the read precision and the prediction of the network, and it turned out to be very high in the network without gating (median MI = 5.2) compared to the gated recurrent network (median MI = 0.7) and the optimal agent (median MI = 0.6) (without lateral connections, median MI = 1.3; without recurrent weight training, median MI = 1.9), confirming that gating is important to separate the precision from the prediction.

Second, in the gated recurrent network, the precision interacts dynamically with the prediction in a manner consistent with the precision-weighting principle: for a given prediction, the lower the precision, the larger the subsequent updates to the prediction (Figure 4b, vertical dotted line indicates the level of prediction and arrows the subsequent updates).

These results indicate that in the network without gating, precision is confounded with prediction and the correlation between precision and effective learning rate is spuriously driven by the prediction itself, whereas in the network with gating, there is a genuine representation of precision beyond the prediction itself, which interacts with the updating of predictions. However, we have so far only provided correlational evidence; to show that the precision represented in the network plays a causal role in the subsequent prediction update, we need to perform an intervention that acts selectively on this precision.

Causal role of precision-weighting for adaptation to changes

We tested whether the internal representation of precision causally regulated the effective learning rate in the networks using a perturbation experiment. We designed perturbations of the recurrent activity that induced a controlled change in the read precision, while leaving the networks’ current prediction unchanged to control for the effect of the prediction error (for the construction of the perturbations, see Figure 5 bottom left diagram and legend, and Materials and methods). These perturbations caused significant changes in the networks’ subsequent effective learning rate, commensurate with the induced change in precision, as predicted by the principle of precision-weighting (Figure 5, middle plot). Importantly, this causal relationship was abolished in the alternative networks that lacked one of the mechanisms of the gated recurrent architecture (Figure 5, right three plots; the slope of the effect was significantly different between the gated recurrent network group and any of the alternative network groups, two-tailed two independent samples t-test, all t(38) > 4.1, all p < 0.001, all Cohen’s d > 1.3).

Precision-weighting causally determines the adjustment of the effective learning rate in gated recurrent networks only.

Causal test of a network’s precision on its effective learning rate. The recurrent activity was perturbed to induce a controlled change δ in the read precision, while keeping the prediction at the current time step—and thus the prediction error at the next time step—constant. This was done by making the perturbation vector orthogonal to the prediction vector and making its projection onto the precision vector equal to δ (bottom left diagram). We measured the perturbation’s effect on the subsequent effective learning rate as the difference in learning rate ‘with perturbation’ minus ‘without perturbation’ at the next time step (four plots on the right). Each dot (and joining line) corresponds to one network. ***: p < 0.001, n.s.: p > 0.05 (one-tailed paired t-test).

These results show that the gated recurrent networks’ ability to adapt to changes indeed relies on their precision-dependent updating and that such precision-weighting does not arise without all three mechanisms of the gated recurrence.

Leveraging and internalizing a latent structure: bigram probabilities

While the changing unigram environment already covers many tasks in the behavioral and neuroscience literature, real-world sequences often exhibit more structure. To study the ability to leverage such structure, we designed a new stochastic and changing environment in which the sequence of observations is no longer generated according to a single unigram probability, p(1), but two ‘bigram probabilities’ (also known as transition probabilities), p(0|0) and p(1|1), which denote the probability of occurrence of a 0 after a 0 and of a 1 after a 1, respectively (Figure 6a; see Figure 1—figure supplement 1 for a graphical model). These bigram probabilities are also changing randomly, with independent change points.

Figure 6 with 1 supplement see all
Gated recurrent networks correctly leverage and internalize the latent bigram structure.

(a) Schematic of the changing bigram environment’s latent probabilities (left) and sample generated sequence (right, dots: observations, lines: latent bigram probabilities). At each time step, a binary observation is randomly generated according to the relevant latent bigram probability, p0|0 or p1|1 depending on the previous observation. p0|0 denotes the probability of occurrence of a 0 after a 0 and p1|1 that of a 1 after a 1 (note that p1|0=1-p0|0 and p0|1=1-p1|1). At any time step, each of the two bigram probabilities can suddenly change to a new value uniformly drawn in [0,1], randomly with a fixed probability and independently from each other. (b) Example prediction sequence illustrating each network’s ability or inability to change prediction according to the local context, compared to the optimal prediction (dots: observations, lines: predictions). (c) Prediction performance of each type of agent in the changing bigram environment. 20 new agents of each type were trained and tested as in Figure 2b but now in the changing bigram environment (dots: agents; bars: average). The gated recurrent network significantly outperformed every other type of agent (p < 0.001, two-tailed two independent samples t-test with Welch’s correction for unequal variances). (d) Internalization of the latent structure as shown on an out-of-sample sequence: the two bigram probabilities are simultaneously represented in the gated recurrent network (top), and closely follow the optimal estimates (bottom). The readouts were obtained through linear regression from the recurrent activity to four estimates separately: the log odds of the mean and the log precision of the optimal posterior distribution on p0|0 and p1|1. In (b) and (d), the networks (out of 20) yielding median performance were selected for illustration purposes.

This ‘changing bigram environment’ is well motivated because there is ample evidence that bigram probabilities play a key role in sequence knowledge in humans and other animals (Dehaene et al., 2015) even in the face of changes (Bornstein and Daw, 2013; Meyniel et al., 2015).

We assessed how well the networks could leverage the latent bigram structure after having been trained in this environment. For comparison, we tested the optimal agent for this environment as well as two groups of heuristics: delta-rule and leaky estimation of unigram probabilities (as in Figure 2b), and now also delta rule and leaky estimation of bigram probabilities (see Materials and methods for details).

The gated recurrent networks achieved 98 % of optimal prediction performance (CI ±0.3%), outperforming the heuristic agents estimating bigram probabilities, and even more so those estimating a unigram probability (Figure 6c). To demonstrate that this was due to their internalization of the latent structure, we also tested the gated recurrent networks that had been trained in the changing unigram environment: their performance was much worse (Figure 6—figure supplement 1).

At the mechanistic level, all three mechanisms of the gated recurrence are important for this ability to leverage the latent bigram structure. Not only does the performance drop when one of these mechanisms is removed (Figure 6c), but also this drop in performance is much larger than that observed in the changing unigram environment (without gating: –11.2 % [CI ±1.5 % calculated by Welch’s t-interval] in the bigram environment vs. –5.5 % [CI ±0.6%] in the unigram environment, without lateral connections: –18.5 % [CI ±1.8%] vs. –2.9 % [CI ±0.2%]; without recurrent weight training: –29.9 % [CI ±1.6%] vs. –11.0 % [CI ±2.1%]; for every mechanism, there was a significant interaction effect between the removal of the mechanism and the environment on performance, all F(1,76) > 47.9, all p < 0.001).

Figure 6b illustrates the gated recurrent networks’ ability to correctly incorporate the bigram context into its predictions compared to networks lacking one of the mechanisms of the gated recurrence. While a gated recurrent network aptly changes its prediction from one observation to the next according to the preceding observation as the optimal agent does, the other networks fail to show such context-dependent behavior, sometimes even changing their prediction away from the optimal agent.

Altogether these results show that gated recurrent networks can leverage the latent bigram structure, but this ability is impaired when one mechanism of the gated recurrence is missing.

Is the networks’ representation of the latent bigram structure impenetrable or easily accessible? We tested the latter possibility by trying to linearly read out the optimal estimate of each of the latent bigram probabilities from the recurrent activity of a gated recurrent network (see Materials and methods). Arguing in favor of an explicit representation, we found that the read estimates of each of the latent bigram probabilities on left-out data were highly accurate (Pearson correlation with the optimal estimates, median and CI: 0.97 [0.97, 0.98] for each of the two bigram probabilities).

In addition to the point estimates of the latent bigram probabilities, we also tested whether a network maintained some information about the precision of each estimate. Again, we assessed the possibility to linearly read out the optimal precision of each estimate and found that the read precisions on left-out data were quite accurate (Pearson correlation with the optimal precisions, median and CI: 0.77 [0.74, 0.78] for one bigram probability and 0.76 [0.74, 0.78] for the other probability).

Figure 6d illustrates the striking resemblance between the estimates read from a gated recurrent network and the optimal estimates. Furthermore, it shows that the network successfully disentangles one bigram probability from the other since the read estimates can evolve independently from each other (for instance during the first 20 time steps, the value for 1|1 changes while the value for 0|0 does not, since only 1 s are observed). It is particularly interesting that both bigram probabilities are simultaneously represented, given that only one of them is relevant for the moment-by-moment prediction read by the network’s output unit (whose weights cannot change during the sequence).

We conclude that gated recurrent networks internalize the latent bigram structure in such a way that both bigram probabilities are available simultaneously, even though only one of the two is needed at any one time for the prediction.

Leveraging a higher-level structure: inference about latent changes

In real life, latent structures can also exhibit different levels that are organized hierarchically (Bill et al., 2020; Meyniel et al., 2015; Purcell and Kiani, 2016). To study the ability to leverage such a hierarchical structure, we designed a third environment in which, in addition to bigram probabilities, we introduced a higher-level factor: the change points of the two bigram probabilities are now coupled, rather than independent as they were in the previous environment (Figure 7a; Figure 1—figure supplement 1 shows the hierarchical structure). Due to this coupling, from the agent’s point of view, the likelihood that a change point has occurred depends on the observations about both bigrams. Thus, optimal prediction requires the ability to make a higher-level inference: having observed that the frequency of one of the bigrams has changed, one should not only suspect that the latent probability of this bigram has changed but also transfer this suspicion of a change to the latent probability of the other bigram, even without any observations about that bigram.

Gated recurrent but not alternative networks leverage a higher-level structure, distinguishing the case where change points are coupled vs. independent.

Procedure to test the higher-level inference: (a) For each network architecture, 20 networks were trained on sequences where the change points of the two latent bigram probabilities are coupled and 20 other networks were trained on sequences where they are independent (the plots show an example training sequence for each case); (b) The networks were then tested on sequences designed to trigger the suspicion of a change point in one bigram probability and measure their inference about the other bigram probability: |pafter−pbefore| should be larger when the agent assumes change points to be coupled rather than independent. The plot shows an example test sequence. Red, blue, solid, and dashed lines: as in (c), except that only the gated recurrent network (out of 20) yielding median performance is shown for illustration purposes. (c) Change in prediction about the unobserved bigram probability of the networks trained on coupled change points (red) and independent change points (blue) for each network architecture, averaged over sequences. Solid lines and bands show the mean and the 95 % confidence interval of the mean over networks. Dotted lines show the corresponding values of the optimal agent for the two cases. Only the gated recurrent architecture yields a significant difference between networks trained on coupled vs. independent change points (one-tailed two independent samples t-test, ***: p < 0.001, n.s.: p > 0.05).

Such a transfer has been reported in humans (Heilbron and Meyniel, 2019, Figure 5B). A typical situation is when a streak of repetitions is encountered (Figure 7b): if a long streak of 1 s was deemed unlikely, it should trigger the suspicion of a change point such that p(1|1) is now high, and this suspicion should be transferred to p(0|0) by partially resetting it. This reset is reflected in the change between the prediction following the 0 just before the streak and that following the 0 just after the streak (Figure 7b, |pafter−pbefore|).

We tested the networks’ ability for higher-level inference in the same way, by exposing them to such streaks of repetitions and measuring their change in prediction about the unobserved bigram before and after the streak. More accurately, we compared the change in prediction of the networks trained in the environment with coupled change points to that of the networks trained in the environment with independent change points, since the higher-level inference should only be made in the coupled case.

We found that gated recurrent networks trained in the coupled environment changed their prediction about the unobserved bigram significantly more than networks trained in the independent environment, and this was true across a large range of streak lengths (Figure 7c, top plot). The mere presence of this effect is particularly impressive given that the coupling makes very little difference in terms of raw performance (Figure 6—figure supplement 1, the networks trained in either the coupled or the independent environment perform very similarly when tested in either environment). All mechanisms of the gated recurrence are important to achieve this higher-level inference since the networks deprived of either gating, lateral connections, or recurrent weight training did not show any effect, no matter the streak length (Figure 7c, bottom three plots; for every mechanism, there was a significant interaction effect between the removal of the mechanism and the training environment on the change in prediction over networks and streak lengths, all F(1,6076) > 43.2, all p < 0.001).

These results show that gated recurrent networks but not alternative networks leverage the higher level of structure where the change points of the latent probabilities are coupled.

Gated recurrence enables simple solutions

Finally, we highlight the small number of units sufficient to perform quasi-optimally in the increasingly structured environments that we tested: the above-mentioned results were obtained with 11 recurrent units. It turns out that gated recurrent networks can reach a similar performance with even fewer units, especially in simpler environments (Figure 8a and b, left plot). For instance, in the unigram environment, gated recurrent networks reach 99 % of their asymptotic performance with no more than 3 units.

Figure 8 with 1 supplement see all
Low-complexity solutions are uniquely enabled by the combination of gating, lateral connections, and recurrent weight training.

(a and b) Prediction performance of each network architecture in the changing unigram environment and the changing bigram environment, respectively, as a function of the number of recurrent units (i.e. space complexity) of the network. For each network architecture and each number of units, 20 networks were trained using hyperparameters that had been optimized prior to training, and prediction performance was measured as the % of optimal log likelihood on new test sequences. Solid lines, bands, and dashed lines show the mean, 95 % confidence interval of the mean, and maximum performance, respectively. At the maximum displayed number of units, all of the alternative architectures have exceeded the complexity of the 11-unit gated recurrent network shown on the left and in previous Figures, both in terms of the number of units and the number of trained parameters (indicated on the twin x-axes), but none of them have yet reached its performance.

By contrast, without either gating, lateral connections, or recurrent weight training, even when the networks are provided with more units to match the number of trained parameters in the 11-unit gated recurrent networks, they are unable to achieve similar performance (Figure 8a and b, right three plots, the twin x-axes indicate the number of units and trained parameters).

With an unlimited number of units, at least in the case without gating (i.e. a vanilla RNN, short for recurrent neural network), the networks will be able to achieve such performance since they are universal approximators of dynamical systems (Cybenko, 1989; Schäfer and Zimmermann, 2006). However, our results indicate that this could require a very large number of units even in the simplest environment tested here (see Figure 8a and b, without gating at 1000 units). Indeed, the slow growth of the vanilla RNNs’ performance with the number of units is well described by a power law function, of the form: (100−p) = c(1/N)α, where p is the % of optimal performance and N is the number of units. We fitted this law in the unigram environment using the obtained performance from 2 to 45 units and it yielded a goodness-of-fit of R2 = 92.4% (fitting was done by linear regression on the logarithm of N and (100−p)). To further confirm the validity of the power law, we then extrapolated to 1,000 units and found that the predicted performance was within 0.2 % of the obtained performance for networks of this size (predicted: 97.8%, obtained: 97.6%). Based on this power law, more than 104 units would be needed for the vanilla RNN to reach the performance exhibited by the GRU with only 11 units.

Note that, in terms of computational complexity, the number of units is a fair measure of space complexity (i.e. the amount of memory) across the architectures we considered, since in all of them it is equal to the number of state variables (having one state variable hi per unit, see Materials and methods). What varies across architectures is the number of trained parameters, that is, the degrees of freedom that can be used during training to achieve different dynamics. Still, the conclusion remains the same when an alternative network exceeds the complexity of an 11-unit gated recurrent network in both its number of units and its number of trained parameters.

Therefore, it is the specific computational properties provided by the combination of the three mechanisms that afford effective low-complexity solutions.

Discussion

We have shown that the gated recurrent architecture enables simple and effective solutions: with only 11 units, the networks perform quasi-optimally in environments fraught with randomness, changes, and different levels of latent structure. Moreover, these solutions reproduce several aspects of optimality observed in organisms, including the adaptation of their effective learning rate, the ability to represent the precision of their estimation and to use it to weight their updates, and the ability to represent and leverage the latent structure of the environment. By depriving the architecture of one of its mechanisms, we have shown that three of them are important to achieve such solutions: gating, lateral connections, and the training of recurrent weights.

Can small neural networks behave like Bayesian agents?

A central and much-debated question in the scientific community is whether the brain can perform Bayesian inference (Knill and Pouget, 2004; Bowers and Davis, 2012; Griffiths et al., 2012; Rahnev and Denison, 2018; Lee and Mumford, 2003; Rao and Ballard, 1999; Sanborn and Chater, 2016; Chater et al., 2006; Findling et al., 2019; Wyart and Koechlin, 2016; Soltani and Izquierdo, 2019; Findling et al., 2021). From a computational viewpoint, there exists no tractable solution (even approximate) for Bayesian inference in an arbitrary environment, since it is NP-hard (Cooper, 1990; Dagum and Luby, 1993). Being a bounded agent (Simon, 1955; Simon, 1972), the brain cannot solve Bayesian inference in its most general form. The interesting question is whether the brain can perform Bayesian inference in some environments that occur in real life. More precisely, by ‘perform Bayesian inference’ one usually means that it performs computations that satisfy certain desirable properties of Bayesian inference, such as taking into account a certain type of uncertainty and a certain type of latent structure (Courville et al., 2006; Deroy et al., 2016; Griffiths et al., 2012; Knill and Pouget, 2004; Ma, 2010; Ma and Jazayeri, 2014; Tauber et al., 2017). In this study, we selected specific properties and showed that they can indeed be satisfied when using specific (not all) neural architectures.

In the changing unigram and changing bigram environments, our results provide an existence proof: there exist plausible solutions that are almost indistinguishable from Bayesian inference (i.e. the optimal solution). They exhibit qualitative properties of Bayesian inference that have been demonstrated in humans but are lacking in heuristic solutions, such as the dynamic adjustment of the effective learning rate (Behrens et al., 2007; Nassar et al., 2010; Nassar et al., 2012), the internal representation of latent variables and the precision of their estimates (Boldt et al., 2019; Meyniel et al., 2015), the precision-weighting of updates (McGuire et al., 2014; Nassar et al., 2010; Nassar et al., 2012), and the ability for higher-level inference (Bill et al., 2020; Heilbron and Meyniel, 2019; Purcell and Kiani, 2016).

The performance we obtained with the gated recurrent architecture is consistent with the numerous other successes it produced in other cognitive neuroscience tasks (Wang et al., 2018; Yang et al., 2019; Zhang et al., 2020). Our detailed study reveals that it offers quasi-optimal low-complexity solutions to new and difficult challenges, including those posed by bigram and higher-level structures and latent probabilities that change unpredictably anywhere in the unit interval. We acknowledge that further generalization to additional challenges remains to be investigated, including the use of more than two categories of observations or continuous observations, and latent structures with longer range dependencies (beyond bigram probabilities).

Minimal set of mechanisms

What are the essential mechanistic elements that enable such solutions? We show that it suffices to have recurrent units of computation equipped with three mechanisms: (1) input, self, and lateral connections which enable each unit to sum up the input with their own and other units’ prior value before a non-linear transformation is applied; (2) gating, which enables multiplicative interactions between activities at the summation step; (3) the training of connection weights.

One of the advantages of such mechanisms is their generic character: they do not include any components specifically designed to perform certain probabilistic operations or estimate certain types of latent variables, as often done in neuroscience (Echeveste et al., 2020; Fusi et al., 2007; Jazayeri and Movshon, 2006; Ma et al., 2006; Pecevski et al., 2011; Soltani and Wang, 2010). In addition, they allow adaptive behavior only through recurrent activity dynamics, without involving synaptic plasticity as in other models (Farashahi et al., 2017; Fusi et al., 2005; Iigaya, 2016; Schultz et al., 1997). This distinction has implications for the timescale of adaptation: in the brain, recurrent dynamics and synaptic plasticity often involve short and long timescales, respectively. Our study supports this view: recurrent dynamics allow the networks to quickly adapt to a given change in the environment (Figure 3), while synaptic plasticity allows the training process to tune the speed of this adaptation to the frequency of change of the environment (Figure 3—figure supplement 1).

Our findings suggest that these mechanisms are particularly advantageous to enable solutions with low computational complexity. Without one of them, it seems that a very large number of units (i.e. a large amount of memory) would be needed to achieve comparable performance (Figure 8) (note that universal approximation bounds in vanilla RNNs can be very large in terms of number of units [Barron, 1993; Cybenko, 1989; Schäfer and Zimmermann, 2006]). These mechanisms thus seem to be key computational building blocks to build simple and effective solutions. This efficiency can be formalized as the minimum number of units sufficient for near-optimal performance (as in Orhan and Ma, 2017 who made a similar argument), and it is important for the brain since the brain has limited computational resources (often quantified by the Shannon capacity, i.e. the number of bits that can be transmitted per unit of time, which here amounts to the number of units) (Bhui et al., 2021; Lieder and Griffiths, 2019). Moreover, simplicity promotes our understanding, and it is with the same goal of understanding that others have used model reduction in large networks (Dubreuil et al., 2020; Jazayeri and Ostojic, 2021; Schaeffer et al., 2020).

Since we cannot exhaustively test all possible parameter values, it might be possible that better solutions exist that were not discovered during training. However, to maximize the chances that the best possible performance is achieved after training, we conducted an extensive hyperparameter optimization, repeated for each environment, architecture, and several number of units, until there is no more improvement according to the Bayesian optimization (see Materials and methods).

Biological implementations of the mechanisms

What biological elements could implement the mechanisms of the gated recurrence? Recurrent connections are ubiquitous in the brain (Douglas and Martin, 2007; Hunt and Hayden, 2017); the lesser-known aspect is that of gating. In the next paragraph, we speculate on the possible biological implementations of gating, broadly defined as a mechanism that modulates the effective weight of a connection as a function of the network state (and not limited to the very specific form of gating of the GRU).

In neuroscience, many forms of gating have been observed, and they can generally be grouped into three categories according to the neural process that supports them: neural circuits, neural oscillations, and neuromodulation. In neural circuits, a specific pathway can be gated through inhibition/disinhibition by inhibitory (GABAergic) neurons. This has been observed in microscopic circuits, e.g. in pyramidal neurons a dendritic pathway can be gated by interneurons (Costa et al., 2017; Yang et al., 2016), or macroscopic circuits, for example in basal ganglia-thalamo-cortical circuits a cortico-cortical pathway can be gated by the basal ganglia and the mediodorsal nucleus of thalamus (O’Reilly, 2006; O’Reilly and Frank, 2006; Rikhye et al., 2018; Wang and Halassa, 2021; Yamakawa, 2020). In addition to inhibition/disinhibition, an effective gating can also be achieved by a large population of interacting neurons taking advantage of their nonlinearity (Beiran et al., 2021; Dubreuil et al., 2020). Regarding neural oscillations, experiments have shown that activity in certain frequency bands (typically, alpha and beta) can gate behavioral and neuronal responses to the same stimulus (Baumgarten et al., 2016; Busch et al., 2009; Hipp et al., 2011; Iemi et al., 2019; Klimesch, 1999; Mathewson et al., 2009). One of the most influential accounts is known as ‘pulsed inhibition’ (Hahn et al., 2019; Jensen and Mazaheri, 2010; Klimesch et al., 2007): a low-frequency signal periodically inhibits a high-frequency signal, effectively silencing the high-frequency signal when the low-frequency signal exceeds a certain threshold. Finally, the binding of certain neuromodulators to the certain receptors of a synapse changes the gain of its input-output transfer function, thus changing its effective weight. This has been demonstrated in neurophysiological studies implicating noradrenaline (Aston-Jones and Cohen, 2005; Salgado et al., 2016; Servan-Schreiber et al., 1990), dopamine (Moyer et al., 2007; Servan-Schreiber et al., 1990; Stalter et al., 2020; Thurley et al., 2008), and acetylcholine (Gil et al., 1997; Herrero et al., 2008) (see review in Thiele and Bellgrove, 2018).

We claim that gated recurrence provides plausible solutions for the brain because its mechanisms can all be biologically implemented and lead to efficient solutions. However, given their multiple biological realizability, the mapping between artificial units and biological neurons is not straightforward: one unit may map to a large population of neurons (e.g. a brain area), or even to a microscopic, subneuronal component (e.g. the dendritic level).

Training: Its role and possible biological counterpart

Regarding the training, our results highlight that it is important to adjust the recurrent weights and thus the network dynamics to the environment (and not fix them as in reservoir computing [Tanaka et al., 2019]), but we make no claims about the biological process that leads to such adjustment in brains. It could occur during development (Sherman et al., 2020), the life span (Lillicrap et al., 2020), or the evolution process (Zador, 2019) (these possibilities are not mutually exclusive). Although our training procedure may not be accurate for biology as a whole, two aspects of it may be informative for future research. First, it relies only on the observation sequence (no supervision or reinforcement), leveraging prediction error signals, which have been found in the brain in many studies (den Ouden et al., 2012; Eshel et al., 2013; Maheu et al., 2019). Importantly, in predictive coding (Rao and Ballard, 1999), the computation of prediction errors is part of the prediction process; here we are suggesting that it may also be part of the training process (as argued in O’Reilly et al., 2021). Second, relatively few iterations of training suffice (Figure 8—figure supplement 1, in the order of 10–100; for comparison, Wang et al., 2018 reported training for 40,000 episodes in an environment similar to ours).

Suboptimalities in human behavior

In this study we have focused on some aspects of optimality that humans exhibit in the three environments we explored, but several aspects of their behavior are also suboptimal. In the laboratory, their behavior is often at best qualitatively Bayesian but quantitatively suboptimal. For example, although they adjust their effective learning rate to changes, the base value of their learning rate and their dynamic adjustments may depart from the optimal values (Nassar et al., 2010; Nassar et al., 2012; Prat-Carrabin et al., 2021). They may also not update their prediction on every trial, unlike the optimal solution (Gallistel et al., 2014; Khaw et al., 2017). Finally, there is substantial interindividual variability which does not exist in the optimal solution (Khaw et al., 2021; Nassar et al., 2010; Nassar et al., 2012; Prat-Carrabin et al., 2021). In the future, these suboptimalities could be explored using our networks by making them suboptimal in three ways (among others): by stopping training before quasi-optimal performance is reached (Caucheteux and King, 2021; Orhan and Ma, 2017), by constraining the size of the network or its weights (with hard constraints or with regularization penalties) (Mastrogiuseppe and Ostojic, 2017; Sussillo et al., 2015), or by altering the network in a certain way, such as pruning some of the units or some of the connections (Blalock et al., 2020; Chechik et al., 1999; LeCun et al., 1990; Srivastava et al., 2014), or introducing random noise into the activity (Findling et al., 2021; Findling and Wyart, 2020; Legenstein and Maass, 2014). In this way, one could perhaps reproduce the quantitative deviations from optimality while preserving the qualitative aspects of optimality observed in the laboratory.

Implications for experimentalists

If already trained gated recurrent networks exist in the brain, then one can be used in a new but similar enough environment without further training. This is an interesting possibility because, in laboratory experiments mirroring our study, humans perform reasonably well with almost no training but explicit task instructions given in natural language, along with a baggage of prior experience (Gallistel et al., 2014; Heilbron and Meyniel, 2019; Khaw et al., 2021; Meyniel et al., 2015; Peterson and Beach, 1967). In favor of the possibility to reuse an existing solution, we found that a gated recurrent network can still perform well in conditions different from those it was trained in: across probabilities of change points (Figure 3—figure supplement 1) and latent structures (Figure 6—figure supplement 1, from bigram to unigram).

In this study, we adopted a self-supervised training paradigm to see if the networks could in principle discover the latent structure from the sequences of observations alone. However, in laboratory experiments, humans often do not have to discover the structure since they are explicitly told what structure they will face and the experiment starts only after ensuring that they have understood it, which makes the comparison to our networks impossible in this setting in terms of training (see similar argument in Orhan and Ma, 2017). In the future, it could be interesting to study the ability of gated recurrent networks to switch from one structure to another after having been informed of the current structure as humans do in these experiments. One possible way would be to give a label that indicates the current structure as additional input to our networks, as in Yang et al., 2019.

One of our findings may be particularly interesting to experimentalists: in a gated recurrent network, the representations of latent probabilities and the precision of these probability estimates (sometimes referred to as confidence [Boldt et al., 2019; Meyniel et al., 2015], estimation uncertainty [McGuire et al., 2014; Payzan-LeNestour et al., 2013], or epistemic uncertainty [Amini et al., 2020; Friston et al., 2015; Pezzulo et al., 2015]) are linearly readable from recurrent activity, the form of decoding most frequently used in neuroscience (Haxby et al., 2014; Kriegeskorte and Diedrichsen, 2019). These representations arise spontaneously, and their emergence seems to come from the computational properties of gated recurrence together with the need to perform well in a stochastic and changing environment. This yields an empirical prediction: if such networks can be found in the brain, then latent probability estimates and their precision should also be decodable in brain signals, as already found in some studies (Bach et al., 2011; McGuire et al., 2014; Meyniel, 2020; Meyniel and Dehaene, 2017; Payzan-LeNestour et al., 2013; Tomov et al., 2020).

Materials and methods

Sequence prediction problem

Request a detailed protocol

The sequence prediction problem to be solved is the following. At each time step, an agent receives as input a binary-valued 'observation', xt{0,1}, and gives as output a real-valued 'prediction', pt[0,1] which is an estimate of the probability that the value of the next observation is equal to 1, p(xt+1=1). Coding the prediction in terms of the observation being 1 rather than 0 is inconsequential since one can be deduced from the other: p(xt+1=1)=1p(xt+1=0). The agent’s objective is to make predictions that maximize the (log) likelihood of observations in the sequence, which technically corresponds to the negative binary cross-entropy cost function:

(1) L(p;x)=t=0T1log[xt+1pt+(1xt+1)(1pt)]

Network architectures

All network architectures consist of a binary input unit, which codes for the current observation, one recurrent layer (sometimes called hidden layer) with a number N of recurrent units, and an output unit, which represents the network’s prediction. Unless otherwise stated, N = 11. At every time step, the recurrent unit i receives as input the value of the observation, xt, and the previous activation values of the recurrent units j that connect to i, hj,t1. It produces as output a new activation value, hi,t, which is a real number. The output unit receives as input the activations of all of the recurrent units, and produces as output the prediction pt.

The parameterized function of the output unit is the same for all network architectures:

pt=σ(i=1Nwhp,ihi,t+bhp)

where σ is the logistic sigmoid, whp,i is the weight parameter of the connection from the i-th recurrent unit to the output unit, and bhp is the bias parameter of the output unit.

The updating of hi takes a different form depending on whether gating or lateral connections are included, as described below.

Gated recurrent network

Request a detailed protocol

A gated recurrent network includes both gating and lateral connections. This enables multiplicative interactions between the input and recurrent activity as well as the activities of different recurrent units during the updating of hi. The variant of gating used here is GRU (Cho et al., 2014; Chung et al., 2014). For convenience of exposition, we introduce, for each recurrent unit i, two intermediate variables in the calculation of the update: the reset gate ri and the update gate zi, both of which have their own set of weights and bias. The update gate corresponds to the extent to which a unit can change its values from one time step to the next, and the reset gate corresponds to the balance between recurrent activity and input activity in case of update. Note that ri and zi do not count as state variables since the system would be equivalently characterized without them by injecting their expression into the update equation of hi below. The update is calculated as follows:

ri,t+1=σ(wxr,ixt+1+bxr,i+whr,iihi,t+jiwhr,jihj,t+bhr,i)zi,t+1=σ(wxz,ixt+1+bxz,i+whz,iihi,t+jiwhz,jihj,t+bhz,i)hi,t+1=zi,t+1hi,t+(1zi,t+1)tanh[wxh,ixt+1+bxh,i+ri,t+1(whh,iihi,t+jiwhh,jihj,t)+bhh,i]hi,t=1=0

where (wxr,i,bxr,i,whr,ji,bhr,i), (wxz,i,bxz,i,whz,ji,bhz,i), (wxh,i,bxh,i,whh,ji,bhh,i) are the connection weights and biases from the input unit and the recurrent units to unit i corresponding to the reset gate, the update gate, and the ungated new activity, respectively.

Another variant of gating is the LSTM (Hochreiter and Schmidhuber, 1997). It incorporates similar gating mechanisms as that of the GRU and can achieve the same performance in our task. We chose the GRU because it is simpler than the LSTM and it turned out sufficient.

Without gating

Request a detailed protocol

Removing the gating mechanism from the gated recurrent network is equivalent to setting the above variables ri equal to 1 and zi equal to 0. This simplifies the calculation of the activations to a single equation, which boils down to a weighted sum of the input and the recurrent units’ activity before applying a non-linearity, as follows:

hi,t+1=tanh[wxh,ixt+1+bxh,i+whh,iihi,t+jiwhh,jihj,t+bhh,i]

Another possibility (not considered here) would be to set the value of zi to a constant other than 1 and treat this value (which amounts to a time constant) as a hyperparameter.

Without lateral connections

Request a detailed protocol

Removing lateral connections from the gated recurrent network is equivalent to setting the weights whr,ji, whz,ji, and whh,ji to 0 for all ji. This abolishes the possibility of interaction between recurrent units, which simplifies the calculation of the activations as follows:

ri,t+1=σ(wxr,ixt+1+bxr,i+whr,iihi,t+bhr,i)zi,t+1=σ(wxz,ixt+1+bxz,i+whz,iihi,t+bhz,i)hi,t+1=zi,t+1hi,t+(1zi,t+1)tanh[wxh,ixt+1+ri,t+1whh,iihi,t+bhh,i]

Note that this architecture still contains gating. We could have tested a simpler architecture without lateral connection and without gating; however, our point is to demonstrate the specific importance of lateral connections to solve the problem we are interested in with few units, and the result is all the more convincing if the network lacking lateral connections has gating (without gating, it would fail even more dramatically).

Without recurrent weight training

Request a detailed protocol

The networks referred to as ‘without recurrent weight training’ have the same architecture as the gated recurrent networks and differ from them only in the way they are trained. While in the other networks, all of the weights and bias parameters are trained, for those networks, only the weights and bias of the output unit, whp,iwhp,i and bhp, are trained; other weights and biases are fixed to the value drawn at initialization.

Environments

An environment is characterized by its data generating process, that is, the stochastic process used to generate a sequence of observations in that environment. Each of the generative processes is described by a graphical model in Figure 1—figure supplement 1 and further detailed below.

Changing unigram environment

Request a detailed protocol

In the changing unigram environment, at each time step, one observation is drawn from a Bernoulli distribution whose probability parameter is the latent variable ptenv. The evolution of this latent variable is described by the following stochastic process.

  • Initially, pt=0env is drawn from a uniform distribution on [0,1].

  • At the next time step, with probability pc, pt+1env is drawn anew from a uniform distribution on [0,1] (this event is called a 'change point'), otherwise, pt+1env remains equal to ptenv. The change point probability pc is fixed in a given environment.

Changing bigram environments

Request a detailed protocol

In the changing bigram environments, at each time step, one observation is drawn from a Bernoulli distribution whose probability parameter is either equal to the latent variable p1|1,tenv, if the previous observation was equal to 1, or to the latent variable (1p0|0,tenv) otherwise (at t = 0, the previous observation is considered to be equal to 0). The evolution of those latent variables is described by a stochastic process which differs depending on whether the change points are independent or coupled.

  • In both cases, initially, p0|0,t=0env and p1|1,t=0env are both drawn independently from a uniform distribution on [0,1].

  • In the case of independent change points, at the next time step, with probability pc, p0|0,t+1env is drawn anew from a uniform distribution on [0,1], otherwise, p0|0,t+1env remains equal to p0|0,tenv. Similarly, p1|1,t+1env is either drawn anew with probability pc or remains equal to p1|1,tenvp1|1,tenv otherwise, and critically, the occurrence of a change point in p1|1env is independent from the occurrence of a change point in p0|0env.

  • In the case of coupled change points, at the next time step, with probability pc, p0|0,t+1env and p1|1,t+1env are both drawn anew and independently from a uniform distribution on [0,1], otherwise, both remain equal to p0|0,tenv and p1|1,tenv respectively.

The changing bigram environment with independent change points and that with coupled change points constitute two distinct environments. When the type of change points is not explicitly mentioned, the default case is independent change points. For conciseness, we sometimes refer to the changing unigram and changing bigram environments simply as ‘unigram’ and ‘bigram’ environments.

In all environments, unless otherwise stated, the length of a sequence is T=380 observations, and the change point probability is pc=175, as in previous experiments done with human participants (Heilbron and Meyniel, 2019; Meyniel et al., 2015).

Optimal solution

Request a detailed protocol

For a given environment among the three possibilities defined above, the optimal solution to the prediction problem can be determined as detailed in Heilbron and Meyniel, 2019. This solution consists in inverting the data-generating process of the environment using Bayesian inference, that is, computing the posterior probability distribution over the values of the latent variables given the history of observation values, and then marginalizing over that distribution to compute the prediction (which is the probability of the next observation given the history of observations). This can be done using a hidden Markov model formulation of the data-generating process where the hidden state includes the values of the latent variables as well as the previous observation in the bigram case, and using the forward algorithm to compute the posterior distribution over the hidden state. Because it would be impossible to compute the probabilities for the infinitely many possible values of the latent variables in the continuous interval [0,1], we discretized the interval into 20 equal-width bins for each of the latent variables. For a more exhaustive treatment, see Heilbron and Meyniel, 2019 and the online code (https://github.com/florentmeyniel/TransitionProbModel).

Heuristic solutions

Request a detailed protocol

The four heuristic solutions used here can be classified into 2 × 2 groups depending on:

  • which kind of variables are estimated: a unigram probability or two bigram probabilities.

  • which heuristic rule is used in the calculation of the estimates: the delta-rule or the leaky rule.

The equations used to calculate the estimates are provided below.

  • Unigram, delta-rule:

  • p^t+1=p^t+α(xt+1p^t)p^t=1=0.5

  • Unigram, leaky rule:

  • n0,t+1=αn0,t+(1xt+1)n1,t+1=αn1,t+xt+1n0,t=1=n1,t=1=0p^t=n1,t+1n1,t+n0,t+2

  • Bigrams, delta-rule:

  • p^0|0,t+1=p^0|0,t+α(1xt)(1xt+1p^0|0,t)p^1|1,t+1=p^1|1,t+αxt(xt+1p^1|1,t)p^0|0,t=1=p^1|1,t=1=0.5

  • Bigrams, leaky rule:

  • n0|0,t+1=αn0|0,t+(1xt)(1xt+1)n1|0,t+1=αn1|0,t+(1xt)xt+1n0|1,t+1=αn0|1,t+xt(1xt+1)n1|1,t+1=αn1|1,t+xtxt+1n0|0,t=1=n1|0,t=1=n0|1,t=1=n1|1,t=1=0p^0|0,t=n0|0,t+1n0|0,t+n1|0,t+2p^1|1,t=n1|1,t+2n1|1,t+n0|1,t+2

The delta-rule corresponds to the update rule of the Rescorla-Wagner model (Rescorla and Wagner, 1972). The leaky rule corresponds to the mean of an approximate posterior which is a Beta distribution whose parameters depend on the leaky counts of observations: n1+1 and n0+1 (see Meyniel et al., 2016 for more details).

The output prediction value is equal to p^t in the unigram case, and in the bigram case, to p^1|1,t if xt=1 and (1p^0|0,t) otherwise. The parameter α is a free parameter which is trained (using the same training data as the networks) and thus adjusted to the training environment.

Training

Request a detailed protocol

For a given environment and a given type of agent among the network types and heuristic types, all the reported results are based on 20 agents, each sharing the same set of hyperparameters and initialized with a different random seed. During training, the parameters of a given agent were adjusted to minimize the binary cross-entropy cost function (see Equation 1). During one iteration of training, the gradients of the cost function with respect to the parameters are computed on a subset of the training data (called a minibatch) using backpropagation through time and are used to update the parameters according to the selected training algorithm. The training algorithm was Adam (Kingma and Ba, 2015) for the network types and stochastic gradient descent for the heuristic types.

For the unigram environment, the analyses reported in Figures 25 were conducted after training on a common training dataset of 160 minibatches of 20 sequences. For each of the two bigram environments, the analyses reported in Figures 67 were conducted after training on a common training dataset (one per environment) of 400 minibatches of 20 sequences. These sizes were sufficient for the validation performance to converge before the end of training for all types of agents.

Parameters initialization

Request a detailed protocol

For all of the networks, the bias parameters are randomly initialized from a uniform distribution on [1/N,+1/N] and the weights whp,i are randomly initialized from a normal distribution with standard deviation 1/N and mean 0. For all the networks, the weights wxr,i, wxz,i, wxh,i are randomly initialized from a normal distribution with standard deviation σ0,x and mean 0, and the weights whr,ji,whz,ji,whh,ji are randomly initialized from a normal distribution with standard deviation σ0,h and mean 0 for all ji and μ0,h.,ii for j=i. σ0,x,σ0,h,μ0,h, ii are hyperparameters that were optimized for a given environment, type of network, and number of units as detailed in the hyperparameter optimization section (the values resulting from this optimization are listed in Table 1).

Table 1
Selected hyperparameter values after optimization.

(*: fixed value.)

EnvironmentNetwork architectureNη0σ0,xσ0,hμ0,h,ii
unigramgated recurrent network38.00E-020.020.020*
unigramgated recurrent network116.60E-020.430.210*
unigramgated recurrent network454.20E-0210.020*
unigramwithout gating32.50E-0210.070*
unigramwithout gating111.70E-0210.070*
unigramwithout gating457.60E-0310.080*
unigramwithout gating1,0001.34E-0410.040*
unigramwithout lateral connections35.30E-020.020.021
unigramwithout lateral connections112.70E-0210.021
unigramwithout lateral connections451.30E-02111
unigramwithout recurrent weight training31.00E-011.070.550*
unigramwithout recurrent weight training111.00E-0120.410*
unigramwithout recurrent weight training451.00E-0120.260*
unigramwithout recurrent weight training4749.60E-0310.10*
bigramgated recurrent network36.30E-020.0210*
bigramgated recurrent network114.40E-0210.020*
bigramgated recurrent network451.60E-0210.020*
bigramwithout gating35.50E-020.020.130*
bigramwithout gating113.20E-0210.050*
bigramwithout gating458.90E-0310.060*
bigramwithout gating1,0005.97E-0510.030*
bigramwithout lateral connections34.30E-0210.020
bigramwithout lateral connections114.30E-02110
bigramwithout lateral connections452.80E-02110
bigramwithout recurrent weight training36.60E-020.730.550*
bigramwithout recurrent weight training111.00E-0120.450*

For the initialization of the parameter α in the heuristic solutions, a random value r is drawn from a log-uniform distribution on the interval [10-2.5,10-0.5], and the initial value of α is set to r in the delta-rule case or exp(-r) in the leaky rule case.

Hyperparameter optimization

Request a detailed protocol

Each type of agent had a specific set of hyperparameters to be optimized. For all network types, it included the initial learning rate of Adam η0 and the initialization hyperparameters σ0,x, σ0,h. For the networks without lateral connections specifically, it also included μ0,h.,ii (for those networks, setting it close to one can help avoid the vanishing gradient problem during training Bengio et al., 1994; Sutskever et al., 2013) for the other networks, this was set to 0. For the heuristic types, it included only the learning rate of the stochastic gradient descent. A unique set of hyperparameter values was determined for each type of agent, each environment, and, for the network types, each number of units, through the optimization described next.

We used Bayesian optimization (Agnihotri and Batra, 2020) with Gaussian processes and the upper confidence bound acquisition function to identify the best hyperparameters for each network architecture, environment, and number of units. During the optimization, combinations of hyperparameter values were iteratively sampled, each evaluated over 10 trials with different random seeds, for a total of 60 iterations (hence, 600 trials) for a given architecture, environment, and number of units. In each trial, one network was created, trained, and its cross-entropy was measured on independent test data. The training and test datasets used for the hyperparameter optimization procedure were not used in any other analyses. The training datasets contained respectively 160 and 400 minibatches of 20 sequences for the unigram and the bigram environment; the test datasets contained 200 sequences for each environment. We selected the combination of hyperparameter values corresponding to the iteration that led to the lowest mean test cross-entropy over the 10 trials. The selected values are listed in Table 1.

For the heuristic types, we used random search from a log uniform distribution in the [10–6, 10–1] range over 80 trials to determine the optimal learning rate of the stochastic gradient descent. This led to selecting the value 3.10–3 for all heuristic types and all three environments.

Performance analyses

Request a detailed protocol

All agents were tested in the environment they were trained in (except for Figure 6—figure supplement 1 which tests cross-environment performance). We used a single test dataset per environment of 1000 sequences independent of the training dataset. The log likelihood L of a given agent was measured from its predictions according to Equation 1. The optimal log likelihood Loptimal was measured from the predictions of the optimal solution for the given environment. The chance log likelihood Lchance was measured using a constant prediction of 0.5. To facilitate the interpretation of the results, the prediction performance of the agent was expressed as the % of optimal log likelihood, defined as:

LLchanceLoptimalLchance×100

To test the statistical significance of a comparison of performance between two types of agents, we used a two-tailed two independent samples t-test with Welch’s correction for unequal variances.

Analysis of the effective learning rate

Request a detailed protocol

The instantaneous effective learning rate of an agent that updates its prediction from pt to pt+1 upon receiving observation xt+1 is calculated as:

(2) αt+1=pt+1ptxt+1ptαt=0=p00.5x00.5

We call it ‘effective learning rate’ because, had the agent been using a delta-rule algorithm, it would be equivalent to the learning rate of the delta-rule (as can be seen by rearranging the above formula into an update equation), and because it can be measured even if the agent uses another algorithm.

Readout analyses

Request a detailed protocol

The readout of a given quantity from the recurrent units of a network consists of a weighted sum of the activation values of each unit. To determine the weights of the readout for a given network, we ran a multiple linear regression using, as input variables, the activation of each recurrent unit at a given time step hi,t, and as target variable, the desired quantity calculated at the same time step. The regression was run on a training dataset of 900 sequences of 380 observations each (hence, 342,000 samples).

In the unigram environment, the precision readout was obtained using as desired quantity the log precision of the posterior distribution over the unigram variable calculated by the optimal solution as previously described, that is, ψt=logσt, where σt is the standard deviation of the posterior distribution over pt+1env:

(3) σt=SD[pt+1env|x0,...,xt]

In the bigram environment, the readout of the estimate of a given bigram variable was obtained using as desired quantity the log odds of the mean of the posterior distribution over that bigram variable calculated by the optimal solution, and the readout of the precision of that estimate was obtained using the log precision of that same posterior under the above definition of precision.

In Figure 4a, to measure the accuracy of the readout from a given network, we calculated the Pearson correlation between the quantity read from the network and the optimal quantity on a test dataset of 100 sequences (hence, 38,000 samples), independent from any training dataset. To measure the Pearson correlation between the read precision and the subsequent effective learning rate, we used 300 out-of-sample sequences (hence, 114,000 samples). To measure the mutual information between the read precision and the prediction of the network, we also used 300 out-of-sample sequences (114,000 samples).

In Figure 6d, the log odds and log precision were transformed back into mean and standard deviation for visualization purposes.

Dynamics of network activity in the prediction-precision subspace

Request a detailed protocol

In Figure 4b, the network activity (i.e. the population activity of the recurrent units in the network) was projected onto the two-dimensional subspace spanned by the prediction vector and the precision vector. The prediction vector is the vector of the weights from the recurrent units to the output unit of the network, whp. The precision vector is the vector of the weights of the precision readout described above, whψ. For the visualization, we orthogonalized the precision vector against the prediction vector using the Gram-Schmidt process (i.e. by subtracting from the precision vector its projection onto the prediction vector), and used the orthogonalized precision vector to define the y-axis shown in Figure 4b.

Perturbation experiment to test precision-weighting

Request a detailed protocol

The perturbation experiment reported in Figure 5 is designed to test the causal role of the precision read from a given network on its weighting of the next observation, measured through its effective learning rate. We performed this perturbation experiment on each of the 20 networks that were trained within each of the four architectures we considered. The causal instrument is a perturbation vector q that is added to the network’s recurrent unit activations. The perturbation vector was randomly generated subject to the following constraints:

  • qwhψ=δψis the desired change in precision (we used five levels) that is read from the units’ activities; it is computed by projecting the perturbation onto the weight vector of the precision readout (whψ, is the dot product);

  • the perturbation q induces no change in the prediction of the network: qwhp=0, where whp is the weight vector of the output unit of the network;

  • the perturbation has a constant intensity c across simulations, which we formalize as the norm of the perturbation: q=c.

We describe below the algorithm that we used to generate random perturbations q that satisfy these constraints. The idea is to decompose q into two components: both components leave the prediction unaffected, the first (qψ) is used to induce a controlled change in precision, the second (qr) does not change the precision but is added to ensure a constant intensity of the perturbation across simulations.

  1. To ensure no change in precision, we compute Q, the subspace of the activation space spanned by all vectors q that are orthogonal to the prediction weight vector whp, as the null space of whp (i.e. the orthogonal complement of the subspace spanned by whp, dimension N-1).

  2. We compute qψ, the vector component of Q that affects precision, as the orthogonal projection of whψ onto Q (qψ is thus collinear to the orthogonalized precision axis shown in Figure 4b and described above).

  3. We compute βψ, the coefficient to assign to qψ in the perturbation vector to produce the desired change in precision δψ, as βψ=δψqψwhψ.

  4. We compute R, the subspace spanned by all vector components of Q that do not affect precision, as the null space of qψ (dimension N-2). A perturbation vector in R therefore leaves both the prediction and the precision unchanged.

  5. We draw a random unit vector qr within R (by drawing from all N-2 components).

  6. We compute βr, the coefficient to assign to qr in the perturbation vector so as to ensure that the final perturbation’s norm equals c, as βr=c2βψ2qψ2.

  7. We combine qψ and qr into the final perturbation vector as q=βψqψ+βrqr.

The experiment was run on a set of 1000 sample time points randomly drawn from 300 sequences. First, the unperturbed learning rate was measured by running the network on all of the sequences. Second, for each sample time point, the network was run unperturbed up until that point, a perturbation vector was randomly generated for the desired change of precision and applied to the network at that point, then the perturbed network was run on the next time point and its perturbed learning rate was measured. This was repeated for each level of change in precision. Finally, for a given change in precision, the change in learning rate was calculated as the difference between the perturbed and the unperturbed learning rate.

For statistical analysis, we ran a one-tailed paired t-test to test whether the population’s mean change in learning rate was higher at one level of precision change than at the next level of precision change. This was done for each of the four consecutive pairs of levels of change in precision.

Test of higher-level inference about changes

Request a detailed protocol

For a given network architecture, higher-level inference about changes was assessed by comparing the population of 20 networks trained in the environment with coupled change points to the population of 20 networks trained in the environment with independent change points.

In Figure 7c, the change in unobserved bigram prediction for a given streak length m was computed as follows. First, prior sequences were generated and each network was run on each of the sequences. We generated initial sequences of 74 observations each with a probability of 0.2 for the 'observed' bigram (which will render its repetition surprising) and a probability p for the 'unobserved' bigram equal to 0.2 or 0.8 (such probabilities, symmetric and substantially different from the default prior 0.5, should render a change in their inferred value detectable). We crossed all possibilities (0|0 or 1|1 as observed bigram, 0.2 or 0.8 for p) and generated 100 sequences for each (hence 400 sequences total). Second, at the end of each of these initial sequences, the prediction for the unobserved bigram, pbefore, was queried by retrieving the output of the network after giving it as input ‘0’ if the unobserved bigram was 0|0 or ‘1’ otherwise. Third, the network was further presented with m repeated observations of the same value: ‘1’ if the observed bigram was 1|1 or ‘0’ otherwise. Finally, after this streak of repetition, the new prediction for the unobserved bigram, pafter, was queried (as before) and we measured its change with respect to the previous query, |pafter−pbefore|. This procedure was repeated for m ranging from 2 and 75.

For statistics, we ran a one-tailed two independent samples t-test to test whether the mean change in unobserved bigram prediction of the population trained on coupled change points was higher than that of the population trained on independent change points.

Complexity analyses

Request a detailed protocol

The complexity analysis reported in Figure 8 consisted in measuring, for each network architecture and each environment, the performance of optimally trained networks as a function of the number of units N. For optimal training, hyperparameter optimization was repeated at several values of N, for each type of network and each environment (the resulting values are listed in Table 1). For the complexity analysis, a grid of equally spaced N values in logarithmic space between 1 and 45 was generated, an additional value of 474 was included specifically for the networks without recurrent weight training so as to match their number of trained parameters to that of an 11-unit gated recurrent network, and an additional value of 1,000 was included specifically for the networks without gating to facilitate the extrapolation. For every value on this grid, 20 networks of a given architecture in a given environment were randomly initialized with the set of hyperparameter values that was determined to be optimal for the nearest neighboring N value in logarithmic space. The performance of these networks after training was evaluated using a new couple of training and test datasets per environment, each consisting of 400 minibatches of 20 sequences for training and 1000 sequences for testing.

Statistics

To assess the variability between different agent solutions, we trained 20 agents for each type of agent and each environment. These agents have different random seeds (which changes their parameter initialization and how their training data is shuffled). Throughout the article, we report mean or median over these agents, and individual data points when possible or 95 % confidence intervals (abbreviated as "CI") otherwise, as fully described in the text and figure legends. No statistical methods were used to pre-determine sample sizes but our sample sizes are similar to those reported in previous publications (Masse et al., 2019; Yang et al., 2019). Data analysis was not performed blind to the conditions of the experiments. No data were excluded from the analyses. All statistical tests were two-tailed unless otherwise noted. The data distribution was assumed to be normal, but this was not formally tested. The specific details of each statistical analysis are reported directly in the text.

Code availability

Request a detailed protocol

The code to reproduce exhaustively the analyses of this paper is available at https://github.com/cedricfoucault/networks_for_sequence_prediction and archived on Zenodo with DOI: 10.5281/zenodo.5707498. This code also enables to train new networks equipped with any number of units and generate Figures 27 with those networks.

Data availability

Request a detailed protocol

This paper presents no experimental data. All synthetic data are available in the code repository at https://github.com/cedricfoucault/networks_for_sequence_prediction and archived on Zenodo with DOI: 10.5281/zenodo.5707498.

Data availability

This paper presents no experimental data. All synthetic data are available in the code repository at https://github.com/cedricfoucault/networks_for_sequence_prediction and archived on Zenodo with https://doi.org/10.5281/zenodo.5707498.

The following data sets were generated
    1. Foucault C
    (2021) Github
    ID prediction. Networks for sequence prediction.
    1. Foucault C
    (2021) Zenodo
    Networks for sequence prediction.
    https://doi.org/10.5281/zenodo.5707498

References

  1. Conference
    1. Amini A
    2. Schwarting W
    3. Soleimany A
    4. Rus D
    (2020)
    Advances in Neural Information Processing Systems
    Deep Evidential Regression. pp. 14927–14937.
  2. Conference
    1. Cho K
    2. van Merrienboer B
    3. Gulcehre C
    4. Bahdanau D
    5. Bougares F
    6. Schwenk H
    7. Bengio Y
    (2014) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
    Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. pp. 1724–1734.
    https://doi.org/10.3115/v1/D14-1179
  3. Conference
    1. Costa R
    2. Assael IA
    3. Shillingford B
    4. de Freitas N
    5. Vogels T
    (2017)
    Advances in Neural Information Processing Systems
    Cortical microcircuits as gated-recurrent neural networks.
  4. Conference
    1. Kingma DP
    2. Ba J
    (2015)
    3rd International Conference on Learning Representations, ICLR 2015
    Adam: A Method for Stochastic Optimization.
  5. Conference
    1. LeCun Y
    2. Denker J
    3. Solla S
    (1990)
    Advances in Neural Information Processing Systems
    Optimal Brain Damage.
  6. Conference
    1. LeCun Y
    (2016)
    Proc. Speech NIPS
    Predictive learning.
    1. Lee TS
    2. Mumford D
    (2003) Hierarchical Bayesian inference in the visual cortex
    Journal of the Optical Society of America. A, Optics, Image Science, and Vision 20:1434–1448.
    https://doi.org/10.1364/josaa.20.001434
  7. Conference
    1. Rescorla RA
    2. Wagner AR
    (1972)
    Classical Conditioning II: Current Research and Theory
    A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. pp. 64–99.
  8. Conference
    1. Schaeffer R
    2. Khona M
    3. Meshulam L
    4. Laboratory IB
    5. Fiete IR
    (2020) NeurIPS ProceedingsSearch
    Reverse-engineering Recurrent Neural Network solutions to a hierarchical inference task for mice.
    https://doi.org/10.1101/2020.06.09.142745
  9. Book
    1. Schäfer AM
    2. Zimmermann HG
    (2006) Recurrent Neural Networks Are Universal Approximators
    In: Kollias SD, Stafylopatis A, Duch W, Oja E, editors. Artificial Neural Networks – ICANN 2006. Berlin Heidelberg: Springer. pp. 632–640.
    https://doi.org/10.1007/11840817
    1. Simon HA
    (1972)
    Theories of bounded rationality
    Decision and Organization 1:161–176.
    1. Srivastava N
    2. Hinton G
    3. Krizhevsky A
    4. Sutskever I
    5. Salakhutdinov R
    (2014)
    Dropout: A simple way to prevent neural networks from overfitting
    The Journal of Machine Learning Research 15:1929–1958.
  10. Conference
    1. Sterling P
    (2004) Allostasis, Homeostasis, and the Costs of Physiological Adaptation
    Principles of allostasis: Optimal design, predictive regulation, pathophysiology, and rational therapeutics. pp. 17–64.
    https://doi.org/10.1017/CBO9781316257081
  11. Conference
    1. Sutskever I
    2. Martens J
    3. Dahl G
    4. Hinton G
    (2013)
    International Conference on Machine Learning
    On the importance of initialization and momentum in deep learning. pp. 1139–1147.
  12. Conference
    1. Sutton R
    (1992)
    In Proceedings of the 7th Yale Workshop on Adaptive and Learning Systems
    Gain Adaptation Beats Least Squares. pp. 161–166.
  13. Conference
    1. Yu AJ
    2. Cohen JD
    (2008)
    Advances in neural information processing systems
    Sequential effects: Superstition or rational behavior?. pp. 1873–1880.

Decision letter

  1. Srdjan Ostojic
    Reviewing Editor; Ecole Normale Superieure Paris, France
  2. Michael J Frank
    Senior Editor; Brown University, United States
  3. Srdjan Ostojic
    Reviewer; Ecole Normale Superieure Paris, France
  4. Mehrdad Jazayeri
    Reviewer; Massachusetts Institute of Technology, United States

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Gated recurrence enables simple and accurate sequence prediction in stochastic, changing, and structured environments" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, including Srdjan Ostojic as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Michael Frank as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Mehrdad Jazayeri (Reviewer #2).

The three reviewers are enthusiastic about the manuscript, but have found that the main claims need to be contextualised or rephrased to avoid giving an overstated impression. In the absence of any direct comparison with human behavior and/or neural activity, and considering the high degree of abstraction in model (11 units with abstract computational building blocks), the paper needs a major revision in Discussion to highlight the gap between the results and neurobiology/behavior.

The Reviewing Editor has drafted a consolidated review to help you prepare a revised submission.

Essential revisions:

1. The most notable weakness of the paper is that is not clear whether its aim is to develop a neural model that is close to optimal or a neural model that explains how biological brains handle stochasticity and volatility. There is no serious and quantitative comparison to behavior or neural data recorded in humans or animal models. All the comparisons are with other algorithms and reduced GRU networks. One can appreciate these comparisons if the goal is to show that a full GRU network is close to optimal (which, in many cases, it is). But do humans exhibit a similar level of optimality? One possibility would have been to provide some sort of analysis that would show that the types of errors the model makes are in some counterintuitive (or even intuitive) way like the types of errors humans make. In some of the papers where certain heuristics were proposed, the entire goal was to explain characteristic sub-optimalities in human behavior. As an example, see the recent paper from the Koechlin group in Nature Human Behavior. More generally, there is no shortage of papers quantifying human behavior in stochastic volatile environments. It would be great to see that the errors humans make in at least some task map onto the errors GRU networks make. Imagine for example that such a comparison would show that human errors are more similar to a lesioned version of the GRU, even though the full GRU is closer to optimal. The natural conclusion for such an observation would be that some of the proposed mechanisms are in fact not at play. In any case, all the reviewers think the comparison to human behavior would be valuable, and should be at minimum extensively discussed.

2. On the importance of gating: a lot of emphasis is put on the necessity of gating (eg title, abstract, discussion line 478). But the methods used in the paper cannot demonstrate necessity. Indeed other studies (see eg Collins, Sohl-Dickenstein and Sussillo, arxiv 2016) have argued that gating in RNNs improves their trainability, but does not increase their capacity. That study argued that large vanilla RNNs are able to reach the same performance as gated RNNs with more extensive training and/or hyper-parameter tuning. The claims and discussion should be revised to reflect this limitation.

3. The biological relevance of gating seems also somewhat over-stated (eg in the abstract): while there is no doubt various forms of gating are present in the nervous system, how they map to the specific time-dependent form used in GRUs is far from clear. The relationship of these gate variables with actual synapses, neurons, or populations of neurons is at best speculative at this point.

4. In terms of comparing to biology, the discussion states that "mapping between artificial units and biological neurons may not be straightforward." But biological and artificial models can still be compared quite effectively in terms of activity in the state space, and these comparisons can help reject hypotheses quite effectively. Training RNNs have been a productive avenue for understanding neural computations in the past years, in many studies of this class networks are constrained or contrasted by experimental data (Mante and Sussillo et al., 2013, Rajan et al., 2016 or Finkelstein and Fontolan et al., 2021 as some examples). It could have been possible to try to understand the geometry of neural representations of latent variables in network dynamics and how it is learned and depends on the environment. Additionally, by performing dynamical system analysis (see eg Susillo and Barak, 2013 or Dubreuil and Valente et al., bioRxiv as examples) it might be possible to understand the role of gating in the network computations.

5. The focus on very small network does not necessarily seem relevant when comparing with biologic networks (the phrase "reasonably sized networks" on l.479 seems inappropriate). The analysis of network size in Figure 7 goes until 45 units, which remains very small, and it's difficult to extrapolate the results to larger networks. For instance, large vanilla RNNs implement an effective form of gating based on their non=linearity (Dubreuil et al. 2020), and this mechanism may be able to drastically increase sequence-prediction performance.

6. Another weakness of the paper is that, for each new task, it trains a new GRU. Humans seem to be able to adapt to changes in the latent structure of the generative process without massive retraining. How does this flexibility map onto the proposed scheme? In one of the supplements, cross-task performances have been shown. One notable result is that a GRU trained on a changing bigram with or without coupled change points does quite poorly on the changing unigram. This is an example of failed generalization from a much more complex latent structure to a simpler one, which is indicative of overfitting (to the structure of a generative model – not its parameters). Somewhat counterintuitively, for the GRU model (as well as various other models), the smallest hit on generalization performance occurs when the models are trained on the changing unigram, which is the simplest latent structure considered. This is consistent with several psychophysical studies suggesting that humans may not rely on accurate latent models and may instead rely on simpler heuristics. In the end, is it justified to train new GRUs for each task?

7. Note that LSTMs are able to perform similar computations like the ones in this study here as is shown in Wang and Kurt-Nelson et al., 2019.

8. As a more technical point, the comparison with networks without gating does not seem fully fair. Freezing gating effectively reduces the number of time-dependent variables by a factor 3. Also, when freezing gating, one could treat the gating parameters as fixed hyper-parameters to be optimized, rather than setting them by hand to one.

https://doi.org/10.7554/eLife.71801.sa1

Author response

Essential revisions:

1. The most notable weakness of the paper is that is not clear whether its aim is to develop a neural model that is close to optimal or a neural model that explains how biological brains handle stochasticity and volatility. There is no serious and quantitative comparison to behavior or neural data recorded in humans or animal models. All the comparisons are with other algorithms and reduced GRU networks. One can appreciate these comparisons if the goal is to show that a full GRU network is close to optimal (which, in many cases, it is). But do humans exhibit a similar level of optimality? One possibility would have been to provide some sort of analysis that would show that the types of errors the model makes are in some counterintuitive (or even intuitive) way like the types of errors humans make. In some of the papers where certain heuristics were proposed, the entire goal was to explain characteristic sub-optimalities in human behavior. As an example, see the recent paper from the Koechlin group in Nature Human Behavior. More generally, there is no shortage of papers quantifying human behavior in stochastic volatile environments. It would be great to see that the errors humans make in at least some task map onto the errors GRU networks make. Imagine for example that such a comparison would show that human errors are more similar to a lesioned version of the GRU, even though the full GRU is closer to optimal. The natural conclusion for such an observation would be that some of the proposed mechanisms are in fact not at play. In any case, all the reviewers think the comparison to human behavior would be valuable, and should be at minimum extensively discussed.

The primary aim of our study is to develop neural models that are both close to optimal and simple (i.e. with a small number of units), and to determine under what conditions they can do so, rather than to develop models that can be directly compared with biological brains. Still, the models we develop can inform neuroscience insofar as the tasks we have chosen are tasks that humans and other animals are capable of doing, and in which they show the specific qualitative aspects of optimality that we have investigated (even if they are otherwise suboptimal in several ways). We have modified the Introduction (l. 28, 30, 71) and the Abstract (l. 13) to make our goal clearer. We also now provide further details on several citations throughout the Results by pointing to the relevant figures of previous papers where these qualitative signatures are observed in humans (see l. 197–198, 241, 242–243, 406).

The direct comparison with the brain (behavioral or neural data), and in particular its suboptimalities, remains a very interesting future direction and it was not sufficiently discussed in the previous version of the manuscript. We have added a section in the Discussion dedicated to this topic and have incorporated new elements: see the section "Suboptimalities in human behavior" l. 607. In particular, we have detailed three possible ways to explore suboptimality with the networks: using networks with less training, using networks with fewer units or sparser connections, or using networks that are altered in some way (as suggested by the reviewers).

Note that although there is no shortage of experimental data on learning in stochastic and volatile environments in general, a direct comparison of the data between our study and previous experimental studies can rarely be made, either because the participant responses are categorical choices (often binary) rather than continuous estimates (e.g. Findling, Chopin, and Koechlin, 2021; Findling and Wyart et al. 2019), or because the generative process is very different (such as when observations are sampled from a Gaussian, e.g. Nassar et al. 2010; 2012; Prat-Carrabin et al., 2021). The lack of experimental data suitable for direct comparison is even more pronounced in the case of the changing bigram environments (the second and third environments in our study): the only data we are aware of are those collected in our lab, which have the shortcoming that participant responses are far too infrequent (one question every ~15 observations on average, Meyniel et al. 2015; 2017; 2019; 2020). We intend to acquire new data (including trial-by-trial estimates) to allow such a comparison in the future.

2. On the importance of gating: a lot of emphasis is put on the necessity of gating (eg title, abstract, discussion line 478). But the methods used in the paper cannot demonstrate necessity. Indeed other studies (see eg Collins, Sohl-Dickenstein and Sussillo, arxiv 2016) have argued that gating in RNNs improves their trainability, but does not increase their capacity. That study argued that large vanilla RNNs are able to reach the same performance as gated RNNs with more extensive training and/or hyper-parameter tuning. The claims and discussion should be revised to reflect this limitation.

We agree with the reviewer that our study cannot prove necessity in the strict mathematical sense. Proving necessity would require proving the non-existence of other architectures with similar performance; in practice we can only compare a limited number of architectures (one could conceive of others), and even within these architectures, we cannot test the infinity of possible parameter values. We had tried to say this in the Discussion paragraph about the minimal set of mechanisms but we now realize based on the reviews that it is not sufficient. We have rephrased this Discussion paragraph (see l. 544–560), and screened our text to eliminate phrasing suggestive of strict necessity (including in the Abstract, the Introduction, the Results, and the Discussion).

We also agree that a much larger vanilla RNN can achieve the same task performance as a smaller gated RNN. We intended to demonstrate this point through Figure 8 and the related text. To better convey this message, we have rephrased the text (see new paragraph l. 466 and legend l. 463), and have added to Figure 8 a new data point corresponding to a much larger number of units for the vanilla RNN, to facilitate the extrapolation and indicate that a larger vanilla RNN can ultimately approach optimality. We interpret this as evidence of the advantage afforded by gating to perform the computation simply, i.e. with few units (see also our response to comment #5).

This slow growth of the vanilla RNN’s performance with the number of units is well described by a power law. More precisely, if N is the number of units, and p is the % of optimal performance, the law would be: (100 – p) = c (1 / N)α. We fitted this law in the unigram environment with a least-squares linear regression on the logarithm of N and (100 – p) using the data points from 2 to 45 units, and obtained a goodness-of-fit R2=92.4%. We then extrapolated to N=1000 using the fitted parameters, and found that the predicted performance was within 0.2% of the performance we actually obtained for networks of this size (predicted: 97.8%, obtained: 97.6%), which further confirms the validity of the power law. Based on this power law, more than 104 units would be needed for the vanilla RNN to reach the performance of the GRU at 11 units. We have reported this power law analysis in the revised manuscript (see new paragraph l. 466).

Regarding trainability: gating is best known indeed for improving the network’s trainability; however, that gating seems advantageous for performing the computation we’re interested in with few units, and not just for trainability, is one outcome of our study that we find interesting. We tried as much as possible to eliminate the trainability factor and approach the best possible performance for each network architecture by conducting an extensive hyperparameter optimization (repeated for each task, each architecture, and several numbers of units). One indication that this procedure worked well is that a plateau is reached (Snoek et al., 2012): the optimal value was always found in the first three quarters of the procedure (most often in the first half); in the last quarter, the validation performances of the new samples are almost identical (although lower), which contrasts with the highly variable performance of the first samples and indicates that Bayesian optimization does not gain from further exploration. Still, we have modified the text to mention the issue of trainability and better gauge the strength of the claim (see paragraph l. 556).

Our findings are not at odds with Collins et al.'s (2016) argument that gating does not increase the capacity of a RNN, because capacity (as measured in their study) is not what we measured. In their study, capacity was defined either as the number of bits per parameter that the RNN can store about its task during training, or as the number of bits per hidden unit that the RNN can remember about its input history. What we measured, and what we’re interested in, is the capability to perform the specific type of probabilistic inference in the specific type of environments that we have introduced (not to perform any task). In fact, capacity is actually what we want to control for rather than measure: given a certain memory capacity, does a particular architecture perform better than another? (See also our response to comment #5 about simplicity.)

3. The biological relevance of gating seems also somewhat over-stated (eg in the abstract): while there is no doubt various forms of gating are present in the nervous system, how they map to the specific time-dependent form used in GRUs is far from clear. The relationship of these gate variables with actual synapses, neurons, or populations of neurons is at best speculative at this point.

We fully agree and this is actually what we meant when we listed different possible candidates of gating in biology: it is speculative. We have strengthened this point by now stating it explicitly in the Discussion (see l. 564). What we meant was that since gating as a computational mechanism seems useful for solving the kind of problems that the brain faces, it is an invitation for us as neuroscientists to see if we can interpret the processes at play in the brain as doing gating, and it is all the more welcome given that, in biology, many forms of gating have already been observed. We also agree that the GRU has a very specific form of gating and we did not mean to imply that it is only this very specific form that one should consider. When exploring biological substrates it is therefore important not to be too attached to the precise form of gating of the GRU. We have rephrased the Discussion to stress this point (see l. 564–566) and have provided additional references for the possible biological implementations of gating (l. 573–574 and 574–576).

4. In terms of comparing to biology, the discussion states that "mapping between artificial units and biological neurons may not be straightforward." But biological and artificial models can still be compared quite effectively in terms of activity in the state space, and these comparisons can help reject hypotheses quite effectively. Training RNNs have been a productive avenue for understanding neural computations in the past years, in many studies of this class networks are constrained or contrasted by experimental data (Mante and Sussillo et al., 2013, Rajan et al., 2016 or Finkelstein and Fontolan et al., 2021 as some examples). It could have been possible to try to understand the geometry of neural representations of latent variables in network dynamics and how it is learned and depends on the environment. Additionally, by performing dynamical system analysis (see eg Susillo and Barak, 2013 or Dubreuil and Valente et al., bioRxiv as examples) it might be possible to understand the role of gating in the network computations.

First, concerning the comparison to biology, please see our response to comment #1.

Second, we would like to thank the reviewers for their suggestion which allowed us to illustrate our point in a different, geometrical and telling way. We have followed the reviewers’ suggestion and made a new figure (analogous to figure 2 and 5 in Mante and Sussillo et al., 2013) that illustrates the dynamics of network activity in the state space, with and without gating, and how these relate to the ideal observer behavior—see Figure 4b. This helps to understand the network computations and the difference that gating makes. The geometry of the trajectories shows that, with gating, the network is able to separate the information about the precision of its estimate from the information about the prediction and to use the former to adapt its rate of update in the latter, whereas without gating, these two are not separated.

This allowed us to see that, in the network without gating, the decoded precision seemed very strongly dependent on the prediction. To quantify this dependence, we computed the mutual information between the decoded precision and the network’s prediction. It turned out to be very high in the network without gating (median MI=5.2) compared to the network with gating (median MI=0.7) and the ideal observer (MI=0.6). Note that the mutual information is not zero in the ideal observer (and the GRU) because precision tends to be higher for more predictable observations (i.e. when the prediction gets closer to 0 or 1). This is consistent with the rest of our results and completes our argument because adaptive behavior leverages the part of precision that is independent of the prediction.

We have incorporated this supplementary analysis and the new figure into our results by splitting the old figure 4 and the corresponding section of the Results into two figures and sections, revamping the text and figures accordingly (see l. 251–295, l. 230, l. 296, Figure 4, and Figure 5), and completing the Methods (l. 870–877 and l. 866–867).

This suggestion also helped us to illustrate the perturbation experiment (see bottom left diagram in Figure 5).

5. The focus on very small network does not necessarily seem relevant when comparing with biologic networks (the phrase "reasonably sized networks" on l.479 seems inappropriate). The analysis of network size in Figure 7 goes until 45 units, which remains very small, and it's difficult to extrapolate the results to larger networks. For instance, large vanilla RNNs implement an effective form of gating based on their non=linearity (Dubreuil et al. 2020), and this mechanism may be able to drastically increase sequence-prediction performance.

Please see our response to comment #1 about our primary goal which is not to develop networks directly comparable with biological neural networks. The phrase "reasonably sized networks" was misleading in that respect and we removed it; thank you for pointing it out.

In response to comment #2, we have added a data point to Figure 8 to facilitate the extrapolation to larger vanilla RNNs.

As for the biological implementation of this gating, we quite agree that it remains an open question: do biological neural networks use a mechanism to perform this gating without many neurons, or do they use a very large number of neurons to perform an effective gating as a vanilla RNN would (these are not mutually exclusive)? We have added the latter to our list of possible biological implementations of gating, along with the references that detail how this effective form of gating can be achieved (Beiran, Dubreuil, Valente, Mastrogiuseppe, Ostojic, Neural Computation 2021; Dubreuil, Valente, Beiran, Mastrogiuseppe, Ostojic, bioRxiv) (l. 574–576).

Regarding our focus on small networks, it is motivated by the desideratum of simplicity, which has two advantages:

1) The reduced model description, which provides better understanding. As scientists, we do not merely want our model to perform the task, we also want to understand how it does it. Constraining the size of the network ensures that the algorithm it performs can be described simply, i.e. with a few effective state variables. Knowing which key computational building blocks enable such simple solutions provides insight into the functioning of the system. This is similar to model reduction approaches as described in (Jazayeri and Ostojic, 2021, last paragraph before the conclusion), such as the reduction to a 2-unit network in (Schaeffer et al., 2020), or the reduction to an effective circuit with 2 internal variables in (Dubreuil et al., 2020).

2) The efficiency of the solution (low-memory, low-computational complexity). This is relevant for the brain insofar as the brain's computational resources are limited (Lieder and Griffiths, 2020). Here by “computational resources” we mean more precisely the amount of memory required for the computation, which is often quantified by the Shannon capacity, i.e. the number of bits that can be transmitted per unit of time (see for example Bates and Jacobs 2020; Bhui, Lai, and Gershman, 2021). In our case, this amounts to the number of units (each unit stores the same number of bits, encoded by the hidden state). Therefore, the minimum number of units sufficient for near-optimal performance gives us a measure of efficiency. (Orhan and Ma, 2017) also used this measure of efficiency.

Given the reviewers’ comments, it seems that this point about simplicity was not sufficiently well conveyed in the previous version of the manuscript. We have modified the Introduction (paragraph l. 73) to better motivate our focus on small networks and relate it to simplicity more explicitly, and have further elaborated on it in the Discussion including the above two advantages (l. 548–555).

6. Another weakness of the paper is that, for each new task, it trains a new GRU. Humans seem to be able to adapt to changes in the latent structure of the generative process without massive retraining. How does this flexibility map onto the proposed scheme? In one of the supplements, cross-task performances have been shown. One notable result is that a GRU trained on a changing bigram with or without coupled change points does quite poorly on the changing unigram. This is an example of failed generalization from a much more complex latent structure to a simpler one, which is indicative of overfitting (to the structure of a generative model – not its parameters). Somewhat counterintuitively, for the GRU model (as well as various other models), the smallest hit on generalization performance occurs when the models are trained on the changing unigram, which is the simplest latent structure considered. This is consistent with several psychophysical studies suggesting that humans may not rely on accurate latent models and may instead rely on simpler heuristics. In the end, is it justified to train new GRUs for each task?

Regarding the cross-task performances, it seems that there was some misunderstanding because our results actually show the opposite: it is the GRU trained in the more complex environment (either of the bigram environments) that generalizes best to the simpler environment (unigram) (Figure 6—figure supplement 1). The reviewer's comment made us realize that this figure was difficult to read in the previous version. We therefore grouped the data differently and present of another set of comparisons to highlight this result more clearly: for one GRU trained in a given environment, the performances in the three test environments are now side by side, which allows the reader to better see the generalization performance given one training environment and to compare it with that given a different training environment (see Figure 6—figure supplement 1).

Regarding the question of whether it is justified to train a new GRU for each environment given that humans seem to be able to adapt to the environment without massive retraining: In fact, it would be unfair to compare the GRUs’ generalization performance as presented here with humans’ ability to generalize as observed in our lab, because when humans do this task in the lab, they are explicitly told what the latent structure is (i.e., the generative process of the observations), they do not have to discover it, unlike GRUs. This point was mentioned but not explicit enough, we now explain it in a new Discussion paragraph (l. 633).

In this study, we focused on the ability to leverage the latent structure during inference rather than the ability to discover this structure during training. From a theoretical point of view, neither the GRU nor humans can be expected to discover the structure purely from the observations without a large sample size, since even an ideal observer model that arbitrates between the two bigram structures in a statistically optimal fashion requires many observations to determine the correct structure—see Heilbron and Meyniel (2019) p.14:

“In our task, the optimal hierarchical model is able to correctly identify the current task structure (coupled vs. uncoupled change points), but only with moderate certainty even after observing the entire experiment presented to one subject (log-likelihood ratios range from 2 to 5 depending on subjects) [one experiment corresponds to 4 sequences i.e. 4*380=1520 observations]. […] We speculate that in real-life situations, some cues or priors inform subjects about the relevant dependencies in their environment; if true, then our experiment in which subjects were instructed about the correct task structure may have some ecological validity.”.

Regarding humans’ ability to flexibly switch from one structure to another without retraining given a cue about the current structure, it would be interesting to study the same ability in our network. This could be done by giving an additional input to the network that codes for the cue. We now mention this future direction in the new Discussion paragraph (see l. 637–641).

7. Note that LSTMs are able to perform similar computations like the ones in this study here as is shown in Wang and Kurt-Nelson et al., 2019.

Thank you for reminding us to mention the LSTM because it is a very popular architecture and many readers are likely to think about it too. We agree: the LSTM incorporates gating mechanisms similar to that of the GRU that allow it to perform the same computation. We have verified this in practice by repeating the hyperparameter optimization, training, and testing procedure with the LSTM: we indeed obtain a performance comparable to the GRU—see Author response image 1 (99% in the unigram environment and 98% in the bigram environment). We added a note in the paper to mention that the LSTM architecture also incorporates gating and can achieve the same performance as the GRU (l. 690–692) and have rephrased our exposition of the architectures to indicate that the GRU is only one particular case of a ‘gated recurrent’ architecture (see l. 136–137).

Author response image 1
At an equal number of units, the LSTM matches the GRU in performance but is more complex.

The reason we had chosen the GRU over the LSTM is that we were looking for the minimal sufficient architecture, and the LSTM is a more complex architecture than the GRU, which turned out sufficient. LSTM units are more complex than GRU units in two ways: they have three gates instead of two, and they have an additional state variable called “cell state” (or “memory cell”) that adds to the hidden state. Thus, for the same number of units, the LSTM has not only more parameters than the GRU (at 11 units, the LSTM has 629 parameters and the GRU 475 parameters), but also and more importantly a state space twice as large as that of the GRU and the other architectures we considered (at 11 units, the number of state variables is 22 for the LSTM and 11 for the GRU and the others; see response to main comment #8 about which variables count as state variables). Besides, the introduction of the cell state means that we cannot always perform the same analyses and interventions that we perform on the other architectures.

8. As a more technical point, the comparison with networks without gating does not seem fully fair. Freezing gating effectively reduces the number of time-dependent variables by a factor 3. Also, when freezing gating, one could treat the gating parameters as fixed hyper-parameters to be optimized, rather than setting them by hand to one.

It seems that clarifying the definition of variables is key to answer this question. Removing gating does not reduce the number of state variables of the system because what we called the “gating variables” (r and z) are not state variables. The hidden state (h) is the only state variable since it alone suffices to determine the future behavior of the system (GRU and others). Our use of the gating variables is merely for convenience of exposition, to make the GRU more intelligible to us researchers (by labeling the factors in the equation that correspond to gating). One can equivalently characterize the system without these variables using a single recurrence equation that contains only the hidden state. We added a note to mention this (l. 683–684). Furthermore, note that when gating is removed, even when tripling the size of the state space, the vanilla RNN does not reach the performance of the GRU (Figure 8).

Regarding the possibility to treat the gating parameters as fixed hyper-parameters, it is an interesting possibility. In the case of r, if we’re not mistaken, it should not change anything because this fixed hyper-parameter could be absorbed into the recurrent weights (w’=rw), which are optimized during training. In the case of z, it would amount to treating the time constant of the units as a hyperparameter. We have added a sentence in the Methods to mention this possibility (l. 698).

https://doi.org/10.7554/eLife.71801.sa2

Article and author information

Author details

  1. Cédric Foucault

    1. Cognitive Neuroimaging Unit, INSERM, CEA, Université Paris-Saclay, NeuroSpin center, Gif sur Yvette, France
    2. Sorbonne Université, Collège Doctoral, Paris, France
    Contribution
    Conceptualization, Formal analysis, Funding acquisition, Methodology, Project administration, Supervision, Visualization, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7247-6927
  2. Florent Meyniel

    Cognitive Neuroimaging Unit, INSERM, CEA, Université Paris-Saclay, NeuroSpin center, Gif sur Yvette, France
    Contribution
    Conceptualization, Formal analysis, Funding acquisition, Methodology, Project administration, Supervision, Visualization, Writing - original draft, Writing - review and editing
    For correspondence
    florent.meyniel@cea.fr
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6992-678X

Funding

École normale supérieure Paris-Saclay (PhD fellowship "Contrat doctoral spécifique normalien")

  • Cédric Foucault

Agence Nationale de la Recherche (18-CE37-0010-01 "CONFI LEARN")

  • Florent Meyniel

H2020 European Research Council (ERC StG 947105 "NEURAL PROB")

  • Florent Meyniel

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Yair Lakretz for useful feedback, advice, and discussions throughout the project, Alexandre Pouget for his input when starting this project, and Charles Findling for comments on a previous version of the manuscript.

Senior Editor

  1. Michael J Frank, Brown University, United States

Reviewing Editor

  1. Srdjan Ostojic, Ecole Normale Superieure Paris, France

Reviewers

  1. Srdjan Ostojic, Ecole Normale Superieure Paris, France
  2. Mehrdad Jazayeri, Massachusetts Institute of Technology, United States

Publication history

  1. Preprint posted: May 3, 2021 (view preprint)
  2. Received: June 30, 2021
  3. Accepted: December 1, 2021
  4. Accepted Manuscript published: December 2, 2021 (version 1)
  5. Version of Record published: January 6, 2022 (version 2)
  6. Version of Record updated: January 21, 2022 (version 3)
  7. Version of Record updated: February 3, 2022 (version 4)

Copyright

© 2021, Foucault and Meyniel

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 845
    Page views
  • 137
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Cédric Foucault
  2. Florent Meyniel
(2021)
Gated recurrence enables simple and accurate sequence prediction in stochastic, changing, and structured environments
eLife 10:e71801.
https://doi.org/10.7554/eLife.71801
  1. Further reading

Further reading

    1. Neuroscience
    Andrew P Davison, Shailesh Appukuttan
    Insight

    Artificial neural networks could pave the way for efficiently simulating large-scale models of neuronal networks in the nervous system.

    1. Neuroscience
    Jonathan Nicholas, Nathaniel D Daw, Daphna Shohamy
    Research Article

    A key question in decision making is how humans arbitrate between competing learning and memory systems to maximize reward. We address this question by probing the balance between the effects, on choice, of incremental trial-and-error learning versus episodic memories of individual events. Although a rich literature has studied incremental learning in isolation, the role of episodic memory in decision making has only recently drawn focus, and little research disentangles their separate contributions. We hypothesized that the brain arbitrates rationally between these two systems, relying on each in circumstances to which it is most suited, as indicated by uncertainty. We tested this hypothesis by directly contrasting contributions of episodic and incremental influence to decisions, while manipulating the relative uncertainty of incremental learning using a well-established manipulation of reward volatility. Across two large, independent samples of young adults, participants traded these influences off rationally, depending more on episodic information when incremental summaries were more uncertain. These results support the proposal that the brain optimizes the balance between different forms of learning and memory according to their relative uncertainties and elucidate the circumstances under which episodic memory informs decisions.