Mechanisms of distributed working memory in a large-scale network of macaque neocortex

  1. Jorge F Mejías
  2. Xiao-Jing Wang  Is a corresponding author
  1. Swammerdam Institute for Life Sciences, University of Amsterdam, Netherlands
  2. Center for Neural Science, New York University, United States

Abstract

Neural activity underlying working memory is not a local phenomenon but distributed across multiple brain regions. To elucidate the circuit mechanism of such distributed activity, we developed an anatomically constrained computational model of large-scale macaque cortex. We found that mnemonic internal states may emerge from inter-areal reverberation, even in a regime where none of the isolated areas is capable of generating self-sustained activity. The mnemonic activity pattern along the cortical hierarchy indicates a transition in space, separating areas engaged in working memory and those which do not. A host of spatially distinct attractor states is found, potentially subserving various internal processes. The model yields testable predictions, including the idea of counterstream inhibitory bias, the role of prefrontal areas in controlling distributed attractors, and the resilience of distributed activity to lesions or inactivation. This work provides a theoretical framework for identifying large-scale brain mechanisms and computational principles of distributed cognitive processes.

Editor's evaluation

The final revision of the manuscript addressed the remaining issues raised by the reviewers. They felt that the paper is an important contribution to the field, providing new and testable insights into the interaction between cortical areas during the memory delay and that the work is likely to become "an influential reference for future modeling efforts" and deserves publication in eLife.

https://doi.org/10.7554/eLife.72136.sa0

Introduction

With the advances of brain connectomics and physiological recording technologies like Neuropixels (Jun et al., 2017; Stringer et al., 2019), an increasingly important challenge in Neuroscience is to investigate biological mechanisms and computational principles of cognitive functions that engage many interacting brain regions. Here, the goal is no longer to identify one local parcellated brain region that contributes to or is crucial for a particular function, but how a large-scale brain system with many interacting parts underlie behavior. Currently, there is a dearth of theoretical ideas and established models for understanding distributed brain dynamics and function.

A basic cognitive function recently shown to involve multiple brain areas is working memory, the brain’s ability to retain and manipulate information in the absence of external inputs. Working memory has been traditionally associated with mnemonic delay neural firing in localized brain areas, such as those in the frontal cortex (Funahashi et al., 1989; Fuster, 1973; Goldman-Rakic, 1995; Inagaki et al., 2019; Kopec et al., 2015; Romo et al., 1999) and computational models uncovered the involvement of local recurrent connections and NMDA receptors in the encoding of memory items in selective neural assemblies (Amit and Brunel, 1997; Brunel and Wang, 2001; Compte et al., 2000; Wang et al., 2013; Wang, 1999).

In spite of the prevalence of prefrontal cortex as a ‘hub’ for working memory maintenance, self-sustained neural activity during working memory has been found in multiple brain regions; and often such highly engaged areas appear in coactivation (Christophel et al., 2017; Guo et al., 2017; Leavitt et al., 2017; Sreenivasan and D’Esposito, 2019). Previous modeling efforts have been limited to exploring either the emergence of sustained activity in local circuits, or in two interacting areas at most (Edin et al., 2009; Guo et al., 2017; Murray et al., 2017b). It is presently not known what biophysical mechanisms could underlie a distributed form of memory-related sustained activity in a large-scale cortex. The observation that mnemonic activity is commonly found in the prefrontal cortex (PFC) does not prove that it is produced locally rather than resulting from multi-regional interactions; conversely, a distributed activity pattern could in principle be a manifestation of sustained inputs broadcasted by a local source area that can produce self-sustained activity in isolation. Therefore, understanding the distributed nature of cognitive functions such as working memory is challenging and requires of both novel theoretical ideas and multi-area recordings (Edin et al., 2009; Guo et al., 2017).

In this study, we tackle this challenge by building and analyzing an anatomically constrained computational model of the cortical network of macaque monkey, and investigate a novel scenario in which long-range cortical interactions support distributed activity patterns during working memory. The anatomical data is used to constrain the model at the level of long-range connections but also at the level of local circuit connectivity. In particular, the model incorporated differences between individual cortical areas, by virtue of macroscopic gradients of local circuit properties in the large-scale network. The emerging distributed patterns of sustained activity involve many areas across the cortex. They engage temporal, frontal and parietal areas but not early sensory areas, in agreement with a recent meta-analysis of delay period activity in macaque cortex (Leavitt et al., 2017). Sustained firing rates of cortical areas across the hierarchy display a gap, indicative of the existence of a transition akin to a bifurcation in cortical space that does not require tuning of bifurcation parameters. Furthermore, the distributed patterns emerge even when individual areas are unable to maintain stable representations, or when other mechanisms such as activity-silent memory traces are considered. Our model predicts that distributed WM patterns (i) require the existence of a certain level of inhibition in long-range feedback projections, (ii) can be controlled or inactivated from a small group of areas at the top of the cortical hierarchy, and (iii) increase the robustness of the network to distractors and simulated inactivation of areas. The concept departs from the classical view of working memory based on local attractors, and sheds new light into recent evidence on distributed activity during cognitive functions.

Results

Our computational model includes 30 areas distributed across all four neocortical lobes (Figure 1A; see Materials and methods for further details). The inter-areal connectivity is based on quantitative connectomic data from tract-tracing studies of the macaque monkey (Markov et al., 2013; Markov et al., 2014a; Markov et al., 2014b; Figure 1—figure supplement 1). For simplicity, each of the cortical areas is modeled as a neural circuit which contains two excitatory populations (selective to sensory stimuli A and B, respectively) and one inhibitory population (Wang, 2002; Wang, 1999; Figure 1B). In addition, the model assumes a macroscopic gradient of outgoing and recurrent synaptic excitation (Chaudhuri et al., 2015; Joglekar et al., 2018; Wang, 2020), so that the level of synaptic strength is specific of each area (Figure 1—figure supplement 2). This gradient is introduced by considering that the number of apical dendritic spines, loci of excitatory synapses, per pyramidal cells increases (Elston, 2007) along the cortical hierarchy as defined by anatomical studies (Felleman and Van Essen, 1991; Markov et al., 2014a; Figure 1C). The gradient of area-specific connection strength was applied to both local recurrent and long-range excitatory outgoing projections. In particular, we denote the maximal strength of local and long-range excitation for the area at the top of the cortical hierarchy by Jmax, which is an important parameter of the model (see below). To allow for the propagation of activity from sensory to association areas, we assumed that inter-areal long-distance outgoing projections target more strongly excitatory neurons for feedforward pathways and inhibitory neurons for feedback pathways, in a graded fashion (Mejias et al., 2016; Figure 1B). We shall refer to the gradual preferential targeting onto inhibitory neurons by top-down projections as the ‘counterstream inhibitory bias’ hypothesis. We assume that the bias of top-down projections towards inhibitory neurons is proportional to the fraction of infragranular projections (see Materials and methods). It is worth noting that exploration of such new hypotheses would have not been possible without a quantitative definition of the cortical hierarchy and biologically realistic circuit modeling. The results provided by this anatomically constrained model, while leading to concrete experimental predictions for macaques, are also robust to small alterations of parameter values and connectivity structure, suggesting the validity of our conclusions in other conditions or animal species.

Figure 1 with 2 supplements see all
Scheme and anatomical basis of the multi-regional macaque neocortex model.

(A) Lateral view of the macaque cortical surface with modelled areas in color. (B) In the model, inter-areal connections are calibrated by mesoscopic connectomic data (Markov et al., 2013), each parcellated area is modeled by a population firing rate description with two selective excitatory neural pools and an inhibitory neural pool (Wong and Wang, 2006). Recurrent excitation within each selective pool is not shown for the sake of clarity of the figure. (C) Correlation between spine count data (Elston, 2007) and anatomical hierarchy as defined by layer-dependent connections (Markov et al., 2014a).

Distributed WM is sustained by long-range cortical loops

In local circuit models of working memory (WM)(Compte et al., 2000; Wang, 1999), areas high in the cortical hierarchy make use of sufficiently strong synaptic connections (notably involving NMDA receptors Wang et al., 2013; Wang, 1999) to generate self-sustained delay activity. Specifically, the strength of local synaptic reverberation must exceed a threshold level (in our model, the local coupling parameter JS must be larger than a critical value of 0.4655), for an isolated local area to produce stimulus-selective mnemonic activity states that coexist with a resting state of spontaneous activity (operating in a multistable regime rather than in a monostable regime, see Figure 2A). However, there is presently no conclusive experimental demonstration that an isolated cortical area like dorsolateral prefrontal cortex (dlPFC) is indeed capable of generating mnemonic sustained activity. Indeed, recent evidence suggest that thalamocortical support might be needed to achieve sustained activity in dlPFC (Guo et al., 2017). In this study, we first examined the scenario in which all areas, including dlPFC (9/46d) at the top of the hierarchy, have JS values below the critical value for multistability (so JSJmax§amp;lt;0.4655) and are connected via excitatory long-range projections of global coupling strength G (we set Jmax = 0.42 and G = 0.48 unless specified otherwise)(Figure 2A). In this case, any observed sustained activity pattern must result from inter-areal connection loops. In a model simulation of a visual delayed response task, a transient visual input excites a selective neural pool in the primary visual cortex (V1), which yielded activation of other visual areas such as MT during stimulus presentation (Figure 2B, upper left). After stimulus withdrawal, neural activity persists in multiple areas across frontal, temporal and parietal lobes (Figure 2B, top right). The resulting activation pattern shows a substantial agreement with a large body of data, from decades of monkey neurophysiological experiments, reviewed in recent meta-analyses (Christophel et al., 2017; Leavitt et al., 2017), regarding which areas display WM-related activity during delay period of WM tasks (Figure 2B, bottom right). The activation pattern from the model was stimulus specific, so only the neural pool selective to the presented stimulus in each cortical area displayed elevated sustained activity (Figure 2C; Figure 2—figure supplement 1). We observed cross-area variations of neural dynamics: while areas like 9/46d displayed a sharp binary jump of activity, areas like LIP exhibited a more gradual ramping activity. Such a population level, or neuron-averaged, ramping activity of LIP in our model would correspond to the trial-averaged temporal accumulation of information in decision-making (Shadlen and Newsome, 2001).

Figure 2 with 5 supplements see all
Distributed WM sustained via long-range loops in cortical networks.

(A) Bifurcation diagram for an isolated area. Green circles denote the position of each area, with all of them in the monostable regime when isolated. (B) Spatial activity map during visual stimulation (left) and delay period (upper right). For comparison purposes, bottom right map summarizes the experimental evidence of WM-related delay activity across multiple studies (Leavitt et al., 2017), dark blue corresponds to strong evidence and light blue to moderate evidence. (C) Activity of selected cortical areas during the WM task, with a selective visual input of 500ms duration. (D) Firing rate for all areas during the delay period, ranked by firing rate.

Given that selective mnemonic activity is also found in somatosensory WM tasks (Romo et al., 1999), we further test our model and simulate a simple somatosensory WM task by transiently and selectively stimulating a neural pool in primary somatosensory cortex. As in the case of visual stimulation, this leads to the emergence of a distributed sustained activity pattern of equal selectivity as the input (Figure 2—figure supplement 2), showing the validity of the distributed WM mechanism across different sensory modalities. At this stage, the model is however not able to predict different between attractors evoked by different sensory modalities. For this, we show that the introduction of a simple gating mechanism allows to study the involvement of certain areas in the processing of particular input modalities, further refining the model predictions (Figure 2—figure supplement 3). Likewise, our model has considered NMDA receptors as the only excitatory dynamics for simplicity. However, AMPA dynamics may also be important (van Vugt et al., 2020), and can be easily introduced leading to a good behavior of the model for shorter durations of the stimulus presentation (Figure 2—figure supplement 4).

When we plotted the firing rate of stimulus-selective sustained activity across 30 areas along the hierarchy, our results revealed a separation between the areas displaying sustained activity and those that did not (Figure 2D). This is a novel type of abrupt transition of behavior that takes place in space, rather than as a function of a network parameter like in Figure 2A. As a matter of fact, the relevant parameter here is the strength of synaptic excitation that varies across cortical space, in the form of a macroscopic gradient. The transition is robust in two respects. First, the separation between areas appears not only when areas are ranked according to their firing rates, but also when they follow their positions in the anatomical hierarchy or in the rank of spine count values (Figure 2—figure supplement 5). Second, it does not depend on any fine tuning of parameter values.

Simplified model of distributed working memory

The above model, albeit a simplification of real brain circuits, includes several biologically realistic features, which makes it difficult to identify essential ingredients for the emergence of distributed WM. For this reason, we developed a minimal model consisting on a fully connected network of excitatory firing-rate nodes (Figure 3A, see Appendix 1). This simplified model will allow us to explore the minimal conditions for the emergence of distributed WM, in the same way that the full, biologically realistic model provided us with stronger support for the mechanism in realistic conditions. The network of the simplified model includes a linear gradient of local properties: areas at the beginning of such gradient have weak self-coupling, while areas at the end have strong self-coupling. As in the more elaborated model, self-excitation is too weak to generate bistability in any isolated nodes.

Figure 3 with 1 supplement see all
Simplified model of distributed WM.

(A) Scheme of our simplified model: a fully connected network of N = 30 excitatory nodes with a gradient of local coupling strength. (B) Population-average firing rate as a function of the global coupling strength, according to numerical simulations (symbols) and a mean-field solution based on first-order (gray line) or second-order statistics (blue). (C) Firing rates of three example individual nodes (symbols denote simulations, lines denote second-order mean-field solutions). (D) Activity of an example node when isolated (light blue) or connected to the network (dark blue). (E) Two forms for the gradient of local coupling strength (lines) compared with the spine count data (symbols). (F) The number of bistable areas in the attractor (grey) is either zero or N when the gradient of local properties saturates as suggested by spine count data. When feedback projections become weaker, resembling inhibitory feedback, the network is able to display distributed WM patterns which involve only a limited number of areas (black).

This simple model allows for a mean-field analytical solution for the network average firing rate R of the form: R=ϕ ((Jη0+G) R+I), with ϕ being a sigmoidal function, Jη0 the average local coupling value across areas, G the inter-areal connection strength, and I a background input current (see Appendix 1 for a full derivation of a two-area example and a more complete N-area network). The factor Jη0+G determines whether the above equation has one stable solution (spontaneous firing) or two (spontaneous and sustained firing). As this factor includes both local and global components, the average network firing rate may be bistable even if local couplings are weak, as long as the inter-areal connections are strong enough. This mean-field solution, as well as a more precise second-order version, show a good agreement with numerical simulations and confirm the emergence of distributed activity in the system (Figure 3B). Simulations also show, around G~0.17, the appearance of states in which only areas at the top of the gradient show bistability, indicated by low values of R. Once R is known, the mean-field solution also permits to predict the emergence of sustained activity in individual nodes (Figure 3C) and also observe how monostable isolated units become bistable when incorporated into the network (Figure 3D). Briefly, the existence of a clear bistability in R would induce, in the individual areas with stronger self-coupling, the emergence of bistability for these areas. For areas with weak self-coupling, either possible value of R is not enough to induce local bistability, and these areas remain in the monostable regime.

The simplified model demonstrates that distributed WM patterns, in which some (but not all) areas display bistability when connected to each other, may emerge on generic networks of excitatory units as long as (i) their long-range connections are strong enough and (ii) the network has a linear gradient of local couplings. When considering biological constraints, however, these two conditions might not be easy to meet. In particular, data on the area-specific number of spines per neuron seems to monotonically increase, but saturates instead of growing linearly (Figure 3E). Introducing this saturating gradient on the simplified model makes the nodes more homogeneous, and as a result the network is not able to display distributed WM patterns without indistinctively activating all nodes (Figure 3F, gray curve). This problem was solved when we assumed that feedforward projections (i.e. those going from lower to higher areas in the gradient) were slightly stronger while feedback projections were slightly weaker, which is consistent with the counterstream inhibitory bias hypothesis. Such assumption, needed for saturating gradients, allows to recover solutions in which only a subset of areas display bistability in the WM patterns (Figure 3F, black curve).

Simplified models can also be used to explore the plausibility of the distributed WM hypothesis in other scenarios besides delay activity –for example, for activity-silent memory traces (Kamiński and Rutishauser, 2020; Mongillo et al., 2008; Stokes, 2015). We modified the simplified model introduced above by adding short-term synaptic facilitation (STF) to both local and long-range projections of the network (and decreasing the overall synaptic strength to allow the transient enhancements of STF to play a sufficient role; see Appendix 1). In such a model, a slowly decaying transient of synaptic efficacy, susceptible of reactivations along the delay period, is thought to preserve the information in an activity-silent manner. As in the delay activity model, isolated areas without STF are not able to sustain the information; similar results are obtained when STF is introduced. However, when long-range projections are considered in a network with STF, synaptic efficacies are sustained for long periods of time (as a result of the contribution of both local and long-range interactions), leading to areas with strong enough synapses to preserve the information via activity-silent memory traces (Figure 3—figure supplement 1).

Impact of the counterstream inhibitory bias in the full model

As indicated by the simplified model, introducing differences between feedforward and feedback projections is a key ingredient to achieve realistic patterns of distributed WM in a data-constrained model. In the full, biologically more realistic model –which will be considered for the rest of the study –this asymmetry is introduced by considering a graded preferential targeting to inhibitory neurons by top-down projections (i.e. counterstream inhibitory bias, or CIB), which prevent indiscriminate sustained activation across all cortical areas (Leavitt et al., 2017; Figure 4A). We systematically varied the strength of the feedback projections targeting inhibitory population in our model, and computed the firing rates of different areas during delay for these cases. We observed that, for strong enough CIB, the overall firing rate of early sensory areas is reduced, while the activity levels of areas high in the hierarchy is maintained at appropriate values (Figure 4B). This also allows distributed WM patterns to emerge for a wide range of the global coupling strength (Figure 4C).

The strength of the counterstream inhibitory bias has also an impact on the overall activity profiles across the brain network. Figure 4D shows activity maps for several CIB level, revealing that a moderate to strong inhibitory feedback agrees well with the experimental evidence.

It is also worth noting that exceptions to the CIB rule may exist in brain networks without compromising the stability of distributed WM attractors. For example, a more balanced targeting would allow for WM-related activity in primary visual areas, which still constitutes a point of controversy in the field (Leavitt et al., 2017). FEF areas 8 l and 8 m, on the other hand, are not able to sustain delay activity when receiving strong inhibitory feedback (especially from other frontal areas) and had to be excluded from this general rule, although such exception does not affect the results aside from local effects in FEF (Materials and Methods, Figure 4—figure supplement 1).

Figure 4 with 3 supplements see all
Effect of inhibitory feedback on distributed WM for the full model.

(A) Scheme showing a circuit without (left) or with (right) the assumption of counterstream inhibitory bias, or CIB. (B) Firing rate of areas at the bottom and top of the hierarchy (10 areas each, thick lines denote averages) as a function of the CIB strength. (C) Number of areas showing sustained activity in the example distributed activity pattern vs global coupling strength without (grey) and with (black) CIB. (D) Activity maps as a function of the CIB strength. As in Figure 2 B, bottom map denotes the experimental evidence for each area (dark blue denotes strong evidence, light blue denotes moderate evidence).

In Figure 2 and also in the following sections, the strength of the counterstream inhibitory bias was considered proportional to the fraction of infragranular projections, as suggested by anatomical studies (Markov et al., 2014b) and following previous work (Mejias et al., 2016). This results in a very small bias for most of the projections, but enough to produce the desired effect (see Materials and methods for further details). In addition to supporting the emergence of distributed WM, CIB could explain observed top-down inhibitory control effects (Tsushima et al., 2006).

While the macroscopic gradient of excitability is an important property of the model, the particular values of excitatory strength assigned to each area are not relevant for the phenomenon (Figure 4—figure supplement 2 A-D). Similar conclusions can be obtained when the anatomical structure of the cortical network is changed –for example, by randomly shuffling individual projection strength values (Figure 4—figure supplement 2E-H). However, in this case, the duration of the sustained activity for multiple areas may be affected. This suggests that salient statistical features of the structure embedded in the cortical network may play a role in the emergence of distributed activity patterns. The model also predicts the emergence of a hierarchy of time scales across cortical areas (Figure 4—figure supplement 3), in agreement with experimental findings (Murray et al., 2014) and supporting and improving previous computational descriptions (Chaudhuri et al., 2015).

Long-range cortical loops support a large number of different distributed attractors

We realized that a large-scale circuit can potentially display a large number of distributed sustained activity patterns (attractors), and some of them may not be accessible by stimulation of a primary sensory area. Note that distinct attractor states are defined here in terms of their spatial patterns, which does not depend on the number of selective excitatory neural pools per area. We developed a numerical approach to identify and count distinct attractors (see Appendix 2 for further details). Our aim is not to exhaustively identify all possible attractors, as the activity space is too large, but to gain insight on how our estimations depend on relevant parameters such as the global coupling strength G, or the maximum area-specific synaptic strength Jmax. Five examples of different distributed WM attractors are shown in Figure 5A, where we can appreciate that not all distributed attractors engage cortical areas at all lobes, and that frontal areas are the ones more commonly involved.

Distributed and local WM mechanisms can coexist in the full model.

(A) Five example distributed attractors of the network model (Jmax = 0.42). (B) Bifurcation diagram of an isolated local area with the four cases considered. (C) Number of attractors (normalized) found via numerical exploration as a function of the global coupling for all four cases. (D) Maximum (peak) number of attractors for each one of the cases. (E) Correlation between size of attractors and mean firing rate of its constituting areas for Jmax = 0.45and G = 0.2. (F) Participation index of each area (left, arranged by spine count) and distribution of attractors according to their size (right).

A more detailed analysis included four cases depending on the value of the maximum area-specific synaptic strength Jmax assumed: two of the cases had Jmax above the bifurcation threshold for isolated areas (0.4655), and the other two had Jmax below the bifurcation threshold. For the first two cases, having Jmax > 0.4655 means that at least certain areas high in the hierarchy, such as dlPFC, have strong enough local reverberation to sustain activity independently (i.e. they were ‘intrinsically multistable’ and able to display bistability even when isolated from the network, Figure 5B); however, areas lower in the hierarchy like 24 c and F2 would require long-range support to participate in WM. For the last two cases, in which Jmax < 0.4655, none of the areas was able to display bistability when isolated, but they can contribute to stabilize distributed WM attractors as in Figure 2. In all four cases, the number of attractors turns out to be an inverted-U function of the global coupling strength G, with an optimal G value maximizing the number of attractors (Figure 5C, curves are normalized to have a peak height of one for visualization purposes). This reflects the fact that a minimal global coupling is needed for areas to coordinate and form distributed WM attractors, but for values of G too large, all areas will follow the majority rule and the diversity and number of possible attractors will decrease. The optimal G value shifted towards lower values for increasing Jmax, and the peak number of attractors simultaneously increasing (Figure 5D).

Across all four cases and G values considered, we found a significant positive correlation between the number of areas involved in a given attractor and the average firing rate of these areas (Figure 5E), which constitutes an experimentally testable prediction of the distributed model of WM. We also analyzed how distributed WM attractors were constituted for the four different cases (Figure 5F). When the network has a high number of intrinsically multistable areas (i.e. when Jmax>0.4655), attractors tend to only involve these areas and are therefore largely restricted to the areas located at the top of the hierarchy (Figure 5F, bottom left and right panels). On the other hand, when the network has zero or a low number of intrinsically multistable areas (i.e. Jmax<0.4655), attractors typically involve a larger number of areas (as a larger pool of areas is needed to sustain distributed WM attractors, see top right panel in Figure 5F) and the areas involved are more diverse in their composition (Figure 5F, top left panel).

Effects of inactivating areas on distributed attractors

To continue probing the robustness of distributed WM patterns, we tested the effect of inactivating cortical areas in our model during WM tasks, which can be done experimentally using optogenetic methods or lesioning selected areas. We tested this by completely and permanently suppressing the firing rate of the inactivated areas in the model, in such a way that the area becomes a sink of current and does not communicate with other areas. We began by inactivating (or silencing) a given number of randomly selected areas in a visually evoked distributed WM attractor, and found that the number of active areas in the attractor decreases only linearly with the number of inactivated areas (Figure 6A). Furthermore, the activity of the areas remained in the distributed WM patterns linearly decreased their sustained activity level with the number of inactivated areas (Figure 6B). As silencing areas at the top of the hierarchy could in principle have strong effects, we then systematically silenced areas in reverse hierarchical order (i.e., silencing the top area first, then the top and second-from-top areas, etc), instead of in random order. In this case, the number of active areas decreases a bit more abruptly (Figure 6C) and, as we will see later, can prevent the emergence of distributed WM altogether if Jmax is not sufficiently large.

Effects of lesioning/silencing areas on the activity and number of attractors.

Silencing occurs throughout the full trial for each area indicated here. (A) Number of active areas in the example attractor as a function of the number of (randomly selected) silenced areas. (B) The activity of the areas which remain as part of the attractor decreases with the number of silenced areas. (C) The number of active WM areas decreases faster when areas are incrementally and simultaneously silenced in reverse hierarchical order. (D) When considering all accessible attractors for a given network (G = 0.48, Jmax = 0.42), silencing areas at the top of the hierarchy has a higher impact on the number of surviving attractors than silencing bottom or middle areas. (E) Numerical exploration of the percentage of surviving attractors for silencing areas in different lobes. (F) Silencing areas at the center of the ‘bowtie hub’ has a strong impact on WM (adapted from Markov et al., 2013). (G) Numerical impact of silencing areas in the center and sides of the bowtie on the number of surviving attractors. For panels (E) and (F), areas color-coded in blue/red have the least/most impact when silenced, respectively.

We also carried out a more systematic evaluation of the effect of cortical inactivations, including their effect on attractors that were not accessible from sensory stimulation directly. This study revealed that inactivating most areas has only limited consequences on the total number of available distributed attractors, although in general the impact increases with the location of the silenced area in the hierarchy (Figure 6D). In particular, the overall impact was large when some temporal and prefrontal areas are silenced, and sometimes more than half of the initially available attractors were lost (Figure 6E). Interestingly, and beyond any hierarchical dependence, the temporal and prefrontal areas that had the strongest impact are part of a subset of the anatomical network which has a very high (92%) density of connections between its nodes. This anatomical core, which has sparser connections with the remaining areas (forming the periphery of the network) is known as the anatomical 'bowtie hub' of the macaque cortex identified in anatomical studies (Markov et al., 2013; Figure 6F). Overall, silencing areas at the center of the bowtie had a deeper impact, in terms of the number of attractors surviving the silencing, than silencing areas on the periphery (Figure 6G).

Effects of inactivations and distractors in distributed vs localized WM patterns

Across all analyses performed above, we assumed a relatively large value for the maximum area-specific recurrent strength Jmax = 0.42, even if still below the critical value needed for bistability in isolation (0.4655). In order to provide clean predictions linked to the distributed WM scenario, in the following sections we studied the case of a strongly distributed WM system with Jmax = 0.26 and G = 0.48, and compared it to the case of networks which rely purely on a localized WM strategy (with Jmax = 0.468, G = 0.21 and feedback projections removed to avoid long-range loops).

We first reexamined the effect of inactivations for this strongly distributed WM network. We found that inactivations have in general a stronger effect here than for networks with larger Jmax (as in Figure 6). For example, inactivating key prefrontal areas such as 9/46d (dlPFC) fully prevented the emergence of distributed WM patterns evoked by external stimulation (Figure 7A and b), which is in agreement with classical prefrontal lesion studies –see (Curtis and D’Esposito, 2004) for a review and a discussion of the implications for dlPFC organization. On the other hand, other areas can still be inactivated without disrupting distributed WM. In some cases, inactivating specific areas might even lead to a disinhibition of other areas and to a general reinforcement of the attractor (e.g. inactivating 24c leads to a larger and faster response by area STPi, Figure 7B). This is a consequence of the hierarchical relationship of cortical areas and the counterstream inhibitory bias –silencing a top area which is effectively inhibiting lower areas might release these lower areas from the inhibition and increase their firing.

Effect of silencing areas in localized vs distributed WM.

(A) Full-brain activity maps during the delay period for the control case (left), and lesioning/silencing area 24 (top right) or area 9/46d (bottom right). (B) Traces of selected areas for the three cases in panel A show the effects of silencing each area. (C) For a network displaying localized WM (top row, corresponding to Jmax = 0.468, G = 0.21), a brief inactivation of area 9/46d leads to losing the selective information retained in that area. For a network displaying distributed working memory (middle row, Jmax = 0.26, G = 0.48) a brief inactivation removes the selective information only transiently, and once the external inhibition is removed the selective information is recovered. In spite of this robustness to brief inactivations, distributed WM patters can be shut down by inhibiting a selective group of frontal areas simultaneously (bottom row, inhibition to areas 9/46 v, 9/46d, F7, and 8B). The shut-down input, of strength I = 0.3 and 1s duration, is provided to the nonselective inhibitory population of each of these four areas.

In addition to permanently inactivating areas, we tested the effects of brief (500ms ~ 1 s) inactivations in specific areas, and compare the effects in localized vs distributed WM scenarios. For networks relying on localized WM, areas at the top of the hierarchy maintained their selective information largely independent from each other. Consequently, briefly inactivating area 9/46d would not, for example, have an effect on the sustained activity of other areas such as 8B (Figure 7C, top row). Furthermore, the brief inactivation was enough to remove the information permanently from 9/46d, which remained in the spontaneous state after the inhibition was withdrawn. On the other hand, silencing an area like 9/46d will slightly affect the sustained activity in other areas (such as 8B) in a network strongly relying on distributed WM (Figure 7C, middle row). However, area 9/46d will be able to recover the encoded selective information once the inhibitory pulse is removed, due to the strong interaction between cortical areas during the delay period. This constitutes a strong prediction for networks which rely on distributed interactions to maintain WM.

The marked resilience of distributed WM attractors to brief inhibitory pulses raises the question of how to shut down the sustained activity once the task has been done. In many traditional WM models, this is achieved by providing a strong nonspecific excitatory input to the whole WM circuit, which triggers inhibition and drives the activity of the selective populations back to their spontaneous state (Compte, 2006; Compte et al., 2000; Wang, 1999). It is, however, unrealistic to expect that this approach could also be used for shutting down distributed WM patterns, as it would require a large-scale synchronous inhibitory pulse to all active areas.

We therefore explore in our model whether more spatially selective signals can shut down distributed patterns of activity. In spite of their robustness to sensory distractors as discussed above, we find that distributed WM activity patterns can be shut down with an excitatory input targeting inhibitory populations of areas high in the hierarchy. Figure 7C (bottom row) shows how a visually evoked distributed WM attractor is deactivated when we deliver excitatory input to the inhibitory populations in the top four areas of the hierarchy (9/46v, 9/46d, F7 and 8B). These prefrontal areas are spatially localized and thought to be highly engaged in WM maintenance, and therefore they are suitable candidates to control the suppression of sustained activity in other cortical areas, such as areas LIP and 24c. Therefore, in spite of engaging cortical areas across all four lobes, distributed WM attractors can be controlled and deactivated by localized inhibition to a small set of frontal areas.

Finally, the distributed nature of WM has also implications for the impact of distractors and similar sensory perturbations on maintenance of selective activity and overall performance of the network. We simulated a delayed response task with distractors (Figure 8A), in which stimulus A is cued to be maintained in WM and stimulus B is presented as distractor during the delay period (and vice versa). When simulated in the localized WM network, we observed that distractors with the same saliency than the original cue were sufficient to switch the network into the new stimuli, making the network easy to distract (Compte et al., 2000; Figure 8B). For the case of distributed WM, however, the network was highly resilient, and distractors with similar saliency levels as the input cues were filtered out by the network so that working memory storage is preserved (Figure 8C). Overall, we found that localized WM networks can be distracted with stimuli similar or even weaker than the minimum cue input strength required to encode a WM pattern, while effective distractors need to be about three times as strong in the case of distributed WM networks (Figure 8D). This difference is due to the robustness of a distributed attractor compared to a local circuit mechanism, but also to the effect of the counterstream inhibitory bias which dampens the propagation of distractor signals (cf. MT responses in Figure 8B and C). This constitutes a key difference between distributed and local WM models feasible of experimental validation.

Resistance to distractors in localized vs distributed WM.

(A) Scheme of the WM task with a distractor, with the cue (current pulse of strength IA = 0.3 and duration 500ms) preceding the distractor (IB = 0.3, 500ms) by four seconds. (B) Activity traces of selected areas during the task, for a network displaying localized WM (Jmax = 0.468, G = 0.21). (C) Same as panel B, but for a model displaying distributed WM (Jmax = 0.26, G = 0.48). (D) Minimal strength required by the cue (blue) to elicit a sustained activity state, and minimal strength required by the distractor (purple) to remove the sustained activity, both for localized WM (left) and distributed WM (right).

Discussion

The investigation of cognitive functions has been traditionally restricted to operations in local brain circuits –mostly due to the limitations on available precision recording techniques to local brain regions, a problem that recent developments are starting to overcome (Jun et al., 2017; Panichello and Buschman, 2021; Siegel et al., 2015; Stringer et al., 2019). It is therefore imperative to advance in the study of distributed cognition using computational models as well, to support experimental advances. In this work, we have presented a large-scale circuit mechanism of distributed working memory, realized by virtue of a new concept of robust transition in space. The distributed WM scenario is compatible with recent observations of multiple cortical areas participating in WM tasks (Christophel et al., 2017; Leavitt et al., 2017; Sreenivasan and D’Esposito, 2019), even when some of these areas have not been traditionally associated with WM. Importantly, considering distributed WM in a truly large-scale network has revealed phenomena such as the transition in cortical space (Figure 2D), the counterstream inhibition (Figure 4) and the diverse pool of available attractors (Figure 5) which would not emerge when studying systems of one or two cortical areas as in Edin et al., 2009; Guo et al., 2017; Murray et al., 2017b.

One of the main ingredients of the model is the gradient of excitation across the cortical hierarchy, implemented via an increase of excitatory recurrent connections (hinted by the existing anatomical evidence on dendritic spines on pyramidal cells across multiple cortical areas Elston, 2007). The particular values of projection strengths do not impact the emergence of distributed WM patterns, but they influence the performance of individual areas (Figure 4—figure supplement 2E-H). Moreover, we introduce the concept of counterstream inhibitory bias (CIB) which was found to stabilize distributed yet spatially confined mnemonic sustained activity patterns in spite of dense inter-areal connectivity. Evidence compatible with CIB includes anatomical studies of feedback projections targeting supragranular layers (Markov et al., 2014a; Rockland and Van Hoesen, 1994), which contain multiple types of inhibitory neurons, and electrophysiological studies showing that figure-ground segregation requires at least partially inhibitory feedback (Hupé et al., 1998). CIB is also compatible with other theoretical frameworks such as predictive coding, which require inhibitory feedback to minimize prediction errors along the cortical hierarchy (Bastos et al., 2012).

Macroscopic gradients and hierarchical structures have recently been proposed as a general principle for understanding heterogeneities in the cortex (Wang, 2020). A standing challenge is to clarify how structural gradients relate to functional ones –for example, the gradual progression from sensory-related to task-related activity as one ascends in the cortical hierarchy, or the higher mixed selectivity in higher cortical areas. For example, it has been shown that gradients of circuit properties in line with hierarchical structures contribute to the emergence of a gradient of time scales across cortex, supporting slow dynamics in prefrontal areas (Chaudhuri et al., 2015; Murray et al., 2014; Wang, 2020) (see also Figure 4—figure supplement 3), and also that a hierarchical organization of functional states could serve as basis for WM-guided decisions and executive control (Miller et al., 2018; Muhle-Karbe et al., 2020). It is possible that structural gradients would play a role not only in other cognitive functions in monkeys, but also in other animals including mice (Fulcher et al., 2019) and humans (Burt et al., 2018; Demirtaş et al., 2019).

Theoretically, the present work is the first to show that graded changes of circuit properties along the cortical hierarchy provides a mechanism to explain qualitatively distinct functions of different cortical areas (whether engaged in working memory). This is reminiscent of the phenomenon mathematically called bifurcation, which denotes the emergence of novel behavior as a result of quantitative property change as a control parameter in a nonlinear dynamical system (Strogatz, 1994). Our model displays a novel form of transition across the cortex, which cannot be simply explained by a parameter change laid out spatially by virtue of a macroscopic gradient, because areas are densely connected with each other in a complex large-scale network. Such a transition implies that a few association areas should exhibit signs of dynamical criticality akin to water near a transition between gas and liquid states. This will be explored further in the future.

Interestingly, the model uncovered a host of distinct sustained activity attractor states, each with its own transition location in the cortical tissue. They are defined by their spatial distributed patterns in the large-scale cortical system, independent of the number of selective neural pools per area (Figure 5A). Many of these mnemonic activity states are not produced by stimulation of primary sensory areas. These attractor internal states could serve various forms of internal representations such as those that are not triggered by a particular sensory pathway –or those triggered by sensory input but are encoded differently as memories (Mendoza-Halliday and Martinez-Trujillo, 2017). The identification of these internal representations in further detail are beyond the scope of the present study, but uncovering their functional role should be within reach of additional experimental and computational work.

Although our proposal extends the idea of attractor dynamics to the scale of large networks, there are several fundamental differences between our model and standard Hopfield-like models of local attractor dynamics. In a large network in which the areas share a preferred selectivity (i.e. the population ‘A’ in Figure 1B), a Hopfield model would trigger a sustained activity of all the populations selective to ‘A’ across the network, which is incompatible with experimental observations (Leavitt et al., 2017). More complex patterns with early sensory areas inactive can be learned by Hopfield models, but only at the expense of modifying the selectivity of population ‘A’ in those areas. On the other hand, our model considers the gradient of properties as a partial solution to this problem, and even tests the validity of such solutions for realistic local coupling levels, which in turn leads to the prediction of the CIB. Overall, our model constitutes an example of how classical ideas of local circuit dynamics may be translated to large-scale networks, and the corresponding new theoretical insights that such process brings along.

Extending the model of distributed working memory

The reported model of large-scale cortical networks is, to the best of our knowledge, the first of its kind addressing a cardinal cognitive function in a data-constrained way, and it opens the door for elucidating this and similar complex brain processes in future research. Several avenues may be taken to extend the functionality of the present model. First, it is straightforward to have an arbitrary number of selective neural pools per area (Wang, 1999), which would increase both the selectivity to sensory inputs and the available number of distributed WM attractors. In that case, more complex connections (not necessarily A to A, B to B, etc.) can be investigated, including a distance-dependent ‘ring structure’ (Compte et al., 2000) or random connections (Bouchacourt and Buschman, 2019). Second, the model presented here is limited to 30 cortical areas, and can be expanded to include both additional cortical areas and subcortical structures relevant for working memory such as thalamic nuclei Guo et al., 2017; Jaramillo et al., 2019 as their connectivity data become available. Interesting extensions in this sense could involve mouse connectomics, to explore the role of thalamocortical loops (Guo et al., 2017) in sustained activity (although working memory mechanisms could differ between rodents and primates), and human connectomics, to reveal the potential influence of complex network structures in the emergence of distributed distractors (Bassett and Sporns, 2017; Demirtaş et al., 2019; van den Heuvel and Sporns, 2013). Third, electrophysiological recording from multiple brain areas could be used to further constrain the dynamics of the model. For example, when extending the model to more complex WM tasks involving components of attention or sensorimotor decisions, additional electrophysiological data could improve the model’s predictive power, especially for areas such as V4 or LIP (Panichello and Buschman, 2021; Siegel et al., 2015). Fourth, the model can be improved by incorporating more biological details such as cortical layers (Mejias et al., 2016), contributions of different neuromodulators, and various types of inhibitory neurons.

Attractor model of working memory and activity-silent state models

In the description adopted here, we have considered that working memory maintained via selective activity is described as an attractor state (Amit and Brunel, 1997; Compte, 2006; Wang, 2001). Other mechanisms have also been proposed, including the maintenance of memories via feedforward mechanisms and activity-silent state mechanisms (Barbosa et al., 2020; Goldman, 2009; Kamiński and Rutishauser, 2020; Miller et al., 2018; Mongillo et al., 2008; Stokes, 2015; Trübutschek et al., 2017; Wolff et al., 2017). Importantly, not limited to steady-states, the attractor framework is fully consistent with temporal variations of delay activity. For instance, during a mnemonic delay a working memory circuit can exhibit stochastic oscillations in the gamma (~40Hz) frequency range, in which neurons often stop momentarily before resuming spike firing, the temporal gap of silence is bridged by slow synaptic reverberation (Compte et al., 2000; Lundqvist et al., 2016). Another example is self-sustained repetition of brief bursts of spikes interspersed with long silent time epochs (Mi et al., 2017). As discussed in a recent review (Wang, 2021), the real conceptual alternative to attractor states is transient activity that fades away in time while a memory trace remains as a hidden state. The biological mechanisms such as the NMDA receptors at recurrent excitatory synapses or short-term synaptic plasticity are not fundamentally separate. A stable (attractor) state does not mean the absence of short-term synaptic facilitation which, as an activity-dependent process can contribute to the maintenance of an attractor state by supplying sufficient excitatory reverberation (Hempel et al., 2000; Pereira and Wang, 2015; Mi et al., 2017), enhance robustness of self-sustained mnemonic activity (Hansel and Mato, 2013; Itskov et al., 2011; Mejias and Torres, 2009; Mejias et al., 2012; Pereira and Wang, 2015; Seeholzer et al., 2019) and induce cross-trial serial effects (Barbosa et al., 2020; Bliss et al., 2017). When the combined strength of excitatory-to-excitatory connections and short-term plasticity is sufficient to maintain a mnemonic state, it is mathematically described as an attractor, no matter how complex its spatiotemporal dynamics may be; otherwise, there is not enough reverberation and neural firing would decay over time over a sufficiently long delay period and never returns spontaneously (Mongillo et al., 2008). Strictly speaking, only the latter should be referred to as activity-silent. An activity-silent state also depends on spiking activity to refresh hidden memory traces and to readout the stored information content. Short-term plasticity could therefore contribute to activity-silent memory traces but also to self-sustained activity. It is worth noting that the above discussion is limited to a local circuit model. The assumed absence of any input to it is unlikely to hold in real life, when there are always external stimulation from the environment and internal processes (e.g. intruding thoughts) in other brain regions that project to the local area under consideration. A mixture of activity-silent state and episodic spikes caused by inputs from the rest of the brain represents an interesting possibility that a local network model is not suitable to investigate. Multi-regional modeling as reported here in interplay with new experiments in the future will shed insights into such a scenario.

Multi-regional network modeling should be extended to explore complex dynamics and cell-to-cell heterogeneity of neural population activity patterns underlying working memory representations (Druckmann and Chklovskii, 2012; Hussar and Pasternak, 2009; Lim and Goldman, 2013; Murray et al., 2017a; Stokes, 2015; Wimmer et al., 2016). There is no reason to think that the encoding of memory items could not use the complex spatiotemporal interactions between brain areas instead of just local interactions. A large-scale implementation of WM also pairs well with recent hypotheses in which memory selectivity is reached via dynamical flexibility instead of content-based attractors, since the wide number and heterogeneity of long-range projections would reduce connection overlap and alleviate the limit capacity of these models (Bouchacourt and Buschman, 2019). The use of inter-areal interactions to sustain WM-related activity has been explored in other recent works (Edin et al., 2009; Guo et al., 2017; Murray et al., 2017b); however, this was limited to two-area systems and the models were not anatomically constrained, therefore limiting their predictive power. Frameworks of WM in which oscillations play an active role, for example regarding WM-guided executive control (Miller et al., 2018), may benefit from using distributed WM approaches, given the usefulness of previous models of large-scale brain networks to explain oscillatory phenomena in the macaque brain (Mejias et al., 2016). Finally, with a simplified model we show that, when short-term facilitation is incorporated at an appropriate level, it enhances the synaptic efficacy in areas at the top of the hierarchy. A more extended consideration of short-term synaptic plasticity, and the contrast between self-sustained activity versus activity-silent state was reported elsewhere (Froudist-Walsh et al., 2021).

Experimental predictions provided by our model

The distributed WM model presented here yields four experimentally testable predictions in monkey (and potentially rodent) experiments, which can be used to validate our theory. First, the model predicts a positive correlation between the number of areas involved in a WM task and their average firing rate of sustained activity (Figure 5E). Such relationship should not occur according to models of localized WM, since activity levels would be fairly independent across areas. Only a distributed WM model in which neurons of similar selectivity (but located at different areas) support each other via long-range projections would lead to this prediction. This prediction could be tested with neuroimaging experiments, by correlating the level of activation of different brain regions during WM with the number of regions activated. Existing data could be used to carefully test this prediction in future studies. A complementary version of this prediction is that, if areas displaying sustained activity are silenced (e.g. optogenetically), the activity of the other sustained activity areas will decrease (Figure 6B).

Second, our model predicts that areas involved in distributed WM patterns can be briefly silenced without losing the encoded information, which will be recovered as soon as the inhibition is gone (Figure 7, middle row), something that localized WM do not predict (Figure 7 top row; see however, related effects on continuous attractor models Seeholzer et al., 2019). As in the first prediction, large-scale interactions across neurons of similar selectivity are a condition for this phenomenon, according to our model. Optogenetic inactivations could be used to test this result.

Third, distributed WM is significantly more robust to distractors than localized WM (Figure 8), due to their intrinsic resilience and the inhibitory feedback condition. Behavioral and neuroimaging experiments in macaques should be able to test this, by testing potential correlations between the spatial extension of a distributed WM pattern and its robustness of the corresponding trials to distractors.

Fourth, electrophysiological recordings in macaques could test whether FEF areas require support from frontal areas (in the form of strong excitation) to maintain WM-related activity (Figure 4—figure supplement 1). In particular, coactivation between FEF and frontal areas could be correlated with elevated activity in FEF neurons. Although this prediction focuses on a particular set of areas, it should shed light into unclear aspects of FEF dynamics.

In a more general sense, our model predicts a reversed gradient of inhibition and strong large-scale interactions to sustain distributed WM patterns, which may be observed using different experimental approaches. It will also be interesting to see whether the same model is able to account for decision-making processes as well as working memory (Wang, 2002; Wong and Wang, 2006).

Given that our model is constrained using data from the macaque brain, it is interesting to discuss which of our results would extend to other conditions (and, in particular, to other animal models). First, we have shown with the simplified model (Figure 3) that the emergence of distributed WM requires minimal elements and is therefore likely to emerge in cortical networks of other animals such as rodents or humans. The existence of a CIB (Figure 4) is a requirement for distributed WM as long as the gradient of local properties grows sublinearly, which makes the CIB a plausible condition in other species as well. Our results on the activity and number of attractors (Figure 5) and the effects of silencing (Figures 6 and 7) are highly dependent on the anatomical constraints used, so their validity will need to be tested for rodents and humans. Finally, the results on the robustness to distractors (Figure 8) rely on the presence of distributed activity (with moderated values of local coupling strengths) and the effect of the CIB on incoming distractor signals, so as long as these ingredients are present in other species, we should expect these effects to be there as well.

Conceptually, this work revealed a novel mechanism in cortical space to generate differential functions across different cortical areas, a concept that is likely to be generalizable for understanding how distinct cortical areas endowed with a canonical circuit organization are at the same time suited for differential functions (Wang, 2020).

Materials and methods

Anatomical data

Request a detailed protocol

The anatomical connectivity data used has been gathered in an ongoing track tracing study in macaque and has been described in detail elsewhere (Markov et al., 2013; Markov et al., 2014a; Markov et al., 2014b; Mejias et al., 2016). Briefly, retrograde tracer injected into a given target area labels neurons in a number of source areas projecting to the target area. By counting the number of labeled neurons on a given source area, Markov et al. defined the fraction of labeled neurons (FLN) from that source to the target area. FLN can serve as a proxy for the ‘connection strength’ between two cortical areas, which yields the connectivity pattern of the cortical network (Figure 1—figure supplement 1A-B). In addition, Markov et al. also measured the number of labeled neurons on the supragranular layer of a given source area. Dividing this number over the total number of labeled neurons on that area, we can define the supragranular layered neurons (SLN) from that source to the target area (Figure 1—figure supplement 1C-D).

SLN values may be used to build a well-defined anatomical hierarchy (Felleman and Van Essen, 1991; Markov et al., 2014b). Source areas located lower (higher) than the target area in the anatomical hierarchy, as defined in Felleman and Van Essen, 1991, display a progressively higher (lower) proportion of labeled neurons in the supragranular layer. As a consequence, the lower (higher) the source area relative to the target area, the higher (lower) the SLN values of the source-to-target projection. By performing a logistic regression on the SLN data to accommodate each area in its optimal position in the anatomical hierarchy (Chaudhuri et al., 2015), we assign a hierarchical value hi to each area ‘i’.

Iterating these measurements across other anatomical areas yields an anatomical connectivity matrix with weighted directed connections and an embedded structural hierarchy. The 30 cortical used to build our data-constrained large-scale brain network are, in hierarchical order: V1, V2, V4, DP, MT, 8 m, 5, 8 l, 2, TEO, F1, STPc, 7 A, 46d, 10, 9/46 v, 9/46d, F5, TEpd, PBr, 7 m, LIP, F2, 7B, ProM, STPi, F7, 8B, STPr and 24 c. Finally, data on wiring connectivity distances between cortical areas is available for this dataset as well, allowing to consider communication time lags when necessary (we found however that introducing time lags this way does not have a noticeable impact on the dynamics of our model). The connectivity data used here is available to other researchers from https://core-nets.org.

The corresponding 30 × 30 matrices of FLN and SLN are shown in Figure 1—figure supplement 1B, D. Areas in these matrices are arranged following the anatomical hierarchy, which is computed with the SLN values and a generalized linear model (Chaudhuri et al., 2015; Mejias et al., 2016). Surgical and histology procedures followed European requirements 86/609/EEC and were approved by the ethics committee of the Rhone-Alpes region.

In addition to the data on FLN and SLN across 30 cortical areas, we used additional data to constrain the area-to-area differences in the large-scale brain network. In particular, we have collected data on the total spine count of layer 2/3 pyramidal neuron basal dendrites across different cortical areas, as the spine count constitutes a proxy for the density of synaptic connections within a given cortical area (Elston, 2007). A full list of all area-specific values of spine densities considered and their sources is given in Table 1. We use an age correction factor meant to correct for the decrease of spine counts with age for data obtained from old monkeys. A plausible estimate would be a ~ 30% decrease for a 10y difference (Duan et al., 2003; Young et al., 2014). See Figure 1—figure supplement 2 for the effect of this correction on the overall gradient established by the spine count data, and the correlation of such gradient with the SLN hierarchy.

Table 1
Spine count data from basal dendrites of layer 2/3 pyramidal neurons in young (~2y o) macaque, acquired from the specified literature.
Rank in SLN hierarchyArea nameMeasured spine countAge correction factorSource
1V16431Elston et al., 1999; Elston and Rosa, 1997
2V212011Elston and Rosa, 1997
3V424291Elston and Rosa, 1998b
4DP--
5MT20771Elston et al., 1999
68m32001.30Elston and Rosa, 1998b
7546891Elston and Rockland, 2002
88l32001.30Elston and Rosa, 1998b
92--
10TEO48121Elston and Rosa, 1998b
11F1--
12STPc83371Elston et al., 1999
137a25721Elston and Rosa, 1997; Elston and Rosa, 1998a
1446d66001.15Estimated from Elston, 2007;
151064881.15Elston et al., 2011
169/46 v78001.15Estimated from Elston, 2007
179/46d78001.15Estimated from Elston, 2007
18F5--
19TEpd72601Elston et al., 1999
20PBr--
217m22941.30Elston, 2001
22LIP23161Elston and Rosa, 1997; Elston and Rosa, 1998a
23F2--
247B68411Elston and Rockland, 2002
25ProM--
26STPi83371Elston et al., 1999
27F7--
288B--
29STPr83371Elston et al., 1999
3024 c68251.15Elston et al., 2005

Experimental evidence of WM-related activity across cortical areas

Request a detailed protocol

To compare the results of our model with existing evidence, we generated brain maps highlighting areas for which experimental evidence of WM-related activity during the delay period has been found. Following the data collected by recent review studies (Christophel et al., 2017; Leavitt et al., 2017), we distinguish between three categories. First, areas with strong WM evidence (for which at least two studies show support of WM-related activity, or if only studies supporting WM activity are known) are shown in dark blue in the maps of Figures 2 and 4. Second, areas with moderate evidence (for which substantial positive and negative evidence exist) are shown in light blue. Finally, areas for which strong negative evidence exists (more than two studies with negative evidence, or absence of any positive studies) are left as grey in the map. Alternative criteria have only small effects on the resulting maps and the general results are consistent to variations.

Computational model: local neural circuit

Request a detailed protocol

We describe the neural dynamics of the local microcircuit representing a cortical area with the Wong-Wang model (Wong and Wang, 2006). In its three-variable version, this model describes the temporal evolution of the firing rate of two input-selective excitatory populations as well as the evolution of the firing rate of an inhibitory population. All populations are connected to each other (see Figure 1A). The model is described by the following equations:

(1) dSAdt=-SAτN+γ 1-SA rA
(2) dSBdt=-SBτN+γ 1-SB rB
(3) dSCdt=-SCτG+γI rC

Here, SA and SB are the NMDA conductances of selective excitatory populations A and B respectively, and SC is the GABAergic conductance of the inhibitory population. Values for the constants are τN=60 ms, τG=5 ms, γ = 1.282 and γI=2. The variables rA, rB and rC are the mean firing rates of the two excitatory and one inhibitory populations, respectively. They are obtained by solving, at each time step, the transcendental equation ri=ϕi(Ii) (where ϕ is the transfer function of the population, detailed below), with Ii being the input to population ‘i’, given by

(4) IA=JsSA+JcSB+JEISC+I0A+InetA+xA(t)
(5) IB=JcSA+JsSB+JEISC+I0B+InetB+xB(t)
(6) IC=JIESA+JIESB+JIISC+I0C+InetC+xC(t)

In these expressions, Js, Jc are the self- and cross-coupling between excitatory populations, respectively, JEI is the coupling from the inhibitory populations to any of the excitatory ones, JIE is the coupling from any of the excitatory populations to the inhibitory one, and JII is the self-coupling strength of the inhibitory population. The parameters I0i with i = A, B, C are background inputs to each population. Parameters are Js = 0.3213 nA, Jc = 0.0107 nA, JIE = 0.15 nA, JEI = −0.31 nA, JII = −0.12 nA, I0A=I0B = 0.3294 nA and I0C=0.26 nA. Later we will modify some of these parameters in an area-specific manner (in particular Js and JIE) to introduce a gradient of properties across the cortical hierarchy. The term Iinet denotes the long-range input coming from other areas in the network, which we will keep as zero for now but will be detailed later. Sensory stimulation can be introduced here as extra pulse currents of strength Ipulse = 0.3 and duration Tpulse = 0.5 sec (unless specified otherwise).

The last term xi(t) with i = A, B, C is an Ornstein-Uhlenbeck process, which introduces some level of stochasticity in the system. It is given by

(7) τnoise dxidt=-xi+τnoise σi ξit

Here, ξi(t) is a Gaussian white noise, the time constant is τnoise=2 ms and the noise strength is σA,B=0.005 nA for excitatory populations and σC=0 for the inhibitory one.

The transfer function ϕi(t) which transform the input into firing rates takes the following form for the excitatory populations (Abbott and Chance, 2005):

(8) ϕA,B(I)= aIb1exp[d (aIb)]

The values for the parameters are a = 135 Hz/nA, b = 54 Hz, and d = 0.308 s. For the inhibitory population a similar function can be used, but for convenience we choose a threshold-linear function:

(9) ϕCI= 1gI c1I-c0+r0+

The notation x+ denotes rectification. The values for the parameters are gI = 4, c1 = 615 Hz/nA, c0 = 177 Hz and r0 = 5.5 Hz. Finally, it is sometimes useful for simulations (although not a requirement) to replace the transcendental equation ri=ϕi(Ii) by its analogous differential equation, of the form

(10) τrdridt=-ri+ϕi(Ii)

The time constant can take a typical value of τr=2 ms.

Computational model: gradient of synaptic strengths

Request a detailed protocol

Before considering the large-scale network and the inter-areal connections, we look into the area-to-area heterogeneity to be included in the model.

Our large-scale cortical system consists of N = 30 local cortical areas, for which inter-areal connectivity data is available. Each cortical area is described as a Wong-Wang model of three populations like the ones described in the previous section. Instead of assuming areas to be identical to each other, here we will consider some of the natural area-to-area heterogeneity that has been found in anatomical studies. For example, work from Elston, 2007 has identified a gradient of dendritic spine density, from low spine numbers (~600) found in early sensory areas to large spine counts (~9000) found in higher cognitive areas. On the other hand, EPSP have similar values both in early sensory (~1.7 ± 1.3 mV) and higher cognitive areas (~0.55 ± 0.43 mV). The combination of these findings suggests an increase of local recurrent strength as we move from sensory to association areas. In addition, cortical areas are distributed along an anatomical hierarchy (Felleman and Van Essen, 1991; Markov et al., 2014a). The position of a given area ‘i’ within this hierarchy, namely hi, can be computed with a generalized linear model using data on the SLN (fraction of supragranular layer neurons) projecting to and from that area. In particular, we assigned hierarchical values to each area such that the difference in values predicts the SLN of a projection. Concretely, we assign a value Hi to each area Ai so that SLN(Aj→ Ai)~ f (Hi-Hj), with ‘f’ being a logistic regression. The final hierarchical values are then obtained by normalizing hi = Hi/Hmax. Further details on the regression are provided elsewhere (Chaudhuri et al., 2015; Markov et al., 2014b).

In the following, we will assign the incoming synaptic strength (both local and long-range) of a given area as a linear function of the dendritic spine count values observed in anatomical studies, with age-related corrections when necessary. Alternatively, when spine count data is not available for a given area, we will use its position in the anatomical hierarchy, which displays a high correlation with the spine count data, as a proxy for the latter. After this process, the large-scale network will display a gradient of local and long-range recurrent strength, with sensory/association areas showing weak/strong local connectivity, respectively. We denote the local and long-range strength value of a given area i in this gradient as hi, and this value normalized between zero (bottom of the gradient, area V1) and one. In summary:

(11) Jsi=Jmin+Jmax-Jmin hi

We assume therefore a gradient of values of Js, with its value going from Jmin to Jmax. Having large values of Js for association areas affects the spontaneous activity of these areas, even without considering inter-areal coupling. A good way to keep the spontaneous firing rate of these areas within physiologically realistic limits is to impose that the spontaneous activity fixed point is the same for all areas (Murray et al., 2017b). To introduce this into the model, we take into account that the solutions in the spontaneous state are symmetrical: SA = SB = S (we assume zero noise for simplicity). The current entering any of the excitatory populations is then (assuming I0A=I0B=I0):

(12) I=Js+JcS+JEISC+I0

Assuming a fast dynamics for rC and SC (mediated by GABA) as compared to SA and SB (mediated by NMDA) we can obtain the approximate expression for SC:

(13) SCτG γI rC=2SJIEζ+β

with

(14) ζ=τG γI c1 gI-JII τG γI c1
(15) β=τG γI c1 I0C+gI r0-c0gI-JII τG γI c1
(16) I=Js+Jc S+2 JEI JIEζ S+JEI β+I0
  • The equation for the excitatory current has then the form

To maintain the excitatory input (and therefore the spontaneous activity level S) constant while varying Js across areas, we just have to keep the quantity Js+Jc+2 JEI JIE ζJ0 constant (for the original parameters of the isolated area described above, we obtain J0 = 0.2112 nA). A good choice, but not the only one, is to assume that the excitatory synapses to inhibitory neurons, JIE, also scales with the ranks and with Js accordingly:

(17) JIE=12 JEI ζ (J0-Js-Jc)

This linear relationship ensures that the spontaneous solution is the same for all areas in the network. Note that deviations from this linear relationship would simply lead to different areas having slightly different spontaneous activity levels, but it does not substantially affect our main results.

Since JIE needs to be non-negative, the linear relationship above imposes a minimum value of Jmin = 0.205 nA for Js. The particular maximum value of Js, namely Jmax, will determine the type of WM model we assume. Since the bifurcation point of an isolated area is at Js = 0.4655 nA for this set of parameter values, setting Jmax below that value implies that all areas in the network are monostable in isolation. In this situation, any sustained activity displayed by the model will be a consequence of a global, cooperative effect due to inter-areal interactions. On the other hand, having Jmax above the bifurcation point means that some areas will be multistable when isolated, for example they will be intrinsically multistable and compatible with classical WM theories.

Unless specified otherwise, we assume a range of Jmin = 0.21 nA and Jmax = 0.42 nA (i.e. below the critical value), so that the model displays distributed WM without having any intrinsically bistable areas.

Computational model: inter-areal projections

Request a detailed protocol

We now consider the inter-areal projections connecting isolated areas to form the large-scale cortical network. Assuming that inter-areal projections stem only from excitatory neurons (as inhibitory projections tend to be local in real circuits) and that such projections are selective for excitatory neurons, the network or long-range input term arriving at each of the populations of a given area x from all other cortical areas is given by

(18) IA, netx=G yWxySLNxySAy
(19) IB, netx=G yWxySLNxySBy
(20) IC, netx=GZ yWxy(1-SLNxy)(SAy+SBy)

Here, G is the global coupling strength, Z is a balancing factor, and W is the connectivity matrix (more details given below). In these equations, a superindex denotes the cortical area and a subindex the particular population within each area. The sum in all equations runs over all cortical areas of the network (N = 30). Excitatory populations A and B receive long-range inputs from equally selective units from other areas, while inhibitory populations receive inputs from both excitatory populations. Therefore, neurons in population A of a given area may be influenced by A-selective neurons of other areas directly, and by B-selective neurons of other areas indirectly, via local interneurons.

G is the global coupling strength, which controls the overall long-range projection strength in the network (G = 0.48 unless specified otherwise). Z is a factor that takes into account the relative balance between long-range excitatory and inhibitory projections. Setting Z = 1 means that both excitatory and inhibitory long-range projections are equally strong, but this does not guarantee that their effect is balanced in the target area, due to the effect of local connections. Following previous work (Murray et al., 2017b), we choose to impose a balance condition that guarantees that, if populations A and B have the same activity level, their net effect on other areas will be zero –therefore highlighting the selectivity aspect of the circuits. Again, deviations from this balance condition do not strongly affect our results besides the appearance of small differences between populations A and B. Considering that the transfer function of inhibitory populations is linear and their approximately linear rate-conductance relationship, it can be shown that

(21) Z=2c1τGγIJEIc1τGγIJII-gI

Aside from global scaling factors, the effect of long-range projections from population y to population x is influenced by two factors. The first one, Wxy, is the anatomical projection strength as revealed by tract-tracing data (Markov et al., 2013). We use the fraction of labelled neurons (FLN) from population y to x to constrain our projections values to anatomical data. We rescale these strengths to translate the broad range of FLN values (over five orders of magnitude) to a range more suitable for our firing rate models. We use a rescaling that maintains the proportions between projection strengths, and therefore the anatomical information, that reads

(22) Wxy=k1 (FLNxy)k2

Here, the values of the rescaling are k1 = 1.2 and k2 = 0.3. The same qualitative behavior can be obtained from the model if other parameter values, or other rescaling functions, are used as long as the network is set into a standard working regime (i.e. signals propagate across areas, global synchronization is avoided, etc.) FLN values are also normalized so that yFLNxy=1. While in-degree heterogeneity might impact network dynamics (de Franciscis et al., 2011; Roxin, 2011), this was done to have a better control of the heterogeneity levels of each area, and to minimize confounding factors such as the uncertainty on volume injections of tract tracing experiments and the influence of potential homeostatic mechanisms. In addition, and as done for the local connections, we introduce a gradient of long-range projection strengths using the spine count data: Wxy(Jsx/Jmax) Wxy , so that long-range projections display the same gradient as the local connectivity presented above.

The second factor that needs to be taken into account is the directionality of signal propagation across the hierarchy. Feedforward (FF) projections that are preferentially excitatory constitute a reasonable assumption which facilitate signal transmission from sensory to higher areas. On the other hand, having feedback (FB) projections with a preferential inhibitory nature contributes to the emergence of realistic distributed WM patterns (Figure 4) (see also previous work Markov et al., 2014b; Tsushima et al., 2006). This feature can be introduced, in a gradual manner, by linking the different inter-areal projections with the SLN data, which provides a proxy for the FF/FB nature of a projection (SLN = 1 means purely FF, and SLN = 0 means purely FB). In the model, we assume a linear dependence with SNL for projections to excitatory populations and with (1-SLN) for projections to inhibitory populations, as shown above.

Following recent evidence of frontal networks having primarily strong excitatory loops (Markowitz et al., 2015), it is convenient to ensure that the SLN-driven modulation of FB projections between frontal areas is not too large, so that interactions between these areas are never strongly inhibitory. In practice, such constraint is only necessary for projections from frontal areas to 8 l and 8 m (which are part of the frontal eye fields) and has little effect on the behavior of our model otherwise. The introduction of this limitation has two minor consequences: (i) it allows area 8 l and 8 m to exhibit a higher level of sustained activity during distributed WM –as their hierarchical position and recurrent strength are not strong enough to sustain activity otherwise, as previously suggested in anatomical studies (Markov et al., 2013; Markov et al., 2014a) and (ii) it slightly shifts the transition point in cortical space (see Figure 4—figure supplement 1). Unless specified otherwise (and in Figure 4, where the limitation is not considered for a cleaner study of the effects of inhibitory feedback), we consider that the SLN-driven modulation of FB projections to 8 l and 8 m is never larger than 0.4.

Deviations from our general assumptions could occur in other areas –for example, a slightly stronger CIB value to primary somatosensory areas could prevent sustained activity in area 2, as the evidence of such activity is still controversial (Lemus et al., 2010; Rossi-Pool et al., 2016; Zhou and Fuster, 1996).

Gating mechanism

Request a detailed protocol

To implement a simple gating mechanism which modulates areas receptive to a particular type of input, we assume that, when the gate of a given area is ‘open’, the strength of incoming synaptic projections effectively increases by a quantity gs. This reflects, in a simplified way, existing gating mechanisms based on the activation of input-specific dendritic compartments, in which activation of a specific dendritic branch increases the effect of synaptic afferents targeting such dendritic branch (Yang et al., 2016). The effects of such gating mechanism are shown in Figure 2—figure supplement 3.

Appendix 1

Simplified computational model: two areas

To provide a deeper intuition of the nature of distributed WM and the model ingredients which are fundamental for the phenomenon, we describe here a simplified version of our model which is suitable for theoretical analysis. We will first introduce a version of the model with two interconnected excitatory nodes, each of them following a rate dynamics:

(23) τdr1dt=-r1+ϕ (Js r1+Jc r2+I)
(24) τdr2dt=-r2+ϕ (Jc r1+Js r2+I)

Here, r1, r2 are the firing rates of node 1 and 2, with display self-connections of strength Js and cross-connections of strength Jc . Both areas have a relaxation time constant τ and receive a background input current I. The transfer function is a threshold-linear function, that is ϕI=I-θ if I§amp;gt;θ and zero otherwise.

The equation for the fixed point of the dynamics can be easily written as a function of the average firing rate of both units, namely R, which leads to the equation

(25) R=(Js+Jc) R+I-θ

The above is only true in the linear regime, while for the subthreshold regime we get the trivial solution R = 0. Solving Eq. 25 leads to the following expression

(26) R=θ-IJs+Jc-1

Equation 26 permits a positive solution for R if both numerator and denominator are positive (or both negative, but the solution in this case is less interesting biologically). Interestingly, if the cross-connection strength is large enough, we can have a fixed point solution with nonzero R even if Js§amp;lt;1, which would correspond to the case in which nodes would be monostable if isolated from each other. Distributed sustained activity can emerge, therefore, even in simple cases of two coupled excitatory nodes.

Simplified computational model: N areas

We present now the case in which we have a large number N of connected excitatory nodes. We consider a fully connected network for simplicity and N = 30 nodes for the simulations. The activity of the nodes is given by

(27) τdridt=ri+ϕ (J ηi ri+ 1Nj=1jiNG rj+I)

Here, the term ηi is a monotonically increasing linear function that is used to introduce a gradient of connectivity strength across the network, with minimum value η1=0.55 and maximum value ηN=0.85. Other parameters values chosen for simulations are: τ=20 ms, I=4.81, J=0.91. For the transfer function, we choose a sigmoidal function (to demonstrate robustness of the phenomenon observed before for the threshold-linear):

(28) ϕI= Smax1+exp[-Ssat (I-I0)]

Parameters chosen for simulations are Smax=60, Ssat=0.1, I0=30. For these parameters, an isolated node is bistable only if η0.88, which is above our chosen ηN value and implies that all nodes are monostable in isolation.

This model admits an approximate mean-field solution if we define the network-average firing rate as R=1N i=1Nri . Averaging Equation 27 over units and using standard mean-field approximations like f(x)f(x), we arrive at

(29) τ dRdt=-R+ϕ(J η r+GR+I)

To estimate the average of the product η r over units, we can follow to approaches. First, we can assume independence between these two variables and accept that η r η r=η0 R, with η0 being the mean value of η. We will refer to this as the first-order mean field solution in the text, and is given by the following equations:

(30) τdRdt=-R+ϕ ((J η0+G) R+I)
(31) τdridt=-ri+ϕ (J ηi ri+G R+I)

It is important to notice that Eq. 30 is the real self-consistent mean-field solution, which can be solved numerically to find the fixed point solutions of our system. Eq. 31, on the other hand, is useful since it allows to obtain the fixed point solutions for any node ‘i’ that is connected to the network, and allows to explore the effect of the local heterogeneity η.

As an alternative to this solution, we can consider additional term in the dependence between the heterogeneity and firing rate of the nodes. For example, we can assume that the firing rate of units will grow proportionally to both η and G, which is a plausible approximation that takes into account both the heterogeneity and the global interactions. We can write this as

(32) rηα r0+β G η

This leads to the following expression:

(33) η r α r0 η0+β G η2=α r0 η0+β G η1ηN

The mean-field solution obtained, which we denote here as second-order solution, is given by

(34) τdRdt=-R+ϕ (J α η0+GR+J β G η1 ηN+I)
(35) τdridt=-ri+ϕ (J ηi ri+G R+I)

A choice of α=0.94 and β=10 (which fulfills the recommendations for a perturbative approach, since α1 and βη1ηN 1) provides good results as Figure 3 shows. However, both the first and second order solutions predict the emergence of distributed WM, and the choice of one or the other solution has only minor implications.

Simplified model for activity-silent memory traces

Similar to the simplified model above, we consider a network of N = 30 nodes whose dynamics is described by

(36) τdridt=-ri+ϕ (J ui ηi ri+ 1Nj=1jiNG uj rj+I)

Both local and long-range projections are now modulated by a variable, ui, accounting for short-term synaptic plasticity (STF), described by

(37) duidt=P-uiτfac+P1-uiri

Transfer function and other parameters as above, unless specifically mentioned (see Figure 3—figure supplement 1).

Appendix 2

Data analysis: Overview

We developed a numerical method to estimate the number of stable distributed WM attractors for a particular set of parameters values of our model. This method, which follows simplified density-based clustering principles, is used to obtain the results shown in Figures 5 and 6. To allow for a cleaner estimation, we do not consider noise in neural dynamics during these simulations.

Our large-scale cortical network has 30 areas, with each of them having two selective excitatory populations A and B. Simply assuming that each of the areas can reach one of three possible states (sustained activity in A, sustained activity in B, or spontaneous activity) means that our model can potentially display up to three to the power of 30 attractor combinations. This number can be even larger if we refine the firing rate reached by each area rather than simply its sustained/non-sustained activity status. Since it is not possible to fully explore this extremely large number of possible attractors, we devised a strategy based on the exploration of a sample of the input space of the model. The core idea is to stimulate the model with a certain input pattern (targeting randomized areas) and registering the fixed point that the dynamics of the model converges to. By repeating this process with a large number of input combinations and later counting the number of different attractors from the obtained pool of fixed points, we can obtain an estimate of the number of attractors for a particular set of parameter values.

Data analysis: Stimulation protocol

A given input pattern is defined as a current pulse of fixed strength (Ipulse = 0.2) and duration (Tpulse = 1 sec) which reaches a certain number P of cortical areas. Only one population (A or B, randomized) in each area receives the input, and the P cortical areas receiving the input are randomly selected across the top 16 areas of the spine count gradient. This decreases the amount of potential input combinations we have to deal with by acknowledging that areas with stronger recurrent connections (such as 9/46d) are more likely to be involved in distributed WM patterns than those with weaker connections (such as MT). P can take any value between one and Pmax = 16, and we run a certain number of trials (see below) for each of them. Different values of Ipulse and Tpulse, as well as setting the randomly selected areas at a high rate initial condition instead of providing an external input, have been also explored and lead to qualitatively similar results. Similar approaches regarding the use of random perturbations in high dimensional systems have been successfully used in the past (Sussillo and Barak, 2013).

It is also important to consider that not all values of P have the same number of input combinations. For example, P = 1 allows for 16*2 = 32 different input combinations (if we discriminate between populations A and B), while P = 2 allows for 16*(16-1)*2 = 480 input combinations, and so on. For a given value of P, the number of possible input combinations Nc is given by

(38) Nc=2P PmaxP=2PPmax!Pmax-P! P!

By summing all values of Nc for P = 1, …Pmax, we obtain around 43 million input combinations, which are still too many trials to simulate for a single model configuration. To simplify this further, we consider a scaling factor Fc on top of Nc to bring down these numbers to reasonable levels for simulations. We use Fc = 0.0002 (or 0.02% of all possible combinations) for our calculations, which brings down the total number of simulated input combinations to around 9000. Other options, such as decreasing Pmax and using a larger scaling factor (Pmax = 12, Fc = 0.01 or 1% or all possible combinations) give also good results. Since the rescaling can have a strong impact for small P (yielding a number of trials smaller than one), we ensure at least one trial for these cases.

To guarantee the stability of the fixed points obtained during these simulations, we simulate the system during a time window of 30 s (which is much larger than any other time scale in the system), and check that the firing rates have not fluctuated during the last 10 s before we register the final state of the system as a fixed point.

Data analysis: Estimating the number of attractors

The final step is to count how many different attractors have been reached by the system, by analyzing the pool of fixed points obtained from simulations. A simple way to do this is to consider that, for any fixed point, the state of each area can be classified as sustained activity in population A (i.e. mean firing rate above a certain threshold of 10 spikes/s), sustained activity in population B, or spontaneous activity (both A and B are below 10 spikes/s). This turns each fixed point into a vector of 30 discrete states, and the number of unique vectors among the pool of fixed points can be quickly obtained using standard routines in Matlab (e.g. 'unique' function).

A more refined way to count the number of attractors, which we use in this work, is to define the Euclidean distance to discriminate between an attractor candidate and any previously identified attractors. Once the first attractor (i.e. the first fixed point analyzed) is identified, we test whether the next fixed point is the same than the first one by computing the Euclidean distance Ed between them:

(39) Ed=1n i=1n(rinew-riold)2

where n = 30 is the total number of areas in the network (only one of the populations, A or B, needs to be considered here). If Ed is larger than a certain threshold distance ε, we consider it a new attractor. We choose ε = 0.01, which grossly means that two fixed points are considered as different attractors if, for example, the activity of one of their cortical areas differs by 0.5 spikes/s and the activity on all other areas is the same for both. The particular value of ε does not have a strong impact on the results (aside from the fact that smaller values of ε gives us more resolution to find attractors). When several attractors are identified, each new candidate is compared to all of them using the same method.

Both the first and the second method to count attractors deliver qualitatively similar results (in terms of the dependence of the number of attractors with model parameters), although as expected the second method yields larger numbers due to its higher discriminability.

Data availability

The current manuscript is a computational study, so no data have been generated for this manuscript. Modelling code has been uploaded to ModelDB.

The following data sets were generated
    1. Mejias JF
    (2022) ModelDB
    ID 267295. Distributed working memory in large-scale macaque brain model.

References

  1. Book
    1. Strogatz SH.
    (1994)
    Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry and Engineering
    Reading, MA: Perseus Books.

Decision letter

  1. Tatiana Pasternak
    Reviewing Editor; National Institute of Neurological Disorders and Stroke, United States
  2. Tirin Moore
    Senior Editor; Stanford University, United States

Our editorial process produces two outputs: i) public reviews designed to be posted alongside the preprint for the benefit of readers; ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Mechanisms of distributed working memory in a large-scale network of macaque neocortex" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Tirin Moore as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

The manuscript was well received by the two reviewers who felt that the large-scale distributed model of working memory has major strengths and fills an important gap in the literature. However, they also had a number of reservations and made suggestions which must be addressed for the manuscript to be accepted for publication in eLife.

Both reviewers were concerned about the lack of consideration of recent work documenting the existence silent delay activity. The concern is that the proposed model relies heavily and exclusively on persistent attractor states and the impression the manuscript created that these states are the only current thinking about working memory.

The reservations raised by the two reviewers are summarized below and along with the original critiques will provide the guide to the revision, should you decide to revise the manuscript.

1. Both reviewers recommend that the work is re-framed by taking into account newer studies and asked the authors "consider changing the tone of manuscript, so that it doesn't come across as if persistent attractors are state-of-the-art thinking about working memory".

2. Abstract, Introduction and Discussion

Please clarify in the Introduction and in Discussion that you are testing one model of working memory and acknowledge that there are other models that consider more complex dynamics of activity recorded during the delay. Also, please incorporate into the Abstract and Introduction the effects of deactivation and resistance to the distractors tested by your model.

3. Please address the point questioning the idea that "anatomical constraints" actually play a critical role in the model. If they are indeed critical to the model, provide documentation.

4. Consider moving the simplified model into the Supplementary Materials.

5. Discuss whether and how the proposed model explains extensive reports of silent periods in delay activity. A related issue is to what extent the proposed model depends on persistent activity and whether it can incorporate an STP.

6. In the Discussion please expand the point already made in the manuscript that "silent activity periods associated with silent WM (Masse et al., 2019; Mongillo et al., 2008; Stokes, 2015) could also be due to distributed WM effects".

7. Please provide the definition of "persistent" activity and consider the recommendation to change "persistent" to "elevated" or to "delay activity. Please also address the comment that past observations of "persistent" activity were based on activity averaged across trials, rather than on a trial-by-trial basis.

Reviewer #1 (Recommendations for the authors):

Overall the interesting findings seem to be obfuscated by seemingly not so relevant ones. For instance, both the abstract and introduction seem to ignore a main finding, which is the deactivation of the attractor by inactivating the top area. CIB and more resilience to distractors is not properly introduced, either (however mentioned in the abstract). Instead, the authors give relevance to a concept already in the literature, ie bistability accomplished through inter-area connectivity (Eding et al., PNAS, Guo et al., Nature) and to the fact that the model is “anatomically constrained”, which it is not clear that is indeed the case.

Anatomical constraints.

The abstract reads: “we developed an anatomically constrained model” but it is not clear in what ways the anatomical data constrains the main model. Indeed, in a supplementary figure and in the simplified model the authors show that it does not seem to matter much “Similar conclusions can be obtained when the anatomical structure of the cortical network is changed -for example, by randomly shuffling individual projection strength values”. This raises the question of which of the new insights depend on this and other “biological constraints”. Namely: “counterstream inhibitory bias”, superiority of distributed WM in resisting distractors, deactivation of global attractor by silencing a top layer, inactivation relationship with specific areas, etc. Each finding should be accompanied by how they are affected by including or not specific “biological constraints”. If some, like anatomical connectivity, are not critical, then they obfuscates the main findings and worsens the overall readability of the paper (which is very good, but a bit long) and could be removed or moved to a supplementary figure showing unequivocally in what ways it does (not) constrain the model. It is somewhat acknowledged in the paper that the main driver of the findings is the gradient of recurrent excitation (“As a matter of fact, the relevant parameter here is the strength of synaptic excitation that varies across cortical space, in the form of a macroscopic gradient”), but this is not very clear at times. Again, if this is indeed the case, less emphasis should be given to the anatomical “constraints” and instead the relevant feature should be spelled out clearly and early on (in the abstract and intro)

Simplified model.

It is not very clear what we gain with this model, if not to show that with homogeneous coupling (instead of heterogeneous from experimental data, see above) similar findings are achieved. The model is motivated by “The above model, albeit a simplification of real brain circuits, includes several biologically realistic features, which makes it difficult to identify essential ingredients for the emergence of distributed WM.” This seems a good reason to remove the “biological constraints” from the “full model” (see above). Additionally, because of where this model is introduced in the paper, it becomes unclear when the simplified or the full model is used in the following figures. This could be improved if the simplified model was introduced only in the supplementary material or in a subpanel with a clear title. At a minimum, all captions should say if the full/simplified model were used.

Previous experimental literature.

Overall we feel that several studies were not properly considered. For instance, Guo et al., Nature is not cited properly, nor discussed. Note for example that also in this paper there was a model – in addition to clear empirical evidence – with different areas and similar concepts as the ones that are explored here.

Likewise, both "Cortical information flow during flexible sensorimotor decisions" Markus Siegel et al., Nature and also Panichello and Buschman, Nature were not considered in this study. In both studies, they recorded from several areas across the hierarchy (from visual cortex to PFC) during WM and DM, so they seem to be extremely relevant, especially to constrain the model further in future studies. For example, Panichello and Buschman show clear WM codes in V4, not present in the current model. Another example: "We observed (…) a sharp binary jump of activity, areas like LIP exhibited a more gradual ramping activity, resembling temporal accumulation of information in decision-making(Shadlen and Newsome, 2001)". Siegel et al., Science show very convincing evidence that this is actually not the case and the model does not seem to match the latencies reported here.

Of course, this mismatch between data and model is not very important and it does not reduce the value of the current model, but the authors should consider toning down claims like "strong agreement with" or "an excellent agreement with a large body of data, from decades of monkey neurophysiological" which occur throughout the study. The model is a great proof of concept that provides several important insights, but it is far from being in "strong agreement" with what happens in the brain.

Somatosensory WM.

Relatedly, the authors perform an experiment that simulates "somatosensory WM". While the question as to which areas trigger the global attractor is interesting and would deserve to be explored further, the way this is framed (i.e. studying different WM modalities) is misleading and should be adapted. Figure2—figure supplement2 shows that the same global attractor is engaged irrespectively of which area is stimulated. The evidence points otherwise (see Christophel et al., 2017). For example Figure2—figure supplement2 shows persistent activity in IT, which would not be expected for somatosensory WM?

Inactivations.

It would be nice to have a schematic of when this inactivation is performed (which we think it is throughout the trial), like in FIGURE 7. It seems that the point made in fig6 C needs the areas to be silenced in the opposite direction (ie hierarchical order) to be conclusive. Figure F seems important, as well as the result in G, but it is very confusing. We would consider simplifying it to show more clearly the relevant features/points made. Again: how much of the findings (in particular the "bowtie" analyses) here depend on the "anatomical constraints" is unclear.

"which is in agreement with classical prefrontal lesion studies(Curtis and D'esposito, 2004)" The cited paper does not do what the authors did in the model. This line should be removed of better explained

"In some cases, inactivating specific areas might even lead to a disinhibition of other areas and to a general reinforcement of the attractor". Again, unclear why this is. Does this depend on gradient of recurrent excitation, hierarchy location or anatomical connectivity?

Relationship with other mechanisms of working memory.

This paragraph, while important, seems a bit incomplete in the current form. In particular the part where activity-silent is discussed. The results presented here seem to depend strongly on the persistent activity hypothesis of working memory. Does it make sense to think about distributed attractors through short-term plasticity? The relationship with silent activity does not seem straightforward and this discussion failed to illuminate it.

In the next paragraph, the authors say "This also means that silent activity periods associated with silent WM (Masse et al., 2019; Mongillo et al., 2008; Stokes, 2015)could also be due to distributed WM effects. Optogenetic inactivations could be used to test this result." This is an interesting idea, but could be expanded a bit more. Intriguingly, the authors cite papers (Masse et al., 2019; Mongillo et al.,) of local circuits with actual activity-siment mechanism. Instead, the author should cite empirical evidence of silent periods, of which the model proposed here offers an alternative view. For example: Wolff et al., Nature Neuroscience (Human occipital cortex), Barbosa et al., Nature Neuroscience (monkey PFC) and Akrami et al., Nature, (rodent PPC) etc.

Reviewer #2 (Recommendations for the authors):

I am not suggesting that the authors overhaul their model and start over. But a re-write (and some changing of terms, see below) would serve them well. I would encourage the authors to consider changing the tone of manuscript so that it doesn't come across as if persistent attractors are state-of-the-art thinking about working memory. I suggest a more up-front acknowledging of the newer developments (as opposed to a single paragraph near the end of the Discussion) and that their work will focus on mechanisms that allow average activity to remain elevated. Right now, it reads as if "persistent activity" is everything, with a disclaimer near the end.

Finally, I encourage the authors to not use the term "persistent activity" (try elevated or sustained elevated activity or just "delay activity"). As noted above, there is evidence against persistent activity. But more to the point, there is little or no evidence for persistent activity. Virtually all of the work purporting such evidence averaged neural activity across multiple trials. Across-trial averaging masks more complex dynamics like gaps of no spiking. One cannot conclude persistent firing from averaged data. It can only be addressed in real time at the single trial level. Also, there is a no definition of "persistent". Is it a spike every 5 ms? Every 10 ms? Every 100 ms? Using a term like "persistent activity" when it is not well defined and for which there is little direct evidence muddies the waters and does not do a service to the field.

Other comments:

One cannot help but wonder how the hierarchical trends discussed here relate to other hierarchical trends. For example, there is a gradual progression from sensory-related activity to task-relevant activity as one ascends the hierarchy. Or the greater mixed selectivity in higher cortex. Maybe those are separate issues. But if the authors have any insights into how their model contributes to them, it would certainly add value to their manuscript.

Page 4: "LIP exhibited a more gradual ramping activity, resembling temporal accumulation of information in decision making (Shadlen and Newsome, 2001)". Again, this was state-of-the-art like a decade ago. It ignores more recent work by Pillow, Shenoy, and others showing that the ramp-up is not gradual. When examined on the single-trial level, the activity is instead a series of discrete state changes. This does not take anything away from the elegant and important work of Shadlen and Newsome, without which the newer work would not have been possible. But, again, by focusing on older, not newer, work, the authors are not giving a full account of where we are in 2021.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Mechanisms of distributed working memory in a large-scale network of macaque neocortex" for further consideration by eLife. Your revised article has been evaluated by Tirin Moore (Senior Editor) and a Reviewing Editor.

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

Reviewers were largely satisfied with the revised manuscript with one exception. They were concerned about the discussion of the role of plasticity in Attractor and Activity-Silent models (lines 559-566). It was felt that the work referred to in the Wang recent review which showed more spiking activity during manipulation of working memory, did not rule out synaptic plasticity. Furthermore, it was pointed out that Activity-Silent models also predict that spiking may be used to "ping" the network and read out the memories. In this case, the role of plasticity is to HELP the spiking, not to replace it.

To address this concern, this section should be modified. One option would be to provide clear evidence that synaptic plasticity only holds for Activity-Silent models and is not required by the attractor models. This can be done by citing specific references with the appropriate simulations/data or by providing new simulations/data.

Alternatively, this paragraph should be modified by allowing short-term plasticity to play a role in both types of models.

Reviewer #1 (Recommendations for the authors):

With most of my initial concerns addressed and the inclusion of interesting, new simulations, I fully support the publication of the manuscript in the current form.

Reviewer #2 (Recommendations for the authors):

The authors' revisions are mostly adequate.

However, the statements that activity-silent models " (1) it cannot filter out distractors that occur later in time than behavioral relevant stimuli, (2) it does not have a severely limited capacity (a characteristic of working memory) and (3) it is incapable of internal manipulation of information" is not true.

The activity-silent models can explain all of this. Synaptic weight changes are driven and refreshed by spiking. Thus, they have the same features and same control as attractor-state models. 1. Distractors can be filtered out by controlling spiking. 2. They do have a severe capacity limitation due to limitations in the spiking refresh rate. Multiple memories cannot be in the active state at the same time. That leads to capacity limitations. 3. Manipulation of WM is achieved by controlling spiking episodes, just like the attractor-state models.

The issue is that in testing the activity-silent models, the author has shifted too much of the burden to synapses alone. That is a misrepresentation of the activity-silent models. It is easy to refute a model if one makes a straw model of it. In the activity-silent models, synapses don't do everything. The help activity by briefly carrying the memories between spiking. That is why they are also referred to as "synaptic attractor" models. Because they also involve attractor states, they have many of the same features and mechanisms as attractor-state models. As a wise colleague recently said, the attractor-state and synaptic-attractor models are more similar than different. The characterization that the former can explain a variety of WM phenomena but the latter cannot is not accurate.

I think this is a valuable review. It is well-written. Attractor dynamics are indeed important and the review offers important insights. But surely these insights can be offered with misrepresenting other models.

https://doi.org/10.7554/eLife.72136.sa1

Author response

Reviewer #1 (Recommendations for the authors):

Overall the interesting findings seem to be obfuscated by seemingly not so relevant ones. For instance, both the abstract and introduction seem to ignore a main finding, which is the deactivation of the attractor by inactivating the top area. CIB and more resilience to distractors is not properly introduced, either (however mentioned in the abstract). Instead, the authors give relevance to a concept already in the literature, ie bistability accomplished through inter-area connectivity (Eding et al., PNAS, Guo et al., Nature) and to the fact that the model is "anatomically constrained", which it is not clear that is indeed the case.

We acknowledge that some important findings were not properly highlighted in our previous version. We have now highlighted more explicitly, in the introduction, our findings regarding CIB, distractor resilience, and the control/inactivation of distributed attractors by prefrontal areas (also in the abstract now, together with the other two).

We consider, however, that the focus on the distributed bistability via large-scale cortical networks is well placed, given that while the concept was already present in the literature (e.g. Christophel et al., 2017, Leavitt et al., 2017, Guo et al., 2017) our model is the first one to provide a mechanistic and anatomically-constrained explanation about the wide multi-area distribution reported in Christophel et al., 2017 and Leavitt et al., 2017. This goes beyond relatively simpler two-area interactions (Edin et al., 2009, Guo et al., 2017, Murray et al., 2017), which can’t explain the wide activity distribution and lack the spatial extension to predict global trends like the robust transient or the counterstream inhibitory bias.

We have also elaborated on the anatomical constraints of our model and discussed previous related work (Edin et al., 2009, Guo et al., 2017), as discussed below.

Anatomical constraints.

The abstract reads: "we developed an anatomically constrained model" but it is not clear in what ways the anatomical data constrains the main model. Indeed, in a supplementary figure and in the simplified model the authors show that it does not seem to matter much "Similar conclusions can be obtained when the anatomical structure of the cortical network is changed -for example, by randomly shuffling individual projection strength values". This raises the question of which of the new insights depend on this and other "biological constraints". Namely: "counterstream inhibitory bias", superiority of distributed WM in resisting distractors, deactivation of global attractor by silencing a top layer, inactivation relationship with specific areas, etc. Each finding should be accompanied by how they are affected by including or not specific "biological constraints". If some, like anatomical connectivity, are not critical, then they obfuscates the main findings and worsens the overall readability of the paper (which is very good, but a bit long) and could be removed or moved to a supplementary figure showing unequivocally in what ways it does (not) constrain the model. It is somewhat acknowledged in the paper that the main driver of the findings is the gradient of recurrent excitation ("As a matter of fact, the relevant parameter here is the strength of synaptic excitation that varies across cortical space, in the form of a macroscopic gradient"), but this is not very clear at times. Again, if this is indeed the case, less emphasis should be given to the anatomical "constraints" and instead the relevant feature should be spelled out clearly and early on (in the abstract and intro)

It is important to clarify the differences between anatomically-constrained model and anatomically-constrained result. Our model is clearly anatomically constrained, in the sense that we use anatomical data to determine many of its parameters, such as the connectivity matrix or some local properties. However, having an anatomically-constrained model doesn’t necessarily imply that all results from such a model will depend critically on the particular anatomical values used. Indeed, some flexibility in parameter values should be expected to allow, for instance, inter-subject variability, or even robustness of the results across species. In general, results which are sensitive to changes in structural assumptions might be valid predictions for macaques, while results more resilient to these changes (like the emergence of distributed WM patterns, as Figure 3 shows) are more likely to occur in other species too.

Likewise, anatomical constrains will be more important for some brain areas but less for others. For example, the local synaptic strength assumed for V1 is not very important, as long as it remains at the bottom of the hierarchy.

In the supplementary figure mentioned by the reviewer (Figure 4 supplement 2e-h), we indeed observe that we obtain the same result globally (i.e. emergence of distributed WM patterns) when individual projection strengths are shuffled. However we also observe a clear effect in results for individual areas (see panel ‘f’, for example). More precisely, we see, as also stated in the text, that the duration of persistent activity for some areas is affected; other areas stop participating in the distributed attractor. Therefore, the anatomical constraint of projection strengths has a quantitative effect –areas may drop from the distributed WM pattern if their connectivity is altered –and therefore this constrain cannot be dropped without altering the predictions of the model.

A similar point can be made about the results of the simplified model: they indicate that the precise anatomical connectivity is not necessary to have distributed activity patters in a generic network, but the particular connectivity matters if we consider differences between concrete areas. Without the anatomical connectivity, we would not be able to make any claims about how areas like LIP, 8B or 9/46d participate or not in the distributed WM attractors, as showed in Figures 2 or 6 for example.

Following the reviewer’s advice, but also with this consideration in mind, we have added a sentence in the first paragraph of the results to summarize our clarifications from above. We have also carefully revised all the results of the manuscript and indicated, in a new paragraph in the discussion (page 15), the dependency of the results with the anatomical constrains. In particular:

– The CIB results in Figure 4 have already been shown (in Figure 3) to be the result of the concrete shape of the local strength gradient (linear vs spine count), and also not strongly dependent on the connectivity values (as similar results are found with the simplified model).

– The emergence and properties of a large number of attractors (Figure 5) are strongly dependent on the gradients, connectivity and also other choices of the model (in particular, the consideration of only two selective populations per area in the model). Therefore, although we expect that a large number of attractors will be found using, for example, connectomes for other species, we can’t predict their numbers.

– For a similar reason, the particular effects of silencing areas on the activity and number of attractors (Figures 6 and 7) will depend on the gradient and connectivity properties. These results will be therefore valid for macaque brains, but may differ for other species in which, for example, a bowtie structure is not present.

– The resistance of distributed WM patterns to distractors in Figure 8 is a property inherently linked to the existence of distributed WM patters, moderate values of Jmax, and the CIB, so as long as these two conditions are present, models of other animals’ brains and/or conditions will also display a similar behavior.

Simplified model.

It is not very clear what we gain with this model, if not to show that with homogeneous coupling (instead of heterogeneous from experimental data, see above) similar findings are achieved. The model is motivated by "The above model, albeit a simplification of real brain circuits, includes several biologically realistic features, which makes it difficult to identify essential ingredients for the emergence of distributed WM." This seems a good reason to remove the "biological constraints" from the "full model" (see above). Additionally, because of where this model is introduced in the paper, it becomes unclear when the simplified or the full model is used in the following figures. This could be improved if the simplified model was introduced only in the supplementary material or in a subpanel with a clear title. At a minimum, all captions should say if the full/simplified model were used.

The apologize if the benefits of the simplified model were not properly explained, but we do not think that is a good idea to “remove the biological constraints from the full model”. These are two different and complementary levels of description, and they both provide useful information to the reader.

In particular, the simplified model tells us about the basic ingredients to have distributed WM (in a generic network, with simplified dynamics, etc) as explicitly stated in page 5: strong enough long-range projections and linear gradient of local couplings (or a CIB, if the gradient is sublinear). Such synthesis is not easy to do from the full model, and it could be useful, for example, to generalize to other systems in future studies (for example, brains of rodents or humans). On the other hand, the full model is needed to assess the validity of our claims in more realistic scenarios. If we only consider the results of the simplified model, it would not be clear whether these results would hold when adding more realistic considerations (i.e. real connectivity matrix, inhibitory populations, etc).

A good example from the text: the simplified model shows that distributed WM emerges for linear gradient of local recurrent strengths (Figure 3b,c). When we introduce the data about dendritic spines (Figure 3e,f), we discover that the soft saturation present in this data turns the distributed WM pattern into an unrealistic all-or-none situation. This leads us to identify CIB (or weaker excitatory FB projections, in the case of the simplified model) as a solution to have realistic WM patters in the full model (as later explored in Figure 4). The study of the simplified model alone, without data to constrain the gradient, would have not explored these effects, and the study of the full model alone might have overlooked the critical importance of CIB.

We therefore consider that both the full model and the simplified model provide important information to the reader, and we have a strong preference to maintain Figure 3 in the main text. In addition, we have now introduced a variation of the simplified model (new Figure 3 —figure supplement 1) to explore the case of distributed activity-silent memory. See response to another related comment below.

We have improved our justification for the use of the simplified model (page 5). We have also indicated clearly in the main text (page 6) that, after Figure 3, the full model is used for the rest of the paper. Furthermore, we added ‘full model’ in captions of Figures 4 and 5 to further emphasize this.

Previous experimental literature.

Overall we feel that several studies were not properly considered. For instance, Guo et al., Nature is not cited properly, nor discussed. Note for example that also in this paper there was a model – in addition to clear empirical evidence – with different areas and similar concepts as the ones that are explored here.

Likewise, both "Cortical information flow during flexible sensorimotor decisions" Markus Siegel et al., Nature and also Panichello and Buschman, Nature were not considered in this study. In both studies, they recorded from several areas across the hierarchy (from visual cortex to PFC) during WM and DM, so they seem to be extremely relevant, especially to constrain the model further in future studies. For example, Panichello and Buschman show clear WM codes in V4, not present in the current model. Another example: “We observed (…) a sharp binary jump of activity, areas like LIP exhibited a more gradual ramping activity, resembling temporal accumulation of information in decision-making(Shadlen and Newsome, 2001)”. Siegel et al., Science show very convincing evidence that this is actually not the case and the model does not seem to match the latencies reported here.

Of course, this mismatch between data and model is not very important and it does not reduce the value of the current model, but the authors should consider toning down claims like "strong agreement with" or "an excellent agreement with a large body of data, from decades of monkey neurophysiological" which occur throughout the study. The model is a great proof of concept that provides several important insights, but it is far from being in "strong agreement" with what happens in the brain.

Although we had previously cited Guo et al., multiple times in our manuscript, including an explicit mention to its importance to extend our model to thalamocortical loops in the future, we agree that this paper required more consideration. In addition to very relevant experimental evidence, they also presented a computational model, although consisting only of two interacting areas (similar in scope to Murray, Jaramillo and Wang 2017). In this sense, it provides a valuable starting point and we now cite it in the introduction, when other two-area WM models (Edin et al., 2009, Murray et al., 2017) are acknowledged. We also included it, together with Murray et al., 2017, in a new sentence in the discussion stating that these works previously explored inter-areal interactions to sustain WM in more limited (i.e. two-area) models.

In addition, we have now cited the studies of Siegel et al., and Panichello and Buschman. We agree with the reviewer about the importance of these works in our conclusions. Even though the tasks are different from the one we simulate with our model (and therefore their results are not directly comparable), we think that they should guide future improvements of the model once ingredients such attention and sensorimotor integration are carefully taken into account –and we have indicated this in the text (page 13). The sentence about LIP has been modified, also as a response to a comment from Reviewer 2. We have also toned down overly strong claims along the manuscript, including replacing mentions to ‘strong agreement’ with just ‘substantial agreement’ or simply ‘agreement’.

Somatosensory WM.

Relatedly, the authors perform an experiment that simulates “somatosensory WM”. While the question as to which areas trigger the global attractor is interesting and would deserve to be explored further, the way this is framed (i.e. studying different WM modalities) is misleading and should be adapted. Figure2—figure supplement2 shows that the same global attractor is engaged irrespectively of which area is stimulated. The evidence points otherwise (see Christophel et al., 2017). For example Figure2—figure supplement2 shows persistent activity in IT, which would not be expected for somatosensory WM?

We agree, current experimental evidence suggests that stimulation of different areas (or via different modalities) shouldn’t necessarily lead to the same attractors, as it happens with the model. We believe that the reason for this “convergence to the same attractor” is the absence of a gating mechanism in the model, which would allow some areas to participate in modality-specific global attractors. This is already tested in an extension of our model, shown in Figure 2 supplement 4, which aims to provide an explanation for such differences. We have rewritten the text of this result (page 5) so that the need of additional considerations is now clearer to the readers. As a result of this change, Figure 2 supplementary figures 3 and 4 have been swapped.

Inactivations.

It would be nice to have a schematic of when this inactivation is performed (which we think it is throughout the trial), like in Figure 7. It seems that the point made in fig6 C needs the areas to be silenced in the opposite direction (ie hierarchical order) to be conclusive. Figure F seems important, as well as the result in G, but it is very confusing. We would consider simplifying it to show more clearly the relevant features/points made. Again: how much of the findings (in particular the “bowtie” analyses) here depend on the “anatomical constraints” is unclear.

As the reviewer suspects, the inactivation is performed throughout the trial, we have now clearly stated in Figure 6 caption. Panel 6C shows indeed the effect of silencing the areas in the opposite hierarchical order, as the reviewer says, and the silencing is also trial-long: first a trial with the last area silenced, then a trial with last and second-to-last areas silenced, etc. The whole purpose with this approach is to test the resilience of attractors when targeting top hierarchical areas specifically for the inactivation. And as we already discussed in the point above, these results are strongly dependent on the anatomical data used –results won’t necessarily be the same for connectomes of rodents or humans, for example.

We have clarified the results of Figure 6C,F,G in the main text (pages 8-9) and in the figure caption. We have also improved the clarity of the figures and explanations of Fig6F and G (including the bowtie analysis and how it depends on the anatomical data).

“which is in agreement with classical prefrontal lesion studies(Curtis and D’esposito, 2004)” The cited paper does not do what the authors did in the model. This line should be removed of better explained

We have corrected this sentence.

“In some cases, inactivating specific areas might even lead to a disinhibition of other areas and to a general reinforcement of the attractor”. Again, unclear why this is. Does this depend on gradient of recurrent excitation, hierarchy location or anatomical connectivity?

The described effect has its origin in the hierarchical relationships between areas and the CIB –silencing a top area which is inhibiting lower ones might release the lower areas from the inhibition and increase their activity. We have included a new sentence to clarify it, in page 9.

Relationship with other mechanisms of working memory.

This paragraph, while important, seems a bit incomplete in the current form. In particular the part where activity-silent is discussed. The results presented here seem to depend strongly on the persistent activity hypothesis of working memory. Does it make sense to think about distributed attractors through short-term plasticity? The relationship with silent activity does not seem straightforward and this discussion failed to illuminate it.

In the next paragraph, the authors say “This also means that silent activity periods associated with silent WM (Masse et al., 2019; Mongillo et al., 2008; Stokes, 2015)could also be due to distributed WM effects. Optogenetic inactivations could be used to test this result.” This is an interesting idea, but could be expanded a bit more. Intriguingly, the authors cite papers (Masse et al., 2019; Mongillo et al.,) of local circuits with actual activity-siment mechanism. Instead, the author should cite empirical evidence of silent periods, of which the model proposed here offers an alternative view. For example: Wolff et al., Nature Neuroscience (Human occipital cortex), Barbosa et al., Nature Neuroscience (monkey PFC) and Akrami et al., Nature, (rodent PPC) etc.

In response to this comment, and also to the concerns of Reviewer #2, we have included new simulation results (new Figure 3 —figure supplement 1) showing how activity-silent memory traces could also emerge in a distributed fashion. In this case, the short-term synaptic efficacy is maintained during longer periods of time because of the long-range interactions between brain areas, rather than local recurrent input. We have rewritten parts of the text over the entire manuscript (and particularly in the introduction and discussion) to make more explicit the generality of our proposed mechanism for distributed WM beyond persistent activity, and also to improve the discussion regarding silent WM. We have also expanded our proposed optogenetic testing.

Regarding the optogenetic inactivations, we meant to suggest that they could be used to test whether information encoded in WM areas could survive short inactivations. We realize that the previous writing was confusing, and since we are addressing the activity-silent phenomenon elsewhere in the text, we have now removed this sentence.

Reviewer #2 (Recommendations for the authors):

I am not suggesting that the authors overhaul their model and start over. But a re-write (and some changing of terms, see below) would serve them well. I would encourage the authors to consider changing the tone of manuscript so that it doesn't come across as if persistent attractors are state-of-the-art thinking about working memory. I suggest a more up-front acknowledging of the newer developments (as opposed to a single paragraph near the end of the Discussion) and that their work will focus on mechanisms that allow average activity to remain elevated. Right now, it reads as if "persistent activity" is everything, with a disclaimer near the end.

Following the reviewer’s advice, we have adapted the text along the manuscript (and especially in the introduction and discussion) to put our proposal in a more general light and acknowledging other WM mechanisms from the beginning. Our work reads now as primarily focused on the persistent activity hypothesis for practical matters but highlighting that our distributed WM proposal can be applied more generally.

To elaborate on this point a bit further, we have also investigated, using a variation of our simplified model, whether our distributed WM proposal could also facilitate the emergence of activity-silent memory traces. More precisely, we have reduced the overall synaptic strength in our simplified model and included short-term synaptic facilitation in both local and long-range connections, and tested whether activity-silent memory traces can also benefit from inter-areal interactions and ‘silent’ distributed attractors may emerge. This seems to be the case, as we show in new Figure 3 —figure supplement 1. Our model shows that silent memory traces emerge when brain areas are allowed to support each other, but it fades away if we only consider isolated areas, as in the case of the persistent activity model. While this result opens the door for more realistic simulations, we hope that this will suffice to make a point about the generality of the distributed WM hypothesis proposed here.

Finally, I encourage the authors to not use the term "persistent activity" (try elevated or sustained elevated activity or just "delay activity"). As noted above, there is evidence against persistent activity. But more to the point, there is little or no evidence for persistent activity. Virtually all of the work purporting such evidence averaged neural activity across multiple trials. Across-trial averaging masks more complex dynamics like gaps of no spiking. One cannot conclude persistent firing from averaged data. It can only be addressed in real time at the single trial level. Also, there is a no definition of "persistent". Is it a spike every 5 ms? Every 10 ms? Every 100 ms? Using a term like "persistent activity" when it is not well defined and for which there is little direct evidence muddies the waters and does not do a service to the field.

This is a very interesting point, although we think that the situation might be a bit different in our case. Our model considers the macroscopic activity of brain areas, which in a real brain would be obtained by averaging over the activity of many individual neural responses in the same circuit. While single-neuron persistent activity is limited as explanation for WM, a ‘population persistent activity’ as used in our model is more plausible, and it could also arise via more flexible mechanisms which allow for a dynamics and highly-variable single-neuron activity (Goldman, Neuron 2009).

In addition, we consider that persistent activity should not be understood as a constant, fixed value of all firing rates, but as the opposite to decaying transient activity. This has been recently discussed in Wang, TiNS 2021, and it could indeed constitute a working definition for the term ‘persistent activity’. In general, persistent activity can incorporate complex dynamics and variability in the firing rates during the delay epoch, a feature of persistent activity which has been addressed in previous studies. For example, Compte et al., 2000 (see panel A in Author response image 1) already showed the presence of rhythmic variability during the persistent activity period. Such rhythmic activity is similar to the periodic bursts required for silent-activity mechanisms –as a matter of fact, Mongillo et al., (2008) showed such an example (see panel B) with a slightly different model parameter change from that corresponding to the activity-silent state regime.

Author response image 1

Nonetheless, we have followed the advice and adapted the terminology so that the terms ‘sustained activity’ or ‘sustained delay activity’ are used by default, and included a new section in the discussion (page 13) where these issues are explained.

Other comments:

One cannot help but wonder how the hierarchical trends discussed here relate to other hierarchical trends. For example, there is a gradual progression from sensory-related activity to task-relevant activity as one ascends the hierarchy. Or the greater mixed selectivity in higher cortex. Maybe those are separate issues. But if the authors have any insights into how their model contributes to them, it would certainly add value to their manuscript.

These are indeed relevant issues, as our work establishes a partial connection between structural gradients (dendritic spine count, position in the SLN-defined anatomical hierarchy) and functional ones (persistent activity being more common in higher vs lower areas in the hierarchy). Although the insight provided by our work is limited, the previous version of our manuscript provided an attempt (in the discussion) to relate our work with other hierarchical trends. We have now extended such paragraph to include the example of sensory- and task-related gradients and the mixed selectivity examples that the reviewer mentioned (page 11).

Page 4: "LIP exhibited a more gradual ramping activity, resembling temporal accumulation of information in decision making (Shadlen and Newsome, 2001)". Again, this was state-of-the-art like a decade ago. It ignores more recent work by Pillow, Shenoy, and others showing that the ramp-up is not gradual. When examined on the single-trial level, the activity is instead a series of discrete state changes. This does not take anything away from the elegant and important work of Shadlen and Newsome, without which the newer work would not have been possible. But, again, by focusing on older, not newer, work, the authors are not giving a full account of where we are in 2021.

Following the point of population activity of our previous comment above, we think it’s important to clarify that our model focuses on population-level (rather than neuron-level) dynamics, and therefore our description is not invalidated by the recent work by Pillow, Shenoy and others. We have modified the sentence highlighted by the reviewer to make this parallelism clearer in the text (page 4).

[Editors’ note: what follows is the authors’ response to the second round of review.]

Reviewer #1 (Recommendations for the authors):

With most of my initial concerns addressed and the inclusion of interesting, new simulations, I fully support the publication of the manuscript in the current form.

We thank Reviewer 1 for his work to improve our manuscript.

Reviewer #2 (Recommendations for the authors):

The authors' revisions are mostly adequate.

However, the statements that activity-silent models " (1) it cannot filter out distractors that occur later in time than behavioral relevant stimuli, (2) it does not have a severely limited capacity (a characteristic of working memory) and (3) it is incapable of internal manipulation of information" is not true.

The activity-silent models can explain all of this. Synaptic weight changes are driven and refreshed by spiking. Thus, they have the same features and same control as attractor-state models. 1. Distractors can be filtered out by controlling spiking. 2. They do have a severe capacity limitation due to limitations in the spiking refresh rate. Multiple memories cannot be in the active state at the same time. That leads to capacity limitations. 3. Manipulation of WM is achieved by controlling spiking episodes, just like the attractor-state models.

The issue is that in testing the activity-silent models, the author has shifted too much of the burden to synapses alone. That is a misrepresentation of the activity-silent models. It is easy to refute a model if one makes a straw model of it. In the activity-silent models, synapses don't do everything. The help activity by briefly carrying the memories between spiking. That is why they are also referred to as "synaptic attractor" models. Because they also involve attractor states, they have many of the same features and mechanisms as attractor-state models. As a wise colleague recently said, the attractor-state and synaptic-attractor models are more similar than different. The characterization that the former can explain a variety of WM phenomena but the latter cannot is not accurate.

I think this is a valuable review. It is well-written. Attractor dynamics are indeed important and the review offers important insights. But surely these insights can be offered with misrepresenting other models.

We thank the Reviewer for elaborating on this point and help us to clarify the text. The Reviewer is correct in that all three limitations mentioned above disappear when a combination of short-term plasticity and spiking activity are considered.

Importantly, we did not mean to imply that self-sustained activity and short-term facilitation are incompatible. As a matter of fact, short-term facilitation is part of recurrent synaptic interactions that need to be sufficiently strong for maintaining self-sustained firing. The attractor scenario is differentiated from the activity-silent scenario not by the nature of the underlying biological feedback mechanism (e.g. NMDA receptor dependent transmission or synaptotagmin 7 dependent synaptic facilitation), but by whether it is above a threshold strength.

To clarify this point and avoid misrepresenting the activity-silent models, we have slightly modified the title of the corresponding subsection (to ‘Attractor model of working memory and activity-silent state models’), and replaced the sentences mentioned by the Reviewer (page 13) by the following:

“Another example is self-sustained repetition of brief bursts of spikes interspersed with long silent time epochs (Mi et al., 2017). [...] Short-term plasticity could therefore contribute to activity-silent memory traces but also to self-sustained activity.”

https://doi.org/10.7554/eLife.72136.sa2

Article and author information

Author details

  1. Jorge F Mejías

    Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, Netherlands
    Contribution
    Conceptualization, Formal analysis, Funding acquisition, Investigation, Software, Validation, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-8096-4891
  2. Xiao-Jing Wang

    Center for Neural Science, New York University, New York, United States
    Contribution
    Conceptualization, Formal analysis, Funding acquisition, Investigation, Validation, Writing – original draft, Writing – review and editing
    For correspondence
    xjwang@nyu.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-3124-8474

Funding

National Institutes of Health (R01MH062349)

  • Xiao-Jing Wang

Office of Naval Research (N00014-17-1-2041)

  • Xiao-Jing Wang

Simons Foundation

  • Xiao-Jing Wang

Human Brain Project SGA 3 (945539)

  • Jorge F F Mejías

National Science Foundation (2015276)

  • Xiao-Jing Wang

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Rishidev Chaudhuri, John Murray and Jorge Jaramillo for their support during the development of this work, and Henry Kennedy for providing the connectivity dataset.

Senior Editor

  1. Tirin Moore, Stanford University, United States

Reviewing Editor

  1. Tatiana Pasternak, National Institute of Neurological Disorders and Stroke, United States

Publication history

  1. Preprint posted: September 8, 2019 (view preprint)
  2. Received: July 12, 2021
  3. Accepted: January 19, 2022
  4. Version of Record published: February 24, 2022 (version 1)

Copyright

© 2022, Mejías and Wang

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 800
    Page views
  • 115
    Downloads
  • 3
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Jorge F Mejías
  2. Xiao-Jing Wang
(2022)
Mechanisms of distributed working memory in a large-scale network of macaque neocortex
eLife 11:e72136.
https://doi.org/10.7554/eLife.72136

Further reading

    1. Neuroscience
    Nataliia Kozhemiako et al.
    Research Article

    Motivated by the potential of objective neurophysiological markers to index thalamocortical function in patients with severe psychiatric illnesses, we comprehensively characterized key non-rapid eye movement (NREM) sleep parameters across multiple domains, their interdependencies, and their relationship to waking event-related potentials and symptom severity. In 72 schizophrenia (SCZ) patients and 58 controls, we confirmed a marked reduction in sleep spindle density in SCZ and extended these findings to show that fast and slow spindle properties were largely uncorrelated. We also describe a novel measure of slow oscillation and spindle interaction that was attenuated in SCZ. The main sleep findings were replicated in a demographically distinct sample, and a joint model, based on multiple NREM components, statistically predicted disease status in the replication cohort. Although also altered in patients, auditory event-related potentials elicited during wake were unrelated to NREM metrics. Consistent with a growing literature implicating thalamocortical dysfunction in SCZ, our characterization identifies independent NREM and wake EEG biomarkers that may index distinct aspects of SCZ pathophysiology and point to multiple neural mechanisms underlying disease heterogeneity. This study lays the groundwork for evaluating these neurophysiological markers, individually or in combination, to guide efforts at treatment and prevention as well as identifying individuals most likely to benefit from specific interventions.

    1. Medicine
    2. Neuroscience
    Guido I Guberman et al.
    Research Article

    Background: The heterogeneity of white matter damage and symptoms in concussion has been identified as a major obstacle to therapeutic innovation. In contrast, most diffusion MRI (dMRI) studies on concussion have traditionally relied on group-comparison approaches that average out heterogeneity. To leverage, rather than average out, concussion heterogeneity, we combined dMRI and multivariate statistics to characterize multi-tract multi-symptom relationships.

    Methods: Using cross-sectional data from 306 previously-concussed children aged 9-10 from the Adolescent Brain Cognitive Development Study, we built connectomes weighted by classical and emerging diffusion measures. These measures were combined into two informative indices, the first representing microstructural complexity, the second representing axonal density. We deployed pattern-learning algorithms to jointly decompose these connectivity features and 19 symptom measures.

    Results: Early multi-tract multi-symptom pairs explained the most covariance and represented broad symptom categories, such as a general problems pair, or a pair representing all cognitive symptoms, and implicated more distributed networks of white matter tracts. Further pairs represented more specific symptom combinations, such as a pair representing attention problems exclusively, and were associated with more localized white matter abnormalities. Symptom representation was not systematically related to tract representation across pairs. Sleep problems were implicated across most pairs, but were related to different connections across these pairs. Expression of multi-tract features was not driven by sociodemographic and injury-related variables, as well as by clinical subgroups defined by the presence of ADHD. Analyses performed on a replication dataset showed consistent results.

    Conclusions: Using a double-multivariate approach, we identified clinically-informative, cross-demographic multi-tract multi-symptom relationships. These results suggest that rather than clear one-to-one symptom-connectivity disturbances, concussions may be characterized by subtypes of symptom/connectivity relationships. The symptom/connectivity relationships identified in multi-tract multi-symptom pairs were not apparent in single-tract/single-symptom analyses. Future studies aiming to better understand connectivity/symptom relationships should take into account multi-tract multi-symptom heterogeneity.

    Funding: financial support for this work from a Vanier Canada Graduate Scholarship from the Canadian Institutes of Health Research (GIG), an Ontario Graduate Scholarship (SS), a Restracomp Research Fellowship provided by the Hospital for Sick Children (SS), an Institutional Research Chair in Neuroinformatics (MD), as well as a Natural Sciences and Engineering Research Council CREATE grant (MD).