Fundamental bound on the persistence and capacity of shortterm memory stored as graded persistent activity
 Cited 4
 Views 1,645
 Annotations
Abstract
It is widely believed that persistent neural activity underlies shortterm memory. Yet, as we show, the degradation of information stored directly in such networks behaves differently from human shortterm memory performance. We build a more general framework where memory is viewed as a problem of passing information through noisy channels whose degradation characteristics resemble those of persistent activity networks. If the brain first encoded the information appropriately before passing the information into such networks, the information can be stored substantially more faithfully. Within this framework, we derive a fundamental lowerbound on recall precision, which declines with storage duration and number of stored items. We show that human performance, though inconsistent with models involving direct (uncoded) storage in persistent activity networks, can be wellfit by the theoretical bound. This finding is consistent with the view that if the brain stores information in patterns of persistent activity, it might use codes that minimize the effects of noise, motivating the search for such codes in the brain.
https://doi.org/10.7554/eLife.22225.001Introduction
Shortterm memory, which refers to the brain’s temporary buffer of readily usable information, is considered to be a critical component of general intelligence (Conway et al., 2003). Despite considerable interest in understanding the neural mechanisms that limit shortterm memory, the issue remains relatively unsettled. Human working memory is a complex phenomenon, involving not just shortterm memory but executive selection and processing, operating on multiple timescales and across multiple brain areas (Jonides et al., 2008). In this study, we restrict ourselves to obtaining limits on shortterm memory performance purely due to noise in persistent activity networks, if analog information is stored directly into these networks, or if it is first wellencoded to make the stored states robust to ongoing noise.
Shortterm memory experiments quantify the precision of memory recall. Typically in such experiments, subjects are briefly presented with sensory inputs, which are then removed. After a delay the subjects are asked to estimate from memory some feature of the input. Consistent with everyday experience, memory capacity is severely limited, restricted to just a handful of items (Miller, 1956), and recall performance is worse when there are more items to be remembered. Persistence can also be limited, though forgetting over time is a less severe constraint than capacity: several experiments show that recall performance declines with delay (Luck and Vogel, 1997; Jonides et al., 2008; Barrouillet et al., 2009; Barrouillet et al., 2011; Barrouillet et al., 2012; Pertzov et al., 2013; Wilken and Ma, 2004; Bays et al., 2011; Pertzov et al., 2017; Anderson et al., 2011), at least when many items are stored in memory.
Efforts in experimental and theoretical psychology to understand the nature of these memory constraints (Atkinson and Shiffrin, 1968) have led to quantification of human memory performance, and to phenomenological models that can fit limitations in capacity (Zhang and Luck, 2008; Bays and Husain, 2008; van den Berg et al., 2012) or in persistence (Wilken and Ma, 2004; Barrouillet et al., 2012). They have also led to controversy: about whether memory consists of discrete ‘slots’ for a limited maximum number of items (Miller, 1956; Cowan, 2001; Zhang and Luck, 2008) or is more continuously allocable across a larger, variable number of items (van den Berg et al., 2012; Bays and Husain, 2008); about whether forgetting in shortterm memory can be attributed in part to some inherent temporal decay of an activity or memory variable over time (Barrouillet et al., 2012; Campoy, 2012; Ricker and Cowan, 2014; Zhang and Luck, 2009) or is, as more widely supported, primarily due to interference across stored items (Lewandowsky et al., 2009).
These controversies have been difficult to resolve in part because different experimental paradigms lend support to different models, while in some cases the resolution of memory performance data is not high enough to adjuciate between models. In addition, psychological models of memory performance make little contact with its neural underpinnings; thus, it is difficult to mediate between them on the basis of mechanism or electrophysiological studies.
On the mechanistic side, persistent neural activity has been widely hypothesized to form the substrate for shortterm memory. The hypothesis is based on a corpus of electrophysiological work establishing a link between shortterm memory and persistent neural activity (Funahashi, 2006; Smith and Jonides, 1998; Wimmer et al., 2014). Neural network models of analog persistent activity predict a degradation of information over time (Compte et al., 2000; Brody et al., 2003; Boucheny et al., 2005; Burak and Fiete, 2009; Fung et al., 2010; Mongillo et al., 2008; Burak and Fiete, 2012; Wei et al., 2012), because of noise in synaptic and neural activation. If individual analog features are assumed to be directly stored as variables in such persistent activity networks, the time course of degradation of persistent activity should directly predict the time course of degradation in shortterm memory performance. However, these models do not typically consider the direct storage of multiple variables (but see (Wei et al., 2012) ), and in general their predictions have not been directly compared against human psychophysics experiments in which the memory load and delay period are varied.
In the present work, we make the following contributions: (1) Generate psychophysics predictions for information degradation as a function of delay period and number of stored items, if information is stored directly, without recoding, in persistent activity neural networks of a fixed total size; (2) Generate psychophysics predictions (though the use of joint sourcechannel coding theory) for a model that assumes information is restructured by encoding and decoding stages before and after storage in persistent activity neural networks; (3) Compare these models to new analog measurements (Pertzov et al., 2017) of human memory performance on an analog task as the demands on both maintenance duration and capacity are varied.
We show that the direct storage predictions are at odds with human memory performance. We propose that noisy storage systems, such as persistent activity networks, may be viewed as noisy channels through which information is passed, to be accessed at another time. We use the theory of channel coding and joint sourcechannel coding to derive the informationtheoretic upperbound on the achievable accuracy of shortterm memory as a function of time and number of items to be remembered, assuming a core of graded persistent activity networks. According to the channel coding view, the brain might strategically restructure information before storing it, to use the available neurons in a way that minimizes the impact of noise upon the ability to retrieve that information later. We apply our framework, which requires the assumption of additional encoding and decoding stages in the memory process, to psychophysical data obtained using the technique of delayed estimation (Ma et al., 2014), which provides a sensitive measure of shortterm memory recall using a continuous, analog response space, rather than discrete (Yes/No) binary recall responses.
We show that empirical results are in substantially better agreement with the functional form of the theoretical bound than with predictions from a model of direct storage of information in persistent activity networks.
Our treatment of the memory problem is distinct from other recent approaches rooted in information theory (Brady et al., 2009; Sims et al., 2012), which consider only source coding – they assume that internal representations have a limited number of states, then compute the minimal distortion achievable in representing an analog variable with these limited states, after redundancy reduction and other compression. All representations are noisefree. By contrast, our central focus is precisely on noise and its effects on memory degradation over time, because the stored states are assumed to diffuse or randomwalk across the set of possible stored states. The emphasis on representation with noise involves channel coding as the central element of our analysis.
Our present work is also complementary to efforts to understand shortterm memory as rooted in variables other than persistent activity, for instance the possibility that shortterm synaptic plasticity, through facilitation (Mongillo et al., 2008; Barak and Tsodyks, 2014; Mi et al., 2017), might ‘silently’ (Stokes, 2015) store shortterm memory, which is reactivated and accessed through intermittent neural activity (Lundqvist et al., 2016).
Results
Analog measurement of human shortterm memory
We consider data from subjects performing a delayed estimation task (Figure 1—source data 1). We briefly summarize the paradigm and the main findings; a more detailed description can be found in Pertzov et al. (2017) Subjects view a display with several ($K$) differently colored and oriented bars that are subsequently removed for the storage (delay) period. Following the storage period, subjects were cued by one of the colored bars in the display, now randomly oriented, and asked to rotate it to its remembered orientation. Bar orientations in the display were drawn randomly from the uniform distribution over all angles (thus the range of orientations lies in the circular interval $[0,\pi ]$) and the report of the subject was recorded as an analog value, to allow for more detailed and quantitative comparisons with theory (van den Berg et al., 2012). Importantly, both the number of items ($K$) and the storage duration ($T$) were varied.
When only a single item had to be remembered, the length of the storage interval had no statistically significant influence on the distribution of responses over the intervals considered (Figure 1B, with different delays marked by different shades and line styles; errors $<10$ degrees, effect of delay: $F(3,36)=1.3,p=0.3$; errors between $3050$ degrees: $F(3,36)=0.2,p=0.9$). By contrast, response accuracy degraded significantly with delay duration when there were 6 items in the stimulus (Figure 1C; true orientation subtracted from all responses to provide a common center at 0 degrees). The number of very precise responses decreased (errors $<10$ degrees, effect of delay: $F(3,36)=6.15,p=0.002$), with a corresponding increase in the number of trials with large errors (e.g. errors between $3050$ degrees, effect of delay: $F(3,36)=5.4,p=0.004$).
Overall, the squared error in recalling an item’s orientation (Figure 1D), averaged over subjects, increased with delay duration ($F(3,27)=49,p<0.001$) and also with item number ($F(3,27)=48,p<0.001$). The data show a clear interaction between storage interval duration and set size ($F(9,81)=17,p<0.001$), apparent as steeper degradation slopes for larger setsizes. In summary, for a small number of items (e.g. $K=1,2$), increasing the storage duration does not strongly affect performance, but for any fixed delay, increasing item number has a more profound effect.
Finally, at all tested delays and item numbers, the squared errors are much smaller than the squared range of the circular variable, and any sublinearities in the curves cannot be attributed to the inevitable saturation of a growing variance on a circular domain (Figure 1—figure supplement 1).
Information degradation in persistent activity networks
In this and all following sections, we start from the hypothesis that persistent neural activity underlies shortterm information storage in the brain. The hypothesis is founded on evidence of a relationship between the stored variable and specific patterns of elevated (or depressed) neural activity (Taube, 1998; Aksay et al., 2001) that persist into the memory storage period and terminate when the task concludes, and on findings that fluctuations in delayperiod neural activity can be predictive of variations in memory performance (Funahashi, 2006; Smith and Jonides, 1998; Blair and Sharp, 1995; Miller et al., 1996; Romo et al., 1999; Supèr et al., 2001; Harrison and Tong, 2009; Wimmer et al., 2014).
Neural network models like the ring attractor generate an activity bump that is a steady state of the network and thus persists when the input is removed, Figure 2A. All rotations of the canonical activity bump form a onedimensional continuum of steady states, Figure 2B. Relatively straightforward extensions of the ring network can generate 2D or higherdimensional manifolds of persistent states. However, any noise in network activity, for instance in form of stochastic spiking (Softky and Koch, 1993; Shadlen and Newsome, 1994), leads to lateral random drift along the manifold in the form of a diffusive (OrnsteinUhlenbeck) random walk (Compte et al., 2000; Brody et al., 2003; Boucheny et al., 2005; Wu et al., 2008; Burak and Fiete, 2009; Fung et al., 2010; Burak and Fiete, 2012), Figure 2C–D.
A defining feature of such random walks is that the squared deviation of the stored state relative to its initial value will grow linearly with elapsed time over short times, Figure 2D, with a proportionality constant $2\mathcal{D}$ (where $\mathcal{D}$ is the diffusivity) that depends on quantities like the size of the network and the peak firing rate of neurons (Burak and Fiete, 2012).
Memory modeled as direct storage in persistent activity networks
Suppose that the variables in a shortterm memory task were directly transferred to persistent activity neural networks with a manifold of fixed points that matched the topology of the represented variable. Thus, $K$ circular variables would be stored, entrybyentry, in $K$ 1dimensional (1D) ring networks (BenYishai et al., 1995). (Alternatively, the $K$ variables could be stored in a single network with a $K$dimensional manifold of stable states, as described in the Appendix; the performance in neural costs and in fit to the data of this version of direct storage is worse than with storage in $K$ 1D networks, thus we focus on banks of 1D networks.)
When $N$ neural resources (e.g. composed of $N$ sets of $M$ neurons each, for a total of $NM$ neurons) are split into $K$ networks, each network is left with $N/K$ resources ($NM/K$ neurons in our example) for storage of a 1D variable. We know from (Burak and Fiete, 2012) that the diffusivity of the state in each of these 1D persistent activity networks will scale as the inverse of the number of neurons and of the peak firing rate per neuron. In other words, the diffusion coefficient is given by $\overline{\mathcal{D}}(K,N)=\mathcal{D}K/N$, where $\mathcal{D}$ is a diffusivity parameter independent of $K,N$ (but $\mathcal{D}\propto 1/M$). So long as the squared error remains small compared to the squared range of the variable, it will grow linearly in time at a rate given by $2\overline{\mathcal{D}}(K,N)$ (indeed, in the psychophysical data, the squared error remains small compared to the squared range of the angular variable; see Figure 1—figure supplement 1). Therefore the mean squared error (MSE) is given by:
The only free parameter in the expression for MSE as a function of time and item number is the ratio $N/2\mathcal{D}$. Because the inverse diffusivity parameter $1/\mathcal{D}$ scales with the number of neurons ($M$ in our example) when $N,K$ are held fixed, the product $N/(2\mathcal{D})$ is proportional to the total number of neurons ($N/(2\mathcal{D})\propto NM$). This ratio therefore functions as a combined neural resource parameter.
Direct storage is a poor model of memory performance
To fit the theory of direct storage to psychophysics data, we find a single bestfit value (with weighted leastsquares) of the free parameter $N/2\mathcal{D}$ across all item numbers and storage durations. For each item number curve, the fits are additionally anchored to the shortest storage period point ($T=100$ ms), which serves as a proxy for baseline performance at zero delay. Such baseline errors close to zero delay – which may be due to limitations in sensory perception, attentional constraints, constraints on the rate of information encoding (loading) into memory, or other factors – are not the subject of the present study, which seeks to describe how performance will deteriorate over time relative to the zerodelay baseline, as a function of storage duration and item number.
As can be seen in Figure 3A, the direct storage theory provides a poor match to human memory performance ($p$ values that the data occur by sampling from the model, excluding the $100$ ms timepoint: $0.07,0.38,<{10}^{4}$ for 1 item; $0.39,<{10}^{4},0.2$ for 2 items; $0.09,0.29,0.08$ for 4 items, and $<{10}^{3},<{10}^{4},<{10}^{4}$ for 6). These $p$values strongly suggest rejection of the model.
Does the direct storage model fail mostly because its dependence on time and item number are linear, while the data exhibits some nonlinear effects at the largest delays? On the contrary, direct storage fails to fit the data even at short delays when the performance curves are essentially linear (see the systematic underestimation of squared error by the model over $\le 2$ second delays in the 4 and 6item curves). If anything, the slight sublinearity in the 6item curve at longer delays tends to bring it closer to the other curves and thus to the model, thus its effect is to slightly reduce the discrepancy between the data and fits from direct storage theory.
One view of the results, obtained by selecting model parameters to best match the 6item curve, is that direct storage theory predicts an insufficiently strong improvement in performance with decreasing item number, Figure 3B ($p$values for directstorage model when fit to the 6item responses: $<{10}^{3},{10}^{3},<{10}^{4}$ for 1 item; $<{10}^{2},<{10}^{4},<{10}^{4}$ for 2 items; $0.76,<{10}^{2},2\times {10}^{3}$ for 4 items; $0.22,0.39,0.38$ for 6, excluding the $100$ ms delay timepoint; the $p$values for the 1 and 2item curves strongly suggest rejection of the model).
Informationtheoretic bound on memory performance with wellcoded storage
Even if information storage in persistent activity networks is a central component of shortterm memory, describing the storage step is not a sufficient account of memory. This fact is widely appreciated in memory psychophysics, where it has been observed that variations in attention, motivation, and other factors also affect memory performance (Atkinson and Shiffrin, 1968; Matsukura et al., 2007). Here we propose that, even discounting these complex factors, direct storage of a set of continuous variables into persistent activity networks with the same total dimension of stable states lacks generality as a model of memory because it does not consider how preencoding of information could affect its subsequent degradation, Figure 3C–E. This omission could help account for the mismatch between predictions from direct storage and human behavior, Figure 3A–B.
Storing information in noisy persistent activity networks means that after a delay there will be some information loss, as described above. Mathematically, information storage in a noisy medium is equivalent to passing the information through a noisy information channel. To allow for highfidelity communication through a noisy channel, it is necessary to first appropriately encode the signal, Figure 3F. Encoding for error control involves the addition of appropriate forms of redundancy tailored to the channel noise. As shown by Shannon (Shannon, 1948), very different levels of accuracy can be achieved with different forms of encoding for the same amount of coding redundancy and channel noise. Thus, predictions for memory performance after good encoding may differ substantially from the predictions from direct storage even though the underlying storage networks (channels) are identical.
Thus, a more general theory of information storage for shortterm memory in the brain would consider the effects of arbitrary encoderdecoder pairs that sandwich the noisy storage stage, Figure 3G. In such a threestage model, information to be stored is first passed to an encoder, which performs all necessary encoding. Encoding strategies may include source coding or compression of the data as well as, critically, channel coding — the addition of redundancy tailored to the noise in the channel so that, subject to constraints on how much redundancy can be added, the downstream effects of channel noise are minimized (Shannon, 1948). The coded information is stored in persistent activity networks, Figure 3H. Finally, the information is accessed by a decoder or readout, Figure 3G. Here, we derive a bound on the best performance that can be achieved by any coding or decoding strategy, if the storage step involves graded persistent activity.
The encoder transforms the $K$dimensional input variable into an $N$ dimensional codeword, to be stored in a bank of storage networks with an $N$dimensional manifold of persistent activity states (in the form of $N$ networks with a 1dimensional manifold each, or 1 network with an $N$dimensional manifold, or something in between). To equalize resource use for the persistent activity networks in both direct storage and coded storage models of memory, the $N$ stored states have a diffusivity $\mathcal{D}$ each, in contrast to the diffusivity of $\mathcal{D}K/N$ each for $K$ states (compare Figure 3D–E and and G–H). The storage step is equivalent to passage of information through additive Gaussian information channels, with variance proportional to the storage duration $T$ and to the diffusivity. The decoder errorcorrects the output of the storage stage and inverts the code to provide an estimate of the stored variable. (For more details, see Materials and methods and Appendix.)
We can use information theory to derive the minimum achievable recall error over all possible encoderdecoder structures, for the given statistics of the variable to be remembered and the noise in the storage information channels. In particular, we use joint sourcechannel coding theory to first consider at what rate information can be conveyed through a noisy channel for a given level of noise and coding redundancy, then obtain the minimal achievable distortion (recall error) for that information rate (see Materials and Appendix). We obtain the following lowerbound on the recall error:
This result is the theoretical lower bound on MSE achievable by any system that passes information through a noisy channel with the specified statistics: a Gaussian additive channel noise of zero mean and variance $2\mathcal{D}T$ per channel use, a codeword of dimension $N$, and a variable to be transmitted (stored) of dimension $K$, with entries that lie in the range $[0,\mathrm{\Phi}]$. The bound becomes tight asymptotically (for large $N$), but for small $N$ it remains a strict lowerbound. Although the potential for decoding errors is reduced at smaller $N$, the qualitative dependence of performance on item number and delay should remain the same (Appendix and (Polyanskiy et al., 2010) ). The bound is derived by dividing the total resources (defined here, as in the direct storage case, as the ratio $N/2\mathcal{D}$) evenly across all stored items (details in Appendix), similar to a ‘continuous resource’ conception of memory. The same theoretical treatment will admit different resource allocations, for instance, one could split the resources into a fixed number of pieces and allocate those to a (sub)set of the presented items, more similar to the ‘discrete slots’ model.
A heuristic derivation of the result above can be obtained by first noting that the capacity of a Gaussian channel with a given signaltonoise ratio ($SNR$) is ${I}_{Gauss}=\frac{1}{2}\mathrm{log}(1+SNR)$. The summed capacity of $N$ channels, spread across the $K$ items of the stored variable, produces ${I}_{peritem}=\frac{N}{K}{I}_{Gauss}$. The variance of a scalar within the unit interval represented by $I$ bits of information is bounded below by ${e}^{2I}$. Inserting ${I}_{peritem}$ into the variance expression and $SNR=1/2\mathcal{D}T$ into ${I}_{Gauss}$, yields Equation 2 , up to scaling prefactors. The Appendix provides more rigorous arguments that the bound we derive is indeed the best that can theoretically be achieved.
Equation 2 exhibits some characteristic features, including, first, a joint dependence on the number of stored items and the storage duration. According to this expression, the timecourse of memory decay depends on the number of items. This effect arises because items compete for the same limited memory resources and when an item is allocated fewer resources it is more susceptible to the effects of noise over time. Second, the scaling with item number is qualitatively different than the scaling with storage duration: Increasing the number of stored items degrades performance much more steeply than increasing the storage interval, because item number is in the exponent. For a single memorized feature or item, the decline in accuracy with storage interval duration is predicted to be weak. On the other hand, increasing the number of memorized items while keeping the storage duration fixed should lead to a rapid deterioration in memory accuracy.
We next consider whether the performance of an optimal encoder (given this lower bound) can be distinguished from the direct storage model based on human performance data. The two predictions differ in their dependence upon the number of independent storage channels or networks, $N$, which we do not know how to control in human behavior. Equally important, since Equation 2 provides a theoretical limit on performance, it is of interest to learn whether human behavior approximates the limit, and where it might deviate from it.
Comparison of theoretical bound with human performance
In comparing the psychophysical data to the theoretical bound on shortterm memory performance, there are two unknown parameters, $1/2\mathcal{D}$ (the inverse diffusivity in each persistent activity network) and $N$ (the number of such networks), both of which scale linearly with the neural resource of neuron number. The product of these parameters corresponds to total neural resource exactly as in the direct storage case. We fit Equation 2 to human performance data, assuming as in the direct storage model that the total neural resource is fixed across all item numbers and delay durations, and setting the 100 ms delay values of the theoretical curves to their empirical values.
The resulting best fit between theory and human behavior is excellent (Figure 4E; $p$ values that the data means may occur by sampling from the model, excluding the $T=100$ ms timepoints: $0.99,0.07,0.75$ for 1 item; $0.46,0.07,0.60$ for 2 items; $0.54,0.24,0.43$ for 4 items; $0.89,0.38,0.32$ for 6; all values are larger than 0.05, most much more so. These $p$ values indicate a significantly better fit to data than obtained with the direct storage model).
If we penalize the wellcoded storage model for its extra parameter compared to direct storage ($1/2\mathcal{D}$ and $N$, versus the single parameter $\mathcal{D}/N$ for the direct storage model) through the Bayesian Information Criterion (BIC), a likelihoodbased hypothesis comparison test (that more stringently penalizes model parameters than the AIC or Aikike Information Criterion), the evidence remains very strongly in favor of the wellcoded memory storage model compared to direct storage ($\mathrm{\Delta}\mathrm{B}\mathrm{I}\mathrm{C}\approx 99\gg 10$, where 10 is the cutoff for ‘very strong’ support) (Kass and Raftery, 1995). In fact, according to the BIC, the discrepancy in the quality of fit to the data between the models is so great that the increased parameter cost of the wellcoded memory model barely perturbs the evidence in its favor. Some more statistical controls by jackknife crossvalidation of the two models (Figure 3—figure supplement 1, Figure 3—figure supplement 2), exclusion of the $T=100$ ms point on the grounds that it might represent iconic memory recall rather than shortterm memory (Figure 3—figure supplement 3), and redefinition of the number of items in memory to take into account the colors and orientations of the objects are given in the Appendix (Figure 3—figure supplement 4); the results are qualitatively unchanged, and also do not result in large quantitative deviations in the extracted parameters (discussed below).
The twodimensional parameter space for fitting the theory to the data contains a onedimensional manifold of reasonable solutions, Figure 4A (dark blue valley), most of which provide better fits to the data than the direct storage model. Some of these different fits to the data are shown in Figure 4B. At large values of $N$, the manifold is roughly a hyperbola in $\mathrm{log}N$ and $\mathrm{log}(1/2\mathcal{D})$, suggesting that the logarithms of the two neural resource parameters can roughly trade off with each other; indeed, the total resource use in the onedimensional solution valley is roughly constant at large $N$, Figure 4C (gray curves). However, at smaller $N$, the resource use drops with increasing $N$. The fits are not equally good along the valley of reasonable solutions, and the best fit lies near $N=5$ independent networks or channels (for jackknife crossvalidation fits, see Figure 3—figure supplement 1, Figure 3—figure supplement 2, the best fits for the coded model can be closer to $N=10$; thus, the figure obtained for the number of memory networks should be taken as an orderofmagnitude estimate rather than an exact value). Resource use in the valley declines with increasing $N$ to its asymptotic constant value (thus larger $N$ would yield bigger representational efficiencies); however, by $N=5$, resource use is already close to its final asymptotic value, thus the gains of increasing the number of separate memory networks beyond $N=510$ diminish. The theory also provides good fits to individual subject performance for all ten subjects, using parameter values within a factor of 10 (and usually much less than a factor of 10) of each other (see Appendix).
Comparison of neural resource use in direct and wellcoded storage models of memory
Finally, we compare the neural resources required for storage in the direct storage model (bestfit) compared to the wellcoded storage model. We quantify the neural resources required for wellcoded storage as the product of the number of networks $N$ with the inverse diffusive coefficient $1/2\mathcal{D}$. This is proportional to the number of neurons required to implement storage. To replicate human behavior, coded storage requires resources totaling $N/2\mathcal{D}\approx 32$ (in units of seconds) for $N=5$, and $N/2\mathcal{D}\approx 22$ (s) for $N=10$, corresponding to the parameter settings for the fits in Figures 4C and 5B (center), respectively. By contrast, uncoded storage requires a 40fold increase in $N$ or a 40fold decrease in the diffusive growth rate in squared error, $2\mathcal{D}$, per network (or a corresponding increase in the product, $N/2\mathcal{D}$), because $N/2\mathcal{D}\approx 1215$ (s) under direct storage, to produce the bestfit result of Figure 3A. Thus, wellcoded storage requires substantially fewer resources in the persistent activity networks for similar performance (assuming best fits of each produce similar performance). Equivalently, a memory system with good encoding can achieve substantially better performance with the same total storage resources, than if information were directly stored in persistent activity networks.
This result on the disparity in resource use between uncoded and coded information storage is an illustration of the power of strong errorcorrecting codes. Confronted with the prospect of imperfect information channels, finitely many resources, and the need to store or transmit information faithfully, one may take two different paths.
The first option is to split the total resources into $K$ storage bins, into which the $K$ variables are stored; when there are more variables, there are more bins and each variable receives a smaller bin. The other is to store $N$ quantities in $N$ bins regardless of $K$, by splitting each of the $K$ variables into $N$ pieces and assigning a piece from each of the different variables to one bin; when there are more variables, each variable gets a smaller piece of the bin. In the former approach, which is similar to the direct storage scenario, increasing $N$ would lead to improvements in the fidelity of each of the $K$ channels, Figure 4D. In the latter approach, which is the strong coding strategy, increasing $N$ would increase the number of channels while keeping their fidelity fixed, Figure 4B. The latter ultimately yields a more efficient use of the same total resources in terms of the final quality of performance, especially for larger values of $N$, at least without considering the cost of the encoding and decoding steps.
If we hold the total resource $N/2\mathcal{D}\propto NM$ fixed, the lowest achievable MSE (Equation 2 ) in the wellcoded memory model is reached for maximally large $N$ and thus maximally large $\mathcal{D}$. However, human memory performance appears to be bestfit by $N\sim 10$. It is not clear, if our model does capture the basic architecture of the human memory system, why the memory system might operate in a regime of relatively small $N$. First, note that for increasing $N$, the total resource cost by $N=10$ is already down to within 10$\%$ of the minimum resource cost reached at much larger $N$. Second, note that the theory is derived under the ‘diffusive’ memory storage assumption: that within a storage network, information loss is diffusive. Thus, the assumption implicitly made while varying the parameter $N$ in Figure 4C is that as the number of networks ($N$) is increased, the diffusivity $\mathcal{D}\propto 1/M$ per network will simply increase in proportion to keep $NM$ fixed. However, the dynamics of persistent activity networks do not remain purely diffusive once the resource per network drops below a certain level: a new kind of nondiffusive error can start to become important (Schwab DJ & Fiete I (in preparation)). In this regime, the effective diffusivity in the network can grow much faster than the inverse network size. The nondiffusive errors produce large, nonlocal errors (which may be consistent with ‘pure guessing’ or ‘sudden death’ errors sometimes reported in memory psychophysics [Zhang and Luck, 2009]). It is possible that the memory networks operate in a regime where each channel (memory network) is allocated enough resources to mostly avoid nondiffusive errors, and this limits the number of networks.
Discussion
Key contributions
We have provided a fundamental lowerbound on the error of recall in shortterm memory as a function of item number and storage duration, if information is stored in graded persistent activity networks (our noisy channels). This bound on performance with an underlying graded persistent activity mechanism provides a reference point for comparison with human performance regardless of whether the brain employs strong encoding and decoding processes in its memory systems. The comparison can yield insights into the strategies the brain does employ.
Next, we used empirical data from analog measurements of memory error as a function of both temporal delay and the number of stored items. Using results from the theory of diffusion on continuous attractor manifolds in neural networks, we derived an expression for memory performance if the memorized variables were stored directly in graded persistent activity networks. The resulting predictions did not match human performance. The mismatch invites further investigation into whether and how directstorage models can be modified to account for real memory performance.
Finally, we found that the bound from theory provided an (unexpectedly) good match to human performance, Figure 4. We are not privy to the actual values of the parameters $N,1/2\mathcal{D}$ in the brain and it is possible the brain uses a value of, to take an arbitrary example, $\approx 5\times N$ to achieve a performance reached with $N$ in Equation 2 , which would be (quantitatively) ‘suboptimal’. Nevertheless, the possibility that the brain might perform qualitatively according to the functional form of the theoretical bound is highly nontrivial: As we have seen, the addition of appropriate encoding and decoding systems can reduce the degradation in accuracy from scaling polynomially ($\sim 1/N$) in the number of neurons, as in direct storage, to scaling exponentially ($\sim {e}^{\alpha N}$ for some $\alpha >0$). This is a startling possibility that requires more rigorous examination in future work.
Are neural representations consistent with exponentially strong codes?
Typical population codes for analog variables, as presently understood, exhibit linear gains in performance with $N$; such codes involve neurons with singlebump or ramplike tuning curves that are offset or scaled copies of one another. For related reasons, persistent activity networks with such tuning curves also exhibit linear gains in memory performance with $N$ (Burak and Fiete, 2012). These ‘classical population codes’ are ubiquitous in the sensory and motor peripheries as well as some cognitive areas. So far, the only example of an analog neural code known in principle to be capable of exponential scaling with $N$ is the periodic, multiscale code for location in grid cells of the mammalian entorhinal cortex (Hafting et al., 2005; Sreenivasan and Fiete, 2011; Mathis et al., 2012) : with this code, animals can represent an exponentially large set of distinct locations at a fixed local spatial resolution using linearly many neurons (Fiete et al., 2008; Sreenivasan and Fiete, 2011).
A literal analogy with grid cells would imply that all such codes should look periodic as a function of the represented variable, with a range of periods. A more general view is that the exponential capacity of the grid cell code results from two related features: First, no one group of grid cells with a common spatial tuning period carries full information about the coded variable (the spatial location of the animal) – location cannot be uniquely specified by the spatially periodic group response even in the absence of any noise. Second, the partial location information in different groups is independent because of the distinct spatial periods across groups (Sreenivasan and Fiete, 2011). In this more general view, strong codes need not be periodic, but there should be multiple populations that encode different, independent ‘parts’ of the same variable, which would be manifest as different subpopulations with diverse tuning profiles, and mixed selectivity to multiple variables.
It remains to be seen whether neural representations for shortterm visual memory are consistent with strong codes. Intriguingly, neural responses for shortterm memory are diverse and do not exhibit tuning that is as simple or uniform as typical for classical population codes (Miller et al., 1996; Fuster and Alexander, 1971; Romo et al., 1999; Wang, 2001; Funahashi, 2006; Fuster and Jervey, 1981; Rigotti et al., 2013). An interesting prediction of the wellcoded model, amenable to experimental testing, is that the representation within a memory channel must be in an optimized format, and that this format is not necessarily the same format that information was initially presented in. The brain would have to perform a transformation from stimulusspace into a wellcoded form, and one might expect to observe this transition of the representation at encoding. (See, e.g., recent works (Murray et al., 2017; Spaak et al., 2017), which show the existence of complex and heterogeneous dynamic transformations in primate prefrontal cortex during working memory tasks.) The less orthogonal the original stimulus space is to noise during storage and the more optimized the code for storage to resist degradation, the more different the mnemonic code will be from the sampleevoked signal. Studies that attempt to decode a stimulus from delayperiod neural or BOLD activity on the basis of tuning curves obtained from the stimulusevoked period are wellsuited to test this question (Zarahn et al., 1999; Courtney et al., 1997; Pessoa et al., 2002; Jha and McCarthy, 2000; Miller et al., 1996; Baeg et al., 2003; Meyers et al., 2008; Stokes et al., 2013) : If it is possible to use early stimulusevoked responses to accurately decode the stimulus over the delayperiod (Zarahn et al., 1999; Courtney et al., 1997; Pessoa et al., 2002; Jha and McCarthy, 2000; Miller et al., 1996), it would suggest that information is not recoded for noise resistance. On the other hand, a representation that is reshaped during the delay period relative to the stimulusevoked response (Baeg et al., 2003; Meyers et al., 2008; Stokes et al., 2013) might support the possibility of recoding for storage.
On the other hand, the encoding and decoding steps for strong codes add considerable complexity to the storage task, and it is unclear whether these steps can be performed efficiently so that the efficiencies of these codes are not nullified by their costs. In light of our current results, it will be interesting to further probe with neurophysiological tools whether storage for shortterm visual memory is consistent with strong neural codes. With psychophysics, it will be important to compare human performance and the informationtheoretic bound in greater detail. On the theoretical side, studying the decoding complexity of exponential neural codes is a topic of ongoing work (Fiete et al., 2014; Chaudhuri and Fiete, 2015), where we find that nonsparse codes made up of a product of many constraints on small subsets of the codewords might be amenable to strong error correction through simple neural dynamics.
Relationship to existing work and questions for the future
Compared to other informationtheoretic considerations of memory (Brady et al., 2009; Sims et al., 2012), the distinguishing feature of our approach is our focus on neuron or circuitlevel noise and the fundamental limits such noise will impose on persistence.
Our theoretical framework permits the incorporation of many additional elements: Variable allocation of resources during stimulus presentation based on task complexity, perceived importance, attention, and information loading rate, may all be incorporated into the present framework. This can be achieved by modeling $1/2\mathcal{D}$ and $N$ as dependent functions (e.g. as done in [van den Berg et al., 2012; Sims et al., 2012; Elmore et al., 2011]) rather than independent parameters, and by exploiting the flexibility allowed by our model in uneven resource allocation across items in the display (Materials and methods).
The memory psychophysics literature contains evidence of more complex memory effects, including a type of response called ‘sudden death’ or pure guessing (Zhang and Luck, 2009; Anderson et al., 2011). These responses are characterized by not being localized around the true value of the cued variable, and contribute a uniform or pedestal component to the response distribution. Other studies show that these apparent pedestals may not be a separate phenomenon and can, at least in some cases, be modeled by a simple growth in the variance over a bounded (circular) variable of a unimodal response distribution that remains centered at the cue location (van den Berg et al., 2012; Bays, 2014; Ma et al., 2014). In our framework, good encoding ensures that for noise below a threshold, the decoder can recover an improved estimate of the stored variable; however, strong codes exhibit sharp threshold behavior as the noise in the channel is varied smoothly. Once the noise per channel grows beyond the threshold, socalled catastrophic or threshold errors will occur, and the errors will become nonlocal: this phenomenon will look like sudden death in the memory report. In this sense, an optimal coding and decoding framework operating on top of continuously diffusing states in memory networks is consistent with the existence of sudden death or pure guessinglike responses, even without a distinct underlying mechanistic process in the memory networks themselves. We note, however, that the fits to the data shown here were all in the belowthreshold regime.
Another complex effect in memory psychophysics is misbinding, in which one or more of the multiple features (color, orientation, size, etc.) of an item are mistakenly associated with those from another item. This work should be viewed as a model of singlefeature memory. Very recently, there have been attempts to model misbinding (Matthey et al., 2015). It may be possible to extend the present model in the direction of (Matthey et al., 2015) by imagining the memory networks to be multidimensional attractors encoding multiple features of an item.
It will be important to understand whether in the direct coding model, modifications with plausible biological interpretations can lead to significantly better agreement with the data. From a purely curvefitting perspective, the model requires strongerthanlinear improvement in recall accuracy with declining item number, and one might thus convert the combined resource parameter $N/\mathcal{D}$ in Equation 1 into a function that varies inversely with $K$. This step would result in a better fit, but would correspond in the direct storage model to an increased allocation of total memory resources when the task involves fewer items, an implausible modification. Alternatively, if multiple items are stored within a single persistent activity network, collision effects can limit performance for larger item numbers (Wei et al., 2012), but a quantitative result on performance as a function of delay time and item number remain to be worked out. Further examination of the types of data we have considered here, with respect to predictions that would result from a memory model dependent on direct storage of variables into persistent activity network(s), should help further the goal of linking shortterm memory performance with neural network models of persistent activity.
Finally, note that our results stem from considering a specific hypothesis about the neural substrates of shortterm memory (that memory is stored in a continuum of persistent activity states) and from the assumption that forgetting in shortterm memory is undesirable but neural resources required to maintain information have a cost. It will also be interesting to consider the possibility of information storage in discrete rather than graded persistent activity states, with appropriate discretization of analog information before storage. Such storage networks will yield different bounds on memory performance than derived here (Koulakov et al., 2002; Goldman et al., 2003; Fiete et al., 2014), which should include the existence of small analog errors arising from discretization at the encoding stage, with little degradation over time because of the resistance of discrete states to noise. Also of great interest is to obtain predictions about degradation of shortterm memory in activitysilent mechanisms such as synaptic facilitation (Barak and Tsodyks, 2014; Mi et al., 2017; Stokes, 2015; Lundqvist et al., 2016). A distinct alternate perspective on the limited persistence of shortterm memory is that forgetting is a design feature that continually clears the memory buffer for future use and that limited memory allows for optimal search and computation that favors generalization instead of overfitting (Cowan, 2001). In this view, neural noise and resource constraints are not bottlenecks and there may be little imperative to optimize neural codes for greater persistence and capacity. To this end, it will be interesting to consider predictions from a theory in which limited memory is a feature, against the predictions we have presented here from the perspective that the neural system must work to avoid forgetting.
Materials and methods
Human psychophysics experiments
Request a detailed protocolTen neurologically normal subjects (age range $19$$35$ yr) participated in the experiment after giving informed consent. All subjects reported normal or correctedtonormal visual acuity. Stimuli were presented at a viewing distance of $60$ cm on a $21$” CRT monitor. Each trial began with the presentation of a central fixation cross (white, ${0.8}^{\circ}$ diameter) for $500$ milliseconds, followed by a memory array consisted of $1$, $2$, $4$, or $6$ oriented bars (${2}^{\circ}\times {0.3}^{\circ}$ of visual angle) presented on a grey background on an imaginary circle (radius ${4.4}^{\circ}$) around fixation with equal interitem distances (centre to centre). The colors of the bars in each trial were randomly selected out of eight easilydistinguishable colors. The stimulus display was followed by a blank delay of $0.1,1,2$ or $3$ seconds and at the end of each sequence, recall for one of the items was tested by displaying a ‘probe’ bar of the same color with a random orientation. Subjects were instructed to rotate the probe using a response dial (Logitech Intl. SA) to match the remembered orientation of the item of the same color in the sequence  henceforth termed the target. Each of the participants performed between $11$ and $15$ blocks of $80$ trials. Each block consisted of $20$ trials for each of the $4$ possible item numbers, consisting of $5$ trials for each delay duration.
Overview of theoretical framework and key steps
Channel coding and channel rate
Request a detailed protocolConsider transmitting information about $K$ scalar variables in the form of codewords of power 1 (i.e., ${\sum}_{k=1}^{K}{P}^{(k)}=1$, where ${P}^{(k)}$ is the average power allocated to encode item $k$, with the average taken over $N$ different channel uses, so that the average power actually used is $\frac{1}{N}{\sum}_{i=1}^{N}{({X}_{i}^{(k)})}^{2}\le {P}^{(k)}$. The number of channel uses, $N$, is equivalent in our memory framework to the number of parallel memory channels, each of which introduces a Gaussian white noise of variance $2\mathcal{D}T$. The rate of growth of variance of the variable stored in persistent activity networks, $2\mathcal{D}$, is derived in Burak and Fiete (2012); here, when we refer to this diffusivity, it is in dimensionless units where the variable is normalized by its range.
The information throughput (i.e., the information rate per channel use, also known as channel rate) for such channels is bounded by (see Appendix for details):
where $\mathcal{S}$ refers to any subset of the the $K$ items, $\{1,\mathrm{\cdots},K\}$. Equation 3 defines an entire region of information rates that are achievable: the total encoding power or the total channel rate, or both, may be allocated to a single item, or distributed across multiple items. Thus, the expression of Equation 3 is compatible with interpretations of memory as either a continuous or a discrete resource (van den Berg et al., 2012; Zhang and Luck, 2008). (E.g., setting ${P}^{(k)}=0$ for any $k\ge 5$, would correspond to a $4$slot conceptualization of shortterm memory. Distributing ${P}^{(k)}=1/K$ for any variable number $K$ of statistically similar items, would more closely describe a continuous resource model.) For both conceptualizations, this framework would allow us to consider, if the experiment setup warranted, different allocations of power ${P}^{(k)}$ and information rates across the encoded items.
For the delayed orientation matching task considered here, all presented items have equal complexity and a priori importance, so the relevant case is ${P}^{(k)}=1/K$ for all $k=1,\mathrm{\cdots},K$, together with equalrate allocation, ${R}^{(1)}=\mathrm{\cdots}={R}^{(K)}$, resulting in the following bound on peritem or perfeature information throughput in the noisy channel (see Appendix for more detail):
Next we consider how this bound on information rate in turn constrains the reconstruction error of the source variable (i.e., the $K$variable vector to be memorized, $\overrightarrow{\varphi}$).
Source coding and ratedistortion theory
Request a detailed protocolAt a source coder that compresses a source variable, ratedistortion theory relates the source rate to the distortion in reconstructing the source, at least for specific source distributions and specific error (distortion) metrics. For instance, if the source variables are each drawn uniformly from the interval $[0,\mathrm{\Phi}]$, then the meansquared error in reconstructing the source, $D}_{\mathrm{M}\mathrm{S}\mathrm{E}$, is related to the source rate $R$ through the ratedistortion function (see Appendix):
Joint sourcechannel coding
Request a detailed protocolIf the source rate is set to equal the maximal channel rate of Equation 4, then use the expression of Equation 5 from ratedistortion theory, we obtain the predicted bound on distortion in the source variable after source coding and channel transmission. This predicted distortion bound is given in Equation 2. In general problems of information transmission through an noisy channel, it is not necessarily jointly optimal to separately derive the optimal channel rate and the optimal distortion for a given source rate, and then to set the source rate to equal the maximal channel rate; the total distortion of the source passed through the channel need not be lowerbounded by the resulting expression. However, in our case of interest the twostep procedure described above, deriving first the channel capacity then inserting the capacity into the ratedistortion equation, yields a tight bound on distortion for the memory framework.
This concludes the basic derivation, in outline form, of the main theoretical result of the manuscript. The Supplementary Information supplies more steps and detail.
Fitting of theory to data
Request a detailed protocolIn all fits of theory to data (for direct and wellcoded storage), we assume that recall error at the shortest storage interval of 100 ms reflects baseline errors unrelated to the temporal loss of recall accuracy from noisy storage that is the focus of the present work. Under the assumption that this early (‘initial’) error is independent of the additional errors accrued over the storage period, it is appropriate to treat the baseline ($T=100$ ms) MSE as an additive contribution to the rest of the MSE (the variance of the sum of independent random variables is the sum of their variances). For this reason, we are justified in treating the $T=100$ ms errors as given by the data and setting these points as the initial offsets of the theory curves, which go on to explain the temporal (itemdependent) degradation of information placed in noisy storage.
The curves are fit by minimizing the summed weighted squared error of the theoretical prediction in fitting the subjectaveraged performance data over all item numbers and storage durations. The theoretical predictions are given by Equation 1 for direct storage and Equation 2 for wellcoded storage. The weights in the weighted leastsquares are the inverse SEMs for each (item, storage duration) pair. The parameters of the fit are $N/2\mathcal{D}$ (direct storage model) or $N$ and $2\mathcal{D}$ (wellcoded model). The parameter value selected is common across all item numbers and storage durations. The $p$ values given in the main paper quantify how likely the data means are to have been based on samples from a Gaussian distribution centered on the theoretical prediction.
Model comparison with the bayesian information criterion
Request a detailed protocolThe Bayesian Information Criterion (BIC) is a likelihoodbased method for model comparison, with a penalty term that takes into account the number of parameters used in the candidate models. BIC is a Bayesian model comparison method, as discussed in Kass and Raftery (1995)
Given data $x$ that are (assumed to be) drawn from a distribution in the exponential family and a model $M(\overrightarrow{\theta})$ with associated parameters $\overrightarrow{\theta}$ ($\overrightarrow{\theta}$ is a vector of $k$ parameters), the BIC is given by:
where $n$ is the number of observations, and $\widehat{L}$ is the likelihood of the model (with parameters $\overrightarrow{\theta}$ selected by maximum likelihood). The smaller the BIC, the better the model. The more positive the difference
between a pair of models ${M}_{1}({\overrightarrow{\theta}}_{1})$ and ${M}_{2}({\overrightarrow{\theta}}_{2})$ (with associated parameters ${\overrightarrow{\theta}}_{1},{\overrightarrow{\theta}}_{2}$, respectively, possibly of different dimensions ${k}_{1}\ne {k}_{2}$), the stronger the evidence for ${M}_{1}$.
To obtain the BIC for the direct and coded models, the model distributions are taken to be Gaussians whose means (for each item and delay) are given by the theoretical results of Equations 1 and 2, respectively, and whose variance is given by the empirically measured data variance across trials and subjects, computed separately per item and delay. We used the parameters $N=10,1/2\mathcal{D}=2.28$ for the wellcoded storage model, and $(2\mathcal{D}/N)=3.24\times {10}^{7}$ for the direct storage model, to obtain $\mathrm{\Delta}\mathrm{B}\mathrm{I}\mathrm{C}=172.67$. The empirical response variance is computed over each trial for each subjects, for a total of $n=660$ observations for each $(T,K)$ or (delay interval, item number) pair. The number of parameters is $k=1$ for direct storage and $k=2$ for wellcoded storage. Setting the parameter numbers to $k=1+4$ and $k=2+4$ to take into account the 4 values of response errors at the shortest delay at $T=100$ ms does not change the $\mathrm{\Delta}\mathrm{B}\mathrm{I}\mathrm{C}$ score because the score is dominated by the likelihood term, so that these changes in the parameter penalty term have negligible effect.
Appendix
Joint sourcechannel coding and memory: justification and main results
Noisy information channels as a component of shortterm memory systems
Noisy information channels have traditionally been used to model communication systems: in satellite or cellphone communications, the transmitted information is degraded during passage from one point to another (Shannon, 1959; Wang, 2001; Cover and Thomas, 1991). Such transmission and degradation over space is referred to as a channel use. However, noisy channels are apt descriptors of any system in which information is put in to be accessed at a different place or a different time, with loss occurring inbetween (Shannon, 1959; Wang, 2001; Cover and Thomas, 1991). Thus, hard drives are channels, with the main channel noise being the probability of random bit flips (from highenergy cosmic rays). Similarly, neural shortterm memory systems store information and are subject to unavoidable loss because of the stochasticity of neural spiking and synaptic activation. In this sense, noiseinduced loss in persistent activity networks is like passing the stored information through a noisy channel.
Channel coding
In channel coding, a message is first encoded to add redundancy, then transmitted through the noisy channel, and finally decoded at the decoder. Here, we establish the terminology and basic results from Shannon’s noisy channel coding theory (Shannon, 1959; Cover and Thomas, 1991), which are used in the main paper.
First, consider a task that involves storing or communicating a simple message, $q$, where $q$ is a uniformly distributed index taking one of $Q$ values: $q\in \{1,\mathrm{\cdots},Q\}$. The message $q$ is encoded according to a deterministic vector function (an encoding function), to generate the $N$dimensional vector $\mathbf{\mathbf{x}}(q)=({x}_{1}(q),{x}_{2}(q),\mathrm{\cdots},{x}_{N}(q))$, Figure 1. This is the channelcoding step. The codeword $\mathbf{\mathbf{x}}(q)$, is redundant, is sent through the noisy channel, which produces an output $\mathbf{\mathbf{y}}$ according to some conditional distribution $p(\mathbf{\mathbf{y}}\mathbf{\mathbf{x}})$ ($\mathbf{\mathbf{y}}$ is an $N$dimensional vector; the channel is specified by the distribution $p(\mathbf{\mathbf{y}}\mathbf{\mathbf{x}})$). In a memoryless channel (no feedback from the decoder at the end of the channel back to the encoder at the mouth of the channel), the channel obeys
where all distributions $p({y}_{n}{x}_{n})$ represent an identical distribution that defines the channel (Cover and Thomas, 1991). In this setup, transmission of the scalar source variable $q$ involves $N$ independent channel uses.
The decoder constructs a mapping $\mathbf{\mathbf{y}}\to \{1,\mathrm{\cdots},Q\}$, to make an estimate $\widehat{q}$ of the received message from the channel outputs $\mathbf{\mathbf{y}}$. If $\widehat{q}\ne q$, the decoder has made an error. The error probability is the probability that $q$ is decoded incorrectly, averaged over all $q$. This scenario, in which $q$, which is a single number (and represents one of the messages to be communicated) and the decoder receives a single number (observation) from each channel use, is referred to as pointtopoint communication (Cover and Thomas, 1991).
If the decoder can correctly decode $q$, the channel communication rate (also known as the rate per channel use), which quantifies how many information bits (about $q$) are transmitted per entry of the coded message $\mathbf{\mathbf{x}}$, is given by $R={\mathrm{log}}_{2}(Q)/N$. Shannon showed in his noisy channel coding theorem (Shannon, 1959; Cover and Thomas, 1991) that for any channel, in the limit $N\to \mathrm{\infty}$, it is possible in principle to communicate errorfree through the channel at any rate up to the channel capacity $C$, defined by:
For specific channels, it is possible to explicitly compute the channel capacity in terms of interesting parameters of the channel model and encoder; below, we will state such results for our channels of interest, for subsequent use in our theoretical analysis.
Pointtopoint Gaussian channel with a power constraint
For a scalar quantity transmitted over an additive Gaussian white noise channel of variance $2\mathcal{D}T$, with an average power constraint $P$ for representing the codewords (i.e., $\frac{1}{N}{\sum}_{i=1}^{N}{{x}_{i}}^{2}\le P$), the channel capacity , or maximum rate at which information can be transmitted without error, is given by (Cover and Thomas, 1991) :
Gaussian multipleaccess channel
Next, suppose the message is itself multidimensional (of dimension $K$), so that the message is $\mathbf{\mathbf{q}}=({q}^{1},\mathrm{\cdots},{q}^{K})$. (In a memory task, these $K$ variables may correspond to different features of one item, or one feature each of multiple items, or some distribution of features and items. All features of all items are simply considered as elements of the message, appropriately ordered.)
The general framework for such a scenario is the multipleaccess channel (MAC). In a MAC, separate encoders each encode one message element ${q}^{k}$ ($k=1,\mathrm{\cdots},K$), as an $N$dimensional codeword ${\mathbf{\mathbf{x}}}^{k}({q}^{k})$. The full message $\mathbf{\mathbf{q}}$ is thus represented by a set of $K$ different $N$dimensional codewords, $\mathbf{\mathbf{X}}(\mathbf{\mathbf{q}})=({\mathbf{\mathbf{x}}}^{1}({q}^{1}),\mathrm{\cdots},{\mathbf{\mathbf{x}}}^{K}({q}^{K}))$. The power of each encoder is limited to ${P}^{(k)}$ with a constraint on the summed power (we assume ${\sum}_{k=1}^{K}{P}^{(k)}\le 1.)$ The encoded outputs are transmitted through a channel with a single receiver at the end.
As before, we consider the channel to be Gaussian. In this Gaussian MAC model, the channel output $\mathbf{\mathbf{y}}$ is a single $N$dimensional vector, like the output in the pointtopoint communication case (Cover and Thomas, 1991). The MAC channel is defined by the distribution $p(\mathbf{\mathbf{y}}\mathbf{\mathbf{X}})=p(\mathbf{\mathbf{y}}{\mathbf{\mathbf{x}}}^{1},\mathrm{\cdots},{\mathbf{\mathbf{x}}}^{K})$. For a Gaussian MAC, $p(\mathbf{\mathbf{y}}\mathbf{\mathbf{X}})$ is a Gaussian distribution with mean equal to ${\sum}_{k=1}^{K}{\mathbf{\mathbf{x}}}^{k}$ and variance equal to the noise variance. The decoder is tasked with reconstructing all $K$ elements of $\mathbf{\mathbf{q}}$ from the $N$dimensional $\mathbf{\mathbf{y}}$.
The probability of error is defined as the average probability of error across all $K$ entries of the message. The fundamental limit on information transmission over the MAC is not a single number, but a region in a $K$dimensional space: It is possible to allocate power and thus rates differentially to different entries of the message $\mathbf{\mathbf{q}}$, and information capacity varies based on allocation. Through Shannon’s channel coding theorem, the region of achievable information rates for the Gaussian MAC with noise variance $2\mathcal{D}T$ is given by:
where $\mathcal{S}$ refers to any subset of $\{1,\mathrm{\cdots},K\}$, and we represent the summed rate for a given $\mathcal{S}$ as ${R}^{\mathcal{S}}={\sum}_{k\in \mathcal{S}}{R}^{(k)}$. In memory tasks, we assume the total power constraint is constant, regardless of the number of items, and $K$ corresponds to the number of items. Thus, power allocation per item will generally vary (decrease) with item number.
To summarize, we have a fundamental limit on information transmission rates in a Gaussian multipleaccess channel as described above.
Capacity of a Gaussian MAC with equal peritem rate equals pointtopoint channel capacity
The summed information rate through a Gaussian MAC channel is maximized when the peritem rate is equal across items. Moreover, at this equalrate peritem point, the Gaussian MAC model corresponds directly to a pointtopoint Gaussian (AWGN) channel coding model, where the channel input has an average power constraint $P$, which is set to $P={\sum}_{k}{P}^{(k)}$, where ${P}^{(k)}$ is the power constraint on the channel input of the $k$th encoder of the original Gaussian MAC model. In this equivalent AWGN model, a single encoder is responsible for transmitting all of the $K$ message elements, by dividing the pointtopoint channel capacity equally among the message elements. The maximum information rate in a pointtopoint AWGN channel is $(1/2)\mathrm{log}(1+\mathrm{S}\mathrm{N}\mathrm{R})$, and therefore the information rate per item, if the rate is divided evenly over all $K$ items, is ${R}^{(k)}=(1/2K)\mathrm{log}(1+\mathrm{S}\mathrm{N}\mathrm{R})$. This capacity can be achieved by setting the inputs for the AWGN pointtopoint channel to be the $N$dimensional vector $\mathbf{\mathbf{x}}$, with $\mathbf{\mathbf{x}}=\sum _{k=1}^{K}{\mathbf{\mathbf{x}}}^{k}({q}^{k})$, where ${\mathbf{\mathbf{x}}}^{k}({q}^{k})$ are the set of $K$ vectors of length $N$ generated from the encoders of the Gaussian MAC. The $i$th component ${x}_{i}$ of $\mathbf{\mathbf{x}}$ is ${x}_{i}=\sum _{k=1}^{K}{x}_{i}^{k}({q}^{k})$, where ${x}_{i}^{k}({q}^{k})$ is the $i$th element of the vector ${\mathbf{\mathbf{x}}}^{k}$ which encodes the message element ${q}^{k}$, and therefore ${x}_{i}$ contains information about all components of the message (joint representation of message elements).
Comparing the expression for the Gaussian MAC information rate with the capacity result from the corresponding pointtopoint Gaussian channel, ${R}^{(k)}=(1/2K)\mathrm{log}(1+\mathrm{S}\mathrm{N}\mathrm{R})$, it is clear that the summed rate of the equalrate peritem Gaussian MAC can achieve the same (optimal) information rate per item as the pointtopoint AWGN channel.
Figure 4B of our main manuscript may be viewed as depicting the AWGN pointtopoint channel, with a scalar input ${x}_{i}$ to each of the $N$ memory networks (AWGN channels). It is interesting to note that both the AWGN channel and Gaussian MAC models suggest that the brain might encode distinct items independently but then store them jointly.
Pointtopoint communication through a Gaussian channel with a peak amplitude constraint
Suppose the codewords are amplitudelimited, rather than collectively powerlimited, so that each element ${x}_{i}\le A$ for some amplitude $A$. If we are considering each entry of the codeword as being stored in a persistent activity network, then the maximal range of each codeword entry is constrained, rather than just the average power across entries. In this sense, amplitudeconstrained channels may be more apt descriptors than powerconstrained channels.
For comparison with the capacity of a Gaussian channel with a power constraint $P$, we set without loss of generality $A=\sqrt{P}$. Then, for a scalar quantity transmitted with this amplitude constraint over an additive Gaussian white noise channel of variance $2\mathcal{D}T$, the channel capacity is similar to that of the powerconstrained Gaussian channel, but with the cost of a modest multiplicative prefactor $c$ that is smaller than, but close to size 1 (Softky and Koch, 1993; Raginsky, 2008):
If the SNR ($=\frac{P}{2\mathcal{D}T}$) is such that $\sqrt{\mathrm{S}\mathrm{N}\mathrm{R}}<1.05$, then $c\in [0.8,1]$ (Raginsky, 2008). Therefore, channel capacity of the amplitudeconstrained Gaussian channel can be 80% or more of the channel capacity of the corresponding powerconstrained Gaussian channel. In any case, the powerconstrained Gaussian channel capacity expression is a good upper bound on the capacity of the amplitudeconstrained version of that channel.
Joint sourcechannel coding
In memory experiments, it is not possible to directly measure information throughput in the internal storage networks. Rather, a related quantity that can be measured, and is thus the quantity of interest, is the accuracy of recall. In this section, we describe how the general bound on information throughput in the storage networks – derived in the previous section – can be used to strictly upperbound the accuracy of recall in a specific class of memory tasks.
Consider a task that involves storing or communicating a variable $\varphi $. This variable is known as the information source. The information source may be analog or discrete, and uniform or not. To remove redundancies in the source distribution or to possibly even further compress the inputs (at the loss of information), the source may be passed through a sourcecoding step. (For instance, the real interval $[1,1]$ can be compressed through binary quantization into one bit by assigning the subinterval $[1,0]$ to the point $0$, and $[0,1]$ to $1$, at the expense of precision.) The output of the source coder is known as the message , which was the assumed input to the noisy channel in the sections discussed above. The message is a uniformly distributed index $q$, taking one of $Q$ values, $q\in \{1,\mathrm{\cdots},Q\}$. The source rate is the number of bits allocated per source symbol, or ${\mathrm{log}}_{2}(Q)$.
For discrete, memoryless pointtopoint Gaussian channels, Shannon’s separation theorem (Shannon, 1959; Cover and Thomas, 1991) holds, which means that to obtain minimal distortion of a source variable that must be communicated through a noisy channel, it is optimal to separately compute the channel information rate, then set the source rate to equal the channel rate. Ratedistortion theory from source coding will then specify the lower bound on distortion with this scheme. Because the separation theorem holds for the pointtopoint AWGN channel considered above, and because the pointtopoint AWGN rate equals the maximal summed MAC rate, we can apply the separation theorem to our memory framework and then use ratedistortion theory to compute the lower bound on distortion.
To minimize distortion according to the separation theorem, we therefore set the source rate ${\mathrm{log}}_{2}(Q)$ to equal the maximum number of bits that may be transmitted errorfree over the channel. With this choice, all messages are transmitted without error in the channel. Then, we apply ratedistortion theory to determine the minimum distortion achievable for the allocated source rate. For a given source rate allocation, the distortion depends on several factors: the statistics of the source (e.g. whether it is uniform, Gaussian, etc.), the source coding scheme, and on the distortion measure (e.g. mean absolute error (an L1 norm), mean squared error (an L2 norm), or another metric that quantifies the difference between the true source and its estimate). Closedform expressions for minimum achievable distortion do not exist for arbitrary sources and distortion metrics, but crucially, there are some useful bounds on specific distortion measures including the mean squared error, which is our focus.
Mean squared error (MSE) distortion
For arbitrary source distributions, the relationship between source rate ($R$ bits per source symbol) and minimum MSE distortion (${D}_{\mathrm{M}\mathrm{S}\mathrm{E}}(R)$) at that rate, is given by:
where $h\left(\varphi \right)$ is the differential entropy of the source, ${\sigma}_{\varphi}^{2}$ is the variance of the source, and $\mathrm{log}$ is in base2. The inequality on the right is saturated (becomes an equality) for a Gaussian source (Cover and Thomas, 1991). The inequality on the left is the Shannon Lower Bound (Sims et al., 2012) on MSE distortion for arbitrary memoryless sources, and it, too, is saturated for a Gaussian source (Cover and Thomas, 1991).
Specializing the above expression to a uniform source over the interval $[0,\mathrm{\Phi}]$, we have $h(\varphi )=\mathrm{log}(\mathrm{\Phi})$, and ${\sigma}_{\varphi}^{2}={\mathrm{\Phi}}^{2}/12$. Thus, we obtain
Inverting the inequalities above to obtain bounds on the MSE distortion, we have
Note that the upper and lower bounds are identical in form – proportional to ${\mathrm{\Phi}}^{2}{2}^{2R}$ – up to a constant prefactor that lies between $[1/2\pi e,1/12]$. Thus, the lower bound on distortion is given by
where $\alpha}_{MSE$ is an unknown constant of size about $1$, somewhere in the range $[1,2\pi e/12]$.
Now, we set the information rate $R$ for the source (bits per source symbol) in the equation above, to match the the maximum rate for errorfree transmission in the noisy storage information channel. The maximum number of bits that can be stored errorfree is $N$ times the channel capacity given in Equation 4 , because Equation 4 represents the information capacity for each channel use, and each of the $N$ storage networks represents one channel use. Thus, we have $R=N{R}^{(k)}(T)$, where ${R}^{(k)}(T)$ is given in Equation 4 , and the minimum MSE distortion is:
Because we are interested in the lowerbound on error, we set $\alpha}_{\mathrm{M}\mathrm{S}\mathrm{E}$ to the lower bound of its range, ${\alpha}_{\mathrm{M}\mathrm{S}\mathrm{E}}=1$, so that we obtain the expression given in the main paper (Equation 2 ):
Indeed, any other choice of $\alpha}_{MSE$ within its range $[1,2\pi e/12]$ does not qualitatively affect our subsequent results in the main paper.
To summarize, we derived the bound given in Equation 16 by separately combining two different bounds  the lowerbound on achievable distortion at a source for a given source rate and the upperbound on information throughput in a noisy information channel. This combination of the two separate bounds, where each bound did not take into account the statistics of the other process (the source bound was computed independently of the channel and the channel independently of the source), is in general suboptimal. It is tight (optimal) in this case only because the uniform source and Gaussian channel obey the conditions of Shannon’s separation theorem, also known as the joint sourcechannel coding theorem (Cover and Thomas, 1991; Wang, 2001; MacKay, 2002; Shannon, 1959; Viterbi and Omura, 1979).
Bound on recall accuracy for amplitudeconstrained channels
As noted in Section 2 of the Appendix, the powerconstrained channel capacity is an upper bound for the amplitudeconstrained channel capacity (amplitude $A=\sqrt{P}$). It follows that the lowerbound on distortion for powerconstrained channels, Equation 16 , is a lowerbound on the amplitudeconstrained channel. Further, because the channel capacity of an amplitudeconstrained Gaussian channel is of the same form as the capacity of a powerconstrained Gaussian channel, with a prefactor $c$ that is close to 1, we easily see that the specific expression for MSE distortion is modified to be:
Because $N$ is a free parameter of the theory, we may simply renormalize $cN$ to equal $N$. Thus, the theoretical prediction obtained for a powerconstrained channel is the same in functional form as that for an amplitudeconstrained channel.
In comparing the theoretical prediction against the predictions of direct storage in persistent activity networks, however, we should take into account the factor $c$, noting that to produce an effective value of $N$ requires $N/c$ many networks, which is greater than $N$ because $c<1$.
Nonasymptotic considerations
Many of the numerical fits in the paper involve values of $N$ that are not large: $N$ is of order 10. When transmitting information with smaller $N$, the errorfree information rate is lower (Polyanskiy et al., 2010), or conversely, if transmitting at rates close to capacity with smaller numbers of channel uses ($N$) there can be decoding errors. In deriving our bound on distortion from jointsource channel coding theory, we inserted the asymptotic value of information rate (the capacity) into the ratedistortion function and assumed that information transmission at that rate would be errorfree. If errors occur, the resulting distortion will be higher. It is important to note that, even far from the asymptotic limit in $N$, the derived lowerbound on distortion in Equation 16 remains a strict lowerbound; nonasymptotic effects can raise the overall error, not lower it.
Nevertheless, it is of interest to consider how distortion may be modified for values of $N$ that are not asymptotically large. One would write the total nonasymptotic MSE distortion ($D}_{MSE}^{\sim asymp$) as the sum of terms:
Here, $D}_{MSE$ is the errorfree distortion bound derived above, ${p}_{e}$ is the probability of error in the nonasymptotic regime, and ${D}_{e}$ is the distortion in case of error. If an error resulted in total loss of information about the transmitted (coded) variable, ${D}_{e}$ would scale as ${\mathrm{\Phi}}^{2}$, independent of $N$ or other parameters in the problem. The only dependence on $N$ would then enter through the probability of error, ${p}_{e}$. The probability of error vanishes exponentially with $N$ (Polyanskiy et al., 2010), and can be small even for relatively small values of $N$. The second term is in practice a small contributor to the MSE. Alternatively, one can ask how small $N$ can be and at how far below the asymptotic capacity to enable information transmission at or below a given error rate. Analytical and numerical results in Polyanskiy et al. (2010) show that at SNR values lower than the estimated SNR in the memory system model ($SNR=P/2\mathcal{D}T=1/2\mathcal{D}T\approx 2.2$ dB at $T=3$ sec and $SNR\approx 4$ dB at $T=2$ sec; while Figure 6 in (Polyanskiy et al., 2010) has $SNR=0$ dB and ${p}_{e}={10}^{3}$), it is possible to remain within a factor of $1/3$ of the asymptotic information capacity with $N<10$. Thus, the nonasymptotic expectation is that the information transmission rate should be scaled down from the asymptotically achievable information rate (the capacity) by some factor $c$ (in this case, $c\sim 3$). Thus, through Equation 15 , we see that the bound on distortion will remain the same as in Equation 2 of the main manuscript, with the replacement of $N/K$ in the exponent by $N/cK$. In other words, the previous values of the fit parameter $N$ in the fits would actually correspond to $cN$. Thus, it actually takes $c$ times more resources (where $c$ scales slowly with $1/N$) to achieve a given level of performance nonasymptotically as asymptotically.
To summarize, the bound on distortion given in Equation 16 is still a strict lowerbound on distortion in the regime where $N$ is not asymptotically large; moreover, the functional form of the bound can remain largely the same in the nonasymptotic regime because the error probability is small for modest $N$. In addition, it is possible to achieve a given low error probability at a fixed SNR by simply decreasing the information rate, which increases distortion in a way that is effectively the same as increasing the value of the free parameter $N$.
Direct (uncoded) storage in persistent activity networks
Modeling shortterm memory as direct storage of variables in persistent activity networks, produces results that are inconsistent with the data, as shown in the main paper. To obtain predictions for persistence and capacity through direct storage in persistent activity networks, first consider storing a single circular orientation variable, for a single bar in the delayed orientation matching task, as a bump in one ring network (BenYishai et al., 1995; Amit, 1992; Zhang, 1996). The ring network would have neurons from all the $N$ storage networks in our shortterm memory system pooled together, thus the network is $N$ times larger. The mean squared error of a variable stored in a continuous attractor neural network with stochastic neural spiking grows linearly with the storage interval $T$ over short intervals (with ‘short’ defined as all intervals before the rootmean squared error has grown to be an appreciable fraction of the range of the variable, $2\pi $). Let $\varphi /\mathrm{\Phi}$ be the coded variable, with $\varphi \in [0,\mathrm{\Phi}]$. If the rate of growth of error in the individual storage networks of the main paper is $2\mathcal{D}$ (recall that $\overline{D}=D/P$, where $D$ is coefficient of diffusion (Burak and Fiete, 2012); thus, the quantity $\overline{D}$ describes the rate at which the stored variable drifts away from its initial value, normalized by the squared range of the variable, per unit power of the representation; alternatively, we may think of the total representional power as being normalized to 1 in all cases), then the rate of growth of squared error in the single ring network is $2\mathcal{D}/N$ (Burak and Fiete, 2012). The factor of $N$ enters because if all other quantities are held fixed, the diffusion coefficient in continuous attractor memory networks is inversely proportional to network size. Thus, the squared error in the variable at short times $T$ is given by $\u27e8{(\varphi (T)\varphi (0))}^{2}\u27e9/{\mathrm{\Phi}}^{2}=2\mathcal{D}T/N$. In other words, we have
Next, consider storing $K$ scalar variables, with each component ranging in $[0,\mathrm{\Phi}]$, and represented in one of $K$ different small networks, constructed from the single storage network above. Thus, its size is $1/K$ of the above. Relative to Equation 20 above, we therefore have
In other words, for memory systems involving direct storage in persistent activity networks without special encoding, we expect the squared error to grow linearly with $K$ and $T$. The prediction of uncoded storage in persistent activity networks can be compared directly with the prediction from encoded storage (Equation 2), because they involve the same parameters and the same resource use in the memory networks. While adding a proper encoding stage can reduce storage errors exponentially in $N$, uncoded storage results in decreases with $N$ that are merely polynomial (more specifically, scaling as ${N}^{1}$).
Finally, one may consider directly storing the $K$dimensional variable in a single persistent activity network that is a $K$dimensional ring network (a $K$torus). In this situation, the neurons have to be arranged so the number of neurons per linear dimension of the network scales as ${N}^{1/K}$. Thus, the rate of growth of squared error along each dimension of the network scales as $2\mathcal{D}/({N}^{1/K})$, and we have
This scaling with $T$ remains linear, while the improvement in squared error with $N$ is weaker than the scaling in Equation 21 , which in turn is weaker than the scaling in Equation 2 , and consequently produces worse fits to the data than does Equation 21 . Therefore, we have chosen to contrast the better of two scenarios of direct (uncoded) storage, Equation 21 , against the predictions of the theory of shortterm memory proposed in this work.
Comparison of direct storage against coded storage in power or amplitudeconstrained channels
In the main text, we compared not only how the predictions of coded versus direct storage compare with each other as a function of $T$ and $K$, but also compared total resource use to achieve a given performance with the two different models of storage. In the latter comparison, we derive the total neural resource, $N/2\mathcal{D}$, required in the two schemes. We report that direct storage requires a $\sim 40$fold larger $N/2\mathcal{D}$ than coded storage, basing our results on the expression for coded storage in powerconstrained channels. As noted in Section 3 of the Appendix, the effective $N$ for an amplitudeconstrained channel, which might be a more apt constraint for persistent activity networks with bounded ranges, is actually $N/c$, where $c$ is a prefactor close to but smaller than 1, that represents the fractional loss in channel capacity incurred by enforcing an amplitude rather than power constraint. As described in (Raginsky, 2008) (see also related work in (Softky and Koch, 1993) ), the cost of replacing a power constraint by an amplitude constraint is modest, with $c\in [0.8,1]$ for an appropriate regime of channel SNR (this is the regime of SNR for our fits to the data). Thus, even with an amplitude constraint for the coded memory scenario, direct storage would require a $\sim 30$fold larger $N/2\mathcal{D}$.
Performance of individual subjects and comparison with theory
Here, we supply the data from individual subjects, as well as fits of the theory of Equation 2 and the direct storage model 1 to their performance.
The individual subject responses and the fits of the wellcoded storage model are shown in Figure 4—figure supplement 1. We first plot the qualityoffit or energy surface of the fits of the wellcoded model to the individual subject data (top two rows in Figure 4—figure supplement 1), as the two parameters of the model are varied. These individualsubject solution spaces look qualitatively similar to the acrosssubject aggregates reported in the main manuscript. All subjects exhibit a 1D manifold of ‘good’ parameter settings, along which the model provides a reasonable match to the data. The quality of fit along the 1D manifold (valley) is shown in the next two rows of Figure 4—figure supplement 1; based on the local minima of these curves, we infer the optimal settings of $N$ and $1/2\mathcal{D}$ for each subject. The differences between individuals emerges in that the best $N$ values range between 2 and 20, and that for most subjects, the best values range between 4 and 11. Subjects with deviations in the optimal $N$ from this narrower range have essentially flat valleys between $N=2$ and $N=20$ (Figure 4—figure supplement 1), and thus the choice of $N$ is not strongly constrained.
The minimum fit errors are necessarily larger than the minimum fit errors for the acrosssubject averaged data, because of the higher variability of individual subject data (fewer trials per subject than total trials across subjects). Nevertheless, the normalized squared errors of the fits can be quite low, and the theory provides good fits to the psychophysics data for the individual subjects.
We also fit the individual subject data to the direct storage models, to be able to compare the predictions from the two models, Figure 4—figure supplement 2. We then compute the Bayesian Information Criterion score for both the direct storage model and the wellcoded storage model, and report the $\mathrm{\Delta}BIC$ score for hypothesis comparison, Figure 4—figure supplement 2. Positive (negative) $\mathrm{\Delta}BIC$ scores indicate support for the wellcoded (direct) storage model, and an absolute value of 10 or greater indicates very strong support. Note that the $\mathrm{\Delta}BIC$ scores for the individual subjects are much smaller in magnitude than the aggregate scores for all pooled data in the main manuscript, because the data set for individual subjects is smaller and has less statistical strength. Nevertheless, there is very strong support ($\mathrm{\Delta}BIC>10$) for the wellcoded model in 4 out of 10 subjects, close to strong support for direct storage in 2 out of 10 subjects ($\mathrm{\Delta}BIC\approx 10$), positive support for direct storage in 2 subjects, and essentially insignificant support ($\mathrm{\Delta}BIC\approx 2$) in 2 remaining subjects.
References
 1

2
Modeling brain function: The world of attractor neural networksCambridge University Press.
 3

4
Human memory: A proposed system and its control processesThe Psychology of Learning and Motivation 2:89–195.https://doi.org/10.1016/S00797421(08)604223
 5

6
Working models of working memoryCurrent Opinion in Neurobiology 25:20–24.https://doi.org/10.1016/j.conb.2013.10.008

7
Time causes forgetting from working memoryPsychonomic Bulletin & Review 19:87–92.https://doi.org/10.3758/s1342301101928

8
Working memory span development: a timebased resourcesharing model accountDevelopmental Psychology 45:477–490.https://doi.org/10.1037/a0014615

9
Further evidence for temporal decay in working memory: reply to Lewandowsky and Oberauer (2009)Journal of Experimental Psychology: Learning, Memory, and Cognition 37:1302–1317.https://doi.org/10.1037/a0022933
 10
 11

12
Noise in neural populations accounts for errors in working memoryJournal of Neuroscience 34:3632–3645.https://doi.org/10.1523/JNEUROSCI.320413.2014
 13

14
Anticipatory head direction signals in anterior thalamus: evidence for a thalamocortical circuit that integrates angular head motion to compute head directionJournal of Neuroscience 15:6260–6270.

15
A continuous attractor network model without recurrent excitation: maintenance and integration in the head direction cell systemJournal of Computational Neuroscience 18:205–227.https://doi.org/10.1007/s108270056559y

16
Compression in visual working memory: using statistical regularities to form more efficient memory representationsJournal of Experimental Psychology: General 138:487–502.https://doi.org/10.1037/a0016797

17
Basic mechanisms for graded persistent activity: discrete attractors, continuous attractors, and dynamic representationsCurrent Opinion in Neurobiology 13:204–211.https://doi.org/10.1016/S09594388(03)000503

18
Accurate path integration in continuous attractor network models of grid cellsPLoS Computational Biology 5:e1000291.https://doi.org/10.1371/journal.pcbi.1000291
 19

20
Evidence for decay in verbal shortterm memory: a commentary on Berman, Jonides, and Lewis (2009)Journal of Experimental Psychology: Learning, Memory, and Cognition 38:1129–1136.https://doi.org/10.1037/a0026934

21
CoSyNe Meeting Abstract II78Using expander codes to construct Hopfield networks with exponential capacity, CoSyNe Meeting Abstract II78, Salt Lake City, UT, USA.
 22

23
Working memory capacity and its relation to general intelligenceTrends in Cognitive Sciences 7:547–552.https://doi.org/10.1016/j.tics.2003.10.005
 24
 25

26
The magical number 4 in shortterm memory: a reconsideration of mental storage capacityBehavioral and Brain Sciences 24:87–114.https://doi.org/10.1017/S0140525X01003922

27
Visual shortterm memory compared in rhesus monkeys and humansCurrent Biology 21:975–979.https://doi.org/10.1016/j.cub.2011.04.031

28
What grid cells convey about rat locationJournal of Neuroscience 28:6858–6871.https://doi.org/10.1523/JNEUROSCI.568407.2008

29
A binary Hopfield network with information rate and applications to grid cell decodingProceedings of the 2nd Workshop on Biological Distributed Algorithms.

30
Prefrontal cortex and working memory processesNeuroscience 139:251–261.https://doi.org/10.1016/j.neuroscience.2005.07.003
 31
 32
 33
 34
 35
 36

37
Prefrontal Activity during Delayedresponse Tasks Requiring Response Selection and PreparationProceedings of Cognitive Neuroscience Society.

38
The mind and brain of shortterm memoryAnnual Review of Psychology 59:193–224.https://doi.org/10.1146/annurev.psych.59.103006.093615

39
Bayes FactorsJournal of the American Statistical Association 90:773–795.https://doi.org/10.1080/01621459.1995.10476572
 40

41
No temporal decay in verbal shortterm memoryTrends in Cognitive Sciences 13:120–126.https://doi.org/10.1016/j.tics.2008.12.003
 42
 43
 44

45
Information Theory, Inference & Learning AlgorithmsNew York: Cambridge University Press.

46
Resolution of nested neuronal representations can be exponential in the number of neuronsPhysical Review Letters 109:018103.https://doi.org/10.1103/PhysRevLett.109.018103

47
Attention effects during visual shortterm memory maintenance: Protection or prioritization?Perception & Psychophysics 69:1422–1434.https://doi.org/10.3758/BF03192957

48
A probabilistic palimpsest model of visual shortterm memoryPLOS Computational Biology 11:e1004003.https://doi.org/10.1371/journal.pcbi.1004003

49
Dynamic population coding of category information in inferior temporal and prefrontal cortexJournal of Neurophysiology 100:1407–1419.https://doi.org/10.1152/jn.90248.2008
 50

51
Neural mechanisms of visual working memory in prefrontal cortex of the macaqueJournal of Neuroscience 16:5154–5167.
 52
 53
 54

55
Rapid forgetting prevented by retrospective attention cuesJournal of Experimental Psychology: Human Perception and Performance 39:1224–1231.https://doi.org/10.1037/a0030947

56
Rapid forgetting results from competition over time between items in visual working memoryJournal of Experimental Psychology: Learning, Memory, and Cognition 43:528–536.https://doi.org/10.1037/xlm0000328
 57

58
Channel coding rate in the finite blocklength regimeIEEE Transactions on Information Theory 56:2307–2359.https://doi.org/10.1109/TIT.2010.2043769

59
On the information capacity of gaussian channels under small peak power constraintsIEEE, 10.1109/ALLERTON.2008.4797569.

60
Differences between presentation methods in working memory procedures: a matter of working memory consolidationJournal of Experimental Psychology: Learning, Memory, and Cognition 40:417–428.https://doi.org/10.1037/a0034301
 61
 62

63
Noise, neural codes and cortical organizationCurrent Opinion in Neurobiology 4:569–579.https://doi.org/10.1016/09594388(94)900590

64
A Mathematical Theory of CommunicationBell System Technical Journal 27:379–423.https://doi.org/10.1002/j.15387305.1948.tb01338.x

65
Coding theorems for a discrete source with a fidelity criterionInstitute of Radio Engineers, International Convention Record, part 4 7:142–163.

66
An ideal observer analysis of visual working memoryPsychological Review 119:807–830.https://doi.org/10.1037/a0029856
 67

68
The information capacity of amplitude and varianceconstrained sclar gaussian channelsInformation and Control 18:203–219.https://doi.org/10.1016/S00199958(71)903469

69
The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPsJournal of Neuroscience 13:334–350.

70
Stable and Dynamic Coding for Working Memory in Primate Prefrontal CortexThe Journal of Neuroscience 37:6503–6516.https://doi.org/10.1523/JNEUROSCI.336416.2017

71
Grid cells generate an analog errorcorrecting code for singularly precise neural computationNature Neuroscience 14:1330–1337.https://doi.org/10.1038/nn.2901
 72

73
'Activitysilent' working memory in prefrontal cortex: a dynamic coding frameworkTrends in Cognitive Sciences 19:394–405.https://doi.org/10.1016/j.tics.2015.05.004
 74

75
Evaluation of ratedistortion functions for a class of independent identically distributed sources under an absolutemagnitude criterionIEEE Transactions on Information Theory 21:59–64.https://doi.org/10.1109/TIT.1975.1055335

76
Head direction cells and the neurophysiological basis for a sense of directionProgress in Neurobiology 55:225–256.https://doi.org/10.1016/S03010082(98)000045
 77

78
The sourcechannel separation theorem revisitedIEEE Transactions on Information Theory 41:44–54.https://doi.org/10.1109/18.370119
 79

80
Synaptic reverberation underlying mnemonic persistent activityTrends in Neurosciences 24:455–463.https://doi.org/10.1016/S01662236(00)018683
 81

82
A detection theory account of change detectionJournal of Vision 4:11–1135.https://doi.org/10.1167/4.12.11
 83

84
Dynamics and computation of continuous attractorsNeural Computation 20:994–1025.https://doi.org/10.1162/neco.2008.1006378

85
Temporal isolation of the neural correlates of spatial mnemonic processing with fMRICognitive Brain Research 7:255–268.https://doi.org/10.1016/S09266410(98)000299

86
Representation of spatial orientation by the intrinsic dynamics of the headdirection cell ensemble: a theoryJournal of Neuroscience 16:2112–2126.
 87

88
Sudden death and gradual decay in visual working memoryPsychological Science 20:423–428.https://doi.org/10.1111/j.14679280.2009.02322.x
Decision letter

Lila DavachiReviewing Editor; New York University, United States
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
Thank you for submitting your article "Fundamental bound on the persistence and capacity of shortterm memory stored as graded persistent activity" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and David Van Essen as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Tim Buschman (Reviewer #1); John D Murray (Reviewer #2); Brad Postle (Reviewer #3).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
The manuscript presents an informationtheoretic computational model of STM that suggests an intriguing new way that information may be coded in working memory. The theoretical framework developed here constitutes an important advance in linking neural circuit mechanisms to testable psychophysical behavior (here, working memory precision as a function of duration and load). The quantitative fit of the model to human behavior is compelling and bolsters the relevance of the theoretical advances.
The reviewers were all in agreement about the potential impact of this work presented. But all also agreed that further discussion of the proposed models and its implications should be added to more thoroughly place this work in the broader context of the field. Some specific suggestions are made below. Furthermore, there several detailed questions regarding aspects of the model that should also be addressed. I have edited and appended the revisions that are essential to include in a revision below.
Please address the following in a revision:
1) One reviewer noted that 'it takes too long to get to the point in the manuscript at which the reader knows, well, what the main point of the paper will be. It's not in the title, not in the Abstract, and, indeed, not clearly articulated until subsection “Informationtheoretic bound on memory performance with wellcoded storage” of the manuscript.' The first part of the manuscript is taken up with a lengthy exposition of why and how direct storage models are unsatisfactory. For a generalinterest journal, one would want the central idea to be clearly articulated in one of the first paragraphs in the paper (not to mention in the Abstract), then the demonstration that direct storage models are insufficient to be dispatched within a few short paragraphs. Perhaps some of this could be accomplished in part by moving some of the text and analyses to figure legends? As things stand, the figures with their minimalist legends are inscrutable. One idea would be to display the panel from Figure 4E sidebyside with 3A and B, to permit a sidebyside comparison of the different approaches. Indeed, Figures 3 and 4 could be merged, together with much of the text between them.
2) A second major absence from the Introduction, which will raise concerns by many familiar with the current literature, is near absence of any consideration of the growing number of suggestions that STM might be accomplished by mechanisms other than sustained activity. To name just a few, there's a recent TICS paper by Stokes that is explicitly devoted to this idea, there are several theoretical accounts by Tsodyks, Barack, and colleagues (nicely summarized in a recent Current Opinion review), and there's the nonlinear dynamical systems model of Lundqvist and colleagues, recently illustrated with data from Miller's group.
If some variant of these "activitysilent" accounts is correct, are the ideas presented in this manuscript irrelevant, or are there principles from the present theory that would apply? Additionally/alternatively, are there principles from the present theory that might apply to sustained activity supporting a behavior other than STM?
3) Some of the writing contains incomplete or misleading assertions. For example, the idea that there are constraints on the amount of time that information can be held in STM ignores the fact that a classically held hallmark of STM is precisely that it is not sensitive to the passage of time, per se. (Two examples are from Keppel and Underwood, and many demonstrations of prolonged retention of information in STM in anterograde amnesic patients.) Indeed, puzzlingly, one of the papers cited by the authors to substantiate their assertion is entitled "No temporal decay in verbal shortterm memory."
4) The manuscript makes not contact with the growing literature of multivariate analyses of data from STM tasks, from nonhuman and human electrophysiology, and from human fMRI. Some of these studies show the ability to decode the contents of STM from delayperiod activity with decoders trained on sampleevoked signal. Others suggest that the neural code may be dynamic, with minimal if any crosstemporal generalization (i.e., "offdiagonal" decoding). How does the proposed theory relate to this empirical literature? Without reference to these broader literatures, the present manuscript might be more suitable for a more specialized computational journal.
5) The authors argue that the currently accepted model of working memory predicts a linear increase in meansquared error (MSE) over time and load (MSE ~ (load)*(time)). In contrast, they find a sublinear increase in MSE with time (Figure 3A and 3B). This sublinearity is well fit by the wellcoded model. However, some of this nonlinearity could be due to other, lesscapacitylimited, forms of memory at very short time delays. For example, iconic memory, thought to have an extremely high capacity, is likely still available at 100 ms (some might argue for longer). This could lead to a reduction in the MSE at the lowest time delays. Ideally the authors would control for this using masking stimuli. Alternatively, the authors could control for this by excluding the very short delays from the analysis (possibly increasing the maximum memory delay if needed for fits).
6) As with many working memory paradigms, it is not entirely clear how to define the working memory load in the current task. It seems subjects must remember multiple pieces of information per memorandum (e.g. both color and orientation) in all cases except for the single item. This would suggest memory load is actually 1, 4, 8, and 12. Does this nonlinearity account for the poor fit of the linear "direct coding" model? It seems like it might not, given the poor fit in Figure 3B but it would still be worth testing the two models with different values for memory load. Similarly, recent work has suggested some degree of independence of working memory load across the two visual hemifields. Again, this would suggest only the balanced displays can be directly compared (e.g. 2, 4, and 6 items). Does the wellcoded model still provide a better fit If the analysis is restricted to these three conditions?
7) The authors appropriately use BIC to perform model comparison. However, these model comparison criterion often penalize parameters to different degrees. Did the authors also find the wellcoded model generalized to a withheld dataset better than the direct coding model?
8) Recent work has debated whether errors during working memory are due, in part, to guessing or not (e.g. Luck, Awh, Vogel, Bays, etc). In fact, Steve Luck argues for no increase in variance with load (or time?), instead only an increase in guess rate. If fitting a circular Gaussian to the distribution do the authors find an increase in variance or an increase in baseline (or both)? Related to this, it isn't clear to me how the pure 'suddendeath' framework matches with the diffusivity arguments made here. It seems that perhaps the wellcoded model could explain the existence of complete failures to remember if the signal diffuses too much, but the model would still argue for some diffusion of memory over time. This doesn't seem consistent with the current model. I know the authors attempt to address this in the Discussion section of the current manuscript but I would encourage the authors to clarify their position.
9) This study uses the coauthors' human psychophysical data from Pertzov et al., 2016 Journal of Experimental Psychology. That study decomposed errors into three sources: (1) noisy representation; (2) misbinding or nontarget responses; and (3) random guessing. They reported that all three of these components increased with higher load and with longer delays. How does these prior findings relate to the present study? Are these different sources subsumed by the present model? Or are these important features that the present model (in the diffusive regime) does not account for? Does the present model produce only the first type of errors? The Authors mention that in another regime of the model, nondiffusive errors can produce pure guessing errors. Can the model speak to the mechanisms of misbinding errors? Please include discussion of this point.
10) Regarding the implications for neural representations: The Authors discuss that one prediction of the model would be signatures of exponentially strong codes in neural representations. As I understand it, one way this could be implemented is that each of the N memory networks has a different spatial period for its periodic coding, as in the case of grid cells. The other feature of the present model is that for multiitem working memory, a memory network contains signals for all of the K items. It would be helpful if the Authors can clarify what the implications on neural representations are for this feature of distributed multiitem coding. Does this imply that single neurons would show mixed selectivity for multiple items? Please include discussion of this point.
https://doi.org/10.7554/eLife.22225.016Author response
Please address the following in a revision:
1) One reviewer noted that 'it takes too long to get to the point in the manuscript at which the reader knows, well, what the main point of the paper will be. It's not in the title, not in the Abstract, and, indeed, not clearly articulated until subsection “Informationtheoretic bound on memory performance with wellcoded storage” of the manuscript.' The first part of the manuscript is taken up with a lengthy exposition of why and how direct storage models are unsatisfactory. For a generalinterest journal, one would want the central idea to be clearly articulated in one of the first paragraphs in the paper (not to mention in the Abstract), then the demonstration that direct storage models are insufficient to be dispatched within a few short paragraphs. Perhaps some of this could be accomplished in part by moving some of the text and analyses to figure legends? As things stand, the figures with their minimalist legends are inscrutable. One idea would be to display the panel from Figure 4E sidebyside with 3A and B, to permit a sidebyside comparison of the different approaches. Indeed, Figures 3 and 4 could be merged, together with much of the text between them.
We have now edited the Abstract and Introduction to convey what the manuscript is about much earlier in the text. Please see the new introductory paragraph: "In the present work, we make the following contributions: 1) Generate psychophysics predictions for information degradation as a function of delay period and number of stored items, if information is stored directly, without recoding, in persistent activity neural networks of a given size over given time interval; 2) Generate psychophysics predictions (though the use of joint sourcechannel coding theory) for a model that assumes information is restructured by encoding and decoding stages before and after storage in persistent activity neural networks; 3) Compare these models to new analog measurements \cite{Pertzov16} of human memory performance on an analog task as the demands on both maintenance duration and capacity are varied."
Please note that the early results of the manuscript are to establish the theoretical predictions for direct storage in persistent activity networks. To our knowledge, these predictions about degradation as a function of time and item number with direct storage have not been made explicit before, and so are one part of our results (if they had been made before it would have been easy to shorten this section and replace it with a citation). It is equally important to state the framework, formalism (including resource use parameters, etc.), and results for the direct storage model in the main results for comparison with the framework and parameters of the wellcoced model, so that it is clear that we are making a fair comparison.
The figure captions are fairly long, and in merging plots as well as clarifying the captions as suggested, they have become slightly longer. Thus, moving more of the text of the results to the figure captions is not ideal. We have edited and shortened the direct storage Results section, but have not eviscerated it as we feel it is an integral part of our main result. As suggested, we have also merged Figures 3 and 4, to make a direct comparison between the different models easier for the reader.
2) A second major absence from the Introduction, which will raise concerns by many familiar with the current literature, is near absence of any consideration of the growing number of suggestions that STM might be accomplished by mechanisms other than sustained activity. To name just a few, there's a recent TICS paper by Stokes that is explicitly devoted to this idea, there are several theoretical accounts by Tsodyks, Barack, and colleagues (nicely summarized in a recent Current Opinion review), and there's the nonlinear dynamical systems model of Lundqvist and colleagues, recently illustrated with data from Miller's group.
If some variant of these "activitysilent" accounts is correct, are the ideas presented in this manuscript irrelevant, or are there principles from the present theory that would apply? Additionally/alternatively, are there principles from the present theory that might apply to sustained activity supporting a behavior other than STM?
We thank the reviewers very much for this comment. Indeed, we did not explicitly discuss activitysilent accounts of STM in our Introduction or Discussion (other than providing a reference to Mongillo and Tsodyks 2008, a model of how synaptic facilitation can aid in the robustness of short term memory). Given recent experimental and modeling results in this direction  they are starting to form a compelling alternate STM mechanism to persistent activity mechanisms  it is important to mention these accounts.
We have added a brief standalone passage in the Introduction, stating that our current work is complementary to efforts to explain STM in terms of synaptic facilitation/activitysilent mechanisms. In this passage, we cite the work of Mi, Katkov and Tsodyks, 2016: Barak and Tsodyks, 2014; Stokes, 2015 and Lundqvist et al., 2016.
With respect to the question about whether our model would apply to activitysilent mechanisms: In citing the model of Mongillo and Tsodyks, 2008 in our earlier manuscript, we had considered the possibility of synaptic facilitation as a source of a longer cellular timeconstant to serve as the basis of STM, but we viewed that model as another persistent activity model, with activity supporting facilitation and facilitation supporting elevated activity. The facilitation process lent a slower intrinsic timeconstant to the persistent activity feedback loop, thus providing a more robust/less finetuned way to generate persistent activity. Such a model would be subject to the same diffusion/drift problems as persistent activity models, qualitatively speaking (but quantitatively with lower noise or slower diffusion timeconstant), and thus subject to similar degradation as considered in our present work.
The newer models cited in the paragraph may exhibit different dynamics, and be subject to different types of noise, in which case the general principle of restructuring of information to improve memory would still be true but the functional form of error versus number of items and N could be somewhat different. However, if the synaptic facilitation states in these models were subject to a Gaussian drift (e.g. if the facilitation states are analogvalued and some biophysical noiseprocess drives a random walk through the set of possible states even in the absence of neural activity), then they too could be could be treated as a bank of information channels with Gaussian noise and potentially our theory would extend to these, but with different parameters.
Since there are not yet good models of noise in the synaptic facilitation variable, for instance, and the effects of such noise on collective network memory states, we cannot directly yet compute a theoretical bound on memory performance for these mechanisms. However, that is definitely a future interest; with more theoretical work on modeling sources of noise in the activitysilent mechanisms, it will be possible to apply a similar theoretical framework to obtain bounds on memory performance with and without good encoding.
3) Some of the writing contains incomplete or misleading assertions. For example, the idea that there are constraints on the amount of time that information can be held in STM ignores the fact that a classically held hallmark of STM is precisely that it is not sensitive to the passage of time, per se. (Two examples are from Keppel and Underwood, and many demonstrations of prolonged retention of information in STM in anterograde amnesic patients.) Indeed, puzzlingly, one of the papers cited by the authors to substantiate their assertion is entitled "No temporal decay in verbal shortterm memory."
Indeed, as the reviewers note, early studies have emphasized the temporal robustness of STM, and compared to “iconic” memory, STM is much less susceptible to forgetting. Consistent with this, our experimental results clearly demonstrate that single items are remembered with very little degradation over time, and the effects of increasing item number are stronger than the effects of increasing delay on memory performance.
However, there is performance degradation over time, especially for more items. We do not ourselves model pure temporal decay as a mechanism for memory loss, so it was not our intention to convey this in the Introduction. The source of confusion was our phrasing and references. We are now more careful in making a distinction between performance degradation over time versus the possible mechanisms for such degradation (which could include noise or interference or, less likely according to the literature, pure temporal decay mechanisms), please see our edits.
4) The manuscript makes not contact with the growing literature of multivariate analyses of data from STM tasks, from nonhuman and human electrophysiology, and from human fMRI. Some of these studies show the ability to decode the contents of STM from delayperiod activity with decoders trained on sampleevoked signal. Others suggest that the neural code may be dynamic, with minimal if any crosstemporal generalization (i.e., "offdiagonal" decoding). How does the proposed theory relate to this empirical literature? Without reference to these broader literatures, the present manuscript might be more suitable for a more specialized computational journal.
Our formalism indicates that the representation within a memory channel must be in an optimised format, and that this format is not necessarily the same format that information was initially presented in. According to the informationtheoretic view, the brain must perform a transformation from stimulusspace into an optimally coded form, and one might expect to observe this transition of the representation at encoding. The less optimal the original stimulus space, the more different the mnemonic code will likely be from the sampleevoked signal.
This insight by the reviewer constitutes a potential key prediction of the model, that in domains that are already combinatorially structured, neural representations should remain similar throughout the delay period, whereas in domains amenable to compression at encoding, neural codes during the delay will appear dynamic or at least different from the stimulusevoked signal. We now include a discussion of this point in the manuscript (Discussion section, paragraph beginning "It remains to be seen whether neural representations for shortterm visual memory are consistent …."), also citing papers in the literature that show variously show either stable, conserved coding during delay or varying, different states during delay.
5) The authors argue that the currently accepted model of working memory predicts a linear increase in meansquared error (MSE) over time and load (MSE ~ (load)*(time)). In contrast, they find a sublinear increase in MSE with time (Figure 3A and 3B). This sublinearity is well fit by the wellcoded model. However, some of this nonlinearity could be due to other, lesscapacitylimited, forms of memory at very short time delays. For example, iconic memory, thought to have an extremely high capacity, is likely still available at 100 ms (some might argue for longer). This could lead to a reduction in the MSE at the lowest time delays. Ideally the authors would control for this using masking stimuli. Alternatively, the authors could control for this by excluding the very short delays from the analysis (possibly increasing the maximum memory delay if needed for fits).
Thank you for this comment. Please note that the real problem with the direct storage (linear) model is not so much that the function in time is linear, as that even the average slopes of the differentitem number curves versus time are not fit by the slopes in the linear model: That is, if we fit the 1item versus time data, then the predicted slope of the 6item versus time slope prediction is far lower than the average slope of the actual data.
This can be seen in Figure 3A. If we attempt to fit all the curves simultaneously as well as possible, again the slopes of the fits in time are far from the mean slopes of the curves, leaving aside the question of sublinearity.
If we understand, the reviewer is suggesting the following scenario: Consider some process that has linear degradation of information in time (e.g. like direct storage of information into persistent activity networks). Add to this model the assumption that the 100 ms timepoint is due to iconic memory. After excluding this 100 ms point, the uncoded model might provide a much better fit than it has so far, and it might also be more competitive with the coded model.
We now perform this analysis, and find that the uncoded model still fails to simultaneously fit the 1 and 6 item versus time data, and remains a substantially poorer fit than the coded model fit to the same data. The result does not change these qualitative comparisons.
6) As with many working memory paradigms, it is not entirely clear how to define the working memory load in the current task. It seems subjects must remember multiple pieces of information per memorandum (e.g. both color and orientation) in all cases except for the single item. This would suggest memory load is actually 1, 4, 8, and 12. Does this nonlinearity account for the poor fit of the linear "direct coding" model? It seems like it might not, given the poor fit in Figure 3B but it would still be worth testing the two models with different values for memory load. Similarly, recent work has suggested some degree of independence of working memory load across the two visual hemifields. Again, this would suggest only the balanced displays can be directly compared (e.g. 2, 4, and 6 items). Does the wellcoded model still provide a better fit If the analysis is restricted to these three conditions?
This is an excellent suggestion. We have now redefined the item numbers from (1, 2, 4, 6) to (1,4, 8, 12) and redone the fits. We find that our qualitative conclusions remain unchanged.
7) The authors appropriately use BIC to perform model comparison. However, these model comparison criterion often penalize parameters to different degrees. Did the authors also find the wellcoded model generalized to a withheld dataset better than the direct coding model?
Thank you for another good question. To address this, we redid the analysis by excluding one timepoint across all itemnumber curves, then asked how well the curves obtained from fitting the other timepoints predicted the error for the heldout datapoint. We repeated this for another timepoint. This is like a leaveoneout or jackknife crossvalidation procedure. We find that the wellcoded model predicts the withheld datapoints with smaller error than the uncoded/direct coding model.
8) Recent work has debated whether errors during working memory are due, in part, to guessing or not (e.g. Luck, Awh, Vogel, Bays, etc). In fact, Steve Luck argues for no increase in variance with load (or time?), instead only an increase in guess rate. If fitting a circular Gaussian to the distribution do the authors find an increase in variance or an increase in baseline (or both)? Related to this, it isn't clear to me how the pure 'suddendeath' framework matches with the diffusivity arguments made here. It seems that perhaps the wellcoded model could explain the existence of complete failures to remember if the signal diffuses too much, but the model would still argue for some diffusion of memory over time. This doesn't seem consistent with the current model. I know the authors attempt to address this in the Discussion section of the current manuscript but I would encourage the authors to clarify their position.
Thank you for the opportunity to clarify. The direct storage model, which involves only diffusion, does not include a nonlinear "suddendeath" process. Instead, the error of recall will simply grow, continuously and monotonically, over time; it's still possible in this model that a noisy, discreteintime experiment will result in the appearance of a suddendeath event where there really is only continuous degradation in the underlying system (e.g. beyond some threshold of memory degradation, noise in the report or observation will make the memory appear to be "gone"). On the other hand, if information is stored in a wellcoded way, according to some good errorcorrecting code, we would expect inherently sharp threshold behavior: such codes display a characteristic level of noise below which they can effectively suppress most error, and above which they are guaranteed to fail, and then their errors are large. Thus, the model would predict a relatively small accumulation of error over some interval, followed by a superlinear increase in squared error. We now clarify this point in the manuscript.
9) This study uses the coauthors' human psychophysical data from Pertzov et al., 2016 Journal of Experimental Psychology. That study decomposed errors into three sources: (1) noisy representation; (2) misbinding or nontarget responses; and (3) random guessing. They reported that all three of these components increased with higher load and with longer delays. How does these prior findings relate to the present study? Are these different sources subsumed by the present model? Or are these important features that the present model (in the diffusive regime) does not account for? Does the present model produce only the first type of errors? The Authors mention that in another regime of the model, nondiffusive errors can produce pure guessing errors. Can the model speak to the mechanisms of misbinding errors? Please include discussion of this point.
Re. Misbinding: The current model does not address this source of error. We now clarify this fact in the text. Our model in its present form is presented a single feature dimension, in this case orientation, and thus it does not consider the binding problem/binding errors. Note that our model could in principle be extended to take into account the joint storage of pairs or more of features per item, by representing those features as part of a higherdimensional continuous attractor network, as in the joint population code model considered by Matthey, Bays and Dayan (2015); this is certainly of future interest of us (but of course out of scope of the current work). We have now added a note about this point in the Discussion.
Re. sudden death: we have now clarified in Discussion how sudden death can be consistent with our framework: "In our framework, good encoding ensures that for noise below a threshold, the decoder can recover an improved estimate of the stored variable; however, strong codes exhibit sharp threshold behavior as the noise in the channel is varied smoothly. […]We note, however, that the fits to the data shown here were all in the belowthreshold regime."
10) Regarding the implications for neural representations: The Authors discuss that one prediction of the model would be signatures of exponentially strong codes in neural representations. As I understand it, one way this could be implemented is that each of the N memory networks has a different spatial period for its periodic coding, as in the case of grid cells. The other feature of the present model is that for multiitem working memory, a memory network contains signals for all of the K items. It would be helpful if the Authors can clarify what the implications on neural representations are for this feature of distributed multiitem coding. Does this imply that single neurons would show mixed selectivity for multiple items? Please include discussion of this point.
Good question. It is difficult to imagine any scenario involving nonmixed selectivity for items in a strongcoding scheme, thus indeed mixed selectivity would be a prediction of such a such a scheme. We already had a longer discussion of the question of tuning curves for strong codes under the heading "Are neural representations consistent with exponentially strong codes?" in the Discussion. We have now added a comment about mixed selectivity there.
https://doi.org/10.7554/eLife.22225.017Article and author information
Author details
Funding
National Science Foundation (IIS1464349)
 Onur Ozan Koyluoglu
Israel Science Foundation (1747/14)
 Yoni Pertzov
MRC Clinician Scientist Fellowship (MR/P00878X)
 Sanjay Manohar
National Institute for Health Research (Oxford Biomedical Centre)
 Masud Husain
Wellcome Trust
 Masud Husain
National Science Foundation (IIS1148973)
 Ila R Fiete
Simons Foundation
 Ila R Fiete
Howard Hughes Medical Institute (Faculty Scholar Award)
 Ila R Fiete
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Ethics
Human subjects: The study reported here conform to the Declaration of Helsinki and all procedures were approved by the ethics committee of the National Hospital for Neurology and Neurosurgery (NHNN) prior to the study commencing. Research Ethics Committee number (ERC) 04/Q0406/60. Personal information about individuals was password protected and saved in compliance to the Data Protection Act 1998 (DPA).
Reviewing Editor
 Lila Davachi, New York University, United States
Publication history
 Received: October 8, 2016
 Accepted: August 25, 2017
 Accepted Manuscript published: September 7, 2017 (version 1)
 Version of Record published: September 20, 2017 (version 2)
Copyright
© 2017, Koyluoglu et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 1,645
 Page views

 298
 Downloads

 4
 Citations
Article citation count generated by polling the highest count across the following sources: PubMed Central, Scopus, Crossref.
Download links
Downloads (link to download the article as PDF)
Download citations (links to download the citations from this article in formats compatible with various reference manager tools)
Open citations (links to open the citations from this article in various online reference manager services)
Further reading

 Computational and Systems Biology
 Physics of Living Systems

 Computational and Systems Biology
 Neuroscience