Abstract
Behavior relies on the ability of sensory systems to infer properties of the environment from incoming stimuli. The accuracy of inference depends on the fidelity with which behaviorally relevant properties of stimuli are encoded in neural responses. Highfidelity encodings can be metabolically costly, but lowfidelity encodings can cause errors in inference. Here, we discuss general principles that underlie the tradeoff between encoding cost and inference error. We then derive adaptive encoding schemes that dynamically navigate this tradeoff. These optimal encodings tend to increase the fidelity of the neural representation following a change in the stimulus distribution, and reduce fidelity for stimuli that originate from a known distribution. We predict dynamical signatures of such encoding schemes and demonstrate how known phenomena, such as burst coding and firing rate adaptation, can be understood as hallmarks of optimal coding for accurate inference.
https://doi.org/10.7554/eLife.32055.001Introduction
Biological systems must make inferences about the environment in order to successfully plan and accomplish goals. Inference is the process of estimating behaviorally relevant properties of the environment from lowlevel sensory signals registered by neurons in the early sensory periphery (Kersten and Schrater, 2002). Many perceptual tasks, such as color perception (Brainard et al., 2006), visual speed estimation (Weiss et al., 2002), or sound localization (Fischer and Peña, 2011; Młynarski, 2015), can be understood as probabilistic inference. All these tasks rely on the estimation of features (such as the speed of an object) that are not explicitly represented by lowlevel sensory stimuli (such as light signals incident on photoreceptors).
To accurately perform inference, the nervous system can construct an internal model that relates incoming sensory stimuli to behaviorally relevant properties of the environment (Kersten and Schrater, 2002; Kersten et al., 2004; Fiser et al., 2010; Rao et al., 2002; CoenCagli et al., 2015). As the environment changes, this internal model must be continually updated with new stimuli (Wark et al., 2009; DeWeese and Zador, 1998; Nassar et al., 2010; Lochmann et al., 2012; Deneve, 2008), and therefore the accuracy of this internal model depends on the fidelity with which incoming stimuli are encoded in neural responses.
The process of encoding sensory stimuli, however, is metabolically expensive (Laughlin et al., 1998; Mehta and Schwab, 2012; Balasubramanian et al., 2001; Harris et al., 2012; Attwell and Laughlin, 2001; Levy and Baxter, 1996), and a large body of evidence suggests that sensory systems have evolved to reduce the energetic costs of stimulus coding (Laughlin et al., 1998; Laughlin and Sejnowski, 2003; Hermundstad et al., 2014). These findings provide empirical support for the efficient coding hypothesis (Barlow, 1961), which postulates that sensory systems minimize metabolic cost while maximizing the amount of information that is encoded about a stimulus (van Hateren, 1992; Olshausen and Field, 1996; Laughlin, 1981).
The goal of maximizing stimulus information does not reflect the fact that different stimuli can have different utility to a system for making inferences about the environment (Tishby et al., 2000; Palmer et al., 2015; Geisler et al., 2009; Burge and Geisler, 2015). The relative utility of a stimulus is determined by the potential impact that it can have on the system’s belief about the state of the environment; stimuli that sway this belief carry high utility, while stimuli that do not affect this belief are less relevant. Moreover, physically different stimuli can exert the same influence the observer’s belief and can therefore be encoded in the same neural activity pattern without affecting the inference process. Such an encoding strategy decreases the fidelity of the neural representation by using the same activity pattern to represent many stimuli, and consequently reduces the amount of metabolic resources required to perform inference.
When the distribution of stimuli changes in time, as in any natural environment, both the belief about the environment (DeWeese and Zador, 1998) and the relative impact of different stimuli on this belief also change in time. Any system that must perform accurate inference with minimal energy must therefore dynamically balance the cost of encoding stimuli with the error that this encoding can introduce in the inference process. While studies have separately shown that sensory neurons dynamically adapt to changing stimulus distributions in manners that reflect either optimal encoding (Fairhall et al., 2001) or inference (Wark et al., 2009), the interplay between these two objectives is not understood.
In this work, we develop a general framework for relating lowlevel sensory encoding schemes to the higher level processing that ultimately supports behavior. We use this framework to explore the dynamic interplay between efficient encoding, which serves to represent the stimulus with minimal metabolic cost, and accurate inference, which serves to estimate behaviorallyrelevant properties of the stimulus with minimal error. To illustrate the implications of this framework, we consider three neurally plausible encoding schemes in a simple model environment. Each encoding scheme reflects a different limitation on the representational capacity of neural responses, and consequently each represents a different strategy for reducing metabolic costs. We then generalize this framework to a visual inference task with natural stimuli.
We find that encoding schemes optimized for inference differ significantly from encoding schemes that are designed to accurately reconstruct all details of the stimulus. The latter produce neural responses that are more metabolically costly, and the resulting inference process exhibits qualitatively different inaccuracies.
Together, these results predict dynamical signatures of encoding strategies that are designed to support accurate inference, and differentiate these strategies from those that are designed to reconstruct the stimulus itself. These dynamical signatures provide a new interpretation of experimentally observed phenomena such as burst coding and firingrate adaptation, which we argue could arise as a consequence of a dynamic tradeoff between coding cost and inference error.
Results
A general framework for dynamically balancing coding cost and inference error
Sensory systems use internal representations of external stimuli to build and update models of the environment. As an illustrative example, consider the task of avoiding a predator (Figure 1A, left column). The predator is signaled by sensory stimuli, such as patterns of light intensity or chemical odorants, that change over time. To avoid a predator, an organism must first determine whether a predator is present, and if so, which direction the predator is moving, and how fast. This inference process requires that incoming stimuli first be encoded in the spiking activity of sensory neurons. This activity must then be transmitted to downstream neurons that infer the position and speed of the predator.
Not all stimuli will be equally useful for this task, and the relative utility of different stimuli could change over time. When first trying to determine whether a predator is present, it might be crucial to encode stimulus details that could discriminate fur from grass. Once a predator has been detected, however, the details of the predator’s fur are not relevant for determining its position and speed. If encoding stimuli is metabolically costly, energy should be devoted to encoding those details of the stimulus that are most useful for inferring the quantity at hand.
We formalize this scenario within a general Bayesian framework that consists of three components: (i) an environment, which is parameterized by a latent state ${\theta}_{t}$ that specifies the distribution $p\left({x}_{t}\right{\theta}_{t})$ of incoming sensory stimuli ${x}_{t}$, (ii) an adaptive encoder, which maps incoming stimuli ${x}_{t}$ onto neural responses ${y}_{t}$, and (iii) an observer, which uses these neural responses to update an internal belief about the current and future states of the environment. This belief is summarized by the posterior distribution $p\left({\theta}_{t}\right{y}_{\tau \le t})$ and is constructed by first decoding the stimulus from the neural response, and then combining the decoded stimulus with the prior belief $p\left({\theta}_{t1}\right{y}_{\tau <t})$ and knowledge of environment dynamics. A prediction about the future state of the environment can be computed in an analogous manner by combining the posterior distribution with knowledge of environment dynamics (Materials and methods, Figure 1—figure supplement 1). This prediction is then fed back upstream and used to adapt the encoder.
In order to optimize and assess the dynamics of the system, we use the point values $\hat{\theta}}_{t$ and $\overrightarrow{\theta}}_{t+1$ as an estimate of the current state and prediction of the future state, respectively. The optimal point estimate is computed by averaging the posterior and is guaranteed to minimize the mean squared error between the estimated state $\hat{\theta}}_{t$ and the true state ${\theta}_{t}$, regardless of the form of the posterior distribution (Robert, 2007).
In stationary environments with fixed statistics, incoming stimuli can have varying impact on the observer’s belief about the state of the environment, depending on the uncertainty in the observer’s belief (measured by the entropy of the prior distribution, $H\left[p\left({\theta}_{t1}{x}_{\tau <t}\right)\right]$), and on the surprise of a stimulus given this belief (measured by the negative log probability of the stimulus given the current prediction, $\mathrm{log}\left[p\left({x}_{t}{\overrightarrow{\theta}}_{t}\right)\right]$. We quantify the impact of a single stimulus ${x}_{t}^{\ast}$ by measuring the mean squared difference between the observer’s estimate before and after observing the stimulus: $\left({\hat{\theta}}_{t}\left({x}_{t}^{\ast}\right){\hat{\theta}}_{t1}\right)}^{2$. When the observer is certain about the state of the environment or when a stimulus is consistent with the observer’s belief, the stimulus has little impact on the observer’s belief (Figure 1B, illustrated for mean estimation of a stationary Gaussian distribution). Conversely, when the observer is uncertain or when the new observation is surprising, the stimulus has a large impact.
The process of encoding stimuli in neural responses can introduce additional error in the observer’s estimate. Some mappings from stimuli onto responses will not alter the observer’s estimate, while other mappings can significantly distort this estimate. We measure the error induced by encoding a stimulus ${x}_{t}$ in a response ${y}_{t}$ using the mean squared difference between the estimates constructed with each input: $\left({\hat{\theta}}_{t}\left({x}_{t}\right){\hat{\theta}}_{t}\left({y}_{t}\right)\right)}^{2$. At times when the observer is certain, it is possible to encode many different stimuli in the same neural response without affecting the observer’s estimate. However, when the observer is uncertain, some encodings can induce high error, particularly when mapping a surprising stimulus onto an expected neural response, or vice versa. These neural responses can in turn have varying impact on the observer’s belief about the state of the environment.
The qualitative features of this relationship between surprise, uncertainty, and the dynamics of inference hold across a range of stimulus distributions and estimation tasks (Figure 1D). The specific geometry of this relationship depends on the underlying stimulus distribution and the estimated parameter. In some scenarios, surprise alone is not sufficient for determining the utility of a stimulus. For example, when the goal is to infer the spread of a distribution with a fixed mean, a decrease in spread would generate stimuli that are closer to the mean and therefore less surprising than expected. In this case, a simple function of surprise can be used to assess when stimuli are more or less surprising than predicted: $H\left[p\left({x}_{t}{\overrightarrow{\theta}}_{t}\right)\right]+\mathrm{log}\left[p\left({x}_{t}{\overrightarrow{\theta}}_{t}\right)\right]$, where $H\left[p\left({x}_{t}{\overrightarrow{\theta}}_{t}\right)\right]$ is the entropy, or average surprise, of the predicted stimulus distribution. We refer to this as centered surprise, which is closely related to the informationtheoretic notion of typicality (Cover and Thomas, 2012).
Together, the relative impact of different stimuli and the error induced by mapping stimuli onto neural responses shape the dynamics of inference. In what follows, we extend this intuition to nonstationary environments, where we show that encoding schemes that are optimized to balance coding cost and inference error exploit these relationships to devote higher coding fidelity at times when the observer is uncertain and stimuli are surprising.
Adaptive coding for inference in nonstationary environments
To make our considerations concrete, we model an optimal Bayesian observer in a twostate environment (Figure 2A). Despite its simplicity, this model has been used to study the dynamics of inference in neural and perceptual systems and can generate a range of complex behaviors (DeWeese and Zador, 1998; Wilson et al., 2013; Nassar et al., 2010; Radillo et al., 2017; VelizCuba et al., 2016). Within this model, the state variable ${\theta}_{t}$ switches randomly between a 'low' state ($\theta ={\theta}^{L}$) and a 'high' state ($\theta ={\theta}^{H}$) at a small but fixed hazard rate $h$ (we use $h=0.01$). We take ${\theta}_{t}$ to specify either the mean or the standard deviation of a Gaussian stimulus distribution, and we refer to these as ‘meanswitching' and ‘varianceswitching’ environments, respectively. At each point in time, a single stimulus sample ${x}_{t}$ is drawn randomly from this distribution. This stimulus is encoded in a neural response and used to update the observer’s belief about the environment. For a twostate environment, this belief is fully specified by the posterior probability ${P}_{t}^{L}$ that the environment is in the low state at time $t$. The predicted distribution of environmental states can be computed based on the probability that the environment will switch states in the next timestep: ${P}_{t+1}^{L}={P}_{t}^{L}(1h)+(1{P}_{t}^{L})h$. The posterior can then be used to construct a point estimate of the environmental state at time $t$: $\hat{\theta}}_{t}={P}_{t}^{L}{\theta}^{L}+\left(1{P}_{t}^{L}\right){\theta}^{H$ (the point prediction $\overrightarrow{\theta}}_{t+1$ can be constructed from the predicted distribution ${P}_{t+1}^{L}$ in an analogous manner). For small hazard rates (as considered here), the predicted distribution of environmental states is very close to the current posterior, and thus the prediction $\overrightarrow{\theta}}_{t+1$ can be approximated by the current estimate $\hat{\theta}}_{t$. Note that although the environmental states are discrete, the posterior distributions, and the point estimates constructed from them, are continuous (Materials and methods).
We consider three neurally plausible encoding schemes that reflect limitations in representational capacity. In one scheme, the encoder is constrained in the total number of distinct responses it can produce at a given time, and uses a discrete set of neural response levels to represent a stimulus (‘discretization’; Figure 2B–D). In second scheme, the encoder is constrained in dynamic range and temporal acuity, and filters incoming stimuli in time (‘temporal filtering’; Figure 2E–G). Finally, we consider an encoder that is constrained in the total amount of activity that can be used to encode a stimulus, and must therefore selectively encode certain stimuli and not others (‘stimulus selection’; Figure 2H–J). For each scheme, we impose a global constraint that controls the maximum fidelity of the encoding. We then adapt the instantaneous fidelity of the encoding subject to this global constraint. We do so by choosing the parameters of the encoding to minimize the error in inference, $\left({\hat{\theta}}_{t}\left({x}_{t}\right){\hat{\theta}}_{t}\left({y}_{t}\right)\right)}^{2$, when averaged over the predicted distribution of stimuli, $p\left({x}_{t}{\overrightarrow{\theta}}_{t}\right)$. (In what follows, we will use $\hat{\theta}}_{t$ and $\overrightarrow{\theta}}_{t+1$ to denote the estimates and predictions constructed from the neural response ${y}_{t}$. When differentiating between ${\hat{\theta}}_{t}\left({x}_{t}\right)$ and ${\hat{\theta}}_{t}\left({y}_{t}\right)$, we will use the shorthand notation $\hat{\theta}}_{x,t$ and $\hat{\theta}}_{y,t$, respectively). We compare this minimization to one in which the goal is to reconstruct the stimulus itself; in this case, the error in reconstruction is given by ${({x}_{t}{y}_{t})}^{2}$. In both cases, the goal of minimizing error (in either inference or reconstruction) is balanced with the goal of minimizing metabolic cost. Because the encoding is optimized based on the internal prediction of the environmental state, the entropy of the neural response will depend on how closely this prediction aligns with the true state of the environment. The entropy specifies the minimal number of bits required to accurately represent the neural response (Cover and Thomas, 2012), and becomes a lower bound on energy expenditure if each bit requires a fixed metabolic cost (Sterling and Laughlin, 2015). We therefore use the entropy of the response as a general measure of the metabolic cost of encoding.
We expect efficient encoding schemes to operate on uncertainty and surprise. The observer’s uncertainty, given by $H\left[{P}_{t}^{L}\right]={P}_{t}^{L}{\theta}^{L}+(1{P}_{t}^{L}){\theta}^{H}$, is largest when the posterior is near 0.5, and the observer believes that the environment is equally likely to be in either state. The degree to which incoming stimuli are surprising depends on the entropy of the stimulus distribution, and on the alignment between this distribution and the observer’s belief. When the mean of the Gaussian distribution is changing in time, the entropy is constant, and surprise depends symmetrically on the squared difference between the true and predicted mean, $\left(\mu \overrightarrow{\mu}\right)}^{2$. When the variance is changing, the entropy is also changing in time, and centered surprise depends asymmetrically on the ratio of true and predicted variances, $\sigma}^{2}/{\overrightarrow{\sigma}}^{2$. As a result, encoding strategies that rely on stimulus surprise should be symmetric to changes in mean but asymmetric to changes in variance.
To illustrate the dynamic relationship between encoding and inference, we use a ‘probe’ environment that switches between two states at fixed intervals of $1/h$ timesteps. This specific instantiation is not unlikely given the observer’s model of the environment (DeWeese and Zador, 1998) and allows us to illustrate average behaviors over many cycles of the environment.
Encoding via discretization
Neurons use precise sequences of spikes (Roddey et al., 2000) or discrete firing rate levels (Laughlin, 1981) to represent continuous stimuli. This inherent discreteness imposes a fundamental limitation on the number of distinct neural responses that can be used to represent a continuous stimulus space. Many studies have argued that sensory neurons make efficient use of limited response levels by appropriately tuning these levels to match the steadystate distribution of incoming stimuli (e.g. Laughlin, 1981; Balasubramanian and Berry, 2002; Gjorgjieva et al., 2017).
Here, we consider an encoder that adaptively maps an incoming stimulus ${x}_{t}$ onto a discrete set of neural response levels $\left\{{y}_{t}^{i}\right\}$ (Figure 2B). Because there are many more stimuli than levels, each level must be used to represent multiple stimuli. The number of levels reflects a global constraint on representational capacity; fewer levels indicates a stronger constraint and results in a lower fidelity encoding.
The encoder can adapt this mapping by expanding, contracting, and shifting the response levels to devote higher fidelity to different regions of the stimulus space. We consider an optimal strategy in which the response levels are chosen at each timestep to minimize the predicted inference error, subject to a constraint on the number of levels:
When the mean of the stimulus distribution is changing over time, we define these levels with respect to the raw stimulus value ${x}_{t}$. When the variance is changing, we define these levels with respect to the absolute deviation from the mean, ${x}_{t}\mu $ (where we take $\mu =0$). The predicted inference error induced by encoding a stimulus ${x}_{t}$ in a response ${y}_{t}$ changes over time as a function of the observer’s prediction of the environmental state (Figure 2C–D). Because some stimuli have very little effect on the estimate at a given time, they can be mapped onto the same neural response level without inducing error in the estimate (white regions in Figure 2C–D). The optimal response levels are chosen to minimize this error when averaged over the predicted distribution of stimuli.
The relative width of each level is a measure of the resolution devoted to different regions of the stimulus space; narrower levels devote higher resolution (and thus higher fidelity) to the corresponding regions of the stimulus space. The output of these response levels is determined by their alignment with the true stimulus distribution. An encoding that devotes higher resolution to stimuli that are likely to occur in the environment will produce a higher entropy rate (and thus higher cost), because many different response levels will be used with relatively high frequency. In contrast, if an encoding scheme devotes high resolution to surprising stimuli, very few response levels will be used, and the resulting entropy rates will be low.
When designed for accurate inference, we find that the optimal encoder devotes its resolution to stimuli that are surprising given the current prediction of the environment (Figure 3B). In a meanswitching environment (left column of Figure 3), stimuli that have high surprise fall within the tails of the predicted stimulus distribution. As a result, when the observer’s prediction is accurate, the bulk of the stimulus distribution is mapped onto the same response level (Figure 3B, left), and entropy rates are low (blue curve in Figure 3D, left). When the environment changes abruptly, the bulk of the new stimulus distribution is mapped onto different response levels. This results in a large spike in entropy rate, which enables the observer to quickly adapt its estimate to the change (blue curve in Figure 3E, left).
In a varianceswitching environment (right column of Figure 3), stimuli that have high centered surprise fall either within the tails of the predicted stimulus distribution (when variance is low), or within the bulk (when variance is high). As a result, entropy rates are low in the lowvariance state, but remain high during the highvariance state (blue curve in Figure 3D, right).
When designed for accurate reconstruction of the stimulus, we find that the optimal encoder devotes its resolution to stimuli that are likely given the current prediction of the environmental state (Figure 3C). As a result, entropy rates are high when the observer’s prediction is accurate, regardless of the environment (green curves in Figure 3D). Entropy rates drop when the environment changes, because likely stimuli become mapped onto the same response level. This drop slows the observer’s detection of changes in the environment (green curve in Figure 3E, left). An exception occurs when the variance abruptly increases, because likely stimuli are still given high resolution by the encoder following the change in the environment.
Whether optimizing for inference or stimulus reconstruction, the entropy rate, and thus the coding cost, changes dynamically over time in a manner that is tightly coupled with the inference error. The average inference error can be reduced by increasing the number of response levels, but this induces a higher average coding cost (Figure 3F). As expected, a strategy optimized for inference achieves lower inference error than a strategy optimized for stimulus reconstruction (across all numbers of response levels), but it also does so at significantly lower coding cost.
Encoding via temporal filtering
Neural responses have limited gain and temporal acuity, a feature that is often captured by linear filters. For example, neural receptive fields are often characterized as linear temporal filters, sometimes followed by a nonlinearity (Bialek et al., 1990; Roddey et al., 2000). The properties of these filters are known to dynamically adapt to changing stimulus statistics (e.g. Sharpee et al., 2006; Sharpee et al., 2011), and numerous theoretical studies have suggested that such filters are adapted to maximize the amount of information that is encoded about the stimulus (van Hateren, 1992; Srinivasan et al., 1982).
Here, we consider an encoder that implements a very simple temporal filter (Figure 2E):
where ${\alpha}_{t}\in [0.5,1]$ is a coefficient that specifies the shape of the filter and controls the instantaneous fidelity of the encoding. When ${\alpha}_{t}=0.5$, the encoder computes the average of current and previous stimuli by combining them with equal weighting, and the fidelity is minimal. When ${\alpha}_{t}=1$, the encoder transmits the current stimulus with perfect fidelity (i.e. ${y}_{t}={x}_{t}$). In addition to introducing temporal correlations, the filtering coefficient changes the gain of the response ${y}_{t}$ by rescaling the inputs $\{{x}_{t},{x}_{t1}\}$.
The encoder can adapt ${\alpha}_{t}$ in order to manipulate the instantaneous fidelity of the encoding (Figure 2E). We again consider an optimal strategy in which the value of ${\alpha}_{t}$ is chosen at each timestep to minimize the predicted inference error, subject to a constraint on the predicted entropy rate of the encoding:
Both terms depend on the strength of averaging ${\alpha}_{t}$ and on the observer’s belief ${P}_{t}^{L}$ about the state of the environment (Figure 2F–G). The inference error depends on belief through the observer’s uncertainty; when the observer is uncertain, strong averaging yields a low fidelity representation. When the observer is certain, however, incoming stimuli can be strongly averaged without impacting the observer’s estimate. The entropy rate depends on belief through the predicted entropy rate (variance) of the stimulus distribution; when the predicted entropy rate is high, incoming stimuli are more surprising on average. The multiplier $\beta $ reflects a global constraint on representational capacity; larger values of $\beta $ correspond to stronger constraints and reduce the maximum fidelity of the encoding. This, in turn, results in a reduction in coding fidelity through a decrease in gain and an increase in temporal correlation.
When designed for accurate inference, we find that the optimal encoder devotes higher fidelity at times when the observer is uncertain and the predicted stimulus variance is high. In a meanswitching environment, the stimulus variance is fixed (Figure 4A, left), and thus the fidelity depends only on the observer’s uncertainty. This uncertainty grows rapidly following a change in the environment, which results in a transient increase in coding fidelity (Figure 4B, left) and a rapid adaptation of the observer’s estimate (Figure 4D, left). This estimate is highly robust to the strength of the entropy constraint; even when incoming stimuli are strongly averaged (${\alpha}_{t}=0.5$), the encoder transmits the mean of two consecutive samples, which is precisely the statistic that the observer is trying to estimate.
In a varianceswitching environment, the predicted stimulus variance also changes in time (Figure 4A, right). This results in an additional increase in fidelity when the environment is in the high versus lowvariance state, and an asymmetry between the filter responses for downward versus upward switches in variance (Figure 4B, right). Both the encoder and the observer are slower to respond to changes in variance than to changes in mean, and the accuracy of the inference is more sensitive to the strength of the entropy constraint (Figure 4D, right).
When designed to accurately reconstruct the stimulus, the fidelity of the optimal encoder depends only on the predicted stimulus variance. In a meanswitching environment, the variance is fixed (Figure 4A), and thus the fidelity is flat across time. In a varianceswitching environment, the fidelity increases with the predicted variance of incoming stimuli, not because variable stimuli are more surprising, but rather because they are larger in magnitude and can lead to higher errors in reconstruction (Figure 4C). As the strength of the entropy constraint increases, the encoder devotes proportionally higher fidelity to highvariance stimuli because they have a greater impact on reconstruction error.
Encoding via stimulus selection
Sensory neurons show sparse activation during natural stimulation (Vinje and Gallant, 2000; Weliky et al., 2003; DeWeese and Zador, 2003), an observation that is often interpreted as a signature of coding cost minimization (Olshausen and Field, 2004; Sterling and Laughlin, 2015). In particular, early and intermediate sensory neurons may act as gating filters, selectively encoding only highly informative features of the stimulus (Rathbun et al., 2010; Miller et al., 2001). Such a selection strategy reduces the number of spikes transmitted downstream.
Here, we consider an encoder that selectively transmits only those stimuli that are surprising and are therefore likely to change the observer’s belief about the state of the environment. When the observer’s prediction is inaccurate, the predicted average surprise $H\left[p\left({x}_{t}{\overrightarrow{\theta}}_{t}\right)\right]$ will differ from the true average surprise $H\left[p\right({x}_{t}\left{\theta}_{t}\right)]$ by an amount equal to the KLdivergence of the predicted from the true stimulus distributions (Materials and methods). In principle, this difference could be used to selectively encode stimuli at times when the observer’s estimate is inaccurate.
In practice, however, the encoder does not have access to the entropy of the true stimulus distribution. Instead, it must measure surprise directly from incoming stimulus samples. The measured surprise of each incoming stimulus sample is given by its negative log probability, $\mathrm{log}\left[p\left({x}_{t}{\overrightarrow{\theta}}_{t}\right)\right]$. We consider an encoder that compares the predicted surprise to a running average of the measured surprise. In this way, the encoder can heuristically assess whether a change in the stimulus distribution had occurred by computing the ‘misalignment’ ${M}_{t}$ between the predicted and measured stimulus distributions:
The misalignment is computed over a time window $T$, which ensures that the observer’s prediction does not gradually drift from the true value in cases where surprising stimuli are not indicative of a change in the underlying stimulus distribution (we use $T=10$). Because the misalignment signal is directly related to the surprise of incoming stimuli, it is symmetric to upward and downward switches in the mean of the stimulus distribution, but it is asymmetric to switches in variance and has a larger magnitude in the highvariance state (shown analytically in Figure 2I–J).
The misalignment signal is both nonstationary and nonGaussian. Optimizing an encoding scheme based on this signal would require deriving the corresponding optimal observer model, which is difficult to compute in the general case. We instead propose a heuristic (albeit suboptimal) solution, in which the encoder selectively encodes the current stimulus with perfect fidelity (${y}_{t}={x}_{t}$) when recent stimuli are sufficiently surprising and the magnitude of the misalignment signal exceeds a threshold $V$ (Figure 2H). When the magnitude of the misalignment signal falls below the threshold, stimuli are not encoded (${y}_{t}=\varnothing$). At these times, the observer does not receive any information about incoming stimuli, and instead marginalizes over its internal prediction to update its estimate (Materials and methods). The value of the threshold reflects a constraint on overall activity; higher thresholds result in stronger criteria for stimulus selection, which decreases the maximum fidelity of the encoding.
When the mean of the stimulus distribution changes in time, very few stimuli are required to maintain an accurate estimate of the environmental state (Figure 5A–B, left). When the environment changes abruptly, the observer’s prediction is no longer aligned with the environment, and the misalignment signal increases until incoming stimuli are encoded and used to adapt the observer’s prediction. Because it requires several stimulus samples for the misalignment to exceed threshold, there is a delay between the switch in the environment and the burst of encoded stimuli. This delay, which is proportional to the size of the threshold, slows the observer’s detection of the change (Figure 5C, left).
When the variance changes in time, the average surprise of incoming stimuli also changes in time. When the variance abruptly increases, the misalignment signal grows both because the observer’s prediction is no longer accurate, and because the average surprise of the incoming stimulus distribution increases. A large proportion of stimuli are transmitted, and the observer quickly adapts to the change. If the threshold is sufficiently high, however, the observer’s prediction never fully aligns with the true state. When the variance abruptly decreases, the incoming stimulus distribution is less surprising on average, and therefore a greater number of stimulus samples is needed before the misalignment signal exceeds threshold. As a result, the observer is slower to detect decreases in variance than increases (Figure 5C, right).
Dynamical signatures of adaptive coding
The preceding sections examined the dynamics of optimal encoding strategies as seen through the internal parameters of the encoder itself. The alignment between these internal parameters and the external dynamics of the environment determine the output response properties of each encoder. It is these output response properties that would give experimental access to the underlying encoding scheme, and that could potentially be used to distinguish an encoding scheme optimized for inference from one optimized for stimulus reconstruction.
To illustrate this, we simulate output responses of each encoder to repeated presentations of the probe environment. In the case of discretization, we use a simple entropy coding procedure to map each of four response levels to four spike patterns ($\left[00\right],\left[01\right],\left[10\right],\left[11\right]$) based on the probability that each response level will be used given the distribution of incoming stimuli, and we report properties of the estimated spike rate (see spike rasters in Figure 6A; Materials and methods). In the cases of filtering and stimulus selection, we report properties of the response ${y}_{t}$.
We find that encodings optimized for inference typically show transient changes in neural response properties after a switch in the environment, followed by a return to baseline. This is manifested in a burst in firing rates in the case of discretization, and a burst in response variability in the cases of filtering and stimulus selection. Filtering is additionally marked by a transient decrease in the temporal correlation of the response. The magnitude of these transient changes relative to baseline is most apparent in the case of mean estimation, where the variability in the environment remains fixed over time. Because periods of higher variability in the environment are intrinsically more surprising, baseline response properties change during variance estimation, and bursts relative to baseline are less pronounced. Nevertheless, we see a transient decrease in temporal correlation in the case of filtering, and a transient increase in response variability in the case of stimulus selection, following switches in variance.
The same dynamical features are not observed in encoders optimized for stimulus reconstruction. For mean estimation, firing rates and response variability remain nearly constant over time, despite abrupt changes in the mean of the stimulus distribution. Discretization shows a brief rise and dip in firing rate following a switch, which has been observed experimentally (Fairhall et al., 2001). For variance estimation, response properties show sustained (rather than transient) changes following a switch.
Differences in response properties are tightly coupled to the speed and accuracy of inference, as mediated by the feedforward and feedback interactions between the encoder and the observer. Note that these measures of speed and accuracy (as well as the comparisons made in Figures 3E, 4D, and 5C) intrinsically favor encodings optimized for inference; we therefore restrict our comparison to this set of encodings. We find that both the speed and accuracy of inference are symmetric to changes in the mean of the stimulus distribution, but asymmetric to changes in variance. This is qualitatively consistent with the optimal Bayesian observer in the absence of encoding (DeWeese and Zador, 1998). We find that encoding schemes optimized for inference have a more significant impact on the speed and accuracy of variance estimation than of mean estimation. Interestingly, the speed of variance adaptation deviates from optimality in a manner that could potentially be used to distinguish between encoding strategies. In the absence of encoding, the ideal observer is faster to respond to increases than to decreases in variance. We find that encoding via stimulus selection increases this asymmetry, encoding via discretization nearly removes this asymmetry, and encoding via stimulus selection reverses this asymmetry.
Together, these observations suggest that both the dynamics of the neural response and the patterns of deviation from optimal inference could be used to infer features of the underlying sensory coding scheme. Moreover, these results suggest that an efficient system could prioritize some encoding schemes over others, depending on whether the goal is to reconstruct the stimulus or infer its underlying properties, and if the latter, whether this goal hinges on speed, accuracy, or both.
Adaptive coding for inference under natural conditions
The simplified task used in previous sections allowed us to explore the dynamic interplay between encoding and inference. To illustrate how this behavior might generalize to more naturalistic settings, we consider a visual inference task with natural stimuli (Figure 7A, Materials and methods). In particular, we model the estimation of variance in local curvature in natural image patches—a computation similar to the putative function of neurons in V2 (Ito and Komatsu, 2004). As before, the goal of the system is to infer a change in the statistics of the environment from incoming sensory stimuli. We consider a sequence of image patch stimuli drawn randomly from a local region of a natural image; this sequence could be determined by, for example, saccadic fixations. Each image patch is encoded in the responses of a population of sensory neurons using a wellknown sparsecoding model (Olshausen and Field, 1996). After adapting to natural stimulus statistics, the basis functions of each model neuron resemble receptive fields of simple cells in V1. A downstream observer decodes the stimulus from this population response and normalizes its contrast. The contrastnormalized patch is then projected onto a set of curvature filters. The variance in the output of these filters is used as an estimate of the underlying statistics of the image region. Both the computation of local image statistics and visual sensitivity to curvature are known to occur in V2 (Freeman et al., 2013; Ito and Komatsu, 2004; Yu et al., 2015).
The encoder reconstructs each stimulus subject to a sparsity constraint $\lambda $; large values of $\lambda $ decrease the population activity at the cost of reconstruction accuracy (Figure 7—figure supplement 1). In contrast to the encoding models discussed previously, this encoder is explicitly optimized to reconstruct each stimulus, rather than to support accurate inference. Even in this scenario, however, the observer can manipulate the sparsity of the population response to decrease resource use while maintaining an accurate estimate of the environmental state. It has been proposed that early sensory areas, such as V1, could manipulate the use of metabolic resources depending on topdown task demands (e.g. Rao and Ballard, 1999).
We model a change in the stimulus distribution by a gaze shift from one region of the image to another (Figure 7B). This shift induces an increase in the variance of curvature filters. Following this change, the observer must update its estimate of local curvature using image patches drawn from the new image region. We empirically estimated the impact of stimulus surprise and observer uncertainty on this estimation and found it to be consistent with results based on model environments (Figure 7D; compare with Figure 1B). Surprising stimuli that project strongly on curvature filters exert a large impact on inference, while expected stimuli (characterized by low centered surprise) exert little impact (Figure 7C–D, F). Similarly, individual stimuli exert a larger impact on the estimate when the observer is uncertain than when the observer is certain (Figure 7D–E).
The system can modulate the sparsity of the population response based on uncertainty and surprise. To illustrate this, we simulated neural population activity in response to a change in each of these quantities (Figure 7E and F, respectively). To do this, we selected a sequence of 45 image patches, 5 of which were chosen to have high centered surprise (Figure 7F; red marker) or to correspond to an observer with high uncertainty (Figure 7E; red marker). An increase in either surprise or uncertainty requires a higher fidelity response to maintain an approximately constant level of inference error. This results in a burst of population activity (blue traces in Figure 7E–F). Similar population bursts were recently observed in V1 in response to violations of statistical regularities in stimulus sequences (Homann et al., 2017). When optimized for constant reconstruction error, the sparsity of the population response remains fixed in time. The resulting population response does not adapt, and instead fluctuates around a constant value determined by $\lambda $ (green traces in Figure 7E–F).
Discussion
Organisms rely on incoming sensory stimuli to infer behaviorally relevant properties of their environment, and hierarchical inference is postulated to be a computational function of a broad range of neural circuits (Lee and Mumford, 2003; Fiser et al., 2010). Representing and transmitting these stimuli, however, is energetically costly, and such costs are known to constrain the design and function of the nervous system (Sterling and Laughlin, 2015). Here, we explored the interplay between efficient encoding and accurate inference, and we identified two general principles that can be used to balance these objectives. First, when the environment is changing over time, the relative utility of incoming stimuli for inference can also change. Second, physically different signals can exert similar influence on the observer’s model of the environment and can therefore be encoded in the same neural representation without negatively affecting the inference process.
We introduced a general theoretical framework that could exploit these two principles in order to dynamically reduce metabolic costs while maintaining accurate inferences about the environment. This framework employs a wellknown computational motif consisting of a feedback loop between an observer and an encoder. We demonstrated that when the goal is accurate inference, the encoder can optimally adapt depending on the uncertainty in the observer’s belief about the state of the environment, and on the surprise of incoming stimuli given this belief. This optimal adaptation enables the system to efficiently infer highlevel features from lowlevel inputs, which we argue is a broad goal of neural circuits across the brain. We therefore expect this framework to bear relevance for many different stages of sensory processing, from the periphery through the midbrain to central brain areas.
Transient increases in fidelity signal salient changes in the environment
To maintain low metabolic costs, we found that encoders optimized for inference adapt their encoding strategies in response to the changing utility of incoming stimuli. This adaptation was signaled by elevated periods of response variability, temporal decorrelation, or total activity. Transient, burstlike changes in each of these properties served to increase the fidelity of the neural response, and enabled the system to quickly respond to informative changes in the stimulus distribution. In the nervous system, bursts of highfrequency activity are thought to convey salient changes in an organism’s surroundings (Marsat et al., 2012). For example, in the lateral line lobe of the weakly electric fish, neurons burst in response to electric field distortions similar to those elicited by prey (Oswald et al., 2004), and these bursts are modulated by predictive feedback from downstream neurons (Marsat et al., 2012). Similarly, in the auditory system of the cricket, bursts signal changes in frequency that are indicative of predators, and the amplitude of these bursts is closely linked to the amplitude of behavioral responses (Sabourin and Pollack, 2009; Marsat and Pollack, 2006). In the visual system, retinal ganglion cells fire synchronously in response to surprising changes in the motion trajectory of a stimulus (Schwartz et al., 2007), and layer 2/3 neurons in primary visual cortex show transient elevated activity in response to stimuli that violate statistical regularities in the environment (Homann et al., 2017). Neurons in IT cortex show strong transient activity in response to visual stimuli that violate predicted transition rules (Meyer and Olson, 2011), and recent evidence suggests that single neurons in IT encode latent probabilities of stimulus likelihood during behavioral tasks (Bell et al., 2016). In thalamus, burst firing is modulated by feedback from cortex (Halassa et al., 2011) and is thought to signal the presence of informative stimuli (Lesica and Stanley, 2004; Miller et al., 2001; Rathbun et al., 2010). In the auditory forebrain of the zebra finch, neural activity is better predicted by the surprise of a stimulus than by its spectrotemporal content (Gill et al., 2008), and brief synchronous activity is thought to encode a form of statistical deviance of auditory stimuli (Beckers and Gahr, 2012). We propose that this broad range of phenomena could be indicative of an active data selection process controlled by a topdown prediction of an incoming stimulus distribution, and could thus serve as an efficient strategy for encoding changes in the underlying statistics of the environment. While some of these phenomena appear tuned to specific stimulus modulations (such as those elicited by specific types of predators or prey), we argue that transient periods of elevated activity and variability more generally reflect an optimal strategy for efficiently inferring changes in highlevel features from lowlevel input signals.
In some cases, it might be more important to reconstruct details of the stimulus itself, rather than to infer its underlying cause. In such cases, we found that the optimal encoder maintained consistently higher firing rates and more heterogeneous response patterns. In both the cricket (Sabourin and Pollack, 2010) and the weakly electric fish (Marsat et al., 2012), heterogeneous neural responses were shown to encode stimulus details relevant for evaluating the quality of courtship signals (in contrast to the bursts of activity that signal the presence of aggressors). While separate circuits have been proposed to implement these two different coding schemes (inferring the presence of an aggressor versus evaluating the quality of a courtship signal), these two strategies could in principle be balanced within the same encoder. The signatures of adaptation that distinguish these strategies could alternatively be used to identify the underlying goal of a neural encoder. For example, neurons in retina can be classified as ‘adapting’ or ‘sensitizing’ based on the trajectory of their firing rates following a switch in stimulus variance (Kastner and Baccus, 2011). These trajectories closely resemble the response entropies of encoders optimized for inference or reconstruction, respectively (right panel of Figure 3D). A rigorous application of the proposed framework to the identification of neural coding goals is a subject of future work.
Importantly, whether the goal is inference or stimulus reconstruction, the encoders considered here were optimized based on predictive feedback from a downstream unit and thus both bear similarity to hierarchical predictive coding as formulated by Rao and Ballard (1999). The goal, however, crucially determines the difference between these strategies: sustained heterogeneous activity enables reconstruction of stimulus details, while transient bursts of activity enable rapid detection of changes in their underlying statistics.
Periods of stationarity give rise to ambiguous stimulus representations
A central idea of this work is that stimuli that are not useful for a statistical estimation task need not be encoded. This was most notably observed during periods in which an observer maintained an accurate prediction of a stationary stimulus distribution. Here, different stimuli could be encoded by the same neural response without impacting the accuracy of the observer’s prediction. This process ultimately renders stimuli ambiguous, and it predicts that the discriminability of individual stimuli should decrease over time as the system’s internal model becomes aligned with the environment (Materials and methods, Figure 6—figure supplement 1). Ambiguous stimulus representation have been observed in electrosensory pyramidal neurons of the weakly electric fish, where adaptation to the envelope of the animal’s own electric field (a secondorder statistic analogous to the variance step considered here) reduces the discriminability of specific amplitude modulations (Zhang and Chacron, 2016). Similarly, in the olfactory system of the locust, responses of projection neurons to chemically similar odors are highly distinguishable following an abrupt change in the odor environment, but become less distinguishable over time (Mazor and Laurent, 2005). The emergence of ambiguous stimulus representations has recently been observed in human perception of auditory textures that are generated from stationary sound sources such as flowing water, humming wind, or large groups of animals (McDermott et al., 2013). Human listeners are readily capable of distinguishing short excerpts of sounds generated by such sources. Surprisingly, however, when asked to tell apart long excerpts of auditory textures, performance sharply decreases. We propose that this steady decrease in performance with excerpt duration reflects adaptive encoding for accurate inference, where details of the stimulus are lost over time in favor of their underlying statistical summary.
Efficient use of metabolic resources yields diverse signatures of suboptimal inference
We used an ideal Bayesian observer to illustrate the dynamic relationship between encoding and inference. Ideal observer models have been widely used to establish fundamental limits of performance on different sensory tasks (Geisler et al., 2009; Geisler, 2011; Weiss et al., 2002). The Bayesian framework in particular has been used to identify signatures of optimal performance on statistical estimation tasks (Simoncelli, 2009; Robert, 2007), and a growing body of work suggests that neural systems explicitly perform Bayesian computations (Deneve, 2008; Fiser et al., 2010; Ma et al., 2006b; Rao et al., 2002). In line with recent studies (Wei and Stocker, 2015; Ganguli and Simoncelli, 2014), we examined the impact of limited metabolic resources on such probabilistic neural computations.
While numerous studies have identified signatures of nearoptimal performance in both neural coding (e.g. Wark et al., 2009) and perception (e.g. Burge and Geisler, 2015; Weiss et al., 2002), the ideal observer framework can also be used to identify deviations from optimality. Such deviations have been ascribed to noise (Geisler, 2011) and suboptimal neural decoding (Putzeys et al., 2012). Here, we propose that statistical inference can deviate from optimality as a consequence of efficient, resourceconstrained stimulus coding. We observed deviations from optimality in both the speed and accuracy of inference, and we found that some of these deviations (namely asymmetries in the speed of variance adaptation) could potentially be used to differentiate the underlying scheme that was used to encode incoming stimuli. It might therefore be possible to infer underlying adaptation strategies by analyzing patterns of suboptimal inference.
Limitations and future work
We discussed general principles that determine optimal encoding strategies for accurate inference, and we demonstrated the applicability of these principles in simple model systems. Understanding the applicability in more complex settings and for specific neural systems requires further investigation.
Complexity of the environment
We considered a simple nonstationary environment whose dynamics varied on a single timescale. These dynamics were parameterized by a single latent variable that specified either the mean or the variance of a Gaussian stimulus distribution. These first and secondorder moments are basic properties of an input distribution and often correspond to interpretable, physical properties such as luminance or local contrast. Similar stimulus distribution have been used to study a range of neural and perceptual dynamics, including adaptation of fly visual neurons to changes in luminance and contrast (Fairhall et al., 2001), neural representations of electric field modulations in the weakly electric fish (Zhang and Chacron, 2016), and human perceptual decision making (Nassar et al., 2010). Here, we used this simple environment to probe the dynamics of encoding schemes optimized for inference. We found that optimal encoding schemes respond strongly to changes in the underlying environmental state, and thereby carry information about the timescale of environmental fluctuations. In natural settings, signals vary over a range of temporal scales, and neurons are known to be capable of adapting to multiple timescales in their inputs (Lundstrom et al., 2008; Wark et al., 2009). We therefore expect that more complex environments, for example those in which the environmental state can both switch between distinct distributions and fluctuate between values within a single distribution, will require that the encoder respond to environmental changes on multiple timescales.
In all such cases, we expect the dimensionality of the latent variable space to determine the lower bound on coding costs for inference. Even in the limit of highly complex models, however, we expect accurate inference and reconstruction to impose qualitatively different constraints on neural response properties.
Diversity of sensory encoding schemes
We considered three encoding schemes that approximate known features of neural responses, and as such could be implemented broadly across the brain. Discretization is a nonlinear encoding scheme that specifies a finite set of instantaneous response levels (such as spiking patterns or discriminable firing rates) and provides a good model of retinal ganglion cells responses (e.g. Koch et al., 2004). Temporal filtering, on the other hand, is a linear encoding scheme that forms the basis of a broad class of linearnonlinear (LN) models. These models have been used to describe neural responses in a range of systems (Sharpee, 2013), and can capture temporal dependencies in the neural response. To more closely approximate spiking nonlinearities observed in real neurons, the linear output of this encoder could be followed by a nonlinearity whose parameters are also adapted over time, thereby enabling the system to more strongly suppress irrelevant stimuli. Finally, our model of stimulus selection implements a form of gating, whereby unsurprising stimuli are not encoded. This nonlinear encoding scheme produces bimodal responses (either strongly active or completely silent), and we would therefore expect such a mechanism to be useful when transmitting signals over long distances. This scheme can also be viewed as implementing a partitioning of the stimulus space into surprising and unsurprising stimuli, similar to discretization.
In order to achieve optimal bounds on performance, the parameters of each encoding scheme were computed and updated on each timestep. While it is known that neural systems can adapt on timescales approaching physical limits (Fairhall et al., 2001), it is possible that more complex neural circuits might implement a heuristic version of this adaptation that operates on slower timescales.
Together, these approaches provide a framework for studying adaptive coding across a broad class of neural encoding schemes. This framework can be implemented with other encoding schemes, such as population or spiketime coding. In such cases, we expect that the principles identified here, including increased coding fidelity during periods of uncertainty or surprise, will generalize across encoding schemes to determine optimal strategies of adaptation.
Robustness to noise
Noise can arise at different stages of neural processing and can alter the faithful encoding and transmission of stimuli to downstream areas (Roddey et al., 2000; Brinkman et al., 2016). Individual neurons and neural populations can combat the adverse effects of noise by appropriately tuning their coding strategies, for example by adjusting the gain or thresholds of individual neurons (van Hateren, 1992; Gjorgjieva et al., 2017), introducing redundancies between neural responses (Doi and Lewicki, 2014; Tkacik et al., 2010; MorenoBote et al., 2014; Abbott and Dayan, 1999; Sompolinsky et al., 2001), and forming highly distributed codes (Denève and Machens, 2016; Deneve and Chalk, 2016). Such optimal coding strategies depend on the source, strength, and structure of noise (Brinkman et al., 2016; Tkacik et al., 2010; van Hateren, 1992; Kohn et al., 2016), and can differ significantly from strategies optimized in the absence of noise (Doi and Lewicki, 2014).
Noise induced during encoding stages can affect downstream computations, such as the class of inference tasks considered here. To examine its impact on optimal inference, we injected additive Gaussian noise into the neural response transmitted from the discretizing encoder to the observer. We found that the accuracy of inference was robust to low levels of noise, but degraded quickly once the noise variance approached the degree of separation between environmental states (Figure 3—figure supplement 2). Although this form of Gaussian transmission noise was detrimental to the inference process, previous work has argued that noiserelated variability, if structured appropriately across a population of encoders, could support representations of the probability distributions required for optimal inference (Ma et al., 2006a). Moreover, we expect that the lossy encoding schemes developed here could be beneficial in combating noise injected prior to the encoding step, as they can guarantee that metabolic resources are not wasted in the process of representing noise fluctuations.
Ultimately, the source and degree of noise can impact both the goal of the system and the underlying coding strategies. Here, we considered the goal of optimally inferring changes in environmental states. However, in noisy environments where the separation between latent environmental states is low, a system might need to remain stable in the presence of noise, rather than flexible to environmental changes. We expect that the optimal balance between stability and flexibility to be modulated by the spread of the stimulus distribution relative to the separation between environmental states. A thorough investigation of potential sources of noise, and their impact on the balance between efficient coding and optimal inference, is the subject of future work.
Measures of optimal performance
To measure the optimal bound on inference error, we used the mean squared difference between point estimates derived in the presence and absence of an encoding step. This metric is general and makes no assumptions about the form of the posterior distribution (Jaynes, 2003; Robert, 2007). Other measures, such as KLdivergence, could be used to capture not only changes in point estimates, but also changes in uncertainty underlying these estimates.
Connections to existing theoretical frameworks
Efficient coding of taskrelevant information has been studied before, primarily within the framework of the Information Bottleneck (IB) method (Tishby et al., 2000; Chechik et al., 2005; Strouse and Schwab, 2016). The IB framework provides a general theoretical approach for extracting taskrelevant information from sensory stimuli, and it has been successfully applied to the study of neural coding in the retina (Palmer et al., 2015) and in the auditory cortex (Rubin et al., 2016). In parallel, Bayesian Efficient Coding (BEC) has recently been proposed as a framework through which a metabolicallyconstrained sensory system could minimize an arbitrary error function that could, as in IB, be chosen to reflect taskrelevant information (Park and Pillow, 2017). However, neither framework (IB nor BEC) explicitly addresses the issue of adaptive sensory coding in nonstationary environments, where the relevance of different stimuli can change in time. Here, we frame general principles that constrain the dynamic balance between coding cost and task relevance, and we pose neurally plausible implementations.
Our approach bears conceptual similarities to the predictive coding framework proposed by Rao and Ballard (1999), in which lowlevel sensory neurons support accurate stimulus reconstruction by encoding the residual error between an incoming stimulus and a topdown prediction of the stimulus. Our encoding schemes similarly use topdown predictions to encode useful deviations in the stimulus distribution. Importantly, however, the goal here was not to reconstruct the stimulus itself, but rather to infer the underlying properties of a changing stimulus distribution. To this end, we considered encoding schemes that could use topdown predictions to adaptively adjust their strategies over time based on the predicted utility of different stimuli for supporting inference.
This work synthesizes different theoretical frameworks in an effort to clarify their mutual relationship. In this broad sense, our approach aligns with recent studies that aim to unify frameworks such as efficient coding and Bayesian inference (Park and Pillow, 2017), as well as concepts such as efficient, sparse, and predictive coding (Chalk et al., 2017).
Outlook
Efficient coding and probabilistic inference are two prominent frameworks in theoretical neuroscience that address the separate questions of how stimuli can be encoded at minimal cost, and how stimuli can be used to support accurate inferences. In this work, we bridged these two frameworks within a dynamic setting. We examined optimal strategies for encoding sensory stimuli while minimizing the error that such encoding induces in the inference process, and we contrasted these with strategies designed to optimally reconstruct the stimulus itself. These two goals could correspond to different regimes of the same sensory system (Balasubramanian et al., 2001), and future work will explore strategies for balancing these regimes depending on task requirements. In order to test the implications of this work for physiology and behavior, it will be important to generalize this framework to more naturalistic stimuli, noisy encodings, and richer inference tasks. At present, our results identify broad signatures of a dynamical balance between metabolic costs and task demands that could potentially explain a wide range of phenomena in both neural and perceptual systems.
Materials and methods
A. Optimal Bayesian inference with adaptively encoded stimuli
We describe a class of discretetime environmental stimuli ${x}_{t}$ whose statistics are completely characterized by a single timevarying environmental state variable ${\theta}_{t}$.
We then consider the scenario in which these stimuli are encoded in neural responses, and it is these neural responses that must be used to construct the posterior probability over environmental states. In what follows, we derive the optimal Bayesian observer for computing this posterior given the history of neural responses. The steps of this estimation process are summarized in Figure 1—figure supplement 1.
In a full Bayesian setting, the observer should construct an estimate of the stimulus distribution, $p\left({x}_{t}\right)$, by marginalizing over its uncertainty in the estimate of the environmental state ${\theta}_{t}$ (i.e. by computing $p\left({x}_{t}\right)=\int d{\theta}_{t}\phantom{\rule{thinmathspace}{0ex}}p\left({x}_{t}{\theta}_{t}\right)p\left({\theta}_{t}\right)$). For simplicity, we avoid this marginalization by assuming that the observer’s belief is wellsummarized by the average of the posterior, which is captured by the point value ${\hat{\theta}}_{t}=\int d{\theta}_{t}\phantom{\rule{thinmathspace}{0ex}}{\theta}_{t}p\left({\theta}_{t}\right)$ for estimation, and ${\overrightarrow{\theta}}_{t+1}=\int d{\theta}_{t+1}\phantom{\rule{thinmathspace}{0ex}}{\theta}_{t+1}p\left({\theta}_{t+1}\right)$ for prediction. The average of the posterior is an optimal scalar estimate that minimizes the mean squared error between the estimated and true states of the environment, and is known to provide a good description of both neural (DeWeese and Zador, 1998) and perceptual (Nassar et al., 2010) dynamics. The observer then uses these point values to condition its prediction of the stimulus distribution, $p\left({x}_{t}{\overrightarrow{\theta}}_{t}\right)$. Conditioning on a point estimate guarantees that the observer’s prediction of the environment belongs to the same family of distributions as the true environment. This is not guaranteed to be the case when marginalizing over uncertainty in ${\theta}_{t}$. For example, if the posterior assigns nonzero probability mass to two different mean values of a unimodal stimulus distribution, the predicted stimulus distribution could be bimodal, even if the true stimulus distribution is always unimodal. We verified numerically that the key results of this work are not affected by approximating the full marginalization with point estimates.
When the timescale of the environment dynamics is sufficiently slow, the point prediction $\overrightarrow{\theta}}_{t+1$ can be approximated by the point estimate $\hat{\theta}}_{t$. In the twostate environments considered here, the probability that the environment remains in the low state from time $t$ to time $t+1$ is equal to ${P}_{t+1}^{L}={P}_{t}^{L}(1h)+(1{P}_{t}^{L})h$, where $h$ is the hazard rate (DeWeese and Zador, 1998). For the small hazard rate used here ($h=0.01$), ${P}_{t+1}^{L}=0.99{P}_{t}^{L}+0.01(1{P}_{t}^{L})$, and the estimate $\hat{\theta}}_{t$ is therefore a very close approximation of the prediction $\overrightarrow{\theta}}_{t+1$. All results presented in the main text were computed using this approximation (i.e. $\overrightarrow{\theta}}_{t+1}\approx {\hat{\theta}}_{t$). With this approximation, the optimal Bayesian observer computes the approximate posterior distribution $p\left({\theta}_{t}{y}_{\tau \le t},{\hat{\theta}}_{\tau <t}\right)$, conditioned on the history of neural responses ${y}_{\tau \le t}$ and the history of point estimates $\hat{\theta}}_{\tau <t$. In the remainder of the Materials and methods, we will formulate all derivations and computations in terms of the history of past estimates (up to and including time $t1$), with the understanding that these estimates can be used as approximate predictions of the current state at time $t$.
With these simplifications, the general steps of the inference process can be broken down as follows:
Encoder: maps incoming stimuli $x}_{\tau \le t$ onto a neural response $y}_{t$ by sampling from the ‘encoding distribution’ $p\left({y}_{t}{x}_{\tau \le t},{\hat{\theta}}_{\tau <t}\right)$
Decoder: uses Bayes’ rule to compute the conditional distribution of a stimulus ${x}_{t}$ given the neural response ${y}_{t}$, which we refer to as the ‘decoding distribution’ $p\left({x}_{t}{y}_{t},{\hat{\theta}}_{\tau <t}\right)$
Observer: uses the neural response $y}_{t$ to update the posterior $p\left({\theta}_{t}{y}_{\tau \le t},{\hat{\theta}}_{\tau <t}\right)$. This can be broken down into the following steps, in which the observer:
Combines the previous posterior $p\left({\theta}_{t1}{y}_{\tau <t},{\hat{\theta}}_{\tau <t1}\right)$ with knowledge of environment dynamics $p\left({\theta}_{t}\right{\theta}_{t1})$ to compute the probability distribution of ${\theta}_{t}$ given all past data, $p\left({\theta}_{t}{y}_{\tau <t},{\hat{\theta}}_{\tau <t1}\right)$
Uses Bayes’ rule to incorporate a new stimulus ${x}_{t}$ and form $p\left({\theta}_{t}{x}_{t},{y}_{\tau <t},{\hat{\theta}}_{\tau <t1}\right)$
Marginalizes over the uncertainty in ${x}_{t}$ using the decoding distribution $p\left({x}_{t}{y}_{t},{\hat{\theta}}_{\tau <t}\right)$, thereby obtaining the updated posterior $p\left({\theta}_{t}{y}_{\tau \le t},{\hat{\theta}}_{\tau <t}\right)$ (which can be averaged to compute the point estimate $\hat{\theta}}_{t$)
Combines the updated posterior with knowledge of environment dynamics $p\left({\theta}_{t+1}\right{\theta}_{t})$ to generate a predicted distribution of environmental states $p\left({\theta}_{t+1}{y}_{\tau \le t},{\hat{\theta}}_{\tau <t}\right)$ (which can be averaged to compute the point prediction $\overrightarrow{\theta}}_{t+1$)
Feedback loop: sends the prediction back upstream to update the encoder.
In what remains of this section, we derive the general equations for the full inference process in the presence of both encoding and decoding. In Section B, we derive the specific forms of the inference equations in a simplified, twostate environment. We first focus on the general equations of the observer model (Section B.2). We then describe the forms of the encoding and decoding distributions implemented by the three different encoding schemes considered in this paper, and detail how the parameters of each encoder can be optimized based on the observer’s prediction of the environmental state (Sections B.3B.6). In Section C, we describe the numerical approximations used to simulate the results presented in the main paper.
A.1. Environment dynamics
Request a detailed protocolWe consider a nonstationary environment with Markovian dynamics. The dynamics of the environmental state variable ${\theta}_{t}$ are then specified by the distribution $p\left({\theta}_{t}\right{\theta}_{t1})$. At each time $t$, the value of ${\theta}_{t}$ specifies the distribution of stimuli $p\left({x}_{t}\right{\theta}_{t})$.
A.2. Encoder
Request a detailed protocolWe consider an encoder that maps incoming stimuli ${x}_{\tau \le t}$ onto a neural response ${y}_{t}$. We assume that the encoder has access to the history of estimates $\hat{\theta}}_{\tau <t$ (fed back from a downstream observer) to optimally encode incoming stimuli via the ‘encoding distribution’, $p\left({y}_{t}{x}_{\tau \le t},{\hat{\theta}}_{\tau <t}\right)$.
A.3. Decoder
Request a detailed protocolBecause the observer does not have direct access to the stimulus, it must first decode the stimulus from the neural response. We assume that the decoder has access to the instantaneous neural response ${y}_{t}$ and this history of past estimates $\hat{\theta}}_{\tau <t$. The decoder must use these signals to marginalize over past stimuli ${x}_{\tau <t}$ and compute the probability of the response ${y}_{t}$ conditioned on the current stimulus ${x}_{t}$ (this probability will later be used to update the observer’s posterior):
The decoder must then invert this distribution (using Bayes’ rule) to estimate the probability of the stimulus ${x}_{t}$ given the response ${y}_{t}$ and past estimates $\hat{\theta}}_{\tau <t$:
where we have written the distribution in the denominator as a normalization constant obtained by integrating the numerator:
In what follows, we refer to $p\left({x}_{t}{y}_{t},{\hat{\theta}}_{\tau <t}\right)$ (defined in Equation 6) as the ‘decoding distribution’.
A.4. Observer
Request a detailed protocolThe optimal observer should use the decoding distribution to marginalize over its uncertainty about the true value of the stimulus ${x}_{t}$ and thereby obtain the posterior probability of ${\theta}_{t}$ given past responses ${y}_{\tau \le t}$ and past estimates $\hat{\theta}}_{\tau <t$. To do this, we first write an expression for the probability of ${\theta}_{t}$ given all data up to (but not including) the current timestep:
where the prior is taken to be the posterior from the last timestep, and the distribution $p\left({\theta}_{t}\right{\theta}_{t1})$ governs the dynamics of the environment.
This distribution can then be combined with a new stimulus ${x}_{t}$:
As before, we have written the distribution in the denominator as a normalization constant obtained by integrating the numerator:
Finally, we marginalize over the unknown value of the signal ${x}_{t}$ using the decoding distribution $p\left({x}_{t}{y}_{t},{\hat{\theta}}_{\tau <t}\right)$ to obtain the updated posterior distribution:
To form a prediction about the future state of the environment, the observer should combine its belief $p\left({\theta}_{t}{y}_{\tau \le t},{\hat{\theta}}_{\tau <t}\right)$ about the current state of the environment with the knowledge $p\left({\theta}_{t+1}\right{\theta}_{t})$ of the environment dynamics in a manner analogous to Equation 8.
A.5. Computing point estimates
Request a detailed protocolThe posterior can be used to compute a point estimate $\hat{\theta}}_{t$ and prediction $\overrightarrow{\theta}}_{t+1$ of the environmental state:
The point estimate given in Equation 12 is referred to in the main text as '$\hat{\theta}}_{y,t$'. We distinguish this from the point estimate '$\hat{\theta}}_{x,t$', which was derived in DeWeese and Zador (1998) in the absence of encoding/decoding.
B. Model environments
B.1. Environment dynamics
Request a detailed protocolWe consider a twostate environment in which the state ${\theta}_{t}$ can take one of two values, ${\theta}^{L}$ and ${\theta}^{H}$. At each timestep, the environment can switch states with a constant probability $h$, referred to as the ‘hazard rate’. The hazard rate fully specifies the dynamics of the environment:
where ${z}_{t}$ is a binary random variable equal to 1 with probability $h$ and 0 with probability $1h$.
We take ${\theta}_{t}$ to parametrize either the mean $\mu $ or the standard deviation $\sigma $ of a Gaussian stimulus distribution:
B.2. Observer
Request a detailed protocolIn a twostate environment, the posterior distribution $p\left({\theta}_{t}{y}_{\tau \le t},{\hat{\theta}}_{\tau <t}\right)$ can be summarized by a single value ${P}_{t}^{L}=p\left({\theta}_{t}={\theta}^{L}{y}_{\tau \le t},{\hat{\theta}}_{\tau <t}\right)$, which is the probability that the environment is in the low state at time $t$.
Given the posterior ${P}_{t1}^{L}$ at the previous timestep, the distribution for ${\theta}_{t}$ given all past data is given by:
where $h$ is the a priori probability that a switch occurred at the current timestep. This distribution can then be combined with a new stimulus ${x}_{t}$:
The variables (${\mu}_{L}$, ${\sigma}_{L}$) and (${\mu}_{H}$, ${\sigma}_{H}$) correspond to mean and standard deviation of the stimulus distribution in the low and high states, respectively, and their values vary depending on the type of the environment (meanswitching versus varianceswitching).
To obtain the updated posterior ${P}_{t}^{L}$, we marginalize over the decoding distribution $p\left({x}_{t}{y}_{t},{\hat{\theta}}_{\tau <t}\right)$:
The posterior can be used to construct a new pointestimate $\hat{\theta}}_{t$ of the environmental state:
where $1{P}_{t}^{L}={P}_{t}^{H}$ is the probability that the environment is in the high state at time $t$. Note that although the environmental states are discrete, the optimal Bayesian observer maintains a continuous estimate $\hat{\theta}}_{t$.
To form a prediction about the future state of the environment, the observer first combines the posterior ${P}_{t}^{L}$ with knowledge of environment dynamics (in a manner analogous to Equation 16), and then computes a point prediction (in a manner analogous to Equation 19):
For small hazard rates (as considered here), the predicted value $\overrightarrow{\theta}}_{t+1$ is very close to the current estimate $\hat{\theta}}_{t$. For simplicity, we approximate the prediction $\overrightarrow{\theta}}_{t+1$ by the estimate $\hat{\theta}}_{t$. This estimate is then fed back upstream and used to update the encoder. In the general case, however, one should compute the full predicted distribution of environmental states via Equation 20, and use this distribution to optimize the encoder.
B.3. Encoder/decoder
Request a detailed protocolThe posterior (given in Equation 18) is a function of the decoding distribution $p\left({x}_{t}{y}_{t},{\hat{\theta}}_{\tau <t}\right)$, which depends on the encoding distribution $p\left({y}_{t}{x}_{\tau \le t},{\hat{\theta}}_{\tau <t}\right)$ through Equations 56. In what follows, we derive the encoding and decoding distributions for the three encoding schemes considered in this paper. All three encoding schemes are noiseless; as a result, the encoding distribution $p\left({y}_{t}{x}_{\tau \le t},{\hat{\theta}}_{\tau <t}\right)$ reduces to a delta function in each case. This encoding distribution can then be used to derive the decoding distribution, from which it is straightforward to compute the posterior ${P}_{t}^{L}$ via Equation 18 (and similarly any estimates and predictions derived from the posterior).
Each encoding scheme considered here was parametrized by one or more encoding parameters. In two of the three encoding schemes, these parameters were chosen to minimize an error function $E(x,y)$, subject to a constraint on the fidelity of the encoding. We defined this error function with respect to inference or stimulus reconstruction:
where $\hat{\theta}}_{y$ was defined in Equation 12, and $\hat{\theta}}_{x$ was derived in DeWeese and Zador (1998).
B.4. Limited neural response levels: encoding via discretization
Request a detailed protocolB.4.1. Encoder
Here, we consider a quantization (instantaneous discretization) of the stimulus space that maps the current stimulus ${x}_{t}$ onto one of a discrete set of values $\left\{{y}_{t}^{i}\right\}$, where $i=1,2,\mathrm{...}N$ labels distinct response levels. This mapping is performed deterministically by choosing the response level that minimizes the instantaneous error $E({x}_{t},\{{y}_{t}^{i}\left\}\right)$:
We can therefore write the encoding distribution as a delta function:
where the set of response levels $\left\{{y}_{t}^{i}\right\}$ implicitly contains the dependence on $\hat{\theta}}_{t1$.
B.4.2. Decoder
The decoder must estimate the probability of a stimulus ${x}_{t}$, given that the observed response was ${y}_{t}^{i}$. In principle, the response ${y}_{t}^{i}$ could have been generated by any stimulus in the range $[{y}_{t}^{i,L},{y}_{t}^{i,H}]$, where ${y}_{t}^{i,L}$ and ${y}_{t}^{i,H}$ are the lower and upper bounds of the bin represented by level ${y}_{t}^{i}$, respectively.
The decoding distribution can then be written as a truncated Gaussian distribution:
where $Z\left({y}_{t}^{i,L},{y}_{t}^{i,H},{\hat{\theta}}_{t1}\right)$ is a normalization constant. For simplicity, we approximated this truncated Gaussian distribution with a delta function:
We verified numerically that this approximation did not impact our results.
B.4.3. Determining the optimal response levels
At each point in time, the optimal set of response levels $\left\{{y}_{t}^{i}\right\}}^{\ast$ was found by minimizing the following equation:
subject to a hard constraint on the number of response levels. When optimizing for meanswitching environments, we defined the error function with respect to the raw stimulus and neural response (i.e. $E=E(x,y)$). When optimizing for varianceswitching environment, we defined the error function with respect to the absolute value of the stimulus and neural response (i.e. $E=E\left(\rightx,y\left\right)$). We computed $\u27e8E\left({x}_{t},\left\{{y}_{t}^{i}\right\}\right)\u27e9$ numerically; see Section C.3.1.
B.5. Limited gain and temporal acuity: encoding via temporal filtering
Request a detailed protocolB.5.1. Encoder
In this encoding scheme, we consider a simple temporal filter parameterized by the coefficient ${\alpha}_{t}$. This filter linearly combines current (${x}_{t}$) and past (${x}_{t1}$) stimuli:
The encoding distribution is then given by:
where the filtering coefficient ${\alpha}_{t}$ implicitly contains the dependence on $\hat{\theta}}_{t1$.
B.5.2. Decoder
The encoding is a function of both current and past stimuli. The decoder, however, only has access to the current response ${y}_{t}$. In order to estimate the probability that this response was generated by the stimulus ${x}_{t}$, the decoder must first use the internal estimates $\hat{\theta}}_{\tau <t$ to marginalize over uncertainty in past stimuli ${x}_{\tau <t}$. This was first outlined in Equation 5, which reduces here to:
The decoder can then use Bayes’ rule to invert this distribution and determine the probability of the stimulus ${x}_{t}$ given the response ${y}_{t}$:
In its current form, this decoding distribution is written as a Gaussian over the variable ${y}_{t}$. Ultimately, the observer must use this decoding distribution to marginalize over uncertainty in ${x}_{t}$. In Appendix I, we walk through the algebra needed to rewrite this distribution as Gaussian over ${x}_{t}$. The final form of this distribution in given by:
B.5.3. Determining the optimal filter coefficient
The optimal filtering coefficient ${\alpha}_{t}^{\ast}$ was found by minimizing the following equation:
The error term, $\u27e8E\left({x}_{t},{y}_{t}\right)\u27e9$, was computed numerically; see Section C.3.2. The entropy term, $H\left({y}_{t},{y}_{t+1}{\hat{\theta}}_{\tau <t}\right)$, can be computed analytically (see Appendix 2 for details):
B.6. Limited neural activity: encoding via dynamic stimulus selection
Request a detailed protocolB.6.1. Encoder
In this encoding scheme, the encoder uses the misalignment signal ${M}_{t}$ to determine whether or not to encode and transmit the stimulus ${x}_{t}$. If the magnitude of the misalignment signal exceeds the threshold $V$, the stimulus is encoded and transmitted. Otherwise, the stimulus is not encoded, and a ‘null symbol’ is transmitted to the observer. For the purposes of computing the encoding and decoding distributions, we use ${y}_{t}=0$ to denote the null symbol (in the main text, we denoted the null symbol by ${y}_{t}=\varnothing $).
This encoding is a deterministic mapping of the stimulus ${x}_{t}$ onto the response ${y}_{t}$, dependent upon the misalignment signal ${M}_{t}$. The encoding distribution can thus be written in a probabilistic form as a mixture of two delta functions:
where ${M}_{t}$ implicitly contains the dependence on $\hat{\theta}}_{t1$.
B.6.2. Decoder
In this scheme, the form of the decoding distribution depends on whether or not the encoder transmits the stimulus ${x}_{t}$. If the stimulus was encoded and transmitted, there is no uncertainty in its value, and the decoding distribution is a delta function about ${y}_{t}$. If the stimulus was not encoded and the null symbol was instead transmitted, the decoder can only assume that the stimulus came from the estimated stimulus distribution $p\left({x}_{t}{\hat{\theta}}_{t1}\right)$.
The decoding distribution therefore takes the following form:
B.6.3. Determining the misalignment signal
In defining this encoding scheme, our aim was to construct a heuristic ‘misalignment’ signal that would alert the encoder to a change in the stimulus distribution. One candidate is a signal that tracks the average surprise of incoming stimuli, given the internal estimate of the environmental state.
The surprise associated with a single stimulus ${x}_{t}$ is equal to the negative log probability of the stimulus given the estimate $\hat{\theta}}_{t1$:
The average surprise of incoming stimuli, obtained by averaging over the true stimulus distribution $p\left({x}_{t}\right{\theta}_{t})$, is equal to crossentropy between the true and estimated stimulus distributions:
where the second term in Equation 39 is the KullbackLeibler divergence of the estimated stimulus distribution from the true stimulus distribution.
The crossentropy is equal to the entropy of the true stimulus distribution when the observer’s estimate is accurate (i.e., when $\hat{\theta}}_{t1}={\theta}_{t$), and increases with the inaccuracy of the observer’s estimate. To construct a signal that deviates from zero (rather than from the entropy of the stimulus distribution) whenever observer’s estimate is inaccurate, we subtract the estimated entropy $H\left({x}_{t};{\hat{\theta}}_{t1}\right)$ from the crossentropy to define the ‘misalignment signal’:
The magnitude of this signal is large whenever the average surprise of incoming stimuli differs from the estimated surprise, and monotonically increases as a function of the difference between the true and estimated states of the environment. In the case of a Gaussian distribution, the misalignment signal reduces to:
where ${\mu}_{t}$ and ${\sigma}_{t}$ are the mean and standard deviation of the true stimulus distribution, respectively, and $\hat{\mu}}_{t1$ and $\hat{\sigma}}_{t1$ are the estimated values of the same parameters. The analytical values of this misalignment signal are plotted in Figure 2I–J.
In practice, the encoder does not have access to the parameters of the true stimulus distribution, and must therefore estimate the misalignment signal directly from incoming stimulus samples. This is discussed in more detail in Section C.3.3.
C. Numerical simulations
C.1. Environment parameters
Request a detailed protocolAll results were generated using a probe environment in which the state ${\theta}_{t}$ switched between two fixed values, ${\theta}^{L}$ and ${\theta}^{H}$, every 100 time samples (corresponding to a hazard rate of $h=0.01$). A single cycle of this probe environment consists of 200 time samples, for which the environment is in the low state $({\theta}_{t}={\theta}^{L})$ for the first 100 time samples and in the high state $\left({\theta}_{t}={\theta}^{H}\right)$ for the second 100 time samples. In the main text, we averaged results over multiple cycles of the probe environment.
For the meanswitching environment, the state ${\theta}_{t}$ parametrized the mean of the stimulus distribution and switched between $\mu ={\theta}^{L}=1$ and $\mu ={\theta}^{H}=1$. The standard deviation was fixed to $\sigma =1$. For the varianceswitching environment, ${\theta}_{t}$ parametrized the standard deviation of the stimulus distribution and switched between $\sigma ={\theta}^{L}=1$ and $\sigma ={\theta}^{H}=2$. The mean was fixed to $\mu =0$.
C.2. Updating the posterior
Request a detailed protocolOn each timestep, a single stimulus ${x}_{t}$ was drawn randomly from $p\left({x}_{t}\right{\theta}_{t})$. The stimulus was encoded, decoded, and used to update the posterior ${P}_{t}^{L}$. Updating the posterior requires marginalizing over the decoding distribution $p\left({x}_{t}{y}_{t},{\hat{\theta}}_{\tau <t}\right)$ (as given by Equation 11). We approximated this marginalization numerically via MonteCarlo simulation. At each time step, we generated 200 samples from the decoding distribution specified by each encoding scheme (for reference, the decoding distributions are given in Equations 26, 32, and 36). Individual samples were then used to compute separate estimates of the posterior, and the resulting set of estimates was averaged over samples. Results were robust to the number of samples used, provided that this number exceeded 50. In the case of encoding via discretization, we found that results were not sensitive to the inclusion of this marginalization step. We therefore computed all results for the discretization encoding scheme in the absence of marginalization by using the neural response ${y}_{t}$ to directly update the posterior. This posterior forms the basis of all estimates $\hat{\theta}}_{t$ and predictions $\overrightarrow{\theta}}_{t+1$.
C.3. Optimizing the encoding
Request a detailed protocolFor two of the three encoding schemes (discretization and temporal filtering), the estimate $\hat{\theta}}_{t1$ was used to optimize a set of encoding parameters (the set of neural response levels $\left\{{y}_{t}^{i}\right\}$ in the case of discretization, and the filtering coefficient ${\alpha}_{t}$ in the case of temporal filtering). To perform these optimizations, we discretized the posterior ${P}_{t}^{L}$ into 100 values equally spaced between 0 and 1. This resulted in a set of 100 discretized values of the estimated state $\hat{\theta}}_{bin$. We found the optimal encoding parameters for each value of $\hat{\theta}}_{bin$ (described in detail in the following sections); this resulted in 100 sets of optimal response levels (given a fixed number of levels), and 100 values of the filtering coefficient $\alpha $ (given a fixed constraint strength $\beta $). On each timestep of the simulation, the true estimate $\hat{\theta}}_{t$ was mapped onto the closest discretized value $\hat{\theta}}_{bin$. The corresponding encoding parameters were then used to encode the incoming stimulus ${x}_{t}$. Additional details of each optimization procedure are described in the following sections.
C.3.1. Limited neural response levels: encoding via discretization
Response levels were chosen to optimize the following objective function:
The optimal set of response levels ${\left\{{y}_{t}^{i}\right\}}^{\ast}$ was found numerically using Lloyd’s algorithm (Cover and Thomas, 2012) (often referred to as Kmeans clustering). The algorithm takes the following as inputs: a set of points to be clustered $\left\{x\right\}$ (corresponding to stimulus samples), a number of quantization levels $N$ (corresponding to the number of neural response levels), and a distortion measure $d(x,y)$ (corresponding to the error function $E(x,y)$). The goal of the algorithm is to find a quantization (what we referred to as a discretization of the stimulus space) that minimizes the average value of the distortion.
The values of the quantization levels, ${y}^{1},\dots ,{y}^{N}$, are first randomly initialized. The algorithm then proceeds in two steps:
Each point $x$ is assigned to a quantization level ${y}^{i}$ that yields the smallest distortion $d(x,{y}^{i})$.
Each quantization level is replaced by the average value of the points assigned to it.
The two steps are iterated until convergence.
We computed a set of optimal quantization levels (optimal response levels) for each of the 100 discretized values of $\hat{\theta}}_{bin$ (described above). For each value of $\hat{\theta}}_{bin$, we generated a training dataset $\left\{x\right\}$ consisting of 50,000 values drawn from the estimated stimulus distribution $p\left({x}_{t}{\hat{\theta}}_{bin}\right)$. We determined the boundaries of each quantization level (i.e., the values ${y}^{i,L}$ and ${y}^{i,H}$ that bounded the set of stimuli that were mapped to the same quantization level) by assigning points in the training dataset to the quantization level ${y}^{i}$ that minimized $d(x,{y}^{i})$.
To compute optimal quantization levels for stimulus reconstruction, we used the standard distortion measure $d(x,y)={(xy)}^{2}$; in this case, the algorithm is guaranteed to converge to the global optimum. To compute optimal quantization levels for inference, we defined the distortion measure to be $d\left(x,y\right)={\left({\hat{\theta}}_{x}{\hat{\theta}}_{y}\right)}^{2}$. The algorithm is not guaranteed to converge to a global optimum in this case, but we found empirically that the algorithm converged to a local optimum (Figure 3—figure supplement 1). Moreover, the two distortion measures did not produce equivalent results.
C.3.2. Limited gain and temporal acuity: encoding via temporal filtering
The optimal filtering coefficient was chosen to minimize the following objective function:
where as before, we choose $E\left(x,y\right)={\left({\hat{\theta}}_{x}{\hat{\theta}}_{y}\right)}^{2}$ when optimizing for inference, and $E(x,y)={(xy)}^{2}$ when optimizing for reconstruction.
The joint entropy $H\left({y}_{t},{y}_{t+1}{\hat{\theta}}_{\tau <t}\right)$ can be determined analytically, as derived in Section B.5.3. We approximated the error term, $\u27e8E\left({x}_{t},{y}_{t}\right)\u27e9}_{p\left({x}_{t}{\hat{\theta}}_{t1}\right)$, numerically. To do so, we first discretized $\alpha $ into 50 values evenly spaced between 0 and 1 (corresponding to 50 discrete values of ${\alpha}_{bin}$). As described above, we also discretized the posterior ${P}_{t}^{L}$ into 100 values (corresponding to 100 discrete values of $\hat{\theta}}_{bin$). For each combination of ${\alpha}_{bin}$ and $\hat{\theta}}_{bin$, we generated 50,000 pairs of stimulus samples $({x}_{t1},{x}_{t})$ from the distribution $p\left({x}_{t}{\hat{\theta}}_{t1}\right)$. Each sample was used to compute values of the estimates $\hat{\theta}}_{x$ and $\hat{\theta}}_{y$. The errors $\left({\hat{\theta}}_{x}{\hat{\theta}}_{y}\right)}^{2$ and ${({x}_{t}{y}_{t})}^{2}$ were then averaged over all 50,000 stimulus pairs.
The optimal value ${\alpha}_{t}^{\ast}$ was then chosen as the value of ${\alpha}_{bin}$ that minimized the objective in Equation 44 for a given choice of the error function $E(x,y)$ and constraint strength $\beta $.
C.3.3. Limited neural activity: encoding via dynamic stimulus selection
The misalignment signal, derived in Section B.6.3, was defined in terms of the relative alignment between the true stimulus distribution, $p\left({x}_{t}\right{\theta}_{t})$, and the estimated stimulus distribution, $p\left({x}_{t}{\hat{\theta}}_{t1}\right)$. When the parameters of the true stimulus distribution are known, the value of this signal can be computed analytically via Equation 40. However, when the system does not have access to the stimulus distribution (as is the case here), this signal must be estimated directly from incoming stimulus samples. We consider a scenario in which the encoder can approximate Equation 40 by computing a runningaverage of the stimulus surprise:
where $T$ specifies the number of timebins used to estimate the average surprise. All results in the main text were generated using $T=10$ timebins.
C.4. The role of surprise and uncertainty
Request a detailed protocolFigure 1B–D illustrated the relative impact of different stimuli on the observer’s estimate of an environmental state $\theta $, which is modulated by the observer’s uncertainty and the surprise of incoming stimuli (for numerical values of the color ranges in Figure 1B–D, see Figure 1—figure supplement 2).
To illustrate this, we considered optimal Bayesian estimation of the location $\mu $ and scale ${\alpha}^{2}/2$ of a generalized Gaussian stimulus distribution:
Our derivation is analogous to that outlined in Murphy (2007) for estimating the mean of a Gaussian stimulus distribution.
We consider a snapshot of the inference process, when the observer’s prior is centered around a fixed estimate of the location ($\hat{\mu}=0$) or scale (${\hat{\alpha}}^{2}/2=1$). When estimating location, we fix the scale parameter to be $\alpha =\sqrt{2}$ (corresponding to a Gaussian distribution with variance ${\sigma}^{2}={\alpha}^{2}/2=1$ when $\beta =2$). When estimating scale, we fix the location parameter to be $\mu =0$. In both cases, we consider three different values of the shape parameter: $\beta =1,2,10$.
The surprise of a single stimulus observation is quantified by the negative log probability of the stimulus value given the observer’s estimate. We consider 100 evenlyspaced values of surprise between 1 and 10. For each value of surprise, we compute the value of the stimulus ${x}_{t}^{\ast}$ that elicits a given surprise.
The observer’s uncertainty is captured by the entropy of the prior distribution. When estimating the location parameter, the natural conjugate prior is the Gaussian distribution $\mathcal{\mathcal{N}}\left(\mu ;{\mu}_{0},{\sigma}_{0}^{2}\right)$ with mean $\mu}_{0}=\hat{\mu$ (we take this mean to be the observer’s point estimate of the environmental state before observing a stimulus sample ${x}_{t}^{\ast}$, that is, $\hat{\theta}}_{t1}=\hat{\mu$). The entropy of the prior distribution depends only on its variance: $H=\frac{1}{2}\mathrm{log}\left(2\pi e{\sigma}_{0}^{2}\right)$. We consider 100 evenlyspaced values of the entropy between 0 and 0.7. For each value of entropy, we compute the value ${\sigma}_{0}^{2}=\mathrm{exp}\left(2H\right)/2\pi e$ that elicits a given entropy.
When estimating the scale parameter, the natural conjugate prior is the inverse gamma function $p(\alpha ;{\alpha}_{0},{\beta}_{0})$ with mean $\hat{\alpha}={\beta}_{0}/\left({\alpha}_{0}1\right)$ (we take ${\hat{\theta}}_{t1}={\hat{\alpha}}^{2}/2$ to be the observer’s estimate of the environmental state before observing ${x}_{t}^{\ast}$). The entropy of the prior depends on both ${\alpha}_{0}$ and ${\beta}_{0}$: $H={\alpha}_{0}+\mathrm{log}\left({\beta}_{0}\mathrm{\Gamma}\left({\alpha}_{0}\right)\right)\left(1+{\alpha}_{0}\right)\mathrm{\Psi}\left({\alpha}_{0}\right)$. We fix ${\beta}_{0}=\hat{\alpha}\left({\alpha}_{0}1\right)$. We note that the entropy is nonmonotonic in ${\alpha}_{0}$; we restrict our analysis to values ${\alpha}_{0}>2$ where both the mean and the variance of the prior are welldefined, and the entropy is monotonic. We again consider 100 evenlyspaced values of the entropy between 0 and 0.7. For each value of entropy, we compute the value ${\alpha}_{0}$ that elicits a given entropy.
For each combination of prior uncertainty and surprise, we computed the posterior either over the location parameter, or over the scale parameter. We then computed the squared difference between the average value of the prior and the average value of the posterior ($\left({\hat{\mu}}_{t1}{\hat{\mu}}_{t}\left({x}_{t}^{\ast}\right)\right)}^{2$ in the case of location estimation, and $\left({\hat{\alpha}}_{t1}^{2}/2{\hat{\alpha}}_{t}{\left({x}_{t}^{\ast}\right)}^{2}/2\right)}^{2$ in the case of scale estimation), and we used this squared difference as a measure of the impact of a single stimulus observation on the observer’s estimate of location or scale. When reporting the results in Figure 1B–D, we separately scaled heatmaps for each stimulus distribution (Laplace, Gaussian, and flat) and for each estimated parameter (location and scale); numerical ranges of these heatmaps are given in Figure 1—figure supplement 2.
C.5. Generating spike rasters
Request a detailed protocolFigure 6A showed simulated spike rasters for an encoding scheme with limited neural response levels. To generate these rasters, a stimulus sample ${x}_{t}$ was randomly drawn from the true stimulus distribution $p\left({x}_{t}\right{\theta}_{t})$. This stimulus was then mapped onto one of $N=4$ neural response levels. Each response level was assigned a binary spike pattern from the set $\left\{\right[00],[10],[01],[11\left]\right\}$, where $1$ or $0$ correspond to presence or absence of a spike, respectively. Patterns were assigned to response levels $\left\{{y}_{t}^{i}\right\}$ according to the probability $p\left({y}_{t}^{i}\right{\theta}_{t})$ that a particular level would be used to encode incoming stimuli. In this way, the pattern with the fewest spikes ($\left[00\right]$) was assigned to the response level with the highest probability, and the pattern with the most spikes ($\left[11\right]$) was assigned to the level with the lowest probability. This strategy (called ‘entropy coding’) achieves the shortest average encoding of the input by using the fewest number of spikes (Cover and Thomas, 2012). We simulated spike patterns for 800 cycles of the probe environment using the set of response levels optimized for inference or stimulus reconstruction.
C.6. Computing metamer probabilities
Request a detailed protocolWe estimated the probability of a metamer as a function of the alignment between the true state of the environment $\theta $ and the observer’s prediction $\overrightarrow{\theta}$. We say that two stimuli ${x}_{t}^{1}$ and ${x}_{t}^{2}$ are metamers (i.e., they are indistinguishable to the observer) if in the process of encoding they become mapped on the same neural response level ${y}^{M}$ (i.e., ${y}_{t}^{1}={y}_{t}^{2}={y}^{M}$). The probability of a metamer, $p\left({y}_{t}^{1}={y}_{t}^{2}{\theta}_{t},{\hat{\theta}}_{t1}\right)$, depends on both the true and predicted states of the environment. We numerically estimated this probability for a meanswitching environment in the low state ($\theta ={\theta}^{L}$). We generated 100 values of $\hat{\theta}}_{t1$, evenly spaced between $\theta}^{L$ and ${\theta}^{H}$. For each value of $\hat{\theta}}_{t1$, we drew 100,000 pairs of samples from the stimulus distribution $p\left({x}_{t}\right{\theta}_{t}={\theta}^{L})$. We encoded each stimulus by mapping it onto the corresponding response level ${y}_{t}$ (using an encoder with eight response levels, optimized as described in Section C.3.1). If both stimuli in the pair were mapped on the same response level, we counted the trial as a metamer. The total probability of a metamer was computed as the proportion of all trials that resulted in metamers.
C.7. The role of transmission noise
Request a detailed protocolTo better understand the influence of noise on the inference process, we analyzed the behavior of the discretization encoding scheme in the presence of noise. Gaussian noise with variance ${\sigma}_{n}^{2}$ was added to the response ${y}_{t}$ of the encoder prior to computing the estimate $\hat{\theta}}_{t$ (Figure 3—figure supplement 2A–B). This form of noise can be viewed as neuronal noise introduced in the transmission of the stimulus representation to downstream areas. The performance of the optimal observer (Figure 3—figure supplement 2C) was relatively robust at low noise levels (up to ${\sigma}_{n}^{2}=0.4$), but decreased substantially at high noise levels. A more thorough investigation of the role of noise on optimal inference and encoding strategies is a subject of future work.
C.8. Measuring speed and accuracy of inference
Request a detailed protocolFigure 6DE compared the accuracy and speed of inference across different encoding schemes and environments. Accuracy was computed separately for the high and low states ($\theta ={\theta}^{H}$ and $\theta ={\theta}^{L}$, respectively) using the posterior ${P}_{t}^{L}$. For each time point, we first computed the average value of ${P}_{t}^{L}$ across many cycles of the probe environment (500 cycles for discretization, and 800 cycles for filtering and thresholding, corresponding to the average trajectories of $\hat{\theta}$ shown in Figures 3–5).
If the observer’s estimate is accurate, ${P}_{t}^{L}$ should be close to one when the environment is in the low state, and $(1{P}_{t}^{L})$ should be close to one when the environment is in the high state. We therefore computed the timeaveraged values $\u27e8{P}_{t}^{L}\u27e9}_{t$ and $\u27e8\left(1{P}_{t}^{L}\right)\u27e9}_{t$ to measure the accuracy in the low and high states, respectively. Timeaverages were computed over the final 10 timesteps in the high or low state, respectively, corresponding to the adapted portion of the inference process.
Speed was computed separately for downward versus upward switches in the environment by measuring the number of time samples required for the posterior to stabilize after a switch. We used the timeaveraged value $\u27e8{P}_{t}^{L}\u27e9}_{t$ (again averaged over the final 10 timesteps) as a measure of the final value of the posterior in the low state. We then counted the number of timesteps after a switch downward before the posterior came within 0.05 of this value, and we used the inverse of this time as a measure of speed. We computed the speed of response to an upward switch in an analogous manner.
C.9. Natural image simulation
Request a detailed protocolFigure 7 illustrated a model visual inference task performed on natural images. Within this model, the encoder implemented a sparse encoding of individual image patches (Olshausen and Field, 1996) using 32 basis functions ${\varphi}_{i}$. The basis functions were chosen to minimize the following cost function:
where $\overrightarrow{x}}_{t$ is an image patch, $\overrightarrow{y}}_{t$ is the neural population response, $n$ indexes pixels in an image patch, and $i$ indexes neurons in the population. The first term imposes a cost on reconstruction error between the image patch $\overrightarrow{x}}_{t$ and the reconstructed patch $\hat{x}}_{t}=\sum {y}_{i,t}{\varphi}_{i$. The second term imposes a penalty for large population responses. The parameter $\lambda $ imposes a constraint on the fidelity of the encoding by controlling the overall sparsity of the population response.
The set of basis functions was trained on 50,000 image patches of size $16\times 16$ pixels. Image patches were drawn randomly from the van Hateren database (van Hateren and Ruderman, 1998). During training, the value of the sparsity parameter $\lambda $ was set to 0.1.
A sparse representation $\overrightarrow{y}}_{t$ was inferred for each image patch $\overrightarrow{x}}_{t$ via gradientdescent on the cost function in Equation 47 (Olshausen and Field, 1996). An image reconstruction $\hat{x}}_{t$ was computed from the sparse representation (Figure 7—figure supplement 1A). The reconstructed patch was contrast normalized by dividing each pixel value by the standard deviation across the set of pixel values. The normalized image patch was projected onto four curvature filters ${C}_{j}$, resulting in four curvature coefficients ${v}_{j,t}$. Curvature filters were handdesigned to bear coarse, qualitative resemblance to curvatureselective receptive fields in V2. The set of four curvature coefficients was used to update the posterior distribution over variance, analogous to the Bayesian estimation of variance described in Section C.4.
Image areas 1 (low curvature) and 2 (high curvature) in Figure 7 were chosen to be $200\times 200$ pixels in size. For illustrative purposes, they were selected to generate a relatively large difference in the variance of curvature filters, which would require a substantial update of the Bayesian estimate. During all simulations, the mean of the prior (corresponding to the observer’s point estimate $\hat{\theta}}_{t1$) was fixed to 5.3, equal to the variance of filter outputs in image area 1.
To numerically compute the impact of a stimulus on the estimate as a function of observer’s uncertainty (prior variance) and centered surprise (Figure 7D), a set of 5,000 image patches was drawn randomly from image area 2. Image patches were then sorted according to their centered surprise and divided into 5 groups that uniformly spanned the range of centered surprise in the set. The variance of the prior was chosen to be one of 5 equally spaced values between 0.018 and 0.18. For each value of prior variance and for each group of stimuli with a given centered surprise, we computed the change in the observer’s estimate before and after incorporating the population response $\overrightarrow{y}}_{t$: $\left({\hat{\theta}}_{t1}{\hat{\theta}}_{t}\left({\overrightarrow{y}}_{t}\right)\right)}^{2$.
We used a similar approach to numerically compute the inference error as a function of the sparsity parameter $\lambda$ and the centered surprise (Figure 7F). We chose 5 equallyspaced values of $\lambda$ between 0.1 and 10. We then randomly drew 5,000 image patches from image area 2. Image patches were again sorted according to their centered surprise and were divided into 5 groups that uniformly spanned the range of centered surprise in the set. We then computed the average inference error for each value of $\lambda$ and each stimulus group. An analogous procedure was used to determine the inference error as a function of the sparsity parameter $\lambda$ and the observer’s uncertainty (Figure 7E).
The abrupt changes in impact and inference error that can be seen in Figure 7D–F are a result of the coarse partitioning of the set of image patches into a small number of groups. In comparison, the results in Figure 1B–D were computed analytically with continuous values of surprise and uncertainty, and therefore show smooth variations in impact and error.
Simulated population responses (Figure 7E–F) were generated by selecting a random subset of 45 image patches with a given centered surprise, or specified values of uncertainty. Image patches were then encoded with a sparsity value of either $\lambda =0.1$ or $\lambda =10$ (corresponding to the inference errors marked with red and white circles). 40 images patches were encoded with the higher value of $\lambda$, and 5 image patches were encoded with the lower value of $\lambda$. For illustrative purposes, the image patches were arranged such that the first and last 20 patches corresponded to high values of $\lambda$ values (white), while the middle $5$ patches correspond to low values of $\lambda$ (red). High and low values of $\lambda$ were chosen to generate similar average inference error for the given values of centered surprise and uncertainty.
Centered surprise was computed for each image patch $\overrightarrow{x}}_{t$ as follows:
where $H\left({v}_{j,t}{\hat{\theta}}_{t1}\right)=\frac{1}{2}\mathrm{log}\left(2\pi e{\hat{\theta}}_{t1}^{2}\right)$ is the entropy of the Gaussian distribution of curvature coefficients given the prior estimate $\hat{\theta}}_{t1$.
Appendix 1
Here, we provide a detailed derivation of the decoding distribution for the filtering encoder (described in Section B.5.2).
To simplify Equation 31, we rewrite the first Gaussian as a function of $\alpha}_{t}{x}_{t$ (for notational simplicity, we will write $\sigma}^{\mathrm{\prime}2}={\left(1{\alpha}_{t}\right)}^{2}{\hat{\sigma}}_{t1}^{2$:
We can now pull out the factor of $\alpha}_{t$ (again, for notational simplicity, we will write $\mu}^{\mathrm{\prime}}={y}_{t}\left(1{\alpha}_{t}\right){\hat{\mu}}_{t1$):
where $\mu}^{\mathrm{\prime}\mathrm{\prime}}={\mu}^{\mathrm{\prime}}/{\alpha}_{t}=\left({y}_{t}\left(1{\alpha}_{t}\right){\hat{\mu}}_{t1}\right)/{\alpha}_{t$ and $\sigma}^{\mathrm{\prime}\mathrm{\prime}2}={\sigma}^{\mathrm{\prime}2}/{\alpha}_{t}^{2}={\left(1{\alpha}_{t}\right)}^{2}{\hat{\sigma}}_{t1}^{2}/{\alpha}_{t}^{2$. Equation 49 can now be written as a Gaussian over $x}_{t$:
This allows us to combine the two distributions in Equation 31:
where:
Because the function $f\left({y}_{t},{\hat{\theta}}_{\tau <t}\right)$ does not depend on $x}_{t$, we can trivially obtain $Z\left({y}_{t},{\hat{\theta}}_{\tau <t}\right)$ by integrating over $x}_{t$ (as given by Equation 7):
The remaining terms in Equation 52 are given by:
Putting everything together, the final form of Equation 31 becomes:
For $\frac{1}{2}\le {\alpha}_{t}\le 1$, we can see that: $0\le \left(1{\alpha}_{t}\right)\left(2{\alpha}_{t}1\right)\le \frac{1}{8}$ and $\frac{1}{2}\le \left(12{\alpha}_{t}+2{\alpha}_{t}^{2}\right)\le 1$.
Appendix 2
Here we provide a detailed derivation of the entropy of the output of filtering encoder (described in Section B.5.3).
To compute $H\left({y}_{t},{y}_{t+1}{\hat{\theta}}_{\tau <t}\right)$, we assume that the encoder has access to the history of estimates $\hat{\theta}}_{\tau <t$, and that it uses the most recent estimate $\hat{\theta}}_{t1$ as an approximate prediction of future states (i.e., $\hat{\theta}}_{t1}\approx {\overrightarrow{\theta}}_{t}\approx {\overrightarrow{\theta}}_{t+1$).
For reference, the entropy of a normal distribution is:
We want to compute $H\left({y}_{t},{y}_{t+1}{\hat{\theta}}_{\tau <t}\right)$:
where $y}_{t}={\alpha}_{t}{x}_{t}+\left(1{\alpha}_{t}\right){x}_{t1$ is the output of the encoder, and ${\alpha}_{t}\in \left[0.5,1\right]$ is the filtering coefficient.
To compute each of the terms in Equation 58, we need to compute $p\left({y}_{t}{\hat{\theta}}_{\tau <t}\right)$ and $p\left({y}_{t+1}{y}_{t},{\hat{\theta}}_{\tau <t}\right)$. The first of these distributions is given by:
whose entropy is given by:
The second of these distributions can be written as:
Noting that $p\left({y}_{t+1}{x}_{t+1},{x}_{t},{\hat{\theta}}_{\tau <t}\right)=\delta \left({y}_{t+1}\left({\alpha}_{t}{x}_{t+1}+\left(1{\alpha}_{t}\right){x}_{t}\right)\right)$, the first term in the integral in Equation 61 is given by:
The second term in the integral in Equation 61 is given by:
Combining the two terms, we have:
where
Putting these terms back into the integral in Equation 61 gives:
The conditional entropy $H\left({y}_{t+1}{y}_{t},{\hat{\theta}}_{\tau <t}\right)$ is determined by the variance in this distribution:
Combining the two entropy terms in Equations 60 and 67, we get:
Data availability
All data generated or analysed during this study are included in the manuscript and supporting files.
References

An energy budget for signaling in the grey matter of the brainJournal of Cerebral Blood Flow & Metabolism 21:1133–1145.https://doi.org/10.1097/0000464720011000000001

A test of metabolically efficient coding in the retinaNetwork: Computation in Neural Systems 13:531–552.https://doi.org/10.1088/0954898X_13_4_306

Metabolically efficient information processingNeural Computation 13:799–815.https://doi.org/10.1162/089976601300014358

Possible principles underlying the transformations of sensory messagesSensory Communication pp. 217–234.https://doi.org/10.7551/mitpress/9780262518420.003.0013

Largescale synchronized activity during vocal deviance detection in the zebra finch auditory forebrainJournal of Neuroscience 32:10594–10608.https://doi.org/10.1523/JNEUROSCI.604511.2012

Encoding of stimulus probability in macaque inferior temporal cortexCurrent Biology 26:2280–2290.https://doi.org/10.1016/j.cub.2016.07.007

Advances in Neural Information Processing Systems36–43, Reading a neural code, Advances in Neural Information Processing Systems.

How do efficient coding strategies depend on origins of noise in neural circuits?PLOS Computational Biology 12:e1005150.https://doi.org/10.1371/journal.pcbi.1005150

Information bottleneck for gaussian variablesJournal of Machine Learning Research 6:165–188.

Flexible gating of contextual influences in natural visionNature Neuroscience 18:1648–1655.https://doi.org/10.1038/nn.4128

Efficiency turns the table on neural encoding, decoding and noiseCurrent Opinion in Neurobiology 37:141–148.https://doi.org/10.1016/j.conb.2016.03.002

Bayesian spiking neurons I: inferenceNeural Computation 20:91–117.https://doi.org/10.1162/neco.2008.20.1.91

Efficient codes and balanced networksNature Neuroscience 19:375–382.https://doi.org/10.1038/nn.4243

Asymmetric dynamics in optimal variance adaptationNeural Computation 10:1179–1202.https://doi.org/10.1162/089976698300017403

Advances in Neural Information Processing Systems117–124, Binary coding in auditory cortex, Advances in Neural Information Processing Systems.

A simple model of optimal population coding for sensory systemsPLoS Computational Biology 10:e1003761.https://doi.org/10.1371/journal.pcbi.1003761

Owl's behavior and neural representation predicted by Bayesian inferenceNature Neuroscience 14:1061–1066.https://doi.org/10.1038/nn.2872

Statistically optimal perception and learning: from behavior to neural representationsTrends in Cognitive Sciences 14:119–130.https://doi.org/10.1016/j.tics.2010.01.003

A functional and perceptual signature of the second visual area in primatesNature Neuroscience 16:974–981.https://doi.org/10.1038/nn.3402

Efficient sensory encoding and bayesian inference with heterogeneous neural populationsNeural Computation 26:2103–2134.https://doi.org/10.1162/NECO_a_00638

Contributions of ideal observer theory to vision researchVision Research 51:771–781.https://doi.org/10.1016/j.visres.2010.09.027

What's that sound? auditory area CLM encodes stimulus surprise, not intensity or intensity changesJournal of Neurophysiology 99:2809–2820.https://doi.org/10.1152/jn.01270.2007

Representation of angles embedded within contour stimuli in area v2 of macaque monkeysJournal of Neuroscience 24:3313–3324.https://doi.org/10.1523/JNEUROSCI.436403.2004

Coordinated dynamic encoding in the retina using opposing forms of plasticityNature Neuroscience 14:1317–1322.https://doi.org/10.1038/nn.2906

Object perception as bayesian inferenceAnnual Review of Psychology 55:271–304.https://doi.org/10.1146/annurev.psych.55.090902.142005

Efficiency of information transmission by retinal ganglion cellsCurrent Biology 14:1523–1530.https://doi.org/10.1016/j.cub.2004.08.060

Correlations and neuronal population informationAnnual Review of Neuroscience 39:237–256.https://doi.org/10.1146/annurevneuro070815013851

A simple coding procedure enhances a neuron’s information capacityZeitschrift für Naturforschung c 36:910–912.

Hierarchical Bayesian inference in the visual cortexJournal of the Optical Society of America A 20:1434–1448.https://doi.org/10.1364/JOSAA.20.001434

Encoding of natural scene movies by tonic and burst spikes in the lateral geniculate nucleusJournal of Neuroscience 24:10731–10740.https://doi.org/10.1523/JNEUROSCI.305904.2004

Energy efficient neural codesNeural Computation 8:531–543.https://doi.org/10.1162/neco.1996.8.3.531

Perceptual inference predicts contextual modulations of sensory responsesJournal of Neuroscience 32:4179–4195.https://doi.org/10.1523/JNEUROSCI.081711.2012

Fractional differentiation by neocortical pyramidal neuronsNature Neuroscience 11:1335–1342.https://doi.org/10.1038/nn.2212

Bayesian inference with probabilistic population codesNature Neuroscience 9:1432–1438.https://doi.org/10.1038/nn1790

Bayesian inference with probabilistic population codesNature Neuroscience 9:1432–1438.https://doi.org/10.1038/nn1790

Cellular and circuit properties supporting different sensory coding strategies in electric fish and other systemsCurrent Opinion in Neurobiology 22:686–692.https://doi.org/10.1016/j.conb.2012.01.009

A behavioral role for feature detection by sensory burstsJournal of Neuroscience 26:10542–10547.https://doi.org/10.1523/JNEUROSCI.222106.2006

ConferenceA database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statisticsProceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001. pp. 416–423.https://doi.org/10.1109/iccv.2001.937655

Summary statistics in auditory perceptionNature Neuroscience 16:493–498.https://doi.org/10.1038/nn.3347

Energetic costs of cellular computationPNAS 109:17978–17982.https://doi.org/10.1073/pnas.1207814109

Feature selectivity and interneuronal cooperation in the thalamocortical systemThe Journal of Neuroscience 21:8136–8144.https://doi.org/10.1523/JNEUROSCI.212008136.2001

Conjugate Bayesian Analysis of the Gaussian Distribution16, Conjugate Bayesian Analysis of the Gaussian Distribution, 1.

Sparse coding of sensory inputsCurrent Opinion in Neurobiology 14:481–487.https://doi.org/10.1016/j.conb.2004.07.007

Parallel processing of sensory input by bursts and isolated spikesJournal of Neuroscience 24:4351–4362.https://doi.org/10.1523/JNEUROSCI.045904.2004

A new perceptual bias reveals suboptimal population decoding of sensory responsesPLoS Computational Biology 8:e1002453.https://doi.org/10.1371/journal.pcbi.1002453

Evidence accumulation and change rate inference in dynamic environmentsNeural Computation 29:1561–1610.https://doi.org/10.1162/NECO_a_00957

Spike timing and information transmission at retinogeniculate synapsesJournal of Neuroscience 30:13558–13566.https://doi.org/10.1523/JNEUROSCI.090910.2010

BookThe Bayesian Choice: From DecisionTheoretic Foundations to Computational ImplementationSpringer Science & Business Media.

Assessing the performance of neural encoding models in the presence of noiseJournal of Computational Neuroscience 8:95–112.https://doi.org/10.1023/A:1008921114108

The representation of prediction error in auditory cortexPLOS Computational Biology 12:e1005058.https://doi.org/10.1371/journal.pcbi.1005058

Behaviorally relevant burst coding in primary sensory neuronsJournal of Neurophysiology 102:1086–1091.https://doi.org/10.1152/jn.00370.2009

Temporal coding by populations of auditory receptor neuronsJournal of Neurophysiology 103:1614–1621.https://doi.org/10.1152/jn.00621.2009

Twodimensional adaptation in the auditory forebrainJournal of Neurophysiology 106:1841–1861.https://doi.org/10.1152/jn.00905.2010

Computational identification of receptive fieldsAnnual Review of Neuroscience 36:103–120.https://doi.org/10.1146/annurevneuro062012170253

Population coding in neuronal systems with correlated noisePhysical Review E 64:051904.https://doi.org/10.1103/PhysRevE.64.051904

Predictive coding: a fresh view of inhibition in the retinaProceedings of the Royal Society B: Biological Sciences 216:427–459.https://doi.org/10.1098/rspb.1982.0085

Independent component analysis of natural image sequences yields spatiotemporal filters similar to simple cells in primary visual cortexProceedings of the Royal Society B: Biological Sciences 265:2315–2320.https://doi.org/10.1098/rspb.1998.0577

A theory of maximizing sensory informationBiological Cybernetics 68:23–29.https://doi.org/10.1007/BF00203134

A Bayesian observer model constrained by efficient coding can explain 'antiBayesian' perceptsNature Neuroscience 18:1509–1517.https://doi.org/10.1038/nn.4105

Motion illusions as optimal perceptsNature Neuroscience 5:598–604.https://doi.org/10.1038/nn0602858

A mixture of deltarules approximation to bayesian inference in changepoint problemsPLoS Computational Biology 9:e1003150.https://doi.org/10.1371/journal.pcbi.1003150
Decision letter

Stephanie PalmerReviewing Editor; University of Chicago, United States
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
Thank you for submitting your article "Adaptive coding for dynamic sensory inference" for consideration by eLife. Your article has been reviewed by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Timothy Behrens as the Senior Editor. The reviewers have opted to remain anonymous.
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
Summary:
The manuscript proposes a modeling framework to unify efficient coding of sensory stimuli with Bayesian inference of a latent, behaviorallyrelevant variable (e.g. an environmental state that generates the sensory stimulus). The main idea is to replace the traditional objective for efficient coding – quality of the sensory stimulus reconstruction – with an "inference cost", namely the discrepancy between the estimate of the latent state based on the neural code and the estimate of the Bayesian ideal observer. Optimizing for these two objectives, under the constraint of limited metabolic resources, gives qualitatively different results. In particular, optimizing for inference cost produces encoding schemes that are less metabolically expensive and more accurate (with respect to the latent state).
Overall, the reviewers found this to be a very nice contribution towards a unifying framework for these ideas. The manuscript states clearly and accurately addresses this research question. The figures are well laidout, visually appealing, and informative.
However, several revisions are requested to 1) show a relation to biological data, 2) demonstrate the generality of the results, 3) more clearly state what areas of the brain this work applies to, and 4) improve clarity of the presentation by reducing the overall length of the article.
Essential revisions:
1) The work should include more explicit relation to biological data, both published results and the authors' predictions from their work on what one might expect to see in data. Please provide some examples that link predictions of the proposed framework to published data.
Example: when discussing metamers, the authors state that "stimuli become less distinguishable to the observer as its model of the environment becomes more accurate". How does this relate to the empirical observation that visual metamers are predominant in the periphery of the visual field where resolution is low and presumably the model less accurate, but can be resolved by foveal inspection where resolution is higher and the model more accurate?
Another example: could the predictions of Figure 6 and the final paragraph of subsection “Limited gain and temporal acuity” be related to data on the stimulusdependence of adaptation in the visual cortex?
2) The manuscript needs to address the generality of these results. A very simple stimulus model with switching between two states was used; how do these results extend to more complex stimuli? Does the work here predict that these results are general and, if so, for which stimulus conditions? Substantive text revisions and additional computational results will be needed to satisfy this point.
The authors discuss the dynamical signatures that distinguish the three encoding schemes (Figure 9). To make a higher impact, they should also provide or suggest examples from biology where one of the three schemes may be more or less likely. Where would one expect to see mean vs variance changing environments? What would one have to measure to observe the dynamical signatures in Figure 9? Can the authors make some predictions/hypotheses about this?
In addition, in the context of the experimental system where the three schemes might apply, how likely is it that one would be able to distinguish between each scheme? In the panels in Figure 9, some of the differences between the different schemes are very small so they might not be measurable with great precision in specific experiments.
In particular, the reviewers noted that actual sensory stimuli are highdimensional, and behaviorallyrelevant latent states (and behavioral output itself) are lowdimensional. The fact that the code optimized for inference leads to better accuracy and efficiency, compared to the code optimized for reconstruction, is true for the latent variable that has been deemed behaviorally relevant, but the result would probably be different in a more realistic generative model in which there are also other latent variables that jointly generate highd stimuli. Also, as we know from natural image and sound statistics, stimulus distributions are not well described by Gaussians or even mixtures of Gaussians. Related to this, the definition of inference cost as the expected squared error makes sense if assuming Gaussian posteriors, but perhaps the authors should use something more general to encompass a broader range of cases, like D_KL (understanding that's effectively what the authors have done for the Gaussian case).
In the Discussion, the authors claim they have "addressed the issues of both tractability and nonstationarity…", but it is not clear if this is because of the simplifying assumptions that were made for the generative model of stimuli and for the inference cost.
3) It was not clear to the reviewers if this is a theory that applies to the sensory periphery, to the cortex, or to both. The resource limitations considered here make sense if the dimensionality of the stimulus x and neural code y are the same. That would seem to be appropriate for a theory of the periphery, but then optimizing a peripheral code in complex ways for behavioral outputs may be implausible. Conversely, if this applies to cortex, we know dim(y)>>dim(x) and the choice of resource limitations might not be as relevant in practice: e.g. wouldn't a large enough population code overcome the problem of discrete response levels? It has been argued that in cortex the major challenge is the computational complexity of the inferences that need to be performed (Beck et al., 'not noisy just wrong'), and that approximate inference may be more a important constraint than resource limitations.
4) The paper would benefit from some streamlining to reduce the number of figures and the overall length of the paper. Repetitive text should be condensed or eliminated (example: Results section, fourth paragraph is a repeat of earlier statements). Overall, the Introduction could be significantly condensed. It is suggested that Box 1 be moved to the Materials and methods section, because there is significant overlap with it and Figure 1. The figures could be streamlined to some extent, perhaps a few could be combined to reduce the total figure count. At times that the manuscript was hard to read, as long paragraphs are spent describing the mechanics of the effects (which are instead very clearly illustrated in the figures). The authors should consider shortening those descriptions.
[Editors' note: further revisions were requested prior to acceptance, as described below.]
Thank you for resubmitting your work entitled "Adaptive coding for dynamic sensory inference" for further consideration at eLife. Your revised article has been favorably evaluated by Timothy Behrens (Senior editor), a Reviewing editor, and two reviewers.
Thank you for your extensive and substantive revision of your paper. The manuscript has been improved, and reviewers were largely satisfied with your changes, but there are some remaining issues that need to be addressed before acceptance, as outlined below:
1) On the question of resource limitations, the reviewers agree that your argument re: sparse coding is valid, but it does raise questions about noise robustness (connected to point 2 below) and cell death, which might both argue for distributed/redundant coding. Perhaps discussing/citing Deneve's recent work on degenerate population codes is appropriate here. Please add some text to the discussion that addresses this.
2) Regarding adding noise to your model: Please include a discussion of the impact of noise on the structure of neural coding along with your text clearly stating that you have left the work of a thorough exploration of these topics for another study. Relevant literature to discuss/cite are, e.g. Zohary et al., 1994; Abbott Dayan, 1999; Sompolinsky et al., 2001; Ma et al., 2006; MorenoBote et al., 2014; a review article by Kohn et al., 2016; as well as Gjorgjieva et al. bioRxiv 2017 (currently cited in the wrong place – should be moved to the discussion of the first framework). Please add some text to the Discussion that addresses this. If possible, the reviewers urge the authors to include one example of the effects of noise (a simple additive Gaussian or Poisson noise source) on their model results.
Points 1 and 2 are connected and that should also be made clear in the added Discussion text.
3) Please address in the Discussion how your results might be thought of in the context of dynamic environments that are stationary – i.e. the stimulus changes in time, and might *also* switch states, but any given state fluctuates.
https://doi.org/10.7554/eLife.32055.020Author response
Essential revisions:
1) The work should include more explicit relation to biological data, both published results and the authors' predictions from their work on what one might expect to see in data. Please provide some examples that link predictions of the proposed framework to published data.
Example: when discussing metamers, the authors state that "stimuli become less distinguishable to the observer as its model of the environment becomes more accurate". How does this relate to the empirical observation that visual metamers are predominant in the periphery of the visual field where resolution is low and presumably the model less accurate, but can be resolved by foveal inspection where resolution is higher and the model more accurate?
Another example: could the predictions of Figure 6 and the final paragraph of subsection “Limited gain and temporal acuity” be related to data on the stimulusdependence of adaptation in the visual cortex?
First, we have added a new analysis that extends our adaptive coding framework to a more naturalistic inference task. We model this task after computations that are known to occur in the visual pathway. This includes a sparse coding model that mimics receptive fields in V1, and a projection onto curvature filters that mimics computations in V2. The output of V2 filters is used to adapt the sparsity of the population code in V1. We use this model to infer changes in local curvature of a natural image when gaze shifts from one region of the image to another. We show that this model exhibit bursts of population activity when stimuli (local image patches) are surprising, or when the observer is uncertain, consistent with the general principles that we now use to frame the paper. We link this result to a recent study that finds bursts of activity in V1 in response to stimuli that violate statistical regularities in the environment. These results can be found in a new section entitled “Adaptive coding for inference under natural conditions”, and the corresponding figure (now Figure 7).
Second, we have clarified two key signatures of encoding schemes optimized for inference: bursts to signal salient changes in the environment, and ambiguous stimulus representations when the environment is stationary. We cite a broad range of published work that provides evidence for both of these dynamical signatures; this work spans several different sensory modalities and different stages of processing. Where available, we provide evidence that these dynamics are modulated by predictive feedback from downstream areas (consistent with the feedback projections that we use to adapt the encoding scheme) and are relevant for behavior. These citations can be found in two sections of the Discussion entitled “Transient increases in fidelity signal salient changes in the environment” and “Periods of stationarity give rise to ambiguous stimulus representations”.
In response to this and other comments, we have revised our discussion of metamers. We now remove the term metamer in favor a description of the effect, namely that in stationary environments, physically different stimuli will be increasingly likely to be perceived as similar as the observer’s model becomes aligned with the environment. We think that this phenomenon, in which the discriminability of stimuli decreases over time, is consistent with the observation of auditory metamers (as discussed in the Discussion section entitled “Periods of stationarity give rise to ambiguous stimulus representations”). It might be possible to extend our framework to the study of visual metamers, as the reviewers propose. Here, the notion of “accuracy” that the reviewers mention is related to the resolution of receptive fields in the fovea relative to the periphery. This is similar to the high and lowresolution response levels in our discretization scheme, which we propose should be dynamically shifted over time to improve the accuracy of the observer’s model (which is distinct from the resolution of response levels used to construct this model). We think that the scenario of visual metamers could map more naturally onto an active sensing scheme in which the visual system can shift its high foveal resolution to different parts of a visual scene in order to extract information about the underlying spatiotemporal statistics. Active sensing is beyond the scope of the current paper, and as such, we have chosen not to elaborate on this example. We do, however, think that this is an interesting direction for future work.
2) The manuscript needs to address the generality of these results. A very simple stimulus model with switching between two states was used; how do these results extend to more complex stimuli? Does the work here predict that these results are general and, if so, for which stimulus conditions? Substantive text revisions and additional computational results will be needed to satisfy this point.
We have added two new sets of analyses (now shown in Figure 1 and Figure 7) and have made substantive changes to the text in order to address the generality of our results. First, we have identified and clearly stated two important principles that shape efficient coding for inference: 1) the relative utility of incoming stimuli for inference can change over time, and 2) physically different stimuli can exert a similar influence on the observer’s model of the environment, and can therefore be encoded in the same neural response without affecting the inference process. We show that both principles are shaped by uncertainty in the observer’s belief about the state of the environment, and by the surprise of incoming stimuli given this belief (Figure 1BC). We then show that the qualitative features of this relationship between surprise, uncertainty, and the dynamics of inference hold for the estimation of both location (analogous to mean) and scale (analogous to variance) of a generalized Gaussian distribution (Figure 1D). The parameters of the generalized Gaussian distribution can be varied to generate many specific distributions (including Laplace, Gaussian, and flat), and as such can capture statistical properties of natural stimuli (see, e.g., [1]). Finally, we show that this qualitative relationship is observed in a more realistic scenario using natural image stimuli and modeled after computations in the visual pathway (Figure 7).
We note, however, that the detailed geometry of the relationship between surprise, uncertainty, and inference can change depending on the specific model of the environment. Developing a full Bayesian observer model for more naturalistic stimuli is an interesting direction for future work, but we anticipate that such a model will rely on surprise and uncertainty in a manner that is qualitatively similar to the systems explored here.
The authors discuss the dynamical signatures that distinguish the three encoding schemes (Figure 9). To make a higher impact, they should also provide or suggest examples from biology where one of the three schemes may be more or less likely. Where would one expect to see mean vs variance changing environments? What would one have to measure to observe the dynamical signatures in Figure 9? Can the authors make some predictions/hypotheses about this?
In addition, in the context of the experimental system where the three schemes might apply, how likely is it that one would be able to distinguish between each scheme? In the panels in Figure 9, some of the differences between the different schemes are very small so they might not be measurable with great precision in specific experiments.
In response to these and other comments, we have revised our claims about the dynamical signatures of these encoding strategies. We agree that some of the differences between encoding schemes are very small. We now highlight the fact that all three encoding schemes produce qualitatively similar response properties when optimized for inference, and these response properties differ from those observed when the same encoding schemes are optimized for stimulus reconstruction. We highlight these similarities and differences in a new figure (Figure 6), and we have significantly revised the accompanying text (which can be found in the Results section entitled “Dynamical signatures of adaptive coding”). In the Discussion, we have added numerous examples of where similar dynamical signatures have been observed experimentally. These examples span both physiology and behavior, and they encompass many different sensory modalities, including vision, audition, and olfaction (see the sections entitled “Transient increases in fidelity signal salient changes in the environment” and “Periods of stationarity give rise to ambiguous stimulus representations”). Several of these studies use simple distributions of stimuli in which the mean or the variance of the stimulus distribution is switching over time. We specifically highlight examples of these types of stimulus environments, and we discuss the utility of these environments for studying inference.
However, as before, we do observe that the three encoding schemes can impact the inference process in qualitatively different ways. Rather than demonstrating this with a visual comparison between inference trajectories (as was previously shown in Figure 9), we quantify these differences in a set of new figure panels, Figure 6DE (which also addresses a later comment). We then show that the asymmetry in the speed of inference for upward versus downward switches in variance takes a qualitatively different form for each encoding scheme: one encoding scheme accentuates this asymmetry, another nearly removes it, and a third reverses it (as compared to the optimal Bayesian model in the absence of an encoding). We believe that this difference in the relative speed of responses to upward versus downward switches provides a stronger test of the underlying encoding scheme, without needing to rely on small quantitative differences. These findings are discussed in the Results section entitled “Dynamical signatures of adaptive coding”.
We feel that it would be a misrepresentation of this work to claim that individual encoding schemes are particular to certain brain regions or stages of neural processing. Rather, we view each encoding scheme as a simplification of a particular neural computation, which can be implemented in different parts of the nervous system. When introducing each encoding scheme, we provide examples of other studies that have used similar models to describe neural computations. In the Discussion, we now highlight the features of each encoding scheme, and we hypothesize conditions under which each scheme might be useful.
In particular, the reviewers noted that actual sensory stimuli are highdimensional, and behaviorallyrelevant latent states (and behavioral output itself) are lowdimensional. The fact that the code optimized for inference leads to better accuracy and efficiency, compared to the code optimized for reconstruction, is true for the latent variable that has been deemed behaviorally relevant, but the result would probably be different in a more realistic generative model in which there are also other latent variables that jointly generate highd stimuli.
By construction, we expect a code optimized for inference to yield better inference accuracy than a code optimized for reconstruction (regardless of the dimensionality of the latent space). We now clarify this in the:
“As expected, a strategy optimized for inference achieves lower inference error than a strategy optimized for stimulus reconstruction (across all numbers of response levels), but it also does so at significantly lower coding cost.”
We agree that in a more realistic scenario in which the latent space is higher dimensional, the cost of encoding for inference could increase. However, we argue that even in complex latent spaces, an encoding scheme optimized for inference should adapt based on uncertainty and surprise and will therefore exhibit qualitatively different features than an encoding scheme optimized for reconstruction. We now highlight this in the Discussion:
“In such cases, we expect the dimensionality of the latent variable space to determine the lower bound on coding costs for inference. Even in the limit of highly complex models, however, we expect accurate inference and reconstruction to impose qualitatively different constraints on neural response properties.”
Also, as we know from natural image and sound statistics, stimulus distributions are not well described by Gaussians or even mixtures of Gaussians. Related to this, the definition of inference cost as the expected squared error makes sense if assuming Gaussian posteriors, but perhaps the authors should use something more general to encompass a broader range of cases, like D_KL (understanding that's effectively what the authors have done for the Gaussian case).
As described above, we have shown that the qualitative relationship between surprise, uncertainty, and inference dynamics extends beyond a Gaussian distribution. In particular, we demonstrate that the same qualitative relationship holds for the estimation of the location and scale of a generalized Gaussian distribution with a range of different parameters (corresponding to Laplace, Gaussian, and flat distributions). We have also shown that these principles apply to a more naturalistic inference scenario using natural image patches.
We also stress the difference between the stimulus distribution (p(x_{t}θ_{t})) and the prior and posterior distributions over parameter values (p(θ_{t}y_{τ<t}) and p(θ_{t}y_{τ≤t}), respectively). While in our simulations, the stimulus distribution is indeed a mixture of Gaussians, the majority of the prior and posterior distributions considered here are nonGaussian: the posterior distributions used in the mean and varianceswitching environments are bimodal, and the prior and posterior distributions over scale parameters in Figure 1 are inverse Γ functions. Moreover, the choice of mean squared error (MSE) as a measure of inference cost does not make any assumptions about the shape of the posterior. MSE is a cost function that is guaranteed to be minimized by the mean of the posterior distribution, regardless of the form of the posterior [2, 3]. In this sense, MSE is a fully general cost function and does not reflect any particular assumptions about Gaussianity. We now motivate our use of MSE as a measure of inference cost, and we highlight its generality:
“In order to optimize and assess the dynamics of the system, we use the point values θ^t and θ→_{t+1} as an estimate of the current state and prediction of the future state, respectively. The optimal point estimate is computed by averaging the posterior and is guaranteed to minimize the mean squared error between the estimated state θ^t and the true state θ_{t}, regardless of the form of the posterior distribution.”
As mentioned by the reviewer, there are other measures of inference cost, including the KLdivergence of the posterior from the prior distribution. Such a measure would take into account not only the difference in the mean of the posterior, but also a change of uncertainty after incorporating a new stimulus sample. We have noted this explicitly in the Discussion, and we agree that this is an interesting generalization to be explored in future work:
“Other measures, such as KLdivergence, could be used to capture not only changes in point estimates, but also changes in uncertainty underlying these estimates.”
In the Discussion, the authors claim they have "addressed the issues of both tractability and nonstationarity…", but it is not clear if this is because of the simplifying assumptions that were made for the generative model of stimuli and for the inference cost.
As described above, our choice of inference cost does not make any assumptions about the form of the inference model. Moreover, we have demonstrated that surprise and uncertainty shape inference across a range of different models of stimulus generation. Nevertheless, we have revised this sentence in the Discussion so as not to overstate our claims:
“Here, we frame general principles that constrain the dynamic balance between coding cost and task relevance, and we pose neurallyplausible implementations.”
3) It was not clear to the reviewers if this is a theory that applies to the sensory periphery, to the cortex, or to both. The resource limitations considered here make sense if the dimensionality of the stimulus x and neural code y are the same. That would seem to be appropriate for a theory of the periphery, but then optimizing a peripheral code in complex ways for behavioral outputs may be implausible. Conversely, if this applies to cortex, we know dim(y)>>dim(x) and the choice of resource limitations might not be as relevant in practice: e.g. wouldn't a large enough population code overcome the problem of discrete response levels? It has been argued that in cortex the major challenge is the computational complexity of the inferences that need to be performed (Beck et al., 'not noisy just wrong'), and that approximate inference may be more a important constraint than resource limitations.
We postulate that the principles discussed in this paper can bear relevance for the entire sensory hierarchy, from periphery to central areas. We believe this to be the case based on two observations. First, neurons in all brain regions perform computations that operate on input from upstream areas, and these computations can frequently be described as probabilistic inference (e.g. [4, 5]). Second, energy limitations shape neuronal communication across the nervous system [6]. Our framework specifies how to bridge these two widespread phenomena. We now address this directly at the beginning of the Discussion.
The fact that cortex is high dimensional does not mean that resource limitations are irrelevant; one could alternatively argue that efficient energy use (at the single neuron level) becomes even more important as the system increases in size. The adaptive coding schemes discussed in this paper could be applied at the single neuron level, or they could be formulated for population codes. In either case, the number of neurons in a population places an upper bound on energy expenditure. By appropriately adapting its neural code, however, the system can operate well below this limit. In fact, it has been argued that the sparse activity observed in cortex (particularly during natural stimulation) is a demonstration of this type of efficiency [7, 8, 6].
While the majority of the paper focuses on cases where the dimensionality of the stimulus and the neural code are the same, the general principles of this framework apply to stimuli and representations of arbitrary dimensionality. We address this issue by simulating a population of model neurons responding to natural image patches (Figure 7). For simplicity, we chose the dimensionality of the neural response to be lower than the dimensionality of the stimulus; however, we expect the qualitative features of these results to hold for scenarios in which the dimensionality of the neural response is larger than the dimensionality of the stimulus. This is because the observer’s belief is most strongly affected by the surprise of incoming stimuli during periods of uncertainty, regardless of the dimensionality of the neural population in which these stimuli are encoded.
4) The paper would benefit from some streamlining to reduce the number of figures and the overall length of the paper. Repetitive text should be condensed or eliminated (example: Results section, fourth paragraph is a repeat of earlier statements). Overall, the Introduction could be significantly condensed. It is suggested that Box 1 be moved to the Materials and methods section, because there is significant overlap with it and Figure 1. The figures could be streamlined to some extent, perhaps a few could be combined to reduce the total figure count. At times that the manuscript was hard to read, as long paragraphs are spent describing the mechanics of the effects (which are instead very clearly illustrated in the figures). The authors should consider shortening those descriptions.
We have reduced the original set of 9 figures to 6 figures (Figure 16), we have added one additional Figure (Figure 7) to help address concerns 13 above, and we have moved Box 1 to the beginning of the Materials and methods section (now labeled “Figure 1—figure supplement 1”). This has reduced the number of graphical elements in the main text from 10 to 7. We have additionally streamlined the text by removing redundant statements, shortening descriptions of the mechanics of the effects, and condensing the Introduction. This streamlining was done throughout the text, so we do not enumerate the changes here, but we have marked all changes to the text in a separate document.
[Editors' note: further revisions were requested prior to acceptance, as described below.]
1) On the question of resource limitations, the reviewers agree that your argument re: sparse coding is valid, but it does raise questions about noise robustness (connected to point 2 below) and cell death, which might both argue for distributed/redundant coding. Perhaps discussing/citing Deneve's recent work on degenerate population codes is appropriate here. Please add some text to the discussion that addresses this.
2) Regarding adding noise to your model: Please include a discussion of the impact of noise on the structure of neural coding along with your text clearly stating that you have left the work of a thorough exploration of these topics for another study. Relevant literature to discuss/cite are, e.g. Zohary et al. 1994; Abbott Dayan 1999; Sompolinsky et al., 2001; Ma et al., 2006; MorenoBote et al., 2014; a review article by Kohn et al., 2016; as well as Gjorgjieva et al. bioRxiv 2017 (currently cited in the wrong place – should be moved to the discussion of the first framework). Please add some text to the Discussion that addresses this. If possible, the reviewers urge the authors to include one example of the effects of noise (a simple additive Gaussian or Poisson noise source) on their model results.
Points 1 and 2 are connected and that should also be made clear in the added Discussion text.
We have revised the Discussion to include a section dedicated to the discussion of noise robustness. Within this section, we have cited a broad body of literature (including references highlighted by the reviewers) that addresses the role of noise on the structure of the neural code. While we agree that cell death is an interesting source of potential fragility within neural population codes, we feel that this is outside the scope of the present study. We have nevertheless cited Deneve’s recent work on degenerate population codes in the context of noise robustness.
As requested, we have run additional simulations to demonstrate the effects of additive Gaussian noise on our results. These results are shown in Figure 3—figure supplement 2, and are highlighted in the Discussion. We find that the accuracy of the inference process is robust to low levels of noise, but degrades significantly when noise levels approach the separation between latent states in the environment.
We acknowledge directly in the text that this example is one of many potential sources of noise, each of which can have differing effects on the structure of optimal codes. As mentioned by the reviewers, a thorough investigation of these issues is the subject of future work, and we now state this directly in the text.
These changes can be found in the following paragraphs:
“Noise can arise at different stages of neural processing, and can alter the faithful encoding and transmission of stimuli to downstream areas [9, 10]. Individual neurons and neural populations can combat the adverse effects of noise by appropriately tuning their coding strategies, for example by adjusting the gain or thresholds of individual neurons [11, 12], introducing redundancies between neural responses [13,14, 15, 16, 17], and forming highly distributed codes [18, 19]. Such optimal coding strategies depend on the source, strength, and structure of noise [10, 14, 11, 20], and can differ significantly from strategies optimized in the absence of noise [13].
“Noise induced during encoding stages can affect downstream computations, such as the class of inference tasks considered here. To examine its impact on optimal inference, we injected additive Gaussian noise into the neural response transmitted from the encoder to the observer. We found that the accuracy of inference was robust to low levels of noise, but degraded quickly once the noise variance approached the degree of separation between environmental states (Figure 3—figure supplement 2). Although this form of Gaussian transmission noise was detrimental to the inference process, previous work has argued that noiserelated variability, if structured appropriately across a population of encoders, could support representations of the probability distributions required for optimal inference [21]. Moreover, we expect that the lossy encoding schemes developed here could be beneficial in combating noise injected prior to the encoding step, as they can guarantee that metabolic resources are not wasted in the process of representing noise fluctuations.
“Ultimately, the source and degree of noise can impact both the goal of the system and the underlying coding strategies. Here, we considered the goal of optimally inferring changes in environmental states. However, in noisy environments where the separation between latent environmental states is low, a system might need to remain stable in the presence of noise, rather than flexible to environmental changes. We expect that the optimal balance between stability and flexibility to be modulated by the spread of the stimulus distribution relative to the separation between environmental states. A thorough investigation of potential sources of noise, and their impact on the balance between efficient coding and optimal inference, is the subject of future work.”
3) Please address in the Discussion how your results might be thought of in the context of dynamic environments that are stationary – i.e. the stimulus changes in time, and might *also* switch states, but any given state fluctuates.
We have revised our Discussion of our simple environment model to highlight the possibility that the environmental state could both fluctuate in time and switch states:
“Here, we used this simple environment to probe the dynamics of encoding schemes optimized for inference. We found that optimal encoding schemes respond strongly to changes in the underlying environmental state, and thereby carry information about the timescale of environmental fluctuations. In natural settings, signals vary over a range of temporal scales, and neurons are known to be capable of adapting to multiple timescales in their inputs. We therefore expect that more complex environments, for example those in which the environmental state can both switch between distinct distributions and fluctuate between values within a single distribution, will require that the encoder respond to environmental changes on multiple timescales.”
[1] Y. Karklin and M. S. Lewicki, “A hierarchical bayesian model for learning nonlinear statistical regularities in nonstationary natural signals,” Neural computation, vol. 17, no. 2, pp. 397–423, 2005.
[2] E. T. Jaynes, Probability theory: The logic of science. Cambridge university press, 2003.
[3] C. Robert, The Bayesian choice: from decisiontheoretic foundations to computational implementation. Springer Science & Business Media, 2007.
[4] T. S. Lee and D. Mumford, “Hierarchical bayesian inference in the visual cortex,” JOSA A, vol. 20, no. 7, pp. 1434–1448, 2003.
[5] R. CoenCagli, A. Kohn, and O. Schwartz, “Flexible gating of contextual influences in natural vision,” Nature neuroscience, vol. 18, no. 11, p. 1648, 2015.
[6] P. Sterling and S. Laughlin, Principles of neural design. MIT Press, 2015.
[7] M. R. DeWeese and A. M. Zador, “Binary coding in auditory cortex,” in Advances in neural information processing systems, pp. 117–124, 2003.
[8] W. E. Vinje and J. L. Gallant, “Sparse coding and decorrelation in primary visual cortex during natural vision,” Science, vol. 287, no. 5456, pp. 1273–1276, 2000
[9] W. Bialek, F. Rieke, R. de Ruyter van Steveninck, and D. Warland, “Spikes: exploring the neural code,” MIT. Roddey, JC, Girish, B., & Miller, JP (2000). Assessing the performance of neural encoding models in the presence of noise. Journal of Computational Neuroscience, vol. 8, no. 95, p. 112, 1997.
[10] B. A. Brinkman, A. I. Weber, F. Rieke, and E. SheaBrown, “How do efficient coding strategies depend on origins of noise in neural circuits?,” PLoS computational biology, vol. 12, no. 10, p. e1005150, 2016.
[11] J. Van Hateren, “A theory of maximizing sensory information,” Biological Cybernetics, vol. 68, no. 1, pp. 23–29, 1992.
[12] J. Gjorgjieva, M. Meister, and H. Sompolinsky, “Optimal sensory coding by populations of on and off neurons,” bioRxiv, p. 131946, 2017.
[13] E. Doi and M. S. Lewicki, “A simple model of optimal population coding for sensory systems,” PLoS computational biology, vol. 10, no. 8, p. e1003761, 2014.
[14] G. Tkačik, J. S. Prentice, V. Balasubramanian, and E. Schneidman, “Optimal population coding by noisy spiking neurons,” Proceedings of the National Academy of Sciences, vol. 107, no. 32, pp. 14419–14424, 2010.
[15] R. MorenoBote, J. Beck, I. Kanitscheider, X. Pitkow, P. Latham, and A. Pouget,
“Informationlimiting correlations,” Nature Neuroscience, vol. 17, pp. 1410–1417, 2014.
[16] L. Abbott and P. Dayan, “The effect of correlated variability on the accuracy of a population code,” Neural Computation, vol. 11, no. 1, pp. 91–101, 1999.
[17] H. Sompolinsky, H. Yoon, K. Kang, and S. M, “Population coding in neuronal systems with correlated noise,” Physical Review E, vol. 64, p. 051904, 2001.
[18] S. Denève and C. Machens, “Efficient codes and balanced networks,” Nature Neuroscience, vol. 19, p. 375–382, 2016.
[19] S. Denève and M. Chalk, “Efficiency turns the table on neural encoding, decoding and noise,” Current Opinion in Neurobiology, vol. 37, pp. 141–148, 2016.
[20] A. Kohn, R. CoenCagli, I. Kanitscheider, and A. Pouget, “Correlations and neuronal population information,” Annual Reviews of Neuroscience, vol. 39, pp. 237–256, 2016.
[21] W. Ma, J. Beck, P. Latham, and A. Pouget, “Bayesian inference with probabilistic population codes,” Nature Neuroscience, vol. 9, p. 1432–1438, 2006.
https://doi.org/10.7554/eLife.32055.021Article and author information
Author details
Funding
National Science Foundation (STC Award CCF1231216)
 Wiktor F Mlynarski
Howard Hughes Medical Institute
 Ann M Hermundstad
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank John Briguglio, Vivek Jayaraman, Yarden Katz, Emily Mackevicius, and Josh McDermott for useful discussions and feedback on the manuscript. WM was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF1231216. AMH was supported by the Howard Hughes Medical Institute.
Reviewing Editor
 Stephanie Palmer, University of Chicago, United States
Publication history
 Received: September 15, 2017
 Accepted: April 11, 2018
 Version of Record published: July 10, 2018 (version 1)
Copyright
© 2018, Młynarski et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 3,103
 Page views

 577
 Downloads

 15
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, Scopus, PubMed Central.