Invariant neural subspaces maintained by feedback modulation

  1. Laura B Naumann  Is a corresponding author
  2. Joram Keijser
  3. Henning Sprekeler
  1. Modelling of Cognitive Processes, Technical University of Berlin, Germany
  2. Bernstein Center for Computational Neuroscience, Germany

Abstract

Sensory systems reliably process incoming stimuli in spite of changes in context. Most recent models accredit this context invariance to an extraction of increasingly complex sensory features in hierarchical feedforward networks. Here, we study how context-invariant representations can be established by feedback rather than feedforward processing. We show that feedforward neural networks modulated by feedback can dynamically generate invariant sensory representations. The required feedback can be implemented as a slow and spatially diffuse gain modulation. The invariance is not present on the level of individual neurons, but emerges only on the population level. Mechanistically, the feedback modulation dynamically reorients the manifold of neural activity and thereby maintains an invariant neural subspace in spite of contextual variations. Our results highlight the importance of population-level analyses for understanding the role of feedback in flexible sensory processing.

Editor's evaluation

One of the key questions in sensory neuroscience is how cortical networks extract invariant percepts from variable sensory inputs. While much of the literature focuses on the role of feedforward hierarchical processing for extracting invariant percepts, this study proposes a novel implementation based on top-down feedback. The article analyses the underlying mechanism based on an invariant subspace and presents instantiations of this mechanism at different levels of biophysical realism.

https://doi.org/10.7554/eLife.76096.sa0

Introduction

In natural environments, our senses are exposed to a colourful mix of sensory impressions. Behaviourally relevant stimuli can appear in varying contexts, such as variations in lighting, acoustics, stimulus position, or the presence of other stimuli. Different contexts may require different responses to the same stimulus, for example, when the behavioural task changes (context dependence). Alternatively, the same response may be required for different stimuli, for example, when the sensory context changes (context invariance). Recent advances have elucidated how context-dependent processing can be performed by recurrent feedback in neural circuits (Mante et al., 2013; Wang et al., 2018a; Dubreuil et al., 2020). In contrast, the role of feedback mechanisms in context-invariant processing is not well understood.

In the classical view, stimuli are hierarchically processed towards a behaviourally relevant percept that is invariant to contextual variations. This is achieved by extracting increasingly complex features in a feedforward network (Kriegeskorte, 2015; Zhuang et al., 2021; Yamins and DiCarlo, 2016). Models of such feedforward networks have been remarkably successful at learning complex perceptual tasks (LeCun et al., 2015), and they account for various features of cortical sensory representations (DiCarlo and Cox, 2007; Kriegeskorte et al., 2008; DiCarlo et al., 2012; Hong et al., 2016; Cichy et al., 2016). Yet, these models neglect feedback pathways, which are abundant in sensory cortex (Felleman and Van Essen, 1991; Markov et al., 2014) and shape sensory processing in critical ways (Gilbert and Li, 2013). Incorporating these feedback loops into models of sensory processing increases their flexibility and robustness (Spoerer et al., 2017; Alamia et al., 2021; Nayebi et al., 2021) and improves their fit to neural data (Kar et al., 2019; Kietzmann et al., 2019; Nayebi et al., 2021). At the neuronal level, feedback is thought to modulate rather than drive local responses (Sherman and Guillery, 1998), for instance, depending on behavioural context (Niell and Stryker, 2010; Vinck et al., 2015; Kuchibhotla et al., 2017; Dipoppa et al., 2018).

Here, we investigate the hypothesis that feedback modulation provides a neural mechanism for context-invariant perception. To this end, we trained a feedback-modulated network model to perform a context-invariant perceptual task and studied the resulting neural mechanisms. We show that the feedback modulation does not need to be temporally or spatially precise and can be realised by feedback-driven gain modulation in rate-based networks of excitatory and inhibitory neurons. To solve the task, the feedback loop dynamically maintains an invariant subspace in the population representation (Hong et al., 2016). This invariance is not present at the single-neuron level. Finally, we find that the feedback conveys a nonlinear representation of the context itself, which can be hard to discern by linear decoding methods.

These findings corroborate that feedback-driven gain modulation of feedforward networks enables context-invariant sensory processing. The underlying mechanism links single-neuron modulation with its function at the population level, highlighting the importance of population-level analyses.

Results

As a simple instance of a context-invariant task, we considered a dynamic version of the blind source separation problem. The task is to recover unknown sensory sources, such as voices at a cocktail party (McDermott, 2009), from sensory stimuli that are an unknown mixture of the sources. In contrast to the classical blind source separation problem, the mixture can change in time, for example, when the speakers move around, thus providing a time-varying sensory context. Because the task requires a dynamic inference of the context, it cannot be solved by feedforward networks (Figure 1—figure supplement 1) or standard blind source separation algorithms (e.g. independent component analysis; Bell and Sejnowski, 1995; Hyvärinen and Oja, 2000). We hypothesised that this dynamic task can be solved by a feedforward network that is subject to modulation from a feedback signal. In our model, the feedback signal is provided by a modulatory system that receives both sensory stimuli and network output (Figure 1a).

Figure 1 with 6 supplements see all
Dynamic blind source separation by modulation of feedforward connections.

(a) Schematic of the feedforward network model receiving feedback modulation from a modulator (a recurrent network). (b) Top: sources (s1,2), sensory stimuli (x1,2), and network output (y1,2) for two different source locations (contexts). Bottom: deviation of output from the sources. (c) Top: modulated readout weights across six contexts (source locations); dotted lines indicate the true weights of the inverted mixing matrix. Bottom: deviation of readout from target weights. (d) Correlation between the sources and the sensory stimuli (left), the network outputs (centre), and calculation of the signal clarity (right). Error bars indicate standard deviation across 20 contexts. (e) Violin plot of the signal clarity for different noise levels in the sensory stimuli across 20 different contexts.

Dynamic blind source separation by modulation of feedforward weights

Before we gradually take this to the neural level, we illustrate the proposed mechanism in a simple example, in which the modulatory system provides a time-varying multiplicative modulation of a linear two-layer network (see ‘Materials and methods’). For illustration, we used compositions of sines with different frequencies as source signals (s, Figure 1b, top). These sources were linearly mixed to generate the sensory stimuli (x) that the network received as input; x=Ats (Figure 1a and b). The linear mixture (At) changed over time, akin to varying the location of sound sources in a room (Figure 1a). These locations provided a time-varying sensory context that changed on a slower timescale than the sources themselves. The feedforward network had to recover the sources from the mixed sensory stimuli. To achieve this, we trained the modulator to dynamically adjust the weights of the feedforward network (W0) such that the network output (y) matches the sources:

y=Wtx=(MtW0)x
Mt=modulator(history of x,y).

Because the modulation requires a dynamic inference of the context, the modulator is a recurrent neural network. The modulator was trained using supervised learning. Afterwards, its weights were fixed and it no longer had access to the target sources (see ‘Materials and methods,’ Figure 8). The modulator therefore had to use its recurrent dynamics to determine the appropriate modulatory feedback for the time-varying context, based on the sensory stimuli and the network output. Put differently, the modulator had to learn an internal model of the sensory data and the contexts, and use it to establish the desired context invariance in the output.

After learning, the modulated network disentangled the sources, even when the context changed (Figure 1b, Figure 1—figure supplement 1a and b). Context changes produced a transient error in the network’s output, but it quickly resumed matching the sources (Figure 1b, bottom). The transient errors occur because the modulator needs time to infer the new context from the time-varying inputs before it can provide the appropriate feedback signal to the feedforward network (Figure 1—figure supplement 6a, compare with Figure 1—figure supplement 1g–i). The modulated feedforward weights inverted the linear mixture of sources by switching on the same timescale (Figure 1c).

To quantify how well the sources were separated, we measured the correlation coefficient of the outputs with each source over several contexts. Consistent with a clean separation, we found that each of the two outputs strongly correlated with only one of the sources. In contrast, the sensory stimuli showed a positive average correlation for both sources as expected given the positive linear mixture (Figure 1d, left). We determined the signal clarity as the absolute difference between the correlation with the first compared to the second source, averaged over the two outputs, normalised by the sum of the correlations (Figure 1d, right; see ‘Materials and methods’). The signal clarity thus determines the degree of signal separation, where a value close to 1 indicates a clean separation as in Figure 1d. Note that the signal clarity of the sensory stimuli is around 0.5 and can be used as a reference.

We next probed the network’s robustness by adding noise to the sensory stimuli. We found that the signal clarity gradually decreased with increasing noise levels, but only degraded to chance performance when the signal-to-noise ratio was close to 1 (1.1 dB, Figure 1e, Figure 1—figure supplement 2e). The network performance did not depend on the specific source signals (Figure 1—figure supplement 3) or the number of sources (Figure 1—figure supplement 4) as long as it had seen them during training. Yet, because the network had to learn an internal model of the task, we expected a limited degree of generalisation to new situations. Indeed, the network was able to interpolate between source frequencies seen during training (Figure 1—figure supplement 5), but failed on sources and contexts that were qualitatively different (Figure 1—figure supplement 6b–d). The specific computations performed by the modulator are therefore idiosyncratic to the problem at hand. Hence, we did not investigate the internal dynamics of the modulator in detail, but concentrated on its effect on the feedforward network.

Since feedback-driven modulation enables flexible context-invariant processing in a simple abstract model, we wondered how this mechanism might be implemented at the neural level. For example, how does feedback-driven modulation function when feedback signals are slow and imprecise? And how does the modulation affect population activity? In the following, we will gradually increase the model complexity to account for biological constraints and pinpoint the population-level mechanisms of feedback-mediated invariance.

Invariance can be established by slow feedback modulation

Among the many modulatory mechanisms, even the faster ones are believed to operate on timescales of hundreds of milliseconds (Bang et al., 2020; Molyneaux and Hasselmo, 2002), raising the question if feedback-driven modulation is sufficiently fast to compensate for dynamic changes in environmental context.

To investigate how the timescale of modulation affects the performance in the dynamic blind source separation task, we trained network models, in which the modulatory feedback had an intrinsic timescale that forced it to be slow. We found that the signal clarity degraded only when this timescale was on the same order of magnitude as the timescale of contextual changes (Figure 2a). Note that timescales in this model are relative and could be arbitrarily rescaled. While slower feedback modulation produced a larger initial error (Figure 2b and c), it also reduced the fluctuations in the readout weights such that they more closely follow the optimal weights (Figure 2b). This speed-accuracy trade-off explains the lower and more variable signal clarity for slow modulation (Figure 2a) because the signal clarity was measured over the whole duration of a context and the transient onset error dominated over the reduced fluctuations.

Figure 2 with 1 supplement see all
The network model is not sensitive to slow feedback modulation.

(a) Signal clarity in the network output for varying timescales of modulation relative to the intervals at which the source locations change. (b) Modulated readout weights across four source locations (contexts) for fast (top) and slow (centre) feedback modulation; dotted lines indicate the optimal weights (the inverse of the mixing matrix). Bottom: deviation of the readout weights from the optimal weights for fast and slow modulation. Colours correspond to the relative timescales in (a). Fast and slow timescales are 0.001 and 1, respectively. (c) Mean deviation of readout from optimal weights within contexts; averaged over 20 contexts. Colours code for timescale of modulation (see (a)). (d, e) Same as (a) but for models in which the modulatory system only received the sensory stimuli x or the network output y, respectively.

To determine architectural constraints on the modulatory system, we asked how these results depended on the input it received. So far, the modulatory system received the feedforward network’s inputs (the sensory stimuli) and its outputs (the inferred sources, see Figure 1a), but are both of these necessary to solve the task? We found that when the modulatory system only received the sensory stimuli, the model could still learn the task, though it was more sensitive to slow modulation (Figure 2d, Figure 2—figure supplement 1). When the modulatory system had to rely on the network output alone, task performance was impaired even for fast modulation (Figure 2e, Figure 2—figure supplement 1). Thus, while the modulatory system is more robust to slow modulation when it receives the network output, the output is not sufficient to solve the task.

Taken together, these results show that the biological timescale of modulatory mechanisms does not pose a problem for flexible feedback-driven processing as long as the feedback modulation changes on a faster timescale than variations in the context. In fact, slow modulation can increase processing accuracy by averaging out fluctuations in the feedback signal. Nevertheless, slow modulation likely requires the modulatory system to receive both input and output of the sensory system it modulates.

Invariance can be established by spatially diffuse feedback modulation

Neuromodulators are classically believed to diffusely affect large areas of the brain. Furthermore, signals in the brain are processed by populations of neurons. We wondered if the proposed modulation mechanism is consistent with such biological constraints. We therefore extended the network model such that the sensory stimuli are projected to a population of 100 neurons. A fixed linear readout of this population determined the network output. The neurons in the population received spatially diffuse modulatory feedback (Figure 3a) such that the feedback modulation affected neighbouring neurons similarly. We here assume that all synaptic weights to a neuron receive the same modulation, such that the feedback performs a gain modulation of neural activity (Ferguson and Cardin, 2020). The spatial specificity of the modulation was determined by the number of distinct feedback signals and their spatial spread (Figure 3b, Figure 3—figure supplement 1a).

Figure 3 with 1 supplement see all
Feedback modulation in the model can be spatially diffuse.

(a) Schematic of the feedforward network with a population that receives diffuse feedback-driven modulation. (b) Spatial spread of the modulation mediated by four modulatory feedback signals with a width of 0.2. (c) Top: per neuron modulation during eight different contexts. Bottom: corresponding deviation of the network output from sources. (d) Mean signal clarity across 20 contexts for different numbers of feedback signals; modulation width is 0.2. Error bars indicate standard deviation. Purple triangle indicates default parameters used in (c). (e) Same as (d) but for different modulation widths; number of feedback signals is 4. The modulation width ‘’ corresponds to uniform modulation across the population.

This population-based model with less specific feedback modulation could still solve the dynamic blind source separation task. The diffuse feedback modulation switched when the context changed, but was roughly constant within contexts (Figure 3c), as in the simple model. The effective weight from the stimuli to the network output also inverted the linear mixture of the sources (Figure 3—figure supplement 1d, compare with Figure 1c).

We found that only a few distinct feedback signals were needed for a clean separation of the sources across contexts (Figure 3d). Moreover, the feedback could have a spatially broad effect on the modulated population without degrading the signal clarity (Figure 3e, Figure 3—figure supplement 1), consistent with the low dimensionality of the context.

We conclude that, in our model, neuromodulation does not need to be spatially precise to enable flexible processing. Given that the suggested feedback-driven modulation mechanism works for slow and diffuse feedback signals, it could in principle be realised by neuromodulatory pathways present in the brain.

Invariance emerges at the population level

Having established that slow and spatially diffuse feedback modulation enables context-invariant processing, we next investigated the underlying mechanisms at the single-neuron and population level. Given that the readout of the population activity was fixed, it is not clear how the context-dependent modulation of single neurons could give rise to a context-independent network output. One possible explanation is that some of the neurons are context-invariant and are exploited by the readout. However, a first inspection of neural activity indicated that single neurons are strongly modulated by context (Figure 4a). To quantify this, we determined the signal clarity for each neuron at each stage of the feedforward network, averaged across contexts (Figure 4b). As expected, the signal clarity was low for the sensory stimuli. Intriguingly, the same was true for all neurons of the modulated neural population, indicating no clean separation of the sources at the level of single neurons. Although most neurons had a high signal clarity in some of the contexts, there was no group of neurons that consistently represented one or the other source (Figure 4c). Furthermore, the average signal clarity of the neurons did not correlate with their contribution to the readout (Figure 4d). Since single-neuron responses were not invariant, context invariance must arise at the population level.

Invariance emerges at the population level.

(a) Population activity in two contexts. (b) Violin plot of the signal clarity in the sensory stimuli (x), neural population (z), and network output (y), computed across 20 different contexts. (c) Signal clarity of single neurons in the modulated population for different contexts. (d) Correlation between average signal clarity over contexts and magnitude of neurons’ readout weight. Corresponding Pearson r and p-value are indicated in the panel. (e) Violin plot of the linear decoding performance of the sources from different stages of the feedforward network, computed across 20 contexts. The decoder was trained on a different set of 20 contexts.

To confirm this, we asked how well the sources could be decoded at different stages of the feedforward network. We trained a single linear decoder of the sources on one set of contexts and tested its generalisation to novel contexts. We found that the decoding performance was poor for the sensory stimuli (Figure 4e), indicating that these did not contain a context-invariant representation. In contrast, the sources could be decoded with high accuracy from the modulated population.

This demonstrates that while individual neurons were not invariant, the population activity contained a context-invariant subspace. In fact, the population had to contain an invariant subspace because the fixed linear readout of the population was able to extract the sources across contexts. However, the linear decoding approach shows that this subspace can be revealed from the population activity itself with only a few contexts and no knowledge of how the neural representation is used downstream. The same approach could therefore be used to reveal context-invariant subspaces in neural data from population recordings. Note that the learned readout and the decoder obtained from population activity are not necessarily identical due to the high dimensionality of the population activity compared to the sources.

Feedback reorients the population representation

The question remains how exactly the context-invariant subspace is maintained by feedback modulation. In contrast to a pure feedforward model of invariant perception (Kriegeskorte, 2015; Yamins and DiCarlo, 2016), feedback-mediated invariance requires time to establish after contextual changes. Experimentally, hallmarks of this adaptive process should be visible when comparing the population representations immediately after a change and at a later point in time. Our model allows to cleanly separate the early and late representation by freezing the feedback signals in the initial period after a contextual change (Figure 5a), thereby disentangling the effects of feedback and context on population activity.

Figure 5 with 1 supplement see all
Feedback reorients the population representation.

(a) Network output (top) and feedback modulation (bottom) for two contexts. The feedback modulation is frozen for the initial period after the context changes. (b) Population activity in the space of the two readout axes and the first principal component. Projection onto the readout is indicated at the bottom (see (c)). The signal representation is shown for different phases of the experiment. Left: context 1 with intact feedback; centre: context 2 with frozen feedback; right: context 2 with intact feedback. The blue plane spans the population activity subspace in context 1 (left). (c) Same as (b), but projected onto the readout space (dotted lines in (b)). The light blue trace corresponds to the sources. (d) Left: change in subspace orientation across 40 repetitions of the experiment, measured by the angle between the original subspace and the subspace for context changes (ctx change), feedback modulation (FB mod), and feedback modulation for similar contexts (ctx close) or dissimilar contexts (ctx far). Right: two-dimensional context space, defined by the coefficients in the mixing matrix. Arrows indicate similar (light blue) and dissimilar contexts (purple). (e) Distance between pairs of contexts versus the angle between population activity subspaces for these contexts. Circles indicate similar contexts (from the same side of the diagonal, see (d)) and triangles dissimilar contexts (from different sides of the diagonal). Pearson r and p-value indicated in the panel.

The simulated experiment consisted of three stages: first, the feedback was intact for a particular context and the network outputs closely tracked the sources. Second, the context was changed but the feedback modulation was frozen at the same value as before. As expected, this produced deviations of the output from the sources. Third, for the same context the feedback modulation was turned back on, which reinstated the source signals in the output. In this experiment, we used pure sines as signals for visualisation purposes (Figure 5a and c). To visualise the population activity in the three stages of the experiment, we considered the space of the two readout dimensions and the first principal component (Figure 5b). We chose this space rather than, for example, the first three principal components (Figure 5—figure supplement 1), because it provides an intuitive illustration of the invariant subspace.

Because the sources were two-dimensional, the population activity followed a pattern within a two-dimensional subspace (Figure 5b, left; Figure 5—figure supplement 1a). For intact feedback, this population activity matched the sources when projected onto the readout (Figure 5c, left). Changing the context while freezing the feedback rotated and stretched this representation within the same subspace, such that the readout did not match the sources (Figure 5b and c, centre). Would turning the feedback modulation back on simply reverse this transformation to re-establish an invariant subspace? We found that this was not the case. Instead, the feedback rotated the representation out of the old subspace (Figure 5b, right), thereby reorienting it into the invariant readout (Figure 5c, right).

To quantify the transformation of the population representation, we repeated this experiment multiple times and determined the angle between the neural subspaces. Consistent with the illustration in Figure 5b, changing the context did not change the subspace orientation, whereas unfreezing the feedback caused a consistent reorientation (Figure 5d). The magnitude of this subspace reorientation depended on the similarity of the old and new context. Similar contexts generally evoked population activity with similar subspace orientations (Figure 5d and e). This highlights that there is a consistent mapping between contexts and the resulting low-dimensional population activity.

In summary, the role of feedback-driven modulation in our model is to reorient the population representation in response to changing contexts such that an invariant subspace is preserved.

The mechanism generalises to a hierarchical Dalean network

So far, we considered a linear network, in which neural activity could be positive and negative. Moreover, feedback modulation could switch the sign of the neurons’ downstream influence, which is inconsistent with Dale’s principle. We wondered if the same population-level mechanisms would operate in a Dalean network, in which feedback is implemented as a positive gain modulation. Although gain modulation is a broadly observed phenomenon that is attributed to a range of cellular mechanisms (Ferguson and Cardin, 2020; Salinas and Thier, 2000), its effect at the population level is less clear (Shine et al., 2021).

We extended the feedforward model as follows (Figure 6a): first, all neurons had positive firing rates. Second, we split the neural population (z in the previous model) into a ‘lower-level’ (zL) and ‘higher-level’ population (zH). The lower-level population served as a neural representation of the sensory stimuli, whereas the higher-level population was modulated by feedback. This allowed a direct comparison between a modulated and an unmodulated neural population. It also allowed us to include Dalean weights between the two populations. Direct projections from the lower-level to the higher-level population were excitatory. In addition, a small population of local inhibitory neurons provided feedforward inhibition to the higher-level population. Third, the modulation of the higher-level population was implemented as a local gain modulation that scaled the neural responses. As a specific realisation of gain modulation, we assumed that feedback targeted inhibitory interneurons (e.g. in layer 1; Abs et al., 2018; Ferguson and Cardin, 2020; Cohen-Kashi Malina et al., 2021) that mediate the modulation in the higher-level population (e.g. via presynaptic inhibition; Pardi et al., 2020; Naumann and Sprekeler, 2020). This means that stronger feedback decreased the gain of neurons (Figure 4b). We will refer to these modulatory interneurons as modulation units m (green units in Figure 4a).

Figure 6 with 1 supplement see all
Feedback-driven gain modulation in a hierarchical rate network.

(a) Schematic of the Dalean network comprising a lower- and higher-level population (zL and zH), a population of local inhibitory neurons (blue), and diffuse gain modulation mediated by modulatory interneurons (green). (b) Decrease in gain (i.e. release probability) with stronger modulatory feedback. (c) Top: modulation of neurons in the higher-level population for 10 different contexts. Bottom: corresponding deviation of outputs y from sources s. (d) Histogram of neuron-specific release probabilities averaged across 20 contexts (filled, light green) and during two different contexts (yellow and dark green, see (c)). (e) Violin plot of signal clarity at different stages of the Dalean model: sensory stimuli (x), lower-level (zL) and higher-level population (zH), modulatory units (m), and network output (y), computed across 20 contexts (compare with Figure 4a). (f) Violin plot of linear decoding performance of the sources from the same stages as in (e) (compare with Figure 4d). (g) Feedback modulation reorients the population activity (compare with Figure 5d).

We found that this biologically more constrained model could still learn the context-invariant processing task (Figure 6—figure supplement 1a and b). Notably, the network’s performance did not depend on specifics of the model architecture, such as the target of the modulation or the number of inhibitory neurons (Figure 6—figure supplement 1c–e). In analogy to the previous model, the gain modulation of individual neurons changed with the context and thus enabled the flexible processing required to account for varying context (Figure 4c). The average gain over contexts was similar across neurons, whereas within a context the gains were broadly distributed (Figure 4d).

To verify if the task is solved by the same population-level mechanism, we repeated our previous analyses on the single-neuron and population level. Indeed, all results generalised to the Dalean network with feedback-driven gain modulation (compare with Figures 46). Single neurons in the higher- and lower-level population were not context-invariant (Figure 6e), but the higher-level population contained a context-invariant subspace (Figure 6f). This was not the case for the lower-level population, underscoring that invariant representations do not just arise from projecting the sensory stimuli into a higher dimensional space. Instead, the invariant subspace in the higher-level population was again maintained by the feedback modulation, which reoriented the population activity in response to context changes (Figure 6g).

Feedback conveys a nonlinear representation of the context

Since single neurons in the higher-level population were not invariant to context, the population representation must also contain contextual information. Indeed, contextual variables could be linearly decoded from the higher-level population activity (Figure 7a). In contrast, decoding the context from the lower-level population gave much lower accuracy. This shows that the contextual information is not just inherited from the sensory stimuli but conveyed by the feedback via the modulatory units. We therefore expected that the modulatory units themselves would contain a representation of the context. To our surprise, decoding accuracy on the modulatory units was low. This seems counterintuitive, especially since the modulatory units clearly covaried with the contextual variables (Figure 7b). To understand these seemingly conflicting results, we examined how the context was represented in the activity of the modulation units.

Feedback conveys a nonlinear representation of the context.

(a) Linear decoding performance of the context (i.e. mixing) from the network. (b) Context variables (e.g. source locations, top) and activity of modulatory interneurons (bottom) over contexts; one of the modulatory interneurons is silent in all contexts. (c) Left: activity of the three active modulatory interneurons (see (b)) for different contexts. The context variables are colour-coded as indicated on the right. (d) Performance of different decoders trained to predict the context from the modulatory interneuron activity. Decoder types are a linear decoder, a decoder on a quadratic expansion, and a linear decoder trained to predict the inverse of the mixing matrix.

We found that the modulation unit activity did encode the contextual variables, albeit in a nonlinear way (Figure 7c). The underlying reason is that the feedback modulation needs to remove contextual variations, which requires nonlinear computations. Specifically, the blind source separation task requires an inversion of the linear mixture of sources. Consistent with this idea, nonlinear decoding approaches performed better (Figure 7d). In fact, the modulatory units contained a linear representation of the ‘inverse context’ (i.e. the inverse mixing matrix, see ‘Materials and methods’).

In summary, the higher-level population provides a linear representation not only of the stimuli, but also of the context. In contrast, the modulatory units contained a nonlinear representation of the context, which could not be extracted by linear decoding approaches. We speculate that if contextual feedback modulation is mediated by interneurons in layer 1, they should represent the context in a nonlinear way.

Discussion

Accumulating evidence suggests that sensory processing is strongly modulated by top-down feedback projections (Gilbert and Li, 2013; Keller and Mrsic-Flogel, 2018). Here, we demonstrate that feedback-driven gain modulation of a feedforward network could underlie stable perception in varying contexts. The feedback can be slow, spatially diffuse, and low-dimensional. To elucidate how the context invariance is achieved, we performed single-neuron and population analyses. We found that invariance was not evident at the single-neuron level, but only emerged in a subspace of the population representation. The feedback modulation dynamically transformed the manifold of neural activity patterns such that this subspace was maintained across contexts. Our results provide further support that gain modulation at the single-cell level enables nontrivial computations at the population level (Failor et al., 2021; Shine et al., 2021).

Invariance in sensory processing

As an example of context-invariant sensory processing, we chose a dynamic variant of the blind source separation task. This task is commonly illustrated by a mixture of voices at a cocktail party (Cherry, 1953; McDermott, 2009). For auditory signals, bottom-up mechanisms of frequency segregation can provide a first processing step for the separation of multiple sound sources (Bronkhorst, 2015; McDermott, 2009). However, separating more complex sounds requires additional active top-down processes (Parthasarathy et al., 2020; Oberfeld and Klöckner-Nowotny, 2016). In our model, top-down feedback guides the source separation itself, while the selection of a source would occur at a later processing stage – consistent with recent evidence for ‘late selection’ (Brodbeck et al., 2020; Har-Shai Yahav and Zion Golumbic, 2021).

Although blind source separation is commonly illustrated with auditory signals, the suggested mechanism of context-invariant perception is not limited to a given sensory modality. The key nature of the task is that it contains stimulus dimensions that need to be encoded (the sources) and dimensions that need to be ignored (the context). In visual object recognition, for example, the identity of visual objects needs to be encoded, while contextual variables such as size, location, orientation, or surround need to be ignored. Neural hallmarks of invariant object recognition are present at the population level (DiCarlo and Cox, 2007; DiCarlo et al., 2012; Hong et al., 2016), and to some extent also on the level of single neurons (Quiroga et al., 2005). Classically, the emergence of invariance has been attributed to the extraction of invariant features in feedforward networks (Riesenhuber and Poggio, 1999; Wiskott and Sejnowski, 2002; DiCarlo and Cox, 2007; Kriegeskorte, 2015), but recent work also highlights the role of recurrence and feedback (Gilbert and Li, 2013; Kar et al., 2019; Kietzmann et al., 2019; Thorat et al., 2021). Here, we focused on the role of feedback, but clearly, feedforward and feedback processes are not mutually exclusive and likely work in concert to create invariance. Their relative contribution to invariant perception requires further studies and may depend on the invariance in question.

Similarly, how invariance can be learned will depend on the underlying mechanism. The feedback-driven mechanism we propose is reminiscent of meta-learning consisting of an inner and an outer loop (Hochreiter et al., 2001; Wang et al., 2018b). In the inner loop, the modulatory system infers the context to modulate the feedforward network accordingly. This process is unsupervised. In the outer loop, the modulatory system is trained to generalise across contexts. Here, we performed this training using supervised learning, which requires the modulatory system to experience the sources in isolation (or at least obtain an error signal). Such an identification of the individual sources could, for example, be aided by other sensory modalities (McDermott, 2009). However, the optimisation of the modulatory system does not necessarily require supervised learning. It could also be guided by task demands via reinforcement learning or by unsupervised priors such as a non-Gaussianity of the outputs.

Mechanisms of feedback-driven gain modulation

There are different ways in which feedback can affect local processing. Here, we focused on gain modulation (McAdams and Maunsell, 1999; Reynolds and Heeger, 2009; Vinck et al., 2015). Neuronal gains can be modulated by a range of mechanisms (Ferguson and Cardin, 2020; Shine et al., 2021). In our model, the mechanism needs to satisfy a few key requirements: (i) the modulation is not uniform across the population, (ii) it operates on a timescale similar to that of changes in context, and (iii) it is driven by a brain region that has access to the information needed to infer the context.

Classical neuromodulators such as acetylcholine (Disney et al., 2007; Kawai et al., 2007), dopamine (Thurley et al., 2008), or serotonin (Azimi et al., 2020) are signalled through specialised neuromodulatory pathways from subcortical nuclei (van den Brink et al., 2019). These neuromodulators can control the neural gain depending on behavioural states such as arousal, attention, or expectation of rewards (Ferguson and Cardin, 2020; Hasselmo and McGaughy, 2004; Bayer and Glimcher, 2005; Polack et al., 2013; Kuchibhotla et al., 2017). Their effect is typically thought to be brain-wide and long-lasting, but recent advances in measurement techniques (Sabatini and Tian, 2020; Lohani et al., 2020) indicate that it could be area- or even layer-specific, and vary on sub-second timescales (Lohani et al., 2020; Bang et al., 2020; Poorthuis et al., 2013; Pinto et al., 2013).

More specific feedback projections arrive in layer 1 of the cortex, where they target the distal dendrites of pyramidal cells and inhibitory interneurons (Douglas and Martin, 2004; Roth et al., 2016; Marques et al., 2018). Dendritic input can change the gain of the neural transfer function on fast timescales (Larkum et al., 2004; Jarvis et al., 2018). The spatial scale of the modulation will depend on the spatial spread of the feedback projections and the dendritic arbourisation. Feedback to layer 1 interneurons provides an alternative mechanism of local gain control. In particular, neuron-derived neurotrophic factor-expressing interneurons (NDNF) in layer 1 receive a variety of top-down feedback projections and produce GABAergic volume transmission (Abs et al., 2018), thereby downregulating synaptic transmission (Miller, 1998; Laviv et al., 2010). This gain modulation can act on a timescale of hundreds of milliseconds (Branco and Staras, 2009; Urban-Ciecko et al., 2015; Cohen-Kashi Malina et al., 2021; Molyneaux and Hasselmo, 2002), and, although generally considered diffuse, can also be synapse type-specific (Chittajallu et al., 2013).

The question remains where in the brain the feedback signals originate. Our model requires the responsible network to receive feedforward sensory input to infer the context. In addition, feedback inputs from higher-level sensory areas to the modulatory system allow a better control of the modulated network state. Higher-order thalamic nuclei are ideally situated to integrate different sources of sensory inputs and top-down feedback (Sampathkumar et al., 2021) and mediate the resulting modulation by targeting layer 1 of lower-level sensory areas (Purushothaman et al., 2012; Roth et al., 2016; Sherman, 2016). In our task setting, the inference of the context requires the integration of sensory signals over time and therefore recurrent neural processing. For this kind of task, thalamus may not be the site of contextual inference because it lacks the required recurrent connectivity (Halassa and Sherman, 2019). However, contextual inference may be performed by higher-order cortical areas and could either be relayed back via the thalamus or transmitted directly, for example, via cortico-cortical feedback connections.

Testable predictions

Our model makes several predictions that could be tested in animals performing invariant sensory perception. Firstly, our model indicates that invariance across contexts may only be evident at the neural population level, but not on the single-cell level. Probing context invariance at different hierarchical stages of sensory processing may therefore require population recordings and corresponding statistical analyses such as neural decoding (Glaser et al., 2020). Secondly, we assumed that this context invariance is mediated by feedback modulation. The extent to which context invariance is enabled by feedback on a particular level of the sensory hierarchy could be studied by manipulating feedback connections. Since layer 1 receives a broad range of feedback inputs from different sources, this may require targeted manipulations. If no effect of feedback on context invariance is found, this may either indicate that feedforward mechanisms dominate or that the invariance in question is inherited from an earlier stage, in which it may well be the result of feedback modulation. Given that feedback is more pronounced in higher cortical areas (McAdams and Maunsell, 1999; Pardi et al., 2020), we expect that the contribution of feedback may play a larger role for the more complex forms of invariance further up in the sensory processing hierarchy. Thirdly, for feedback to mediate context invariance, the feedback projections need to contain a representation of the contextual variables. Our findings suggest, however, that the detection of this representation may require a nonlinear decoding method. Finally, a distinguishing feature of feedback and feedforward mechanisms is that feedback mechanisms take more time. We found that immediately following a sudden contextual change, the neuronal representation initially changes within the manifold associated with the previous context. Later, the feedback reorients the manifold to re-establish the invariance on the population level. Whether these dynamics are a signature of feedback processing or also present in feedforward networks will be an interesting question for future work.

Comparison to prior work

Computational models have implicated neuronal gain modulation for a variety of functions (Salinas and Sejnowski, 2001; Reynolds and Heeger, 2009). Even homogeneous changes in neuronal gain can achieve interesting population effects (Shine et al., 2021), such as orthogonalisation of sensory responses (Failor et al., 2021). More heterogeneous gain modulation provides additional degrees of freedom that enables, for example, attentional modulation (Reynolds and Heeger, 2009; Carandini and Heeger, 2011), coordinate transformations (Salinas and Thier, 2000), and – when amplified by recurrent dynamics – a rich repertoire of neural trajectories (Stroud et al., 2018). Gain modulation has also been suggested as a means to establish invariant processing (Salinas and Abbott, 1997), as a biological implementation of dynamic routing (Olshausen et al., 1993). While the modulation in these models of invariance can be interpreted as an abstract form of feedback, the resulting effects on the population level were not studied.

An interesting question is by which mechanisms the appropriate gain modulation is computed. In previous work, gain factors were often learned individually for each context, for example, by gradient descent or Hebbian plasticity (Olshausen et al., 1993; Salinas and Abbott, 1997; Stroud et al., 2018), mechanisms that may be too slow to achieve invariance on a perceptual timescale (van Hemmen and Sejnowski, 2006). In our model, by contrast, the modulation is dynamically controlled by a recurrent network. Once it has been trained, such a recurrent modulatory system can rapidly infer the current context and provide an appropriate feedback signal on a timescale only limited by the modulatory mechanism.

Limitations and future work

In our model, we simplified many aspects of sensory processing. Using simplistic sensory stimuli – compositions of sines – allowed us to focus on the mechanisms at the population level, while avoiding the complexities of natural sensory stimuli and deep sensory hierarchies. Although we do not expect conceptual problems in generalising our results to more complex stimuli, such as speech or visual stimuli, the associated computational challenges are substantial. For example, the feedback in our model was provided by a recurrent network, whose parameters were trained by backpropagating errors through the network and through time. This training process can get very challenging for large networks and long temporal dependencies (Bengio et al., 1994; Pascanu et al., 2013).

In our simulations, we trained the whole model – the modulatory system, the sensory representation, and the readout. For the simplistic stimuli we used, we observed that the training process mostly concentrated on optimising the modulatory system and readout, while a random mapping of sensory stimuli to neural representations seemed largely sufficient to solve the task. For more demanding stimuli, we expect that the sensory representation the modulatory system acts upon may become more important. A well-suited representation could minimise the need for modulatory interventions (Finn et al., 2017), in a coordinated interaction of feedforward and feedback.

To understand the effects of feedback modulation on population representations, we included biological constraints in the feedforward network and the structure of the modulatory feedback. However, we did not strive to provide a biologically plausible implementation for the computation of the appropriate feedback signals and instead used an off-the-shelf recurrent neural network (Hochreiter and Schmidhuber, 1997). The question how these signals could be computed in a biologically plausible way remains for future studies. The same applies to the question how the appropriate feedback signals can be learned by local learning rules (Lillicrap et al., 2020) and how neural representations and modulatory systems learn to act in concert.

Materials and methods

To study how feedback-driven modulation can enable flexible sensory processing, we built models of feedforward networks that are modulated by feedback. The feedback was dynamically generated by a modulatory system, which we implemented as a recurrent network. The weights of the recurrent network were trained such that the feedback modulation allowed the feedforward network to solve a flexible invariant processing task.

The dynamic blind source separation task

As an instance of flexible sensory processing, we used a dynamic variant of blind source separation. In classical blind source separation, two or more unknown time-varying sources s(t) need to be recovered from a set of observations (i.e. sensory stimuli) x(t). The sensory stimuli are composed of an unknown linear mixture of the sources such that x(t)=As(t) with a fixed mixing matrix A. Recovering the sources requires to find weights W such that Wx(t)s(t). Ideally, W is equal to the pseudo-inverse of the unknown mixing matrix A, up to permutations.

In our dynamic blind source separation task, we model variations in the stimulus context by changing the linear mixture over time – albeit on a slower timescale than the time-varying signals. Thus, the sensory stimuli are constructed as

(1) x(t)=A(t)s(t)+σnξ(t),

where A(t) is a time-dependent mixing matrix and σn is the amplitude of additive white noise ξ(t). The time-dependent mixing matrix determines the current context and was varied in discrete time intervals nt, meaning that the mixing matrix A(t) (i.e. the context) was constant for nt samples before it changed. The goal of the dynamic blind source separation task is to recover the original signal sources s from the sensory stimuli x across varying contexts. Thus, the network model output needs to be invariant to the specific context of the sources. Note that while the context was varied, the sources themselves were the same throughout the task, unless stated otherwise. Furthermore, in the majority of experiments the number of source signals and sensory stimuli was ns=2. A list of default parameters for the dynamic blind source separation task can be found in Table 1.

Table 1
Default parameters of the dynamic blind source separation task.
ParameterSymbolValue
Number of signalsns2
Number of samples in contextnt1000
Additive noiseσn0.001
Sampling frequencyfs8 KHz

Source signals

Request a detailed protocol

As default source signals, we used two compositions of two sines each (‘chords’) with a sampling rate of fs=8000 Hz that can be written as

(2) s1(t)=sin(2πf11t/fs)+sin(2πf12t/fs)
(3) s2(t)=sin(2πf21t/fs)+sin(2πf22t/fs)

with frequencies f11=100 Hz, f12=125 Hz, f21=150 Hz, and f22=210 Hz. Note that in our model we measure time as the number of samples from the source signals, meaning that timescales are relative and could be arbitrarily rescaled.

In Figure 5, we used pure sine signals with frequency f for visualisation purposes: si=sin(2πft/fs). We also validated the model on signals that are not made of sine waves, as a sawtooth and a square wave signal (Figure 1—figure supplement 4). Unless stated otherwise, the same signals were used for training and testing the model.

Time-varying contexts

Request a detailed protocol

We generated the mixing matrix A for each context by drawing random weights from a uniform distribution between 0 and 1, allowing only positive mixtures of the sources. Unless specified otherwise, we sampled new contexts for each training batch and for the test data, such that the training and test data followed the same distribution without necessarily being the same. The dimension of the mixing matrices was determined by number of signals ns such that A was of shape ns×ns. To keep the overall amplitude of the sensory stimuli in a similar range across different mixtures, we normalised the row sums of each mixing matrix to one. In the case of ns=2, this implies that the contexts (i.e. the mixing matrices) are drawn from a two-dimensional manifold (see Figure 8, bottom left). In addition, we only used the randomly generated mixing matrices whose determinant was larger than some threshold value. We did this to ensure that each signal mixture was invertible and that the weights needed to invert the mixing matrix were not too extreme. A threshold value of 0.2 was chosen based on visual inspection of the weights from the inverted mixing matrix.

Schematic of the dynamic blind source separation task, context space, and the modulated feedforward network.

Information flow is indicated by black arrows, and the flow of the error during training with backpropagation through time (BPTT) is shown in yellow.

Modulated feedforward network models

Throughout this work, we modelled feedforward networks of increasing complexity. Common to all networks was that they received the sensory stimuli x and should provide an output y that matches the source signals s. In the following, we first introduce the simplest model variant and how it is affected by feedback from the modulatory system, and subsequently describe the different model extensions.

Modulation of feedforward weights by a recurrent network

Request a detailed protocol

In the simplest feedforward network, the network output y(t) is simply a linear readout of the sensory stimuli x(t), with readout weights that are dynamically changed by the modulatory system:

(4) y(t)=(M(t)W0)x(t)

where W0 are the baseline weights and M(t) the modulation provided by the modulatory system. M(t) is of the same shape as W0 and determines the element-wise multiplicative modulation of the baseline weights. Because the task requires the modulatory system to dynamically infer the context, we modelled it as a recurrent network – more specifically, a long-short-term memory network (LSTMs; Hochreiter and Schmidhuber, 1997) – with Nh=100 hidden units. In particular, we used LSTMs with forget gates (Gers et al., 2000) but no peephole connections (for an overview of LSTM variants, see Greff et al., 2017).

In this work, we treated the LSTM as a black-box modulatory system that receives the sensory stimuli and the feedforward network’s output and provides the feedback signal in return (Figure 1a). A linear readout of the LSTM’s output determines the modulation M(t) in Equation 4. In brief, this means that

(5) M(t)=LSTM(x(t),y(t)),

where LSTM() is a function that returns the LSTM readout. For two-dimensional sources and sensory stimuli, for instance, LSTM() receives a concatenation of the two-dimensional vectors x(t) and y(t) as input and returns a two-by-two feedback modulation matrix – one multiplicative factor for each weight in W0. The baseline weights W0 were randomly drawn from the Gaussian distribution N(1,0.001) and fixed throughout the task. The LSTM parameters and readout were learned during training of the model.

Extension 1: Reducing the temporal specificity of feedback modulation

Request a detailed protocol

To probe our model’s sensitivity to the timescale of the modulatory feedback (Figure 2), we added a temporal filter to Equation 5. In that case, the modulation M(t) followed the dynamics

(6) τdM(t)t=M(t)+LSTM(x(t),y(t)),

with τ being the time constant of modulation. For small τ, the feedback rapidly affects the feedforward network, whereas larger τ imply a slowly changing modulatory feedback signal. The unit of this timescale is the number of samples from the source signals. Note that the timescale of the modulation should be considered relative to the timescale of the context changes nt. As a default time constant, we used τ=100<nt (see Table 2).

Table 2
Default parameters of the network models.
ParameterSymbolValue
Number of hidden units in long-short-term memory networkNh100
Number of units in middle layer zNz100
Number of distinct feedback signalsNFB4
Number of neurons in lower-level populationNL40
Number of neurons in higher-level populationNH100
Number of inhibitory neuronsNI20
Timescale of modulationτ100
Spatial spread of modulationσm20.2

Extension 2: Reducing the spatial specificity of feedback modulation

Request a detailed protocol

To allow for spatially diffuse feedback modulation (Figure 3), we added an intermediate layer between the sensory stimuli and the network output. This intermediate layer consisted of a population of Nz=100 units that were modulated by the feedback, where neighbouring units were modulated similarly. More specifically, the units were arranged on a ring to allow for a spatially constrained modulation without boundary effects. The population’s activity vector z(t) is described by

(7) z(t)=m(t)(Wxx(t)),

with the sensory stimuli x(t), a weight matrix Wx of size Nz×ns, and the vector of unit-specific multiplicative modulations m(t). Note that the activity of the units was not constrained to be positive here. The output of the network was then determined by a linear readout of the population activity vector according to

(8) y(t)=Wroz(t)

with a fixed readout matrix Wro.

The modulation to a single unit i was given by

(9a) τdmi(t)t=mi(t)+j=1NFBKij lj,
(9b) withlj=LSTM(x(t),y(t))j.

Here, τ is the modulation time constant, K is a kernel that determines the spatial specificity of modulation, LSTM()j the jth feedback signal from the LSTM, and NFB is the total number of feedback signals. As in the simple model, the NFB feedback signals were determined by a linear readout from LSTM.

The modulation kernel K was defined as a set of von Mises functions:

(10) Kij=exp(1σm2cos(ziloc-ljloc)),

where ziloc=2πiNz[0,2π] represents the location of the modulated unit i on the ring and ljloc the ‘preferred location’ of modulatory unit j, that is, the location on the ring that it modulates most effectively. These ‘preferred locations’ ljloc of the feedback units were evenly distributed on the ring. The variance parameter σm2 determines the spatial spread of the modulatory effect of the feedback units, that is, the spatial specificity of the modulation. Overall, the spatial distribution of the modulation was therefore determined by the number of distinct feedback signals NFB and their spatial spread σm2 (see Table 2 for a list of network parameters).

Extension 3: Hierarchical rate-based network

Request a detailed protocol

We further extended the model with spatial modulation (Equation 7Equation 10) to include a two-stage hierarchy, positive rates and synaptic weights that obey Dale’s law. Furthermore, we implemented the feedback modulation as a gain modulation that scales neural rates but keeps them positive. To this end, we modelled the feedforward network as a hierarchy of a lower-level and a higher-level population. Only the higher-level population received feedback modulation. Splitting the neural populations in this way allowed us to model the connections between them with weights that follow Dale’s law. Furthermore, the unmodulated lower-level population could serve as a control for the emergence of context-invariant representations. The lower-level population consisted of NL=40 rate-based neurons and the population activity vector was given by

(11) zL(t)=[WLxx(t)]+,

where WLx is a fixed weight matrix, x(t) the sensory stimuli, and the rectification []+=max(0,) ensures that rates are positive. The lower-level population thus provides a neural representation of the sensory stimuli. The higher-level population consisted of NH=100 rate-based neurons that received feedforward input from the lower-level population. The feedforward input consisted of direct excitatory projections as well as feedforward inhibition through a population of NI=20 local inhibitory neurons. The activity vector of the higher-level population zH(t) was thus given by

(12) zH(t)=[p(t)(WHLzL(t)-WHIzI(t))]+
(13) zI(t)=[WILzL(t)]+.

Here, WHL, WHI, and WIL are positive weight matrices, zI(t) is the inhibitory neuron activities, and p(t) is the neuron-specific gain modulation factors. As for the spatially modulated network of Extension 2, the network output y(t) was determined by a fixed linear readout Wro (see Equation 8). The distributions used to randomly initialise the weight matrices are provided in Table 3.

Table 3
Distributions used for randomly initialised weight parameters.
WeightsDistribution
W0N(1,0.001)
WxN(0,0.5)
WLxN(0,0.5)
WroN(0,0.5)
WHLN(1,0.5)20/NH
WILN(1,0.5)/NI
WHIN(1,1)20/NH
Long-short-term memory network parametersU(1/NH,1/NH)
Long-short-term memory network readoutU(1/NFB,1/NFB)

Again, the modulation was driven by feedback from the LSTM, but in this model variant we assumed inhibitory feedback, that is, stronger feedback signals monotonically decreased the gain. More specifically, we assumed that the feedback signal targets a population of modulation units m, which in turn modulate the gain in the higher-level population. The gain modulation of neuron i was constrained between 0 and 1 and determined by

(14) pi(t)=11+exp(mi(t))

with mi(t) being the activity of a modulation unit i, which follows the same dynamics as in Equation 9a (see Figure 6a).

Training the model

Request a detailed protocol

We used gradient descent to find the model parameters that minimise the difference between the source signal s(t) and the feedforward network’s output y(t):

(15) =t=1ntdist(s(t),y(t))

with a distance measure dist(). We used the machine learning framework PyTorch (Paszke et al., 2019) to simulate the network model, obtain the gradients of the objective by automatic differentiation, and update the parameters of the LSTM using the Adam optimiser (Kingma and Ba, 2014) with a learning rate of η=10-3. As distance measure in the objective, we used a smooth variant of the L1 norm (PyTorch’s smooth L1 loss variant) because it is less sensitive to outliers than the mean squared error (Huber, 1964).

During training, we simulated the network dynamics over batches of 32 trials using forward Euler with a time step of Δt=1. Each trial consisted of nt time steps (i.e. samples) and the context (i.e. mixing matrix) differed between trials. Since the model contains feedback and recurrent connections, we trained it using backpropagation through time (Werbos, 1990). This means that for each trial we simulated the model and computed the loss for every time step. At the end of the trial, we propagated the error through the nt steps of the model to obtain the gradients and updated the parameters accordingly (Figure 8). Although the source signals were the same in every trial, we varied their phase independently across trials to prevent the LSTM from learning the exact signal sequence. To this end, we generated 16,000 samples of the source signals and in every batch randomly selected chunks of nt samples independently from each source. Model parameters were initialised according to the distributions listed in Table 3.

In all model variants, we optimised the parameters of the modulator (input, recurrent, and readout weights as well as the biases of the LSTM; see Equation 5 and Equation 9b). The parameters were initialised with the defaults from the corresponding PyTorch modules, as listed in Table 3. To facilitate the training in the hierarchical rate-based network despite additional constraints, we also optimised the feedforward weights WHL, WHI, WIL, WLx, and Wro. In principle, this allows to adapt the representation in the two intermediate layers such that the modulation is most effective. However, although we did not quantify it, we observed that optimising the network readout Wro facilitated the training the most, suggesting that a specific format of the sensory representations was not required for an effective modulation.

To prevent the gain modulation factor from saturating at 0 or 1, we added a regularisation term to the loss function Equation 15 that keeps the LSTM’s output small:

(16) =λoutt=1ntj=1NFB|LSTM(x(t),y(t))j|

with λout=10-5.

Gradient values were clipped between –1 and 1 before each update to avoid large updates. For weights that were constrained to be positive, we used their absolute value in the model. Each network was trained for 10,000–12,000 batches and for 5 random initialisations (Figure 1—figure supplement 2).

Testing and manipulating the model

Request a detailed protocol

We tested the network model performance on an independent random set of contexts (i.e. mixing matrices), but with the same source signals as during training. During testing, we also changed the context every nt steps, but the length of this interval was not crucial for performance (Figure 1—figure supplement 1d).

To manipulate the feedback modulation in the hierarchical rate-based network (Figure 4), we provided an additional input to the modulation units m in Equation 9a. We used an input of 3 or –3 depending on whether the modulation units were activated or inactivated, respectively. To freeze the feedback modulation (Figure 6), we discarded the feedback signal and held the local modulation p in Equation 14 at a constant value determined by the feedback before the manipulation. The dynamics of the LSTM were continued, but remained hidden to the feedforward network until the freezing was stopped.

Unmodulated feedforward network models

Linear regression

Request a detailed protocol

As a control, we trained feedforward networks with weights that were not changed by a modulatory system. First, we used the simplest possible network architecture, in which the sensory stimuli are linearly mapped to the outputs (Figure 1—figure supplement 1a):

(17) y(t)=Wx(t).

It is intuitive that a fixed set of weights W cannot invert two different contexts (i.e. different mixing matrices A1 and A2). As an illustration, we trained this simple feedforward network on one context and tested it on different contexts. To find the weights W, we used linear regression to minimise the mean squared error between the source signal s(t) and the network’s output y(t). The training data consisted of 1024 consecutive time steps of the sensory stimuli for a fixed context, and the test data consisted of different 1024 time steps generated under a potentially different mixing. We repeated this procedure by training and testing a network for all combinations of 20 random contexts.

Multilayer nonlinear network

Request a detailed protocol

Since solving the task was not possible with a single set of readout weights, we extended the feedforward model to include three hidden layers consisting of 32, 16, and 8 rectified linear units (Figure 1—figure supplement 1d). The input to this network was one time point from the sensory stimuli and the target output the corresponding time point of the sources. We trained the multilayer network on 5000 batches of 32 contexts using Adam (learning rate 0.001) to minimise the mean squared error between the network output and the sources.

Multilayer network with sequences as input

Request a detailed protocol

Solving the task requires the network to map the same sensory stimulus to different outputs depending on the context. However, inferring the context takes more than one time point. To test if a feedforward network with access to multiple time points at once could in principle solve the task, we changed the architecture of the multilayer network, such that it receives a sequence of the sensory stimuli (Figure 1—figure supplement 1g). The output of the network was a sequence of equal length. We again trained this network on 5000 batches of 32 contexts to minimise the error between its output and the target sources, where both the network input and output were sequences. The length of these sequences was varied between 1 and 150.

Data analysis

Signal clarity

Request a detailed protocol

To determine task performance, we measured how clear the representation of the source signals is in the network output. We first computed the correlation coefficient of each signal si with each output yj

(18) rij=t(si(t)-s¯i)(yj(t)-y¯j)σs,iσy,j,

where s¯i and y¯j are the respective temporal mean and σs,i and σy,j the respective temporal standard deviations. The signal clarity in output yj is then given by the absolute difference between the absolute correlation with one compared to the other signal:

(19) cj=||r1j|-|r2j||.

By averaging over outputs, we determined the overall signal clarity within the output. Note that the same measure can be computed on other processing stages of the feedforward network. For instance, we used the signal clarity of sources in the sensory stimuli as a baseline control.

Signal-to-noise ratio

Request a detailed protocol

The signal-to-noise ratio in the sensory stimuli was determined as the variability in the signal compared to the noise. Since the mean of both the stimuli and the noise was zero, the signal-to-noise ratio could be computed by

SNR=σs2σn2,

where σn is the standard deviation of the additive white noise and σs is the measured standard deviation in the noise-free sensory stimuli, which was around 0.32. As a scale of the signal-to-noise ratio, we used decibels (dB), that is, we used dB=10log10(SNR).

Linear decoding analysis

Signal decoding

Request a detailed protocol

We investigated the population-level invariance by using a linear decoding approach. If there was an invariant population subspace, the source signals could be decoded by the same decoder across different contexts. We therefore performed linear regression between the activity in a particular population and the source signals. This linear decoder was trained on nc=10 different contexts with nt=1,000 time points each, such that the total number of samples was 10,000. The linear decoding was then tested on 10 new contexts and the performance determined using the R2 measure.

Context decoding

Request a detailed protocol

We took a similar approach to determine from which populations the context could be decoded. For the dynamic blind source separation task, the context is given by the source mixture, as determined by the mixing matrix. Since we normalised the rows of each mixing matrix, the context was determined by two context variables. We calculated the temporal average of the neuronal activities within each context and performed a linear regression of the context variables onto these averages. To exclude onset transients, we only considered the second half (500 samples) of every context. Contexts were sampled from the two-dimensional grid of potential contexts. More specifically, we sampled 20 points along each dimension and excluded contexts, in which the sensory stimuli were too similar (analogously to the generation of mixing matrices), leaving 272 different contexts (see Figure 7c, right). The linear decoding performance was determined with a fivefold cross-validation and measured using R2. Since the modulatory feedback signals depend nonlinearly on the context (Figure 7c), we tested two nonlinear versions of the decoding approach. First, we performed a quadratic expansion of the averaged population activity before a linear decoding. Second, we tested a linear decoding of the inverse mixing matrix (four weights) instead of the two variables determining the context.

Population subspace analysis

Request a detailed protocol

We visualised the invariant population subspaces by projecting the activity vector onto the two readout dimensions and the first principal component. To measure how the orientation of the subspaces changes when the context or feedback changes, we computed the angle between the planes spanned by the respective subspaces. These planes were fitted on the three-dimensional data described above using the least-squares method. Since we were only interested in the relative orientation of the subspaces, we used a circular measure of the angles, such that a rotation of 180° corresponded to 0°. This means that angles could range between 0 and 90°.

Code availability

Request a detailed protocol

The code for models and data analysis is publicly available under https://github.com/sprekelerlab/feedback_modulation_Naumann22, (copy archived at swh:1:rev:05373b093803e464082ad5b9e8ab2dbbf43bb23e; Naumann, 2022).

Data availability

The current manuscript is a computational study, so no data have been generated for this manuscript. Modelling code is available under https://github.com/sprekelerlab/feedback_modulation_Naumann21, (copy archived at swh:1:rev:05373b093803e464082ad5b9e8ab2dbbf43bb23e) upon publication.

References

  1. Conference
    1. Finn C
    2. Abbeel P
    3. Levine S
    (2017)
    Model-agnostic meta-learning for fast adaptation of deep networks
    In International Conference on Machine Learning. pp. 1126–1135.
  2. Conference
    1. Hochreiter S
    2. Younger AS
    3. Conwell PR
    (2001) Learning to learn using gradient descent
    In International Conference on Artificial Neural Networks. pp. 87–94.
    https://doi.org/10.1007/3-540-44668-0
    1. Miller RJ
    (1998) Presynaptic receptors
    Annual Review of Pharmacology and Toxicology 38:201–227.
    https://doi.org/10.1146/annurev.pharmtox.38.1.201
  3. Conference
    1. Paszke A
    2. Gross S
    3. Massa F
    4. Lerer A
    5. Bradbury J
    6. Chanan G
    7. Killeen T
    8. Lin Z
    9. Gimelshein N
    10. Antiga L
    (2019)
    Pytorch: An imperative style, high-performance deep learning library
    Advances in Neural Information Processing Systems. pp. 8026–8037.
  4. Book
    1. van Hemmen JL
    2. Sejnowski TJ
    (2006) How Does Our Visual System Achieve Shift and Size Invariance
    In: van Hemmen JL, editors. Problems in Systems Neuroscience. Oxford University Press. pp. 322–340.
    https://doi.org/10.1093/acprof:oso/9780195148220.003.0016

Decision letter

  1. Srdjan Ostojic
    Reviewing Editor; Ecole Normale Superieure Paris, France
  2. Andrew J King
    Senior Editor; University of Oxford, United Kingdom

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Decision letter after peer review:

[Editors’ note: the authors submitted for reconsideration following the decision after peer review. What follows is the decision letter after the first round of review.]

Thank you for submitting the paper "Invariant neural subspaces maintained by feedback modulation" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by a Senior Editor. The reviewers have opted to remain anonymous.

We are sorry to say that, after consultation with the reviewers, we have decided that this work will not be considered further for publication by eLife at this time.

While all three reviewers appreciated the novelty of the proposed computational role for feedback connections, they estimated that substantial additional work would be needed to establish more firmly the mechanisms underlying context–invariant processing and its biological relevance. Given the extent of the criticisms, we have decided to reject the paper. Should further analyses allow you to fully address these criticisms we would be open to a resubmission.

Reviewer #1:

One of the key questions in sensory neuroscience is how cortical networks extract invariant percepts from variable sensory inputs. Much of the existing literature focuses on the role of feed–forward hierarchical processing for extracting such invariances. The present study proposes an alternative mechanism based on top–down feedback. Focusing on the so–called source–separation, or cocktail–party problem, the manuscript shows how sources mixed in a context–dependent manner can be separated independently of context, using feed–forward networks modulated by top–down context–dependent inputs.

The manuscript starts with a simplified, abstract network, and then progressively moves to more biologically plausible ones. By performing population analyses of network activity, the authors then argue for a mechanism based on context–invariant subspaces.

Strengths of the paper:

– novel proposal for an important class of cortical computations

– very elegant formulation of the problem

– the writing style is very clear and appealing

– network implementations at different levels of biophysical realism.

Weaknesses of the paper:

– the announced mechanism, based on invariant subspaces, is not clearly explained and needs to be supported by additional evidence.

– how the network detects contextual changes does not seem to be explained

– the analyses of network activity, their rationale and the resulting conclusions are difficult to follow.

While I very much appreciated the novelty and the elegance of the approach developed in this paper, ultimately, I was left wondering how the networks perform their task.

– The title and abstract announce a mechanism based on invariant neural subspaces. Clearly, since the readout is fixed, there must be an invariant subspace, but the key question is how it is generated and maintained across contexts. In the Results, this mechanism is explained only briefly at the very end of the results, in connection to Figure 6, which seems to be just an illustration. The authors would need to unpack what precisely the mechanism is (not clear right now) and give more evidence for it.

– An important complementary issue is how the network detects context changes. The manuscript states that "feedback–mediated invariance requires time to establish after contextual changes" (lines 245–246), but how this works does not seem to be explained. What type of error signal does the network use to change the gains?

On a related note, is the network trained on all the contexts it sees during testing, or is it able to deal with totally novel contexts?

– The logic of the sequence of analysis (optogenetic manipulations; correlation; changes in gain…) is a bit difficult to follow and needs more motivation. In particular, why is the non–linear encoding of context important?

– It is a bit surprising that the analyses focus on the most complex version of the network to examine mechanisms. Presumably the simplified networks could be leveraged to identify and explain the mechanisms in a more transparent manner.

Reviewer #2:

The authors aim to explore an understudied potential function of feedback connections: providing context–independent sensory processing. Invariant sensory processing is frequently assumed to be carried out by feedforward processing and much of the study of feedback focuses on how feedback could implement context–dependent processing. This makes this study promising and relatively novel.

The strengths of this paper are that it demonstrates convincingly and using a variety of network architectures and feedback mechanisms that feedback modulations can indeed help a network read out sensory input in a context–independent way.

The weaknesses are in the analysis and comparisons of the various networks. While the basic finding that this invariance does not result from invariant activity on the individual neuron level is interesting and of value, the explanation that it instead leads to invariant population activity is almost tautological given the network architecture. It is also unclear how the simpler models the authors present are meant to provide insight on either the more biologically detailed hierarchical model or on real neural processing, especially given that the mode of modulation in the simplest model (re–weighting of feedforward weights) differs from that of the later models (re–weighting of neural activation). In this way I don't feel that the authors fully achieved their goal of describing the mechanism of feedback modulation.

The methods appear technically sound, but I am confused by some of the choices. For example, the authors start with a single layer network where feedback modulates the weights between the input and output. This is a different mechanism than the normal neuronal gain usually attributed to feedback. The authors then add more details to push the model more in the biological direction, but multiple details are sometimes added at once and the logic behind these choices isn't always clear. I believe the authors switch to using neuronal gain when they want to explore spatially correlated modulation, but they don't talk about neuronal modulation until they introduce their full hierarchical model. The hierarchical model also adds Dale's law and a separate inhibitory population but it is not clear why these details were added or if/how they change the function of the model in a way relevant to understanding feedback modulation. Even the use of a multi–layer model is not very well motivated given that they show that this task can be completed with a very small one layer model. The simplicity of the task has implications for understanding some of these findings as well. For example, to show that modulatory signals can be spatially correlated, the authors create a model with many more neurons than is needed to solve the task and show that the modulatory signal can target nearby cells in this population similarly without sacrificing performance. But the low dimensional nature of the modulatory signal is only really an issue of interest in the context of a higher dimensional task. As a thought experiment: if the 2 neurons in the original model were simply replicated to 50 each and each population of 50 neurons was given the same modulation, this would be essentially equivalent to the original 2 cell model, but under the logic of what the authors have shown here, would supposedly demonstrate that modulatory signals still work if low dimensional. In this way, that analysis fell short.

I think that this work may spur more interest in studying the role of feedback for invariant sensory processing, which would be a very productive outcome. Furthermore, the demonstration that the context signals cannot be linearly readout from the cells performing the modulation is an important lesson for the analysis of neural data. I also think further reflection on the finding that the modulatory network needs direct sensory input (more so even than the input from later processing stages) will be very important for understanding how this modulation works and how it relates to biological structures. As the authors note, this may mean that their model is more akin to inputs from higher order thalamic areas, though even that mapping is imperfect due to the lack of recurrence.

I think it would help the readability of the paper if the authors included a few more brief descriptions of the methods in the Results. For example, a better description of how the signals are generated, the fact that the networks are trained with a single set of signals only, etc. Also, there were points where it wasn't clear if a network was tested under different conditions or actually retrained for them (for example, in figure 2d/e). Also, the fact that the modulation went from being on the weights to on the neurons themselves was not made clear in section "Invariance can be established by spatially diffuse feedback modulation". I also found the schematic in Figure 1a a bit confusing. I don't know why x is represented as a question mark when it is a sum of the two signals. I'd prefer a diagram that makes the dimensionality of x clearer (relatedly, why are there only 3 weights from x to y when I believe it is a 2x2 matrix).

"While we trained the modulatory system using supervised learning, the contextual inference is performed by its dynamics without access to the target sources and thus unsupervised" I feel this could be read as saying that an actual unsupervised objective was used, when in fact only supervised learning took place, so I would suggest re–wording.

I didn't understand the claim about matched EI inputs and how it depends on using gain modulation. This should probably be expanded and related to the main questions of the paper or possibly removed.

Figure 4i seems to be the main demonstration that individual neural activity itself is not invariant to context. I'd like to see a more in–depth exploration of this. Particularly, if the readout only relied on a small handful of neurons then finding that the rest of the neurons are not context–invariant wouldn't prove that individual neural invariance is not a relevant mechanism. Given that the readout from this network is known, it would be particularly easy to determine if the heavily weighted neurons in particular are or are not context invariant.

In general, I don't understand why the authors use a separately trained linear readout when trying to show that the population activity at the final layer is invariant. They eventually acknowledge that "Since this readout is obtained from the data, this procedure does not require knowledge of the readout in the network model. Note that the trained decoder and the network readout are not necessarily identical" but they don't explain why they are using this alternative readout or what new insights its use adds. Particularly, the performance of the network indicates the there is some sort of context invariant read out possible from this population, yet the authors use this other readout in a way that is seemingly supposed to add something to the explanation.

Be sure to say what errorbars are based on in all figures.

"In our model, the mechanism needs to satisfy a few key

requirements: i) the modulation is not uniform across the population, ii) it operates on a timescale similar to that of changes in context, and iii) it is driven by feedback projections." I don't understand claim (iii). If anything, the results show the importance of the modulation being driven by feedforward sensory signals (figure 2d/e).

"In addition, feedback inputs from the sensory to the modulatory system allow a better control of the modulated network state." I don't see how the connections from a sensory system to a modulatory system are "feedback".

Reviewer #3:

I appreciate the didactic way in which the manuscript was written (and beautiful figures!), in particular the progression from a vanilla architecture towards the full fledged model with EI rectified neurons with spatially specific modulation. My main concerns (detailed below) are two–fold:

1. I felt that some extensions were not explicitly justified (e.g. why 2 layers instead of 2, etc)

2. I was expecting more 'reverse–engineering' of the mechanism through which the network accomplishes a context invariant projection. This is the main result of the paper, as reflected in the title, so I think it deserves more unpacking. Below I unpack these concerns, sometimes providing some suggestions to improve the motivation and clarity of the paper (without any particular order)

1. Overall, the architecture choices are a bit unjustified. In the extreme, wouldn't the LSTM alone solve the task? The addition of each feedforward layer should be better motivated (e.g. more biologically realistic? In what sense?). For example, why add an extra layer from extensions 2 and 3? If those are necessary, this should be explained. If they are not necessary, they should be removed.

2. 'Because the task requires a dynamic inference of the context, it cannot be solved by feedforward networks or standard blind source separation algorithms' I think the paper could be better motivated if this was shown explicitly with some examples.

3. A figure explicitly illustrating the training setup would help motivate what is trivially solved and what is actually challenging. For instance, in the main manuscript, it is not clear in which cases the network is trained and tested on the same contexts (ie A(t)) and which cases it is not. In the first case, the context can be easily inferred from x(t) but the latter is more challenging?

4. however I understand that the paper is already too long, Intra / extrapolation results deserve more spotlight and unpacking in my opinion. In general, if there is a lack of space, I would merge Figure 1 and figure 2 – and jump directly to extension 1 – and move most of figure 2 to sup.

5. Most important concern to me: Figure 6, in which the mechanism is revealed, deserves more quantifications to explicitly pinpoint the mechanism. Three suggestions come to mind:

a. Plot the 3 PCs components (instead of just 1) and show the readout in this space. The key result is that the readout is invariant to context and this is not clearly illustrated at the moment. Instead, what is shown is that the representation changes, but that it changes in a way that preserves invariance on the readout is not clearly highlighted.

b. The authors highlight that the network is not just reversing the new mixing coefficients and projecting the activity back into the 2d low manifold. Instead, it is rotating everything out of this manifold. My suggestion would be to show this alternatively explicitly. Is it actually possible? Relatedly, what happens if the context is changed back to context 1?

c. Finally, all the statements made about this figure should be quantified and not just illustrated for 1 trial.

https://doi.org/10.7554/eLife.76096.sa1

Author response

[Editors’ note: The authors appealed the original decision. What follows is the authors’ response to the first round of review.]

Reviewer #1:

One of the key questions in sensory neuroscience is how cortical networks extract invariant percepts from variable sensory inputs. Much of the existing literature focuses on the role of feed–forward hierarchical processing for extracting such invariances. The present study proposes an alternative mechanism based on top–down feedback. Focusing on the so–called source–separation, or cocktail–party problem, the manuscript shows how sources mixed in a context–dependent manner can be separated independently of context, using feed–forward networks modulated by top–down context–dependent inputs.

The manuscript starts with a simplified, abstract network, and then progressively moves to more biologically plausible ones. By performing population analyses of network activity, the authors then argue for a mechanism based on context–invariant subspaces.

Strengths of the paper:

– novel proposal for an important class of cortical computations

– very elegant formulation of the problem

– the writing style is very clear and appealing

– network implementations at different levels of biophysical realism.

Weaknesses of the paper:

– the announced mechanism, based on invariant subspaces, is not clearly explained and needs to be supported by additional evidence.

– how the network detects contextual changes does not seem to be explained

– the analyses of network activity, their rationale and the resulting conclusions are difficult to follow.

We thank the reviewer for their interest in our work. We hope to have addressed all of the weaknesses listed above with the revision of the manuscript. In brief:

– We provide a more in-depth analysis of the mechanism based on invariant subspaces (see Change 2 above).

– We show and discuss what the network needs to detect context changes (see Change 4 above).

– We rearranged the results in the manuscript and added additional explanations to make the sequence of analyses and main findings clearer (see Change 1 and 3 above).

While I very much appreciated the novelty and the elegance of the approach developed in this paper, ultimately, I was left wondering how the networks perform their task.

– The title and abstract announce a mechanism based on invariant neural subspaces. Clearly, since the readout is fixed, there must be an invariant subspace, but the key question is how it is generated and maintained across contexts. In the Results, this mechanism is explained only briefly at the very end of the results, in connection to Figure 6, which seems to be just an illustration. The authors would need to unpack what precisely the mechanism is (not clear right now) and give more evidence for it.

We agree with the reviewer that the mechanism based on invariant subspaces needed more unpacking to underscore the main claims of the paper. As outlined in Change 2 above, we have extended our analyses of the mechanism and provided a quantification of the findings illustrated in the previous Figure 6.

– An important complementary issue is how the network detects context changes. The manuscript states that "feedback–mediated invariance requires time to establish after contextual changes" (lines 245–246), but how this works does not seem to be explained. What type of error signal does the network use to change the gains?

On a related note, is the network trained on all the contexts it sees during testing, or is it able to deal with totally novel contexts?

We thank the reviewer for pointing out that this was not sufficiently explained. As described in Change 4, we have added several sentences and a supplementary figure to clarify how the modulator maps its input to the correct feedback modulation:

“[…] The modulator therefore had to use its recurrent dynamics to determine the appropriate modulatory feedback for the time-varying context, based on the sensory stimuli and the network output. Put differently, the modulator had to learn an internal model of the sensory data and the contexts, and use it to establish the desired context invariance in the output.“ (ll. 57-61)

“Because the network had to learn an internal model of the task, we expected a limited degree of generalisation to new situations. Indeed, the network was able to interpolate between source frequencies seen during training (Supp. Figure S5), but failed on sources and contexts that were qualitatively different (Supp. Figure S6 b-d). The specific computations performed by the modulator are therefore idiosyncratic to the problem at hand. Hence, we did not investigate the internal dynamics of the modulator in detail, but concentrated on its effect on the feedforward network.” (ll. 85-91)

These excerpts should also answer the question regarding new contexts. Note that contexts were randomly sampled from a continuous context space for training and testing. The network is therefore not tested on the exact same contexts, but on contexts from the same distribution (unless specified otherwise). We have also clarified this in the methods section:

“Unless specified otherwise, we sampled new contexts for each training batch and for the test data, such that the training and test data followed the same distribution without necessarily being the same.” (ll. 494-497)

Regarding the question of the error signal: The modulator does not use an error signal, but computes a new mapping from its input to the correct modulation in response to context changes. Since the inputs are time-dependent the modulator needs to see a sufficient number of time-points before it can provide the appropriate feedback signal.

In additional analyses we found that a feedforward network can also solve the task, if it is given a time window of the sensory signals rather than a single moment in time. Such a network requires about the same number of timesteps from the stimuli (compare with Supp. Figure S1i and Supp. Figure S6a) as the modular-based architecture. This shows that the time it takes the network to infer the context from its input is not particular to our model but to the task, i.e. the statistics of the sources and contexts.

– The logic of the sequence of analysis (optogenetic manipulations; correlation; changes in gain…) is a bit difficult to follow and needs more motivation. In particular, why is the non–linear encoding of context important?

As described in the general answer, we agree that the sequence of analyses was not intuitive to the reader and thus rearranged the results in the new version of the manuscript. In particular, we have removed the optogenetic manipulations and considerations of E/I balance, since they are not central to the message of the paper (see Change 1 and 3 above).

Regarding the non-linear encoding of the context: We now include a deeper discussion of the results (see Change 3) that aim to clarify the relevance of these findings:

“In summary, the higher-population provides a linear representation not only of the stimuli, but also of the context. In contrast, the modulatory units contained a nonlinear representation of the context, which could not be extracted by linear decoding approaches. We speculate that if contextual feedback modulation is mediated by interneurons in layer 1, they should represent the context in a nonlinear way.” (ll. 288-292).

– It is a bit surprising that the analyses focus on the most complex version of the network to examine mechanisms. Presumably the simplified networks could be leveraged to identify and explain the mechanisms in a more transparent manner.

We want to thank the reviewer for this suggestion as we think it has significantly helped us to improve the structure of the manuscript. We now perform the single cell and population analyses on the network with spatially diffuse modulation (from Figure 3), as this is the simplest model comprising a neural population. We then verify that the findings hold for the Dalean network (new Figure 6). Details on the new order of the results can be found in the list of major changes above (specifically point 1).

Reviewer #2:

The authors aim to explore an understudied potential function of feedback connections: providing context–independent sensory processing. Invariant sensory processing is frequently assumed to be carried out by feedforward processing and much of the study of feedback focuses on how feedback could implement context–dependent processing. This makes this study promising and relatively novel.

The strengths of this paper are that it demonstrates convincingly and using a variety of network architectures and feedback mechanisms that feedback modulations can indeed help a network read out sensory input in a context–independent way.

The weaknesses are in the analysis and comparisons of the various networks. While the basic finding that this invariance does not result from invariant activity on the individual neuron level is interesting and of value, the explanation that it instead leads to invariant population activity is almost tautological given the network architecture. It is also unclear how the simpler models the authors present are meant to provide insight on either the more biologically detailed hierarchical model or on real neural processing, especially given that the mode of modulation in the simplest model (re–weighting of feedforward weights) differs from that of the later models (re–weighting of neural activation). In this way I don't feel that the authors fully achieved their goal of describing the mechanism of feedback modulation.

The methods appear technically sound, but I am confused by some of the choices. For example, the authors start with a single layer network where feedback modulates the weights between the input and output. This is a different mechanism than the normal neuronal gain usually attributed to feedback. The authors then add more details to push the model more in the biological direction, but multiple details are sometimes added at once and the logic behind these choices isn't always clear. I believe the authors switch to using neuronal gain when they want to explore spatially correlated modulation, but they don't talk about neuronal modulation until they introduce their full hierarchical model. The hierarchical model also adds Dale's law and a separate inhibitory population but it is not clear why these details were added or if/how they change the function of the model in a way relevant to understanding feedback modulation. Even the use of a multi–layer model is not very well motivated given that they show that this task can be completed with a very small one layer model. The simplicity of the task has implications for understanding some of these findings as well. For example, to show that modulatory signals can be spatially correlated, the authors create a model with many more neurons than is needed to solve the task and show that the modulatory signal can target nearby cells in this population similarly without sacrificing performance. But the low dimensional nature of the modulatory signal is only really an issue of interest in the context of a higher dimensional task. As a thought experiment: if the 2 neurons in the original model were simply replicated to 50 each and each population of 50 neurons was given the same modulation, this would be essentially equivalent to the original 2 cell model, but under the logic of what the authors have shown here, would supposedly demonstrate that modulatory signals still work if low dimensional. In this way, that analysis fell short.

I think that this work may spur more interest in studying the role of feedback for invariant sensory processing, which would be a very productive outcome. Furthermore, the demonstration that the context signals cannot be linearly readout from the cells performing the modulation is an important lesson for the analysis of neural data. I also think further reflection on the finding that the modulatory network needs direct sensory input (more so even than the input from later processing stages) will be very important for understanding how this modulation works and how it relates to biological structures. As the authors note, this may mean that their model is more akin to inputs from higher order thalamic areas, though even that mapping is imperfect due to the lack of recurrence.

We thank the reviewer for their thorough assessment of our work and for raising important concerns about the sequence of analyses and line of argumentation. In the following we first address what we think are the main concerns raised. After that, we respond to the specific recommendations in more detail.

Main concerns and answers:

1. Concern: The emergence of subspaces at the population level is not a surprising finding.

It is indeed not surprising that the neural population contains a context-invariant subspace. We do not think the existence of an invariant subspace is the main finding of our work, but rather how this subspace can be dynamically maintained by feedback modulation. Nevertheless, we think it is worth noting that the invariant subspace can not only be extracted with the readout learned during training, but also with a separate decoder that was trained on only a few contexts. To make clearer why this could be of interest, we have edited the respective text:

“In fact, the population had to contain an invariant subspace, because the fixed linear readout of the population was able to extract the sources across contexts. However, the linear decoding approach shows that this subspace can be revealed from the population activity itself with only a few contexts and no knowledge of how the neural representation is used downstream. The same approach could therefore be used to reveal context-invariant subspaces in neural data from population recordings.” (ll. 179-185)

2. Concern: It is unclear how simpler models give insight into the more complex models. Relatedly, some model choices are not well motivated.

We acknowledge that the connection between the models was not made sufficiently clear. To address this, we made several changes to the manuscript. First, we demonstrate explicitly that the network with spatial modulation finds an equivalent solution to the initial linear readout network. In particular, we show that the effective weights from the stimuli to the network output in the network with spatial modulation also follow the inverse of the mixing (Supp. Figure S8d) and describe this in the text:

“The diffuse feedback modulation switched when the context changed, but was roughly constant within contexts (Figure 3c), as in the simple model. The effective weight from the stimuli to the network output also inverted the linear mixture of the sources (Supp. Figure S8d, compare with Figure 1c).” (ll. 143-146)

Second, we perform the population level analyses on the simpler spatially modulated model and then verify that the same results hold for the Dalean network (see Major Change 1 and 3 above). This demonstrates that our findings are not a result of the specific architecture of the feedforward model. Third, we have added a few more sentences to motivate the distinction between the two neural populations in the Dalean model:

“We extended the feedforward model as follows (Figure 6a): First, all neurons had positive firing rates. Second, we split the neural population (z in the previous model) into a "lower-level" (zL) and "higher-level" population (zH). The lower-level population served as a neural representation of the sensory stimuli, whereas the higher-level population was modulated by feedback. This allowed a direct comparison between a modulated and an unmodulated neural population. It also allowed us to include Dalean weights between the two populations.” (ll. 236-242)

3. Concern: The feedback can only be low-dimensional, because the task is low-dimensional.

The reviewer is right that the low dimensionality of the task is why the feedback modulation can be low-dimensional. More precisely, it is the low dimensionality of the context that allows a low-dimensional feedback modulation. We acknowledge this in lines 148-150 of the manuscript:

“Moreover, the feedback could have a spatially broad effect on the modulated population without degrading the signal clarity (Figure 3e, Supp. Figure S6), consistent with the low dimensionality of the context.” (ll. 148-150)

The size of the neural populations that are necessary to solve the task are more related to the dimensionality of the stimuli or the degree of nonlinearity in the input-output mapping. Therefore, a low-dimensional and diffuse modulation may still be functional for more high-dimensional or nonlinear tasks, as long as the context remains low-dimensional. We also think that for higher dimensional inputs feedforward mechanisms could play a role in preprocessing them either directly towards invariance or into a form where the modulation can achieve it most effectively.

I think it would help the readability of the paper if the authors included a few more brief descriptions of the methods in the Results. For example, a better description of how the signals are generated, the fact that the networks are trained with a single set of signals only, etc. Also, there were points where it wasn't clear if a network was tested under different conditions or actually retrained for them (for example, in figure 2d/e). Also, the fact that the modulation went from being on the weights to on the neurons themselves was not made clear in section "Invariance can be established by spatially diffuse feedback modulation". I also found the schematic in Figure 1a a bit confusing. I don't know why x is represented as a question mark when it is a sum of the two signals. I'd prefer a diagram that makes the dimensionality of x clearer (relatedly, why are there only 3 weights from x to y when I believe it is a 2x2 matrix).

We thank the reviewer for providing specific recommendations on how to improve the readability of the manuscript. We have implemented this advice by adding the key equations of the setup to the results. We also added a methods figure that explicitly illustrates the task, the model and the training setup on a more technical level (see Figure 8). Furthermore, we adapted Figure 1a according to the reviewer’s recommendation.

Regarding the lack of clarity if the network was retrained or not, we hope that the additional analyses on the generalisation ability of the network help make this point clearer. To make explicit that networks were retrained for Figure 2 we modified the text to say:

“To investigate how the timescale of modulation affects the performance in the dynamic blind source separation task, we trained network models, in which the modulatory feedback had an intrinsic timescale that forced it to be slow.” (ll. 103-105)

Finally, it is correct that in Figure 3 the modulation went from modulating weights to modulating neurons. For the spatially diffuse modulation, we assumed that all weights to a neuron receive the same modulation. Since synaptic inputs are integrated linearly, this is equivalent to a neuronal gain modulation. We have added an explicit explanation to the results:

“We here assume that all synaptic weights to a neuron receive the same modulation, such that the feedback performs a gain modulation of neural activity (Ferguson and Cardin, 2020).” (ll. 137-139)

"While we trained the modulatory system using supervised learning, the contextual inference is performed by its dynamics without access to the target sources and thus unsupervised" I feel this could be read as saying that an actual unsupervised objective was used, when in fact only supervised learning took place, so I would suggest re–wording.

Good point. We have changed this sentence to make clear that the training of the network itself is unsupervised. It now reads:

“The modulator was trained using supervised learning. Afterwards, its weights were fixed and it no longer had access to the target sources (see Materials and methods, Figure 8). The modulator therefore had to use its recurrent dynamics to determine the appropriate modulatory feedback for the time-varying context, based on the sensory stimuli and the network output. Put differently, the modulator had to learn an internal model of the sensory data and the contexts, and use it to establish the desired context invariance in the output.” (ll. 55-61)

I didn't understand the claim about matched EI inputs and how it depends on using gain modulation. This should probably be expanded and related to the main questions of the paper or possibly removed.

Motivated by this recommendation and other comments we have removed these results from the paper in order to make more room for the central claims of the manuscript (see Change 3 above).

Figure 4i seems to be the main demonstration that individual neural activity itself is not invariant to context. I'd like to see a more in–depth exploration of this. Particularly, if the readout only relied on a small handful of neurons then finding that the rest of the neurons are not context–invariant wouldn't prove that individual neural invariance is not a relevant mechanism. Given that the readout from this network is known, it would be particularly easy to determine if the heavily weighted neurons in particular are or are not context invariant.

We thank the reviewer for this suggestion on how to further explore the lack of invariance of single neurons. These analyses are now performed on the simpler network in a new Figure 4 (see Change 1 above). We have extended our analyses and the text according to the reviewer’s suggestions:

“However, a first inspection of neural activity indicated that single neurons are strongly modulated by context (Figure 4a). To quantify this, we determined the signal clarity for each neuron at each stage of the feedforward network, averaged across contexts (Figure 4b). As expected, the signal clarity was low for the sensory stimuli. Intriguingly, the same was true for all neurons of the modulated neural population, indicating no clean separation of the sources at the level of single neurons. Although most neurons had a high signal clarity in some of the contexts, there was no group of neurons that consistently represented one or the other source (Figure 4c). Furthermore, the average signal clarity of the neurons did not correlate with their contribution to the readout (Figure 4d).” (ll. 161-170)

In general, I don't understand why the authors use a separately trained linear readout when trying to show that the population activity at the final layer is invariant. They eventually acknowledge that "Since this readout is obtained from the data, this procedure does not require knowledge of the readout in the network model. Note that the trained decoder and the network readout are not necessarily identical" but they don't explain why they are using this alternative readout or what new insights its use adds. Particularly, the performance of the network indicates the there is some sort of context invariant read out possible from this population, yet the authors use this other readout in a way that is seemingly supposed to add something to the explanation.

We agree with the reviewer that it was not sufficiently clear why we used the linear readout obtained from the data. The original idea was to highlight that these analyses can be done on neural data, because they do not require knowing the readout that is performed by downstream areas. We acknowledge that raising this point while explaining the feedback-freezing-experiment is confusing. To address this, we now use the model’s readout (former Figure 6, now Figure 5). In addition, we explicitly highlight that an invariant readout could be obtained from neural data (see response to public review above).

Be sure to say what errorbars are based on in all figures.

Thank you for pointing this out. We have added the respective information to the captions.

"In our model, the mechanism needs to satisfy a few key

requirements: i) the modulation is not uniform across the population, ii) it operates on a timescale similar to that of changes in context, and iii) it is driven by feedback projections." I don't understand claim (iii). If anything, the results show the importance of the modulation being driven by feedforward sensory signals (figure 2d/e).

Yes, that's a fair point. We have rephrased the respective sentence as follows:

“[…] iii) it is driven by a brain region that has access to the information needed to infer the context” (ll. 349-350)

"In addition, feedback inputs from the sensory to the modulatory system allow a better control of the modulated network state." I don't see how the connections from a sensory system to a modulatory system are "feedback".

We believe that our phrasing was unfortunate. We meant feedback from the feedforward network’s output. In the brain, this could correspond to higher-level sensory areas:

“In addition, feedback inputs from higher-level sensory areas to the modulatory system allow a better control of the modulated network state.” (ll. 375-377).

Reviewer #3:

I appreciate the didactic way in which the manuscript was written (and beautiful figures!), in particular the progression from a vanilla architecture towards the full fledged model with EI rectified neurons with spatially specific modulation. My main concerns (detailed below) are two–fold:

1. I felt that some extensions were not explicitly justified (e.g. why 2 layers instead of 2, etc)

2. I was expecting more 'reverse–engineering' of the mechanism through which the network accomplishes a context invariant projection. This is the main result of the paper, as reflected in the title, so I think it deserves more unpacking. Below I unpack these concerns, sometimes providing some suggestions to improve the motivation and clarity of the paper (without any particular order)

We thank the reviewer for their very constructive feedback and agree with both points made here. We hope to have addressed them with the revision of the manuscript, in particular Changes 1-3 (see above). Below we provide answers to the specific recommendations and questions in some more detail.

1. Overall, the architecture choices are a bit unjustified. In the extreme, wouldn't the LSTM alone solve the task? The addition of each feedforward layer should be better motivated (e.g. more biologically realistic? In what sense?). For example, why add an extra layer from extensions 2 and 3? If those are necessary, this should be explained. If they are not necessary, they should be removed.

We are glad that the reviewers have brought this to our attention. We have substantially reorganised the paper to clarify which architectural choices are relevant for which finding (Changes 1 and 3).

We modified the description of the Dalean network to make our model choices more transparent to the reader:

“We extended the feedforward model as follows (Figure 6a): First, all neurons had positive firing rates. Second, we split the neural population (z in the previous model) into a "lower-level" (zL) and "higher-level" population (zH). […] This allowed for a direct comparison between a modulated and an unmodulated neural population. It also allowed us to include Dalean weights between the two populations.” (ll. 236-242)

“[…] the higher-level population contained a context-invariant subspace (Figure 6f). This was not the case for the lower-level population, underscoring that invariant representations do not just arise from projecting the sensory stimuli into a higher dimensional space.” (ll. 262-266)

Furthermore, we decided to perform our single neuron and population analyses on the simpler network with spatially diffuse modulation and verify them on the Dalean network (see Change 1). We hope that the new manuscript structure and the additional analyses address the concerns raised above.

Regarding the question whether an LSTM alone can solve the task: yes it can. We tested this during the course of the project. This is not surprising, because an LSTM could in principle even learn the same architecture as we use here. Our architecture may even make it more difficult to solve the task since it contains more constraints. However, the focus of this project was not on whether a recurrent network can solve this task, but rather what role feedback modulation could play in (invariant) sensory processing. We have therefore decided not to include any results from a purely recurrent network.

2. 'Because the task requires a dynamic inference of the context, it cannot be solved by feedforward networks or standard blind source separation algorithms' I think the paper could be better motivated if this was shown explicitly with some examples.

We agree that this would help to motivate the architecture of our model and have included a new supplementary figure, in which we explicitly demonstrate that a feedforward network cannot solve the task (see Supp. Figure S1).

3. A figure explicitly illustrating the training setup would help motivate what is trivially solved and what is actually challenging. For instance, in the main manuscript, it is not clear in which cases the network is trained and tested on the same contexts (ie A(t)) and which cases it is not. In the first case, the context can be easily inferred from x(t) but the latter is more challenging?

We thought this was an excellent idea and have created a methods figure that illustrates the task, the model and the training setup explicitly (see Figure 8). Generally, we always sample new contexts when testing the model, from the same distribution as during training (unless stated otherwise). We now also show that the network does not generalise to out-of-distribution contexts or sensory stimuli (Supp. Figure S6, see also Change 4 above).

4. however I understand that the paper is already too long, Intra / extrapolation results deserve more spotlight and unpacking in my opinion. In general, if there is a lack of space, I would merge Figure 1 and figure 2 – and jump directly to extension 1 – and move most of figure 2 to sup.

We thank the reviewer for this suggestion. We removed some results and reorganised the remaining results such that there is more focus on the invariant subspaces. Hence, we do not feel that the manuscript has become substantially longer. Furthermore, we would prefer to only show the proof of concept in the first figure, in order to not overload the reader. If the reviewers find that the paper is much too long, we could move figure 2 to the supplementary material.

5. Most important concern to me: Figure 6, in which the mechanism is revealed, deserves more quantifications to explicitly pinpoint the mechanism. Three suggestions come to mind:

Thanks, these are great suggestions.

a. Plot the 3 PCs components (instead of just 1) and show the readout in this space. The key result is that the readout is invariant to context and this is not clearly illustrated at the moment. Instead, what is shown is that the representation changes, but that it changes in a way that preserves invariance on the readout is not clearly highlighted.

Unfortunately, using the space of the first 3 PCs instead of the first PC and the readout axes does not illustrate the invariant subspace in an intuitive way (see Change 2). Therefore, we decided to stick with the more unconventional space but verify our findings for PC space in a new supplementary figure (Supp. Figure S9). To better illustrate that the readout is context-invariant we plotted the projection of the subspace onto the readout into the 3D figure (Figure 5b) and make explicit that this is the same projection as shown in Figure 5c.

b. The authors highlight that the network is not just reversing the new mixing coefficients and projecting the activity back into the 2d low manifold. Instead, it is rotating everything out of this manifold. My suggestion would be to show this alternatively explicitly. Is it actually possible? Relatedly, what happens if the context is changed back to context 1?

c. Finally, all the statements made about this figure should be quantified and not just illustrated for 1 trial.

Regarding b and c: As described in Change 2, we have implemented the reviewer’s advice by adding a quantification to the former Figure 6 (now Figure 5) in terms of the angles between the subspaces (Figure 5d). We also show that the magnitude of this change depends on the similarity of the old and the new context (Figure 5dande), indicating that there is a consistent mapping between context and the low-dimensional population activity. Switching back to context 1 would therefore reinstate the original subspace, without hysteretic effects.

https://doi.org/10.7554/eLife.76096.sa2

Article and author information

Author details

  1. Laura B Naumann

    1. Modelling of Cognitive Processes, Technical University of Berlin, Berlin, Germany
    2. Bernstein Center for Computational Neuroscience, Berlin, Germany
    Contribution
    Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review and editing
    For correspondence
    laura-bella.naumann@bccn-berlin.de
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7919-7349
  2. Joram Keijser

    Modelling of Cognitive Processes, Technical University of Berlin, Berlin, Germany
    Contribution
    Conceptualization, Methodology, Project administration, Supervision, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
  3. Henning Sprekeler

    1. Modelling of Cognitive Processes, Technical University of Berlin, Berlin, Germany
    2. Bernstein Center for Computational Neuroscience, Berlin, Germany
    Contribution
    Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Supervision, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0690-3553

Funding

No external funding was received for this work.

Acknowledgements

We thank Owen Mackwood for providing a code framework that manages simulations on a compute cluster, Loreen Hertäg and Johannes Letzkus for feedback on the manuscript, and the members of the Sprekeler lab for valuable discussions. No external funding was received for this work.

Senior Editor

  1. Andrew J King, University of Oxford, United Kingdom

Reviewing Editor

  1. Srdjan Ostojic, Ecole Normale Superieure Paris, France

Publication history

  1. Preprint posted: November 1, 2021 (view preprint)
  2. Received: December 3, 2021
  3. Accepted: April 6, 2022
  4. Accepted Manuscript published: April 20, 2022 (version 1)
  5. Version of Record published: May 13, 2022 (version 2)

Copyright

© 2022, Naumann et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 984
    Page views
  • 199
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Laura B Naumann
  2. Joram Keijser
  3. Henning Sprekeler
(2022)
Invariant neural subspaces maintained by feedback modulation
eLife 11:e76096.
https://doi.org/10.7554/eLife.76096

Further reading

    1. Computational and Systems Biology
    2. Neuroscience
    Vasileios Dimakopoulos et al.
    Research Article

    The maintenance of items in working memory (WM) relies on a widespread network of cortical areas and hippocampus where synchronization between electrophysiological recordings reflects functional coupling. We investigated the direction of information flow between auditory cortex and hippocampus while participants heard and then mentally replayed strings of letters in WM by activating their phonological loop. We recorded local field potentials from the hippocampus, reconstructed beamforming sources of scalp EEG, and – additionally in four participants – recorded from subdural cortical electrodes. When analyzing Granger causality, the information flow was from auditory cortex to hippocampus with a peak in the [4 8] Hz range while participants heard the letters. This flow was subsequently reversed during maintenance while participants maintained the letters in memory. The functional interaction between hippocampus and the cortex and the reversal of information flow provide a physiological basis for the encoding of memory items and their active replay during maintenance.

    1. Computational and Systems Biology
    2. Neuroscience
    Raymond Doudlah et al.
    Research Advance

    Visually guided behaviors require the brain to transform ambiguous retinal images into object-level spatial representations and implement sensorimotor transformations. These processes are supported by the dorsal 'where' pathway. However, the specific functional contributions of areas along this pathway remain elusive due in part to methodological differences across studies. We previously showed that macaque caudal intraparietal (CIP) area neurons possess robust three-dimensional (3D) visual representations, carry choice- and saccade-related activity, and exhibit experience-dependent sensorimotor associations (Chang et al., 2020b). Here, we used a common experimental design to reveal parallel processing, hierarchical transformations, and the formation of sensorimotor associations along the 'where' pathway by extending the investigation to V3A, a major feedforward input to CIP. Higher-level 3D representations and choice-related activity were more prevalent in CIP than V3A. Both areas contained saccade-related activity that predicted the direction/timing of eye movements. Intriguingly, the time-course of saccade-related activity in CIP aligned with the temporally integrated V3A output. Sensorimotor associations between 3D orientation and saccade direction preferences were stronger in CIP than V3A, and moderated by choice signals in both areas. Together, the results explicate parallel representations, hierarchical transformations, and functional associations of visual and saccade-related signals at a key juncture in the 'where' pathway.