1. Neuroscience
Download icon

Sensory cortex is optimized for prediction of future input

  1. Yosef Singer
  2. Yayoi Teramoto
  3. Ben DB Willmore
  4. Jan WH Schnupp
  5. Andrew J King
  6. Nicol S Harper  Is a corresponding author
  1. University of Oxford, United Kingdom
  2. City University of Hong Kong, Hong Kong
Research Article
  • Cited 4
  • Views 3,949
  • Annotations
Cite this article as: eLife 2018;7:e31557 doi: 10.7554/eLife.31557

Abstract

Neurons in sensory cortex are tuned to diverse features in natural scenes. But what determines which features neurons become selective to? Here we explore the idea that neuronal selectivity is optimized to represent features in the recent sensory past that best predict immediate future inputs. We tested this hypothesis using simple feedforward neural networks, which were trained to predict the next few moments of video or audio in clips of natural scenes. The networks developed receptive fields that closely matched those of real cortical neurons in different mammalian species, including the oriented spatial tuning of primary visual cortex, the frequency selectivity of primary auditory cortex and, most notably, their temporal tuning properties. Furthermore, the better a network predicted future inputs the more closely its receptive fields resembled those in the brain. This suggests that sensory processing is optimized to extract those features with the most capacity to predict future input.

https://doi.org/10.7554/eLife.31557.001

eLife digest

A large part of our brain is devoted to processing the sensory inputs that we receive from the world. This allows us to tell, for example, whether we are looking at a cat or a dog, and if we are hearing a bark or a meow. Neurons in the sensory cortex respond to these stimuli by generating spikes of activity. Within each sensory area, neurons respond best to stimuli with precise properties: those in the primary visual cortex prefer edge-like structures that move in a certain direction at a given speed, while neurons in the primary auditory cortex favour sounds that change in loudness over a particular range of frequencies.

Singer et al. sought to understand why neurons respond to the particular features of stimuli that they do. Why do visual neurons react more to moving edges than to, say, rotating hexagons? And why do auditory neurons respond more to certain changing sounds than to, say, constant tones? One leading idea is that the brain tries to use as few spikes as possible to represent real-world stimuli. Known as sparse coding, this principle can account for much of the behaviour of sensory neurons.

Another possibility is that sensory areas respond the way they do because it enables them to best predict future sensory input. To test this idea, Singer et al. used a computer to simulate a network of neurons and trained this network to predict the next few frames of video clips using the previous few frames. When the network had learned this task, Singer et al. examined the neurons’ preferred stimuli. Like neurons in primary visual cortex, the simulated neurons typically responded most to edges that moved over time.

The same network was also trained in a similar way, but this time using sound. As for neurons in primary auditory cortex, the simulated neurons preferred sounds that changed in loudness at particular frequencies. Notably, for both vision and audition, the simulated neurons favoured recent inputs over those further into the past. In this way and others, they were more similar to real neurons than simulated neurons that used sparse coding.

Both artificial networks trained to foretell sensory input and the brain therefore favour the same types of stimuli: the ones that are good at helping to grasp future information. This suggests that the brain represents the sensory world so as to be able to best predict the future.

Knowing how the brain handles information from our senses may help to understand disorders associated with sensory processing, such as dyslexia and tinnitus. It may also inspire approaches for training machines to process sensory inputs, improving artificial intelligence.

https://doi.org/10.7554/eLife.31557.002

Introduction

Sensory inputs guide actions, but such actions necessarily lag behind these inputs due to delays caused by sensory transduction, axonal conduction, synaptic transmission, and muscle activation. To strike a cricket ball, for example, one must estimate its future location, not where it is now (Nijhawan, 1994). Prediction has other fundamental theoretical advantages: a system that parsimoniously predicts future inputs from their past, and that generalizes well to new inputs, is likely to contain representations that reflect their underlying causes (Bialek et al., 2001). This is important because ultimately, we are interested in these causes (e.g. flying cricket balls), not the raw images or sound waves incident on the sensory receptors. Furthermore, much of sensory processing involves discarding irrelevant information, such as that which is not predictive of the future, to arrive at a representation of what is important in the environment for guiding action (Bialek et al., 2001).

Previous theoretical studies have suggested that many neural representations can be understood in terms of efficient coding of natural stimuli in a short time window at or just before the present (Attneave, 1954; Barlow, 1959; Olshausen and Field, 1996, Olshausen and Field, 1997). Such studies generally built a network model of the brain, which was trained to represent stimuli subject to some set of constraints. One pioneering such study trained a network to efficiently represent static natural images using a sparse, generative model (Olshausen and Field, 1996, Olshausen and Field, 1997). More recent studies have used related ideas to model the representation of moving (rather than static) images (van Hateren and Ruderman, 1998a; Berkes and Wiskott, 2005; Berkes et al., 2009) and other sensory stimuli (Klein et al., 2003; Carlson et al., 2012; Zhao and Zhaoping, 2011; Kozlov and Gentner, 2016; Cusack and Carlyon, 2004). In contrast, we built a network model that was optimized not for efficient representation of the recent past, but for efficient prediction of the immediate future of the stimulus, which we will refer to as the temporal prediction model. The timescale of prediction considered for our model is in the range of tens to hundreds of milliseconds. Conduction delays to cortex and very fast motor responses are on this timescale (Bixler et al., 1967; Yeomans and Frankland, 1995; Bizley et al., 2005).

The idea that prediction is an important component of perception dates at least as far back as Helmholtz (Helmholtz, 1962; Sutton and Barton, 1981), although what is meant by prediction and the purpose it serves is quite varied between models incorporating it (Chalk et al., 2018; Salisbury and Palmer, 2016). With regards to perception and prediction, two contrasting but interrelated frameworks have been distinguished (Chalk et al., 2018; Salisbury and Palmer, 2016). In the ‘predictive coding’ framework (Huang and Rao, 2011; Rao and Ballard, 1999; Friston, 2003), prediction is used to remove statistical redundancy in order to provide an efficient representation of the entire stimulus. Some models of this type use prediction as a term for estimation of the current or a static input (such as images) from latent variables (Rao and Ballard, 1999), whereas other have also considered the temporal dimension of the input (Rao and Ballard, 1997; Rao, 1999; Srinivasan et al., 1982). Sparse coding models (Olshausen and Field, 1996Olshausen and Field, 1997) can be related to this framework (Huang and Rao, 2011). In contrast, the ‘predictive information’ framework (Bialek et al., 2001; Salisbury and Palmer, 2016; Palmer et al., 2015; Heeger, 2017), which our approach relates to more closely, involves selective encoding of those features of the stimulus that predict future input. A related idea to predictive information is the encoding of slowly varying features (Berkes and Wiskott, 2005; Creutzig and Sprekeler, 2008; Kayser et al., 2001; Hyvärinen et al., 2003), which are one kind of predictive feature. Hence, the predictive coding approach seeks to find a compressed representation of the entire input, whereas the predictive information approach selectivity encodes only predictive features (Chalk et al., 2018; Salisbury and Palmer, 2016). Our model relates to the predictive information approach in that it is optimized to predict the future from the past, but it has a combination of characteristics, such a non-linear encoder and sparse weight regularization, which have not previously been explored for such an approach.

To evaluate the representations produced by these normative theoretical models, they can be optimized for natural stimuli, and the tuning properties of their units compared to the receptive fields of real neurons. A useful and commonly used definition of a neuron’s receptive field (RF) is the stimulus that maximally linearly drives the neuron (Adelson and Bergen, 1985; Aertsen et al., 1981; Aertsen and Johannesma, 1981; Reid et al., 1987; deCharms et al., 1998; Harper et al., 2016). In mammalian primary visual cortex (V1), neurons typically respond strongly to oriented edge-like structures moving over a particular retinal location (Hubel and Wiesel, 1959; Jones and Palmer, 1987; DeAngelis et al., 1993; Ringach, 2002). In mammalian primary auditory cortex (A1), most neurons respond strongly to changes in the amplitude of sounds within a certain frequency range (deCharms et al., 1998).

The temporal prediction model provides a principled approach to understanding the temporal aspects of RFs. Previous models, based on sparsity or slowness related principles, were successful in accounting for many spatial aspects of V1 RF structure (Olshausen and Field, 1996, Olshausen and Field, 1997; van Hateren and Ruderman, 1998a; Berkes and Wiskott, 2005; Berkes et al., 2009; van Hateren and van der Schaaf, 1998b), and had some success in accounting for spectral aspects of A1 RF structure (Klein et al., 2003; Carlson et al., 2012; Zhao and Zhaoping, 2011; Cusack and Carlyon, 2004). However, these models do not account well for the temporal structure of V1 or A1 RFs. Notably, for both vision (Ringach, 2002) and audition (deCharms et al., 1998), the envelopes of real neuronal RFs tend to be asymmetric in time, with greater sensitivity to very recent inputs compared to inputs further in the past. In contrast, the RFs predicted by previous models (van Hateren and Ruderman, 1998a; Klein et al., 2003; Carlson et al., 2012; Kozlov and Gentner, 2016; Cusack and Carlyon, 2004) typically show symmetrical temporal envelopes, with either approximately flat envelopes over time or a balanced falloff of the envelope over time either side of a peak. They also lack the greater sensitivity to very recent inputs.

Here we show using qualitative and quantitative comparisons that, for both V1 and A1 RFs, these shortcomings are largely overcome by the temporal prediction approach. This suggests that neural sensitivity at early levels of the cortical hierarchy may be organized to facilitate a rapid and efficient prediction of what the environment will look like in the next fraction of a second.

Results

The temporal prediction model

To determine what type of sensory RF structures would facilitate predictions of the imminent future, we built a feedforward network model with a single layer of nonlinear hidden units, mapping the inputs to the outputs through weighted connections (Figure 1). Each hidden unit’s output results from a linear mapping (by input weights) from the past input, followed by a monotonic nonlinearity, much like the classic linear-nonlinear model of sensory neurons (Klein et al., 2003; Carlson et al., 2012; Zhao and Zhaoping, 2011). The model then generates a prediction of the future from a linear mapping (by output weights) from the hidden units’ outputs. This is consistent with the observation that decoding from the neural response is often well approximated by a linear transformation (Eliasmith and Anderson, 2003).

Temporal prediction model implemented using a feedforward artificial neural network, with the same architecture in both visual and auditory domains.

(a), Network trained on cochleagram clips (spectral content over time) of natural sounds, aims to predict immediate future time steps of each clip from recent past time steps. (b), Network trained on movie clips of natural scenes, aims to predict immediate future frame of each clip from recent past frames. ui, input – the past; wji, input weights; sj, hidden unit output; wkj, output weights; v^k, output – the predicted future; vk, target output – the true future. Hidden unit’s RF is the wji between the input and that unit j.

https://doi.org/10.7554/eLife.31557.003

We trained the temporal prediction model on extensive corpora, either of soundscapes or silent movies, modelling A1 (Figure 1a) or V1 (Figure 1b) neurons, respectively. In each case, the networks were trained by optimizing their synaptic weights to most accurately predict the immediate future of the stimulus from its very recent past. For vision, the inputs were patches of videos of animals moving in natural settings, and we trained the network to predict the pixel values for one movie frame (40 ms) into the future, based on the seven most recent frames (280 ms). For audition, we trained the network to predict the next three time steps (15 ms) of cochleagrams of natural sounds based on the 40 most recent time steps (200 ms). Cochleagrams resemble spectrograms but are adjusted to approximate the auditory nerve representation of sounds (see Materials and methods).

During training we used sparse, L1 weight regularization (see Equation 3 in Materials and methods) to constrain the network to predict future stimuli in a parsimonious fashion, forcing the network to use as few weights as possible while maintaining an accurate prediction. This constraint can be viewed as an assumption about the sparse nature of causal dependencies underlying the sensory input, or alternatively as analogous to the energy and space restrictions of neural connectivity. It also prevents our network model from overfitting to its inputs. Note that this sparsity constraint differs from that used in sparse coding models, in that it is applied to the weights rather than the activity of the units, being more like a constraint on the wiring between neurons than a constraint on their firing rates.

Qualitative assessment of auditory receptive fields

To compare with the model, we recorded responses of 114 auditory neurons (including 76 single units) in A1 and the anterior auditory field (AAF) of 5 anesthetized ferrets (Willmore et al., 2016) and measured their spectrotemporal RFs (see Materials and methods). Ferrets are commonly used for auditory research, because they are readily trained in a range of sound detection, discrimination or localization tasks (Nodal and King, 2014), the frequency range of their hearing (approximately 40 Hz–40 kHz [Kavanagh and Kelly, 1988]) overlaps well with (and extends beyond) the human range, and most of their auditory cortex is not buried in a sulcus and hence easily accessible for electrophysiological or optical measurements.

The A1 RFs we recorded are diverse (Figure 2a); their frequency tuning can be narrowband or broadband, and sometimes showing flanking inhibition. Some may also be more complex in frequency tuning, lack clear order, or be selective for the direction of frequency modulation (Carlin and Elhilali, 2013).

Auditory spectrotemporal and visual spatiotemporal RFs of real neurons and temporal prediction model units.

(a), Example spectrotemporal RFs of real A1 neurons (Willmore et al., 2016). Red – excitation, blue – inhibition. Most recent two time steps (10 ms) were removed to account for conduction delay. (b), Example spectrotemporal RFs of model units when model is trained to predict the future of natural sound inputs. Note that the overall sign of a receptive field learned by the model is arbitrary. Hence, in all figures and analyses we multiplied each model receptive field by −1 where appropriate to obtain receptive fields which all have positive leading excitation (see Materials and methods). (c), Example spatiotemporal (I, space-time separable, and II, space-time inseparable) RFs of real V1 neurons (Ohzawa et al., 1996). Left, grayscale: 3D (space-space-time) spatiotemporal RFs showing the spatial RF at each of the most recent six time steps. Most recent time step (40 ms) was removed to account for conduction delay. White – excitation, black – inhibition. Right: corresponding 2D (space-time) spatiotemporal RFs obtained by summing along the unit’s axis of orientation for each time step. Red – excitation, blue – inhibition. (d), Example 3D and corresponding 2D spatiotemporal (I-III, space-time separable, and IV-VI, space-time inseparable) RFs of model units when model is trained to predict the future of natural visual inputs.

https://doi.org/10.7554/eLife.31557.004

In their temporal tuning, A1 RFs tend to weight recent inputs more heavily, with a temporally asymmetric power profile, involving excitation near the present followed by lagging inhibition of a longer duration (deCharms et al., 1998). The temporal prediction model RFs (Figure 2b) are similarly diverse, showing all of the RF types seen in vivo (including examples of localized, narrowband, broadband, complex, disordered and directional RFs) and are well matched in scale and form to those measured in A1. This includes having greater power (mean square) near the present, with brief excitation followed by longer lagging inhibition, producing an asymmetric power profile. This stands in contrast to previous attempts to model RFs based on efficient coding,sparsecoding and slow feature hypotheses, which either did not capture the diversity of RFs (Zhao and Zhaoping, 2011), or lacked temporal asymmetry, punctate structure, or appropriate time scale (Klein et al., 2003; Carlson et al., 2012; Kozlov and Gentner, 2016; Cusack and Carlyon, 2004; Carlin and Elhilali, 2013; Brito and Gerstner, 2016).

Qualitative assessment of visual receptive fields

By eye, substantial similarities were also apparent when we compared the temporal prediction model’s RFs trained using visual inputs (Figure 1b) with the 3D (space-space-time) and 2D (space-time) spatiotemporal RFs of real V1 simple cells, which were obtained from Ohzawa et al (Ohzawa et al., 1996). Simple cells (Hubel and Wiesel, 1959) have stereotyped RFs containing parallel, spatially localized excitatory and inhibitory regions, with each cell having a particular preferred orientation and spatial frequency (Jones and Palmer, 1987; DeAngelis et al., 1993; Ringach, 2002) (Figure 2c). These features are also clearly apparent in the model RFs (Figure 2d).

Unlike previous models (van Hateren and Ruderman, 1998a; Hyvärinen et al., 2003; Olshausen, 2003), the temporal prediction model captures the temporal asymmetry of real RFs. The RF power is highest near the present and decays into the past (Figure 2d), as observed in real neurons (Ohzawa et al., 1996) (Figure 2c). Furthermore, simple cell RFs have two types of spatiotemporal structure: space-time separable RFs (Figure 2cI), whose optimal stimulus resembles a flashing or slowly ramping grating, and space-time inseparable RFs, whose optimal stimulus is a drifting grating (DeAngelis et al., 1993) (Figure 2cII). Our model captures this diversity (Figure 2dI–III separable, Figure 2dIV–VI inseparable).

We also examined linear aspects of the tuning of the output units for the visual temporal prediction model using a response-weighted average to white noise input, and found punctate non-oriented RFs that decay into the past.

Qualitative comparison to other models

For comparison, we trained a sparse coding model (Olshausen and Field, 1996, Olshausen and Field, 1997; Carlson et al., 2012) (https://github.com/zayd/sparsenet) using our dataset. We would expect such a model to perform less well in the temporal domain, because unlike the temporal prediction model, the direction of time is not explicitly accounted for. The sparse coding model was chosen because it has set the standard for normative models of visual RFs (Olshausen and Field, 1996, Olshausen, 2003; Olshausen and Field, 1997), and the same model has also been applied for auditory RFs (Carlson et al., 2012; Brito and Gerstner, 2016; Młynarski and McDermott, 2017; Blättler et al., 2011). Past studies (Olshausen and Field, 1996, Olshausen and Field, 1997; Carlson et al., 2012) have largely analysed the basis functions produced by the sparse coding model and compared their properties to neuronal RFs. To be consistent with these studies we have done the same, and to have a common term, refer to the basis functions as RFs (although strictly, they are projective fields). We can visually compare the large set of RFs recorded from A1 neurons (Figure 3) to the full set of RFs obtained from the temporal prediction model when trained on auditory inputs (Figure 4) and those of the sparse coding model (Figure 5) when trained on the same auditory inputs.

Full dataset of real auditory RFs.

114 neuronal RFs recorded from A1 and AAF of 5 ferrets. Red – excitation, blue - inhibition. Inset shows axes.

https://doi.org/10.7554/eLife.31557.005
Figure 4 with 7 supplements see all
Full set of auditory RFs of the temporal prediction model units.

Units were obtained by training the model with 1600 hidden units on auditory inputs. The hidden unit number and L1 weight regularization strength (10−3.5) was chosen because it results in the lowest MSE on the prediction task, as measured using a cross validation set. Many hidden units’ weight matrices decayed to near zero during training (due to the L1 regularization), leaving 167 active units. Inactive units were excluded from analysis and are not shown. Example units in Figure 2 come from this set. Red – excitation, blue - inhibition. Inset shows axes. Figure 4—figure supplement 1 shows the same RFs on a finer timescale. The full sets of visual spatial and corresponding spatiotemporal RFs for the temporal prediction model when it is trained on visual inputs are shown in Figure 4—figure supplements 23. Figure 4—figure supplement 4 shows the auditory RFs of the temporal prediction model when a linear activation function instead of a sigmoid nonlinearity was used. Figure 4—figure supplement 57 show the auditory spectrotemporal and visual spatial and 2D spatiotemporal RFs of the temporal prediction model when it was trained on inputs without added noise.

https://doi.org/10.7554/eLife.31557.006
Figure 5 with 5 supplements see all
Full set of auditory ‘RFs’ (basis functions) of sparse coding model used as a control.

Units were obtained by training the sparse coding model with 1600 units on the identical auditory inputs used to train the network shown in Figure 4. L1 regularization of strength 100.5 was applied to the units’ activities. This network configuration was selected as it produced unit RFs that most closely resembled those recorded in A1, as determined using the KS measure of similarity Figure 8—figure supplement 1 . Although the basis functions of the sparse coding model are not receptive fields, but projective fields, they tend to be similar in structure (Olshausen and Field, 1996, Olshausen and Field, 1997). In this manuscript, to have a common term between models and the data, we refer to sparse coding basis functions as RFs. Red – excitation, blue - inhibition. Inset shows axes. The full sets of visual spatial and corresponding spatiotemporal RFs for the sparse coding model when it is trained on visual inputs are shown in Figure 5—figure supplements 12. Figure 5—figure supplements 35 show the auditory spectrotemporal and visual spatial and 2D spatiotemporal RFs of the sparse coding model when it was trained on inputs without added noise.

https://doi.org/10.7554/eLife.31557.014

A range of RFs were produced by the sparse coding model, some of which show characteristics reminiscent of A1 RFs, particularly in the frequency domain. However, the temporal properties of A1 neurons are not well captured by these RFs. While some RFs display excitation followed by lagging inhibition, very few, if any, show distinct brief excitation followed by extended inhibition. Instead, RFs that show both excitation and inhibition tend to have a symmetric envelope and these features are randomly localized in time, and many RFs display temporally elongated structures that are not found in A1 neurons.

We also trained the sparse coding model on the dataset of visual inputs to serve as a control for the temporal prediction model trained on these same inputs. We compared the full population of spatial and 2D spatiotemporal visual RFs of the temporal prediction model (Figure 4—figure supplements 2–3) and the sparse coding model (Figure 5—figure supplements 12). As shown in previous studies (Olshausen and Field, 1996, Olshausen and Field, 1997; van Hateren and Ruderman, 1998a; van Hateren and van der Schaaf, 1998b), the sparse coding model produces RFs whose spatial structure resembles that of V1 simple cells (Figure 5—figure supplements 12), but does not capture the asymmetric nature of the temporal tuning of V1 neurons. Furthermore, while it does produce examples of both separable and inseparable spatiotemporal RFs, those that are separable tend to be completely stationary over time, resembling immobile rather than flashing gratings (Figure 5—figure supplement 2).

Quantitative analysis of auditory results

We compared the RFs generated by both models to the RFs of the population of real A1 neurons we recorded. We first compared the RFs in a non-parametric manner by measuring the Euclidean distances between the coefficient values of the RFs, and then used multi-dimensional scaling to embed these distances in a two-dimensional space (Figure 6a). The RFs of the sparse coding model span a much larger region than the real A1 and temporal prediction model RFs. Furthermore, the A1 and temporal prediction model RFs occupy a similar region of the space, indicating their greater similarity to each other relative to those of the sparse coding model. We then examined specific attributes of the RFs to determine points of similarity and difference between each of the models and the recorded data. We first considered the temporal properties of the RFs and found that for the data and the temporal prediction model, most of the power is contained in the most recent time-steps (Figures 2a–b, 34 and 6b, and Figure 4—figure supplement 1). Given that the direction of time is not explicitly accounted for in the sparse coding model, as expected, it does not show this feature (Figures 5 and 6b). Next, we examined the tuning widths of the RFs in each population for both time and frequency, looking at excitation and inhibition separately. In the time domain, the real data tend to show leading excitation followed by lagging inhibition of longer duration (Figures 2a, 3 and 6c–e). The temporal prediction model also shows many RFs with this temporal structure, with lagging inhibition of longer duration than the leading excitation (Figures 2b, 4 and 6c–e, and Figure 4—figure supplement 1). This is not the case with the sparse coding model, where units tend to show either excitation and inhibition having the same duration or an elongated temporal structure that does not show such stereotyped polarity changes (Figures 5 and 6c–e). It is also the case that the absolute timescales of excitation and inhibition match the data more closely in the case of the temporal prediction model (Figure 6c–e), although a few units display inhibition of a longer duration than is seen in the data (Figure 6c). The sparse coding model shows a wide range of temporal spans of excitation and inhibition, in keeping with previous studies (Carlson et al., 2012; Carlin and Elhilali, 2013).

Figure 6 with 1 supplement see all
Population measures for real A1, temporal prediction model and sparse coding model auditory spectrotemporal RFs.

The population measures are taken from the RFs shown in Figures 35. (a), Each point represents a single RF (with 32 frequency and 38 time steps) which has been embedded in a 2-dimensional space using Multi-Dimensional Scaling (MDS). Red circles - real A1 neurons, black circles – temporal prediction model units, blue triangles – sparse coding model units. Colour scheme applies to all subsequent panels. (b), Proportion of power contained in each time step of the RF, taken as an average across the population of units. (c), Temporal span of excitatory subfields versus that of inhibitory subfields, for real neurons and temporal prediction and sparse coding model units. The area of each circle is proportional to the number of occurrences at that point. The inset plots, which zoom in on the distribution use a smaller constant of proportionality for the circles to make the distributions clearer. (d), Distribution of temporal spans of excitatory subfields, taken by summing along the x-axis in (c). (e), Distribution of temporal spans of inhibitory subfields, taken by summing along the y-axis in (c). (f), Frequency span of excitatory subfields versus that of inhibitory subfields, for real neurons and temporal prediction and sparse coding model units. (g), Distribution of frequency spans of excitatory subfields, taken by summing along the x-axis in (f). (h), Distribution of frequency spans of inhibitory subfields, taken by summing along the y-axis in (f). Figure 6—figure supplement 1 shows the same analysis for the temporal prediction model and sparse coding model trained on auditory inputs without added noise.

https://doi.org/10.7554/eLife.31557.020

Regarding the spectral properties of real neuronal RFs, the spans of inhibition and excitation over sound frequency tend to be similar (Figure 6f–h). This is also seen in the temporal prediction model, albeit with slightly more variation (Figure 6f–h). The sparse coding model shows more extensive variation in frequency spans than either the data or our model (Figure 6f–h).

Quantitative analysis of visual results

We also compared the spatiotemporal RFs derived from the temporal prediction and sparse coding models with restricted published datasets summarizing RF characteristics of V1 neurons (Ringach, 2002) and a small number of full spatiotemporal visual RFs acquired from Ohzawa et al (Ohzawa et al., 1996). We assessed the orientation and spatial frequency tuning properties of the models’ RFs by fitting Gabor functions to them (see Materials and methods).

We compared temporal properties of the RFs from the neural data and the temporal prediction model. In both cases, most power (mean over space and neurons of squared values) is in the most recent time steps (Figure 7a). Previous normative models of spatiotemporal RFs (van Hateren and Ruderman, 1998a; Hyvärinen et al., 2003; Olshausen, 2003) (Figure 7—figure supplement 1c–d) do not show this property, being either invariant over time or localized, but with a symmetric profile that is not restricted to the recent past. We also measured the space-time separability of the RFs of the temporal prediction model (see Materials and methods); substantial numbers of both space-time separable and inseparable units were apparent (631 separable, 969 inseparable; Figure 4—figure supplement 3). In addition to this, we measured the tilt direction index (TDI) of the model units from their 2D spatiotemporal RFs. This index indicates spatiotemporal asymmetry in space-time RFs and correlates with direction selectivity (DeAngelis et al., 1993; Pack et al., 2006; Anzai et al., 2001; Baker, 2001; Livingstone and Conway, 2007). The mean TDI for the population was 0.34 (0.29 SD), comparable with the ranges in the neural data (mean 0.16; 0.12 SD in cat area 17/18 (Baker, 2001), mean 0.51; 0.30 SD in macaque V1 [Livingstone and Conway, 2007]). Finally, we observed an inverse correlation (r2 = −0.33, p<10−9, n = 1205) between temporal and spatial frequency tuning (See Materials and methods), which is also a property of real V1 RFs (DeAngelis et al., 1993) and is seen in a sparse-coding-related model (van Hateren and Ruderman, 1998a).

Figure 7 with 3 supplements see all
Population measures for real V1 and temporal prediction model visual spatial and spatiotemporal RFs.

Model units were obtained by training the model with 1600 hidden units on visual inputs. The hidden unit number and L1 weight regularization strength (10−6.25) was chosen because it results in the lowest MSE on the prediction task, as measured using a cross validation set. Example units in Figure 2 come from this set. (a), Proportion of power (sum of squared weights over space and averaged across units) in each time step, for real (Ohzawa et al., 1996) and model populations. (b), Joint distribution of spatial frequency and orientation tuning for population of model unit RFs at their time step with greatest power. (c), Distribution of orientation tuning for population of model unit RFs at their time step with greatest power. (d), Distribution of RF shapes for real neurons (cat, Jones and Palmer, 1987, mouse, Niell and Stryker, 2008 and monkey, Ringach, 2002) and model units. nx and ny measure RF span parallel and orthogonal to orientation tuning, as a proportion of spatial oscillation period (Ringach, 2002). For (b–d), only units that could be well approximated by Gabor functions (n = 1205 units; see Materials and methods) were included in the analysis. Of these, only model units that were space-time separable (n = 473) are shown in (d) to be comparable with the neuronal data (Ringach, 2002). A further 4 units with 1.5 < ny < 3.1 are not shown in (d). Figure 7—figure supplements 13 show example visual RFs and the same population measures for the sparse coding model trained on visual inputs with added noise and for the temporal prediction and sparse coding models trained on visual inputs without added noise.

https://doi.org/10.7554/eLife.31557.022

The spatial tuning characteristics of the temporal prediction model’s RFs displayed a wide range of orientation and spatial frequency preferences, consistent with the neural data (DeAngelis et al., 1993; Kreile et al., 2011) (Figure 4—figure supplement 2). Both model and real RFs (Kreile et al., 2011) show a preference for spatial orientations along the horizontal and vertical axes, although this orientation bias is seen to a greater extent in the temporal prediction model than in the data. The orientation and frequency tuning characteristics are also well captured by sparse coding related models of spatiotemporal RFs (van Hateren and Ruderman, 1998a; Olshausen, 2003) (Figure 7—figure supplement 1e-f). Furthermore, the widths and lengths of the RFs of the temporal prediction model, relative to the period of their oscillation, also match the neural data well (Figure 7d). The distribution of units extends along a curve from blob-like RFs, which lie close to the origin in this plot, to stretched RFs with several subfields, which lie further from the origin. Although this property is again fairly well captured by previous models (Olshausen and Field, 1996, Olshausen and Field, 1997; Berkes et al., 2009; Ringach, 2002; van Hateren and van der Schaaf, 1998b) (Figure 7—figure supplement 1g), only the temporal prediction model seems to be able to capture the blob-like RFs that form a sizeable proportion of the neural data (Ringach, 2002) (Figure 7d where nx and ny < ~0.25, Figure 4—figure supplement 2). A small proportion of the population have RFs with several short subfields, forming a wing from the main curve in Figure 7d.

Optimizing predictive capacity

Under our hypothesis of temporal prediction, we would expect that the better the temporal prediction model network is at predicting the future, the more the RFs of the network should resemble those of real neurons. To examine this hypothesis, we plotted the prediction error of the network as a function of two hyperparameters; the regularization strength and the number of hidden units (Figure 8a). Then, we plotted the similarity between the auditory RFs of real A1 neurons and those of the temporal prediction model (Figure 8b), as measured by the mean KS distances of the temporal and frequency span distributions (Figure 6d–e,g–h, Materials and methods). The set of hyperparameter settings that give good predictions are also those where the temporal prediction model produces RFs that are most similar to those recorded in A1 (r2 = 0.8, p<10−9, n = 55). This result argues that cortical neurons are indeed optimized for temporal prediction.

Figure 8 with 3 supplements see all
Correspondence between the temporal prediction model’s ability to predict future auditory input and the similarity of its units’ responses to those of real A1 neurons.

Performance of model as a function of number of hidden units and L1 regularization strength on the weights as measured by (a), prediction error (mean squared error) on the validation set at the end of training and (b), similarity between model units and real A1 neurons. The similarity between the real and model units is measured by averaging the Kolmogorov-Smirnov distance between each of the real and model distributions for the span of temporal and frequency tuning of the excitatory and inhibitory RF subfields (e.g. the distributions in Figure 6d–e and Figure 6g–h). Figure 8—figure supplement 1 shows the same analysis, performed for the sparse coding model, which does not produce a similar correspondence.

https://doi.org/10.7554/eLife.31557.026

When the similarity measure was examined as a function of the same hyperparameters for the sparse coding model (Figure 8—figure supplement 1), and this was compared to that model’s stimulus reconstruction capacity as a function of the same hyperparameters, a monotonic relationship between stimulus reconstruction capacity and similarity of real RFs was not found (Figure 8—figure supplement 1; r2 = −0.05, p=0.69, n = 50). In previous studies in which comparisons have been made between normative models and real data, the model hyperparameters have been selected to maximize the similarity between the real and model RFs. In contrast, the temporal prediction model provides an independent criterion, the prediction error, to perform hyperparameter selection. To our knowledge, no such effective, measurable, independent criterion for hyperparameter selection has been proposed for other normative models of RFs.

Variants of the temporal prediction model

The change in the qualitative structure of the RFs as a function of the number of hidden units and L1 regularization strength, for both the visual and auditory models, can be seen in the interactive supplementary figures (Figure 8—figure supplements 23; https://yossing.github.io/temporal_prediction_model/figures/interactive_supplementary_figures.html) The main effect of the regularization is to restrict the RFs in space for the visual case and in frequency and time for the auditory case. When the regularization is non-existent or substantially weaker than the optimum for prediction, the visual RFs become less localized in space with more elongated bars. The auditory RFs become more disordered, losing clear structure in most cases. When the regularization is made stronger than the optimum, the RFs become more punctate, for both the visual and auditory models. When the regularization strength is at the optimum for prediction, the auditory and visual model RFs qualitatively most closely resemble those of A1 neurons and V1 simple cells, respectively. This is consistent with what we found quantitatively in the previous section for the auditory model.

The temporal prediction model and the sparse coding model both produce oriented Gabor-like RFs when trained on visual inputs. This raises the possibility that optimization for prediction implicitly optimizes for a sparse response distribution, and hence leads to oriented RFs. To test for this, we measured the sparsity of the visual temporal prediction model’s hidden unit activities (by the Vinje-Gallant measure [Baker, 2001]) in response to the natural image validation set. Examining the relationship between predictive capacity and sparsity, over the range of L1 weight regularization strength and hidden units explored, we did not find a clear monotonic relationship. Indeed, in both the auditory and visual cases, the hidden unit and L1 regularization combination with the best prediction had intermediate sparsity. For the visual case, the best-predicting model had sparsity 0.25, and other models within the grid search had sparsity ranging from 0.16 to 0.57. For the auditory case, the best-predicting model had sparsity 0.58, and other models had sparsity ranging from 0.42 to 0.69.

We also varied other characteristics of the temporal prediction model to understand their influence. For both the auditory and visual models, when a different hidden unit nonlinearity (tanh or rectified linear) was used, the networks had similar predictive capacity and produced comparable RFs. However, when the temporal prediction model had linear hidden units, it no longer predicted as well and produced RFs that were less like real neurons in their structure. For the auditory model, the linear model RFs generally became more narrowband in frequency with temporally extended excitation, instead of extended lagging inhibition (Figure 4—figure supplement 4). For the visual model, the linear model RFs also showed substantially less similarity to the V1 data. At low regularization (the best predicting case), the RFs formed full-field grid-like structures. At higher regularization, they were more punctate, with some units having oriented RFs with short subfields. The RFs also did not change form or polarity over time, but simply decayed into the past.

The temporal prediction model and sparse coding model results shown in the main figures of this paper were trained on inputs with added Gaussian noise (6 dB SNR), mimicking inherent noise in the nervous system. To determine the effect of adding this noise, all models were also trained without noise, producing similar results (Figure 4—figure supplements 57; Figure 5—figure supplements 35; Figure 6—figure supplement 1; Figure 7—figure supplements 23). The results were also robust to changes in the duration of the temporal window being predicted. We trained the auditory model to predict a span of either 1, 3, 6, or 9 time steps into the future and the visual model to predict 1, 3 or 6 time steps into the future. For the auditory case, we found that increasing the number of time steps being predicted had little effect on the RF structure, both qualitatively and by the KS measure of similarity to the real data. In the visual case, Gabor-like units were present in all cases. Increasing the number of time steps made the RFs more restricted in space and increased the proportion of blob-like RFs.

Discussion

We hypothesized that finding features that can efficiently predict future input from its past is a principle that influences the structure of sensory RFs. We implemented an artificial neural network model that instantiates a restricted version of this hypothesis. When this model was trained using natural sounds, it produced RFs that are both qualitatively and quantitatively similar to those of A1 neurons. Similarly, when we trained the model using natural movies it produced RFs with many of the properties of V1 simple cells. This similarity is particularly notable in the temporal domain; the model RFs have asymmetric envelopes, with a preference for the very recent past, as is seen in A1 and V1. Finally, the more accurate a temporal prediction model is at prediction, the more its RFs tend to be like real neuronal RFs by the measures we use for comparison.

Relationship to other models

A number of principles, often acting together, have been proposed to explain the form and diversity of sensory RFs. These include efficient coding (Barlow, 1959; Olshausen and Field, 1996, Olshausen and Field, 1997; Carlson et al., 2012; Zhao and Zhaoping, 2011; Srinivasan et al., 1982; Brito and Gerstner, 2016; Olshausen, 2003; Attneave, 1954), sparseness (Olshausen and Field, 1996, Olshausen and Field, 1997; Carlson et al., 2012; Kozlov and Gentner, 2016; Brito and Gerstner, 2016; Olshausen, 2003), and slowness (Hyvärinen et al., 2003; Carlin and Elhilali, 2013). Efficient coding indicates that neurons should encode maximal information about sensory input given certain constraints, such as spike count or energy costs. Sparseness posits that only a small proportion of neurons in the population should be active for a given input. Finally, slowness means that neurons should be sensitive to features that change slowly over time. The temporal prediction principle we describe here provides another unsupervised objective of sensory coding. It has been described in a very general manner by the information bottleneck concept (Bialek et al., 2001; Salisbury and Palmer, 2016; Palmer et al., 2015). We have instantiated a specific version of this idea, with linear-nonlinear encoding of the input, followed by a linear transform from the encoding units’ output to the prediction.

In the following discussion, we describe previous normative models that infer RFs with temporal structure from auditory or movie input and relate them to spectrotemporal RFs in A1 or simple cell spatiotemporal RFs in V1, respectively. For focus, other normative models of less directly relevant areas, such as spatial receptive fields without a temporal component (Olshausen and Field, 1996, Olshausen and Field, 1997), complex cells (Berkes and Wiskott, 2005), retinal receptive fields (Huang and Rao, 2011; Srinivasan et al., 1982), or auditory nerve impulse responses (Smith and Lewicki, 2006), will not be examined.

Auditory normative models

A number of coding objectives have been explored in normative models of A1 spectrotemporal RFs. One approach (Zhao and Zhaoping, 2011) found analytically that the optimal typical spectrotemporal RF for efficient coding was spectrally localized with lagging and flanking inhibition, and showed an asymmetric temporal envelope. However, the resulting RF also showed substantially more flanking inhibition, more ringing over time and frequency, and operated over a much shorter timescale (~10 ms) than seen in A1 RFs (Figure 3). Moreover, this approach produced a single generic RF, rather than capturing the diversity of the population.

Other models have produced a diverse range of spectrotemporal RFs. In the sparse coding approach (Carlson et al., 2012; Brito and Gerstner, 2016; Młynarski and McDermott, 2017; Blättler et al., 2011), a spectrogram snippet is reconstructed from a sum of basis functions (a linear generative model), each weighted by its unit’s activity, with a constraint to have few active units. This approach is the same as the sparse coding model we used as a control (Figure 5). A challenge with many sparse generative models is that the activity of the units is found by a recurrent iterative process that needs to find a steady state; this is fine for static stimuli such as images, but for dynamic stimuli like sounds it is questionable whether the nervous system would have sufficient time to settle on appropriate activities before the stimulus had changed. Related work also used a sparsity objective, but rather than minimizing stimulus reconstruction error, forced high dispersal (Kozlov and Gentner, 2016) or decorrelation (Klein et al., 2003; Carlin and Elhilali, 2013) of neural responses. Although lacking some of the useful probabilistic interpretations of sparse generative models, this approach does not require a settling process for inference. An alternative to sparseness is temporal slowness, which can be measured by temporal coherence (Carlin and Elhilali, 2013). Here the linear transform from sequential spectrogram snippets to unit activity is optimized to maximize the correlation of each unit’s response over a certain time window, while maintaining decorrelation between the units’ activities.

Although the frequency tuning derived with these models can resemble that found in the midbrain or cortex (Klein et al., 2003; Carlson et al., 2012; Kozlov and Gentner, 2016; Carlin and Elhilali, 2013; Brito and Gerstner, 2016; Młynarski and McDermott, 2017; Blättler et al., 2011) (Figure 5), the resulting RFs lack the distinct asymmetric temporal profile and lagging inhibition seen in real midbrain or A1 RFs. Furthermore, they often have envelopes that are too elongated over time, often spanning the full temporal width of the spectrotemporal RF. This is related to the fact that the time window to be encoded by the model is set arbitrarily, and every time point within that window is given equal importance, that is, the direction of time is not accounted for. This is in contrast to the temporal prediction model, which naturally gives greater weighting to time-points near the present than to those in the past due to their greater predictive capacity.

Visual normative models

The earliest normative model of spatiotemporal RFs of simple cells used independent component analysis (ICA) (van Hateren and Ruderman, 1998a), which is practically equivalent for visual or auditory data to the critically complete case of the sparse coding model (Olshausen and Field, 1996, Olshausen and Field, 1997) we used as a control (Figure 5—figure supplements 12 and Figure 7—figure supplement 1). The RFs produced by this model and the control model reproduced fairly well the spatial aspects of simple cell RFs. However, in contrast to the temporal prediction model (Figure 7d), the subset of more ‘blob-like’ RFs seen in the data are not well captured by our control sparse coding model (Figure 7—figure supplement 1g). In the temporal domain, again unlike the temporal prediction model and real V1 simple cells, the RFs of the ICA and sparse coding models are not pressed up against the present with an asymmetrical temporal envelope, but instead show a symmetrical envelope or span the entire range of times examined. A related model (Olshausen, 2003) assumes that a longer sequence of frames is generated by convolving each basis function with a time-varying sparse coefficient and summing the result, so that each basis function is applied at each point in time. The resulting spatiotemporal RFs are similar to those produced by ICA (van Hateren and Ruderman, 1998a), or our control model (Figure 5—figure supplement 2 and Figure 7—figure supplement 1c). Although they tend not to span the entire range of times examined, they do show a symmetrical envelope, and require an iterative inference procedure, as described above for audition.

Temporal slowness constraints have also been used to model the spatiotemporal RFs of simple cells. The bubbles (Hyvärinen et al., 2003) approach combines sparse and temporal coherence constraints with reconstruction. The resulting RFs show similar spatial and temporal properties to those found using ICA. A related framework is slow feature analysis (SFA) (Berkes and Wiskott, 2005; Wiskott and Sejnowski, 2002), which enforces temporal smoothness by minimizing the derivative of unit responses over time, while maximizing decorrelation between units. SFA has been used to model complex cell spatiotemporal RFs (over only two time steps, Berkes and Wiskott, 2005), and a modified version has been used to model spatial (not spatiotemporal) RFs of simple cells (Berkes et al., 2009). These results are not directly comparable with our results or the spatiotemporal RFs of simple cells.

In the slowness framework, the features found are those that persist over time; the presence of such a feature in the recent past predicts that the same feature will be present in the near future. This is also the case for our predictive approach, which, additionally, can capture features in the past that predict features in the future that are subtly or radically different from themselves. The temporal prediction principle will also give different weighting to features, as it values predictive capacity rather than temporal slowness (Creutzig and Sprekeler, 2008). In addition, although slowness models can be extended to model RFs over more than one time step (Berkes and Wiskott, 2005; Hyvärinen et al., 2003; Carlin and Elhilali, 2013), capturing temporal structure, they do not inherently give more weighting to information in the most recent past and therefore do not give rise to asymmetric temporal profiles in RFs.

There is one study that has directly examined temporal prediction as an objective for visual RFs in a manner similar to ours (Palm, 2012). Here, as in our model, a single hidden layer feedforward neural network was used to predict the immediate future frame of a movie patch from its past frames. However, only two frames of the past were used in this study, so a detailed exploration of the temporal profile of the spatiotemporal RFs was not possible. Nevertheless, some similarities and differences in the spatial RFs between the two frames were noted, and some units had oriented RFs. In contrast to our model, however, many RFs were noisy and did not resemble those of simple cells. Potential reasons for this difference include the use of L2 rather than L1 regularization on the weights, an output nonlinearity not present in our model, the optimization algorithm used, network size, or the dataset. Another very recent related study (Chalk et al., 2018) also implemented a somewhat different form of temporal prediction, with a linear (rather than linear-nonlinear) encoder, and linear decoder. When applied to visual scenes, oriented receptive fields were produced, but they were spatio-temporally separable and hence not direction selective.

Strengths and limitations of the temporal prediction model

Temporal prediction has several strengths as an objective function for sensory processing. First, it can capture underlying features in the world (Bialek et al., 2001); this is also the case with sparseness (Olshausen and Field, 1996, Olshausen and Field, 1997) and slowness (Wiskott and Sejnowski, 2002), but temporal prediction will prioritize different features. Second, it can predict future inputs, which is very important for guiding action, especially given internal processing delays. Third, objectives such as efficient or sparse reconstruction retain everything about the stimulus, whereas an important part of neural information processing is the selective elimination of irrelevant information (Marzen and DeDeo, 2017). Prediction provides a good initial criterion for eliminating potentially unwanted information. Fourth, prediction provides a natural method to determine the hyperparameters of the model (such as regularization strength, number of hidden units, activation function and temporal window size). Other models select their hyperparameters depending on what best reproduces the neural data, whereas we have an independent criterion – the capacity of the network to predict the future. One notable hyperparameter is how many time-steps of past input to encode. As described above, this is naturally decided by our model because only time-steps that help predict the future have significant weighting. Fifth, the temporal prediction model computes neuronal activity without needing to settle to a steady state, unlike some other models (Olshausen and Field, 1996, Olshausen and Field, 1997; Carlson et al., 2012; Brito and Gerstner, 2016; Młynarski and McDermott, 2017). For dynamic stimuli, a model that requires settling may not reach equilibrium in time to be useful. Sixth, and most importantly, temporal prediction successfully models many aspects of the RFs of primary cortical neurons. In addition to accounting for spatial and spectral tuning in V1 and A1, respectively, at least as well as other normative models, it reproduces the temporal properties of RFs, particularly the asymmetry of the envelopes of RFs, something few previous models have attempted to explain.

Although the temporal prediction model’s ability to describe neuronal RFs is high, the match with real neurons is not perfect. For example, the span of frequency tuning of our modelled auditory RFs is narrower than in A1 (Figure 6g–h). We also found an overrepresentation of vertical and horizontal orientations compared to real V1 data (Figure 7b–c). Some of these differences could be a consequence of the data used to train the model. Although the statistics of natural stimuli are broadly conserved (Field, 1987), there is still variation (Torralba and Oliva, 2003), and the dataset used to train the network may not match the sensory world of the animal experienced during development and over the course of evolution. In future work, it would be valuable to explore the influence of natural datasets with different statistics, and also to match those datasets more precisely to the evolutionary context and individual experience of the animals examined. Furthermore, a comparison of the model with neural data from different species, at different ages, and reared in different environments would be useful.

Another cause of differences between the model and neural RFs may be the recording location of the RFs and how they are characterized. We used the primary sensory cortices as regions for comparison, because we performed transformations on the input data that are similar to the preprocessing that takes place in afferent subcortical structures. We spatially filtered the visual data in a similar way to the retina (Olshausen and Field, 1996, Olshausen and Field, 1997), and spectrally decomposed the auditory data as in the inner ear, and then used time bins (5 ms) which are coarser than, but close to, the maximum amplitude modulation period that can be tracked by auditory midbrain neurons (Rees and Møller, 1983). However, primary cortex is not a homogenous structure, with neurons in different layers displaying certain differences in their response properties (Harris and Mrsic-Flogel, 2013). Furthermore, the methods by which neurons are sampled from the cortex may not provide a representative sample. For example, multi-electrode arrays tend to favour larger and more active neurons. In addition, the method and stimuli used to construct RFs from the data can bias their structure somewhat (Willmore et al., 2016).

The model presented here is based on a simple feedforward network with one layer of hidden units. This limits its ability to predict features of the future input, and to account for RFs with nonlinear tuning. More complex networks, with additional layers or recurrency may allow the model to account for more complex tuning properties, including those found beyond the primary sensory cortices. Careful, principled adjustment of the preprocessing, or different regularization methods (such as sparseness or slowness applied to the units’ activities), may also help. There is an open question as to whether the current model may eliminate some information that is useful for reconstruction of the past input or for prediction of higher order statistical properties of the future input, which might bring it into conflict with the principle of least commitment (Marr, 1976). It is an empirical question how much organisms preserve information that is not predictive of the future, although there are theoretical arguments against such preservation (Bialek et al., 2001). Such conflict might be remedied, and the model improved, by adding feedback from higher areas or by adding an objective to reconstruct the past or present (Barlow, 1959; Olshausen and Field, 1996, Olshausen and Field, 1997; Attneave, 1954) in addition to predicting the future.

To determine whether the model could help explain neuronal responses in higher areas, it would be useful to develop a hierarchical version of the temporal prediction model, applying the same model again to the activity of the hidden units rather than to the input. Another useful extension would be to see if the features learnt by the temporal prediction model could be used to accelerate learning of useful tasks such as speech or object recognition, by providing input or initialization for a supervised or reinforcement learning network. Indeed, temporal predictive principles have been shown to be useful for unsupervised training of networks used in visual object recognition (Srivastava et al., 2015; Ranzato, 2016; Lotter et al., 2016; Oh et al., 2015).

Finally, it is interesting to consider possible more explicit biological bases for our model. We envisage the input units of the model as thalamic input, and the hidden units as primary cortical neurons. Although the function of the output units could be seen as just a method to optimize the hidden units to find the most predictive code given sensory input statistics, they may also have a physiological analogue. Current evidence (Dahmen and King, 2007; Huberman et al., 2008; Kiorpes, 2015) suggests that while primary cortical RFs are to an extent hard-wired in form by natural selection, their tuning is also refined by individual sensory experience. This refinement process may require a predictive learning mechanism in the animal’s brain, at least at some stage of development and perhaps also into adulthood. Hence, one might expect to find a subpopulation of neurons that represent the prediction (analogous to the output units of the model) or the prediction error (analogous to the difference between the output unit activity and the target). Indeed, signals relating to sensory prediction error have been found in A1 (Rubin et al., 2016), though they may also be located in other regions of the brain. Finally, it is important to note that, although the biological plausibility of backpropagation has long been questioned, recent progress has been made in developing trainable networks that perform similarly to artificial neural networks trained with backpropagation, but with more biologically plausible characteristics (Bengio et al., 2015), for example, by having spikes or avoiding the weight transport problem (Lillicrap et al., 2016).

Conclusion

We have shown that a simple principle - predicting the imminent future of a sensory scene from its recent past - explains many features of the RFs of neurons in both primary visual and auditory cortex. This principle may also account for neural tuning in other sensory systems, and may prove useful for the study of higher sensory processing and aspects of neural development and learning. While the importance of temporal prediction is increasingly widely recognized, it is perhaps surprising nonetheless that many basic tuning properties of sensory neurons, which we have known about for decades, appear, in fact, to be a direct consequence of the brain’s need to efficiently predict what will happen next.

Materials and methods

Data used for model training and testing

Visual inputs

Videos (without sound, sampled at 25 fps) of wildlife in natural settings were used to create visual stimuli for training the artificial neural network. The videos were obtained from http://www.arkive.org/species, contributed by: BBC Natural History Unit, http://www.gettyimages.co.uk/footage/bbcmotiongallery; BBC Natural History Unit and Discovery Communications Inc., http://www.bbcmotiongallery.com; Granada Wild, http://www.itnsource.com; Mark Deeble and Victoria Stone, Flat Dog Productions Ltd., http://www.deeblestone.com; Getty Images, http://www.gettyimages.com; National Geographic Digital Motion, http://www.ngdigitalmotion.com. The longest dimension of each video frame was clipped to form a square image. Each frame was then band-pass filtered (Olshausen and Field, 1997) and downsampled (using bilinear interpolation) over space, to provide 180 × 180 pixel frames. Non-overlapping patches of 20 × 20 pixels were selected from a fixed region in the centre of the frames, where there tended to be visual motion. The video patches were cut into sequential overlapping clips each of 8 frames duration. Thus, each training example (clip) was made up of a 20 × 20 pixel section of the video with a duration of 8 frames (320 ms), providing a training set of N =~500,000 clips from around 5.5 hr of video, and a validation set of N =~100,000 clips. Finally, the training and validation sets were normalized by subtracting the mean and dividing by the standard deviation (over all pixels, frames and clips in the training set). The goal of the neural network was to predict the final frame (the ‘future’) of each clip from the first seven frames (the ‘past’).

Auditory inputs

Auditory stimuli were compiled from databases of human speech (~60%), animal vocalizations (~20%) and sounds from inanimate objects found in natural settings (e.g. running water, rustling leaves; ~20%). Stimuli were recorded using a Zoom H4 or collected from online sources. Natural sounds were obtained from www.freesound.org, contributed by users sedi, higginsdj, jult, kvgarlic, xenognosis, zabuhailo, funnyman374, videog, j-zazvurek, samueljustice00, gfrog, ikbenraar, felix-blume, orbitalchiller, saint-sinner, carlvus, vflefevre, hitrison, willstepp, timbahrij, xdimebagx, r-nd0mm3m, the-yura, rsilveira-88, stomachache, foongaz, edufigg, yurkobb, sandermotions, darius-kedros, freesoundjon-01, dwightsabeast, borralbi, acclivity, J.Zazvurek, Zabuhailo, soundmary, Darius Kedros, Kyster, urupin, RSilveira and freelibras. Human speech sounds were obtained from http://databases.forensic-voice-comparison.net/ (Morrison et al., 2015, Morrison et al., 2012).

Each sound was sampled at (or resampled to) 44.1 kHz and converted into a simple ‘cochleagram’, to make it more analogous to the activity pattern that would be passed to the auditory pathway after processing by the cochlea. To calculate the cochleagram, a power spectrogram was computed using 10 ms Hamming windows, overlapping by 5 ms (giving time steps of 5 ms). The power across neighbouring Fourier frequency components was then aggregated into 32 frequency channels using triangular windows with a base width of 1/3 octave whose centre frequencies ranged from 500 to 17,827 Hz (1/6th octave spacing, using code adapted from melbank.m, http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html). The cochleagrams were then decomposed into sequential overlapping clips, each of 43 time steps (415 ms) in duration, providing a training set of ~1,000,000 clips (~1.3 hr of audio) and a validation set of ~200,000 clips. To approximately model the intensity compression seen in the auditory nerve (Sachs and Abbas, 1974), each frequency band in the stimulus set was divided by the median value in that frequency band over the training set, and passed through a hill function, defined as h(x)=cx/(1+cx) with c = 0.02. Finally, the training and cross-validation sets were normalized by subtracting the mean and dividing by the standard deviation over all time steps, frequency bands and clips in the training set. The first 40 time steps (200 ms) of each clip (the ‘past’) were used as inputs to the neural network, whose aim was to predict the content (the ‘future’) of the remaining three time steps (15 ms).

Addition of Gaussian noise

To replicate the effect of noise found in the nervous system, Gaussian noise was added to both the auditory and visual inputs with a signal-to-noise ratio (SNR) of 6 dB. While the addition of noise did not make substantial differences to the RFs of units trained on visual inputs, this improved the similarity to the data when the model was trained on auditory inputs. The results from training the network on inputs without added noise are shown for auditory inputs in Figure 4—figure supplement 5 and Figure 6—figure supplement 1 and for visual inputs in Figure 4—figure supplements 67 and Figure 7—figure supplement 2. The results from the sparse coding model were similar in both cases for inputs with and without noise (Figures 56, Figure 5—figure supplements 15, Figure 6—figure supplement 1, Figure 7—figure supplements 1 and 3).

Temporal prediction model

The model and cost function

The temporal prediction model was implemented using a standard fully connected feed-forward neural network with one hidden layer. Each hidden unit in the network computed the linear weighted sum of inputs, and its output was determined by passing this sum through a monotonic nonlinearity. This nonlinearity s=h(a) was either a logistic function h(a)=1/(1+exp(a)) or a similar nonlinear function (such as tanh). For results reported here, we used the logistic function, though obtained similar results when we trained the model using h(a)=tanh(a). For comparison, we also trained the model replacing the nonlinearity with a linear function, where h(a)=a. In this case, we found that the RFs tended to be punctate in space or frequency and did not typically show the alternating excitation and inhibition over time that is characteristic real neurons in A1 and V1.

Formally, for a network with i=1 to I input variables, k=1 to K output units and a single layer of j=1 to J hidden units, the output sjn of hidden unit j for clip n is given by:

(1) sjn=h(bj+i=1Iwjiuin)

The value uin of input variable i for clip n is simply the value for a particular pixel and time step (frame) of the ‘past’ in preprocessed visual clip n (I = 20 pixels × 20 pixels × 7 time steps = 2800), or the value for a particular frequency band and time step of the ‘past’ of cochleagram clip n (I = 32 frequencies × 40 time steps = 1280). Hence, the index i spans over several frequencies or pixels and also over time steps into the past. The subscript n has been dropped for clarity in the figures (Figure 1). The parameters in Equation 1 are the connective input weights wji (between each input variable i and hidden unit j), and the bias bj (of hidden unit j).

The activity v^kn of each output unit k, which is the estimate of the true future vkn given the past uin, is given by:

(2) v^kn=bk+j=1Jwkjsjn

The parameters in Equation 2 are the connective output weights wkj (between each hidden unit j and output unit k) and the bias bk (of output unit k). The activity v^kn of output unit k for clip n is the estimate for a particular pixel of the ‘future’ in the visual case (K = 20 pixels × 20 pixels × 1 time step = 400), or the value for a particular frequency band and time step of the ‘future’ in the auditory case (K = 32 frequencies × 3 time steps = 96).

The parameters wji, wkj, bj, and bk were optimized for the training set by minimizing the cost function given by:

(3) E=1NKn=1Nk=1K(v^knvkn)2+λ(i=1Ij=1J|wji|+j=1Jk=1K|wkj|)

Thus, E is the mean squared error (the prediction error) between the prediction v^kn and the target vkn over all N training examples and K target variables, plus an L1 regularization term, which is proportional to the sum of absolute values of all weights in the network and its strength is determined by the hyper-parameter λ. This regularization tends to drive redundant weights to near zero and provides a parsimonious network.

Implementation details

The networks were implemented in Python (https://lasagne.readthedocs.io/en/latest/; http://deeplearning.net/software/theano/). The objective function was minimized using backpropagation as performed by the Adam optimization method (Kingma and Adam, 2014). An alternative implementation of the model was also made in MATLAB using the Sum-of-Functions Optimizer (Sohl-Dickstein et al., 2014) (https://github.com/Sohl-Dickstein/Sum-of-Functions-Optimizer) to train the network using backpropagation. Training examples were split into minibatches of approximately 200 training examples each.

During model network training, several hyperparameters were varied, including the regularization strength (λ), the number of units in the hidden layer and the nonlinearity used by each hidden unit. For each hyperparameter setting, the training algorithm was run for 1000 iterations. Running the network for longer (10000 iterations) showed negligible improvement to the prediction error (as measured on the validation set) or change in RF structure.

The effect of varying the number of hidden units and λ on the prediction error for the validation set is shown in Figure 8. In both the visual and auditory case, the results presented (Figure 2,4,6,7 and supplements) are the networks that predicted best on the validation set after 1000 iterations through the training data. For the auditory case, the settings that resulted in the best prediction were 1600 hidden units and λ = 10−3.5, while in the visual case, the optimal settings were 1600 hidden units and λ = 10−6.25.

Model receptive fields

In the model, the combination of linear weights and nonlinear activation function are similar to the basic linear non-linear (LN) model (Simoncelli et al., 2004; Dahmen et al., 2008; Atencio et al., 2008; Chichilnisky, 2001; Rabinowitz et al., 2011) commonly used to describe neural RFs. Hence, the input weights between the input layer and a hidden unit of the model network are taken directly to represent the unit’s RF, indicating the features of the input that are important to that unit.

Because of the symmetric nature of the sigmoid function, h(a)=1h(a), after appropriate modification of the biases a hidden unit has the same influence on the prediction if its input and output matrices are both multiplied by −1. That is, for unit j, if we convert wij to wij, wjk to wjk, bj to bj, and bk to bk+wjk, this will have no effect on the prediction or the cost function. This can be done independently for each hidden unit. Hence, the sign of each unit’s RF could equally be positive or negative and have the same result on the predictions given by the network. However, we know that auditory units always have leading excitation (Figure 3). Hence, for both the predictive model and for the sparse coding model, we assume leading excitation for each unit. This was done for all auditory analyses.

As more units are added to the model network, the number of inactive units increases. To account for this, we measured the relative strength of all input connections to each hidden unit by summing the square of all input weights for that unit. Units for which the sum of square input weights was <1% of the maximum strength for the population were deemed to be inactive and excluded from all subsequent analyses. The difference in connection strength between active and inactive units was very distinct; a threshold <0.0001% only marginally increases the number of active units.

Sparse coding model

The sparse coding model was used as a control for both visual and auditory cases. The Python implementation of this model (https://github.com/zayd/sparsenet) was trained using the same visual and auditory inputs used to train the predictive model. The training data were divided into mini-batches which were shuffled and the model optimized for one full pass through the data. Inference was performed using the Fast Iterative Shrinkage and Thresholding (FISTA) algorithm. A sparse L1 prior with strength λ was applied to the unit activities, providing activity regularization. A range of λ-values and unit numbers were tried (Figure 8—figure supplement 1). The learning rate and batch size were also varied until reasonable values were found. As there was no independent criterion by which to determine the ‘best’ settings, we chose the network that produced basis functions whose receptive fields were most similar to those of real neurons. In the auditory case, this was determined using the mean KS measure of similarity (Figure 8—figure supplement 1). In the visual case, as a similarity measure was not performed, this was done by inspection. In both cases, the model configurations chosen were restricted to those trained in an overcomplete condition (having more units than the number of input variables) in order to remain consistent with previous instantiations of this model (Olshausen and Field, 1996; Olshausen and Field, 1997; Carlson et al., 2012). In this manner, we selected a sparse coding network with 1600 units, λ = 100.5, learning rate = 0.01 and 100 mini-batches in the auditory case (Figures 56). In the visual case, the network selected was trained with 3200 units, λ = 100.5, learning rate = 0.05 and 100 mini-batches (Figure 5—figure supplements 12 and Figure 7—figure supplement 1). Although the sparse coding basis functions are projective fields, they tend to be similar in structure to receptive fields (Olshausen and Field, 1996; Olshausen and Field, 1997), and, for simplicity, are referred to as RFs.

Auditory receptive field analysis

In vivo A1 RF data

Auditory RFs of neurons were recorded in the primary auditory cortex (A1) and anterior auditory field (AAF) of 5 pigmented ferrets of both sexes (all >6 months of age) and used as a basis for comparison with the RFs of model units trained on auditory stimuli. Systematic differences in response properties of A1 and AAF neurons are minor and not relevant for this study, and for simplicity here, we refer to neurons from either primary field indiscriminately as ‘A1 neurons’. These recordings were performed under license from the UK Home Office and were approved by the University of Oxford Committee on Animal Care and Ethical Review. Full details of the recording methods are described in earlier studies (Willmore et al., 2016; Bizley et al., 2009). Briefly, we induced general anaesthesia with a single intramuscular dose of medetomidine (0.022 mg · kg−1 · h−1) and ketamine (5 mg · kg−1 · h−1), which was then maintained with a continuous intravenous infusion of medetomidine and ketamine in saline. Oxygen was supplemented with a ventilator, and we monitored vital signs (body temperature, end-tidal CO2, and the electrocardiogram) throughout the experiment. The temporal muscles were retracted, a head holder was secured to the skull surface, and a craniotomy and a durotomy were made over the auditory cortex. Extracellular recordings were made using silicon probe electrodes (Neuronexus Technologies) and acoustic stimuli were presented via Panasonic RPHV27 earphones, which were coupled to otoscope specula that were inserted into each ear canal, and driven by Tucker-Davis Technologies System III hardware (48 kHz sample rate).

The neuronal recordings used the ‘BigNat’ stimulus set (Willmore et al., 2016), which consists of natural sounds including animal vocalizations (e.g., ferrets and birds), environmental sounds (e.g., water and wind), and speech. To identify those neural units that were driven by the stimuli, we calculated a ‘noise ratio’ statistic (Rabinowitz et al., 2011; Sahani and Linden, 2003) for each unit and excluded from further analysis any units with a noise ratio >40. In total, driven spiking responses of 114 units (75 single unit, 39 multi-unit) were recorded to this stimulus set. Then, the auditory (spectrotemporal) RF of each unit was constructed using a previously described method (Willmore et al., 2016). Briefly, linear regression was performed in order to minimize the squared error between each neuron’s spiking response over time and the cochleagram of the stimuli that gave rise to that response. The method used was exactly the same as in our earlier study (Willmore et al., 2016), except that L1 rather than L2 regularization was used to constrain the regression. The spectrotemporal RFs of these neurons took the same form as the inputs to the model neural network (i.e., 32 frequencies and 40 time-steps over the same range of values) and were therefore comparable to the model units’ RFs. In order to account for the latency of auditory cortical responses, the most recent two time-steps (10 ms) of the neuronal RFs were removed, leaving 38 time-steps.

Multi-dimensional scaling (MDS)

To get a non-parametric indication of how similar the model units’ RFs were to those of real A1 neurons, each RF was embedded into a 2-dimensional space using MDS (Figure 6a and Figure 6—figure supplement 1a). First, 100 units each from the temporal prediction and sparse coding models and from the real population were chosen at random. To ensure that the model RFs were of the same dimensionality as the real RFs prior to embedding, the least recent two time steps of each model RF were removed.

Measuring temporal and frequency spans of RFs

We quantified the span, over time and frequency, of the excitatory and inhibitory subfields of each RF. To do this, each RF was first separated into excitatory and inhibitory subfields, where the excitatory subfield was the RF with negative values set to 0, and the inhibitory subfield the RF with positive values set to 0. In some cases, model units did not exhibit notable inhibitory subfields. To account for this, the power contained in each subfield was calculated (sum of the squares of the subfield). Inhibitory subfields with <5% of the power of that unit’s excitatory subfield were excluded from further analysis. According to this criterion, 44 of 167 active units in the temporal prediction model and 193 of 1600 units in the sparse model did not display inhibition.

Singular value decomposition (SVD) was performed on each subfield separately, and the first pair of singular vectors was taken, one of which is over time, the other over frequency. For the excitatory subfield, the temporal span was measured as the proportion of values in the temporal singular vector that exceeded 50% of the maximum value in the vector. The same analysis provided the temporal span for the inhibitory subfield. Similarly, we measured the frequency spans of the RFs by applying this measure to the frequency singular vectors of the excitatory and the inhibitory subfields.

We also examined, for both real and model RFs, the mean power for each of the 38 time steps in the RFs (Figure 6b), which was calculated as the mean of the squared RF values, over all frequencies and RFs, at each time step.

Mean KS measure

To compare each network’s units with those recorded in A1 (Figure 3), the two-sample Kolmogorov-Smirnov (KS) distance between the real and model distribution was measured for both the temporal and spectral span of the excitatory and inhibitory subfields (e.g. the distributions in Figure 6d–e and Figure 6g–h). These four KS measures were then averaged to give a single mean KS measure for each network, indicating how closely the temporal and frequency characteristics of real and model units matched on average for that network. The KS measure is low for similar distributions and high for distributions that diverge greatly. Thus networks whose units display temporal and frequency tuning characteristics that match those of real neurons more closely give rise to a lower mean KS measure.

Visual receptive field analysis

In vivo V1 RF data

Visual RFs measured using recordings from V1 simple cells were compared against the model (Figure 2c, and Figure 7a, cat, Ohzawa et al., 1996). The model was also compared to measures of simple cell RFs (Figure 7d and corresponding supplements, cat, Jones and Palmer, 1987, mouse, Niell and Stryker, 2008 and monkey, Ringach, 2002). The data were taken from the authors’ website (Ringach, 2002) or extracted from relevant papers (Jones and Palmer, 1987) or provided by the authors (Ohzawa et al., 1996; Niell and Stryker, 2008).

Fitting Gabors

In order to quantify tuning properties of the model’s visual RFs, 2D Gabors were fitted to the optimal time-step of each unit’s response (Jones and Palmer, 1987; Ringach, 2002). This allowed comparison to previous experimental studies which parameterized real RFs by the same method (Ringach, 2002). The optimal time-step was defined (Ringach, 2002) as the time-step of the unit’s response which contained the most power (mean square). The Gabor function has been shown to provide a good approximation for most spatial aspects of simple visual RFs (Jones and Palmer, 1987; Ringach, 2002). The 2D Gabor is given as:

(4) G(x,y)=Aexp((x2σx)2(y2σy)2)cos(2πfx+ϕ)

where, the spatial coordinates (x, y) are acquired by translating the centre of the RF (x0, y0) to the origin and rotating the RF by its spatial orientation θ

(5) x=(xx0)cosθ+(yy0)sinθ
(6) y=(xx0)sinθ+(yy0)cosθ

σx and σy provide the width of the Gaussian envelope in the x and y directions, while f and ϕ parameterize the spatial frequency and phase of the sinusoid along the x axis. A parameterizes the height of the Gaussian envelope.

For each RF, the parameters (x0y0, σx, σy, θ, f, ϕ) of the Gabor were fitted by minimizing the mean squared error between the Gabor model and the RF using the minFunc minimization package (http://www.cs.ubc.ca/~schmidtm/Software/minFunc.html). In order to avoid local minima, the fitting was performed in two steps. First, the spatial RF was converted to the spectral domain using a 2D Fourier transform. Since the Fourier transform of a 2D Gabor is a 2D Gaussian (Jones and Palmer, 1987), which is easier to fit, an estimate of many of the parameters was obtained by first fitting a 2D Gaussian in the spectral magnitude domain. Using the parameters obtained from the spectral fitting as initial estimates, a 2D Gabor was then fitted to the original RF in the spatial domain. The fitted parameters provided a good estimate of the units’ responses, with residual errors between the spatial responses and the corresponding Gabor fits being small and lacking spatial structure, and the median pixel-wise correlation coefficient of the Gabor fits for the temporal prediction model units was 0.88. Units whose fitted Gabors had a poor fit (those with a correlation coefficient <0.7; 214 units) were excluded from further analysis. We also excluded units with a high correlation coefficient (>0.7) if the centre position of the Gabor was estimated to be outside the RF, and hence only the Gabor’s tail was being fitted to the response (39 units), and those for which the estimated standard deviation of the Gaussian envelope in either x or y was <0.5 pixels, which meant very few non-negligible pixel values were used to constrain the parameters (146 units). Together, these exclusion criteria (which sometimes overlapped), led to 395 of the 1600 responsive units being excluded for the temporal prediction model.

2D spatiotemporal receptive fields

In order to better view their temporal characteristics we collapsed the 3D spatiotemporal real and model RFs (space-space-time) along a single spatial direction to create 2D spatiotemporal (space-time) representations (DeAngelis et al., 1993). First, we determined the 3D RFs’ optimal time step (the time step with the largest sum of squared values). We then acquired the rotation and translation that centres the RF on zero and places the oriented bars parallel to the y-axis at the optimal time step from the Gabor parameterization of each unit at its optimal time step. We applied this fixed transformation to each time step and collapsed the RF by summing the activity along the newly defined y-axis. The resulting 2D (space-time) RFs provide intuitive visualization of the RF across time, while losing minimal information. For the RFs of real neurons (Ohzawa et al., 1996), the most recent time step (40 ms) of the 3D and 2D spatiotemporal RFs were removed to account for the latency of V1 neurons (Figures 2c and 7a).

Estimating space-time separability

The population of model units contained both space-time (ST) separable and inseparable units. First the two spatial dimensions of the 20 × 20 × 7 3D RF were collapsed to a single vector to yield a single 400 × 7 matrix. The SVD of this matrix was then taken and the singular values examined. If the ratio between the second and first singular value was ≥0.5, the unit was deemed to be inseparable. Otherwise, the unit was deemed to be separable. Examining the 20 × 7 2D spatiotemporal RFs (obtained as outlined in the preceding section; Figure 4—figure supplement 3) showed this to be an accurate way of separating space-time separable and inseparable units.

Spatial RF structure

For comparison with the real V1 RF and previous theoretical studies, the width and length of our model’s RFs were measured relative to their spatial frequency (Ringach, 2002). Here, ny=σyf gives a measure of the length of the bars in the RF, while nx=σxf gives a measure of the number of oscillations of its sinusoidal component. Thus, in the ny, nx plane, blob-like RFs with few cycles lie close to the origin, while stretched RFs with many subfields lie away from the origin. RFs with values high along the nx axis, have many bars, while those far along the ny axis have long bars. As in Ringach (Ringach, 2002) only space-time separable units were included in this analysis.

Temporal weighting profile of the population

The mean power for each of the seven time steps of the RFs was examined for both real and model populations (Figure 7a). The temporal weighting profile was calculated as the mean, over space and the population, of the squared values of the 2D spatiotemporal RFs at each time step.

Tilt direction index

The tilt direction index (TDI) (DeAngelis et al., 1993; Pack et al., 2006; Anzai et al., 2001; Baker, 2001; Livingstone and Conway, 2007) of an RF is given by (RpRq)/(Rp+Rq), where Rp is the amplitude at the peak of the 2D Fourier transform of the 2D spatiotemporal RF, found at spatial frequency Fspace and temporal frequency Ftime. Rq is the amplitude at (Fspace, Ftime) in the 2D Fourier transform. The mean and standard deviations of TDI for experimental data for the cat (Baker, 2001) and macaque (Livingstone and Conway, 2007) were measured from data extracted from figures in the relevant references (Figure 11A and the low-contrast axis of Figure 3A in these papers respectively).

Peak temporal frequency

The 2D spatiotemporal RFs were also useful for calculating further temporal response properties of the model. The temporal frequency was calculated as the peak temporal frequency of each spatiotemporal RF as measured from its 2D Fourier transform.

Code and data availability

All custom code used in this study was implemented in MATLAB and Python. We have uploaded the code to a public Github repository (Singer, 2018; copy archived at https://github.com/elifesciences-publications/temporal_prediction_model). The raw auditory experimental data is available at https://osf.io/ayw2p/. The movies and sounds used for training the models are all publicly available at the websites detailed in the Materials and methods.

References

  1. 1
  2. 2
  3. 3
    Spectro-temporal receptive fields of auditory neurons in the grassfrog. III. Analysis of the stimulus-event relation for natural stimuli
    1. AM Aertsen
    2. JH Olders
    3. PI Johannesma
    (1981)
    Biological Cybernetics 39:195–209.
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
    Sensory mechanisms, the reduction of redundancy, and intelligence
    1. HB Barlow
    (1959)
    In: D. V Blake, A. M Uttley, editors. The Mechanisation of Thought Processes. London: Her Majesty's Stationery Office. pp. 535–539.
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
    Auditory perceptual organization inside and outside the laboratory
    1. R Cusack
    2. R Carlyon
    (2004)
    In: J. G Neuhoff, editors. Echological Pyschoacoustics. Elsevier. pp. 15–48.
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
    Neural Engineering : Computation, Representation, and Dynamics in Neurobiological Systems
    1. C Eliasmith
    2. CH Anderson
    (2003)
    MIT Press.
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
     Treatise on Physiological Optics
    1. H Helmholtz
    (1962)
    Concerning the perceptions in general,  Treatise on Physiological Optics, 3rd ed, New York, Dover Publications.
  35. 35
    Predictive coding
    1. Y Huang
    2. RPN Rao
    (2011)
    Wiley Interdisciplinary Reviews: Cognitive Science 2:580–593.
    https://doi.org/10.1002/wcs.142
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
    Extracting Slow Subspaces from Natural Videos Leads to Complex Cells
    1. C Kayser
    2. W Einhäuser
    3. O Dümmer
    4. P König
    5. K Körding
    (2001)
    In: G Dorffner, H Bischof, K Hornik, editors. Artificial Neural Networks — ICANN 2001, 2130. Springer. pp. 1075–1080.
    https://doi.org/10.1007/3-540-44668-0_149
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
    Early processing of visual information
    1. D Marr
    (1976)
    Philosophical Transactions of the Royal Society B: Biological Sciences 275:483–519.
    https://doi.org/10.1098/rstb.1976.0090
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
    Biology and Diseases of the Ferret
    1. FR Nodal
    2. AJ King
    (2014)
    685–710, Hearing and Auditory Function in Ferrets, Biology and Diseases of the Ferret, John Wiley & Sons, Inc.
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
    Prediction as a candidate for learning deep hierarchical models of data
    1. RB Palm
    (2012)
    Technical University of Denmark, (DTU) Informatics.
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76
    How linear are auditory cortical responses?
    1. M Sahani
    2. J Linden
    (2003)
    Advances in Neural Information Processing Systems 15:109–116.
  77. 77
  78. 78
    Characterization of neural responses with stochastic stimuli
    1. E Simoncelli
    2. JW Pillow
    3. L Paninski
    4. O Schwartz
    (2004)
    In: M Gazzaniga, editors. The Cognitive Neurosciences, III. MIT Press. pp. 327–338.
  79. 79
  80. 80
  81. 81
  82. 82
  83. 83
  84. 84
    An adaptive network that constructs and uses an internal model of its world
    1. RS Sutton
    2. AG Barton
    (1981)
    Cognition and Brain Theory 4:217–246.
  85. 85
  86. 86
  87. 87
  88. 88
  89. 89
  90. 90
  91. 91

Decision letter

  1. Jack L Gallant
    Reviewing Editor; University of California, Berkeley, United States
  2. Sabine Kastner
    Senior Editor; Princeton University, United States

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Sensory cortex is optimized for prediction of future input" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by Reviewing Editor Jack Gallant and Senior Editor Sabine Kastner. The following individuals involved in review of your submission have agreed to reveal their identity: Rhodri Cusack (Reviewer #1); Laurenz Wiskott (Reviewer #2); Christoph Zetzsche (Reviewer #3).

The reviewers have discussed the reviews with one another, and they and the Reviewing Editor agree that your paper is potentially suitable for publication in eLife after appropriate revision. The Reviewing Editor has drafted this decision to help you prepare a revised submission.

The Reviewing Editor hopes that you will address all of the concerns of the authors, most of which are straightforward. But to help you in revision the major issues are listed here:

1) If you look at the reviews you will see that the list of suggestions is very long, but the vast majority of the comments only ask for clarification, they do not require additional data analysis or substantial rewriting. It would be good if you could address all questions in your reply to reviewers and revise the text appropriately where necessary to provide necessary information to the reader.

2) The proposed model is interesting, but it has many components that may differentially contribute to the result, such as the nonlinearity or L1 regularization. The reviewers felt that the paper would be stronger if the specific effects of these various model component choices were analyzed in a bit more detail, in order to try to pin down more precisely why these components are important. (See for example suggestions of reviewer 2.)

3) In some places the choices made during modelling seemed arbitrary (e.g., the choice of temporal windows and the regularization parameters). Either grid search over a training set should be used to choose these parameters, or hyperparameter modelling should be performed to show that the specific values chosen are not critical, or the choices should be justified. The first of these is obviously most desirable.

4) Some of the figures are difficult to interpret (c.f. Figure 7B and Figure 7—figure supplement 2B). Please try to improve the figures where necessary.

5) Please address the concerns of reviewer 3 regarding the introductory material on temporal prediction and the neurobiological plausibility of the approach.

Reviewer #1:

This manuscript compares two stimulus coding principles that could explain the form of receptive fields in sensory cortex: efficient sparse coding and temporal prediction. A simple temporal prediction model was found create receptive fields that were similar in many ways to those seen in sensory cortex. It performed much better than a sparse coding model.

This manuscript makes an interesting and important contribution. The "sparse coding" hypothesis is popular, both in neuroscience and in artificial intelligence, where autoencoding in deep-neural networks implements efficient coding. The authors argue that for dynamic stimuli, the need to predict what will happen next may be a more important than merely encoding efficiently what has happened recently. Their model is simple and elegant, and the results are convincing.

I felt there were some places where choices were made during modelling that seemed arbitrary – such as the choice of temporal windows and the regularization parameters. The manuscript would be stronger if these choices were either justified, or hyperparameter modelling done to show that the specific values chosen are not critical, to allay concerns readers may have of "p-hacking".

I found the manuscript to be well-structured, thorough and well written, clearly conveying a convincing message.

Reviewer #2:

The paper presents a two-layer network optimized for predicting immediate future sensory input (auditory or visual) from recent past sensory input. The resulting spatio- or spectrotemporal receptive fields, i.e. weight vectors of the first layer, are analyzed and compared with physiological receptive fields. For model comparison, results from a sparse coding network are used.

The results show that the predicting neural network captures receptive field properties fairly well, in particular temporal structure is reproduced much better than by the sparse coding network.

The topic is interesting, and the results are highly relevant to the field. I must add, however, that I have not followed the field recently, so I cannot really tell, whether some similar work has been published recently. But the authors seem to have done a careful literature research and discuss alternative approaches fairly.

The paper is well structured and has obviously been written very carefully. I have rarely reviewed a manuscript that feels so ready for publication. So, I am tempted to recommend the paper for publication as it is.

There is just one issue I would invite the authors to consider a bit further: The claim of the paper is that the objective of temporal prediction results in the receptive field properties found. But there are additional factors, such as the nonlinearity and the L1 regularization, that contribute to it. The authors have investigated this to some extent. For example, they find that receptive fields are seriously degraded if the nonlinearity is replaced by a linear activation function. My suggestion is to try to pin down, what objective is implicitly added by the nonlinearity and the L1 regularization. I suggest to perform a similar experiment as in Figure 8, but with a sparseness or independence measure rather than final validation loss. This could also be done on the hidden units.

I suggest this, because I believe that temporal prediction alone does not do the trick. I feel it must be combined with some sparseness or independence objective to yield the receptive fields. And I feel that this missing objective is implicitly added by the nonlinearity and the L1 regularization. Making this more transparent would be great and the suggested experiment should be very easy to do.

Reviewer #3:

The authors propose a new principle for the development of cortical receptive fields which combines the concepts of predictive coding and sparse coding. They train a three-layer network with one hidden layer in order to predict the future visual spatial input or the future auditory auditory spectro-temporal input from the recent spatio-/spectro-temporal input, subject to a sparsity constraint.

They perform this training for two examples: for an auditory network, based on training data which contain human speech, animal vocalization and inanimate natural sounds, and for a visual network based on training data with movies of wildlife in natural settings.

They compare the resulting networks to real cortical neurons from A1 and V1 and to an alternative sparse coding approach intended to provide a sparse representation of the complete spatio/spectro-temporal input.

For their comparison they consider the spectrotemporal and spatiotemporal receptive fields and various population measures, e.g. the temporal decay of power in the receptive fields, the temporal span of excitation an inhibition, orientation and frequency tuning properties, and receptive field dimensions.

Except for orientation, for which the majority of visual units is restricted in their orientation preference to horizontal and vertical orientations, the proposed model can capture neural tuning properties as well as the established models. And in case of the asymmetric emphasis of the most recent past it can even provide a better description.

In my opinion this is a quite interesting paper. First, it presents a novel approach which unifies the principles of sparse coding and of temporal prediction. This combination enables the explanation of a large set of spatio-/spectro-temporal tuning properties within one single integrated framework. Second, the authors have an important point in stressing the asymmetry of the temporal response with its emphasis of the most recent past, as observed in typical cortical neurons. This is indeed a property that other learning schemes, like sparse coding, by the very nature of their objective functions, cannot produce.

There are some points that, in my view, need to be clarified or described in more detail. In the following I describe the modifications and additions which I assume to be helpful in a revision of the paper. Due to my background I will put more emphasis on the visual aspects.

1) The description of the history of the concept of temporal prediction is not clear enough, both in the introduction and in the discussion. I am aware of the pressure for novelty in current science but in my view, there is sufficient novelty in the suggested model to allow the authors to avoid such ambiguities. Currently, the paper might be misinterpreted by a swift non-specialist reader as if the concept of the "prediction of the immediate future" is a novel principle being introduced here (Introduction; Discussion section: "We hypothesized"). Only a few selected papers are cited directly in this context (only Bialek in the Introduction), and in the Discussion section they are characterized as unspecific: "The temporal prediction principle we describe.… has been described in a very general manner"(reference only to Bialek, Palmer). Other references exist but are spread out through the further text. But of course, the principle as such has a long history, there are numerous papers which describe the prediction of the future sensory input as an important goal of neural information processing. I am no specialist, and this is not comprehensive but early examples are corollary discharge theories, and already Sutton and Barto, (1981) and Srinivasan et al., (1982), for example, considered the temporal dimension of prediction. Motion extrapolation has also been interpreted as prediction computed in visual cortex (Nijhawan, 1994). A further, canonical example of a method for the optimal prediction of the future sensory input is the Kalman filter, as considered by, e.g., Rao, (1999). I suggest that the authors devote one paragraph to the history of the concept, with all the appropriate references included there, and then make precisely clear in which aspects their novel contribution extends beyond these earlier approaches.

2) Neurobiological plausibility of the approach: I do not think that the authors have to be as clear about the neural implementation of the suggested architecture as the other predictive coding approaches, but at least some rough or speculative ideas should be presented: What is the status of the second-order units? Where in the cortex are they (V2?) and what do they encode? Really the future INPUT itself? That is, they have no selectivity, no tuning properties? Have such units been observed? Where and how is the prediction error computed? Does this model not require a bypass line which brings the retinal spatial input directly to V1 or V2 to enable the comparison?.…

3) The sparse coding data appear quite unusual. Why are the units not more "localized" in the temporal dimension, in particular for the visual model? Furthermore, it seems as if the visual sparse model is not used in the usual overcomplete regime. I also would have expected a more concentrated distribution of the temporal span of excitation and inhibition for a typical sparse coding model. Please discuss this in the paper.

4) Subsection “Model receptive fields” by inspection: is there really no other possibility to determine the optimal hyperparameters of the sparse coding model? A fit to the neural data? You have Figure 8—figure supplement 1B anyway. Why have you not made use of it? One could use only a training subset, if this seems critical issue. And one can include KS measures of other tuning properties.

5) Subsection “Addition of Gaussian noise”: Noise. For me the use of noise in this investigation is somewhat unclear. First, it seems to favor the prediction model over the sparse model, which is more susceptible to noise. Second, the noise level used appears unusually strong (is this dB?). This issue should be clearly motivated and discussed in the main text of the paper.

6) Subsection “Model receptive fields”: I am a bit skeptic with respect to the sign-flipping of excitation and inhibition. The argument that the signs could as well be flipped if this is done for the first-order and the second-order units alike appears only valid because any specification and relation to real neurons is omitted for the second-order units. In fact, this sign-flipping will inevitably imply a prediction of how excitation and inhibition operate in the second-order units.

Furthermore, if this argumentation would be accepted then one could arbitrarily flip signs to the desired result in any learning model, because one can always argue that appropriate sign flips at some subsequent processing stages could compensate for this. Used in this way, excitation and inhibition would lose any meaning.

It is perfectly ok for me if a model is agnostic with respect to the correct prediction of excitation and inhibition, a model does not have to be perfect in all aspects. But if this is the case this should be clearly visible for the reader in the presented receptive field plots. (This does not exclude to use of an appropriate sign-flip in population measures.)

7) Figure 4—figure supplement 2 and Figure 4—figure supplement 3: Two separate populations? Visual inspection of these figures suggests the possible existence of two distinct populations. Is this related to separability, or blob-like units, or both? (Is ordering according to separability?) I am not sure about the current state in the field, but I remember a discussion about the existence of two distinct populations as opposed to a continuous distribution for separability. This issue should be described and discussed.

8) ibid. The percentage of blob-like units appears quite high. Is this percentage comparable to the neural data? Or only if the mouse data, which are special in this respect, are being included?

9) Figure 4—figure supplement 6 The percentage of blob-like units seems to be substantially reduced in comparison to the noisy case (Figure 4—figure supplement 2). Is there a systematical relation between high noise levels and the emergence of blob-like units? Please discuss this issue in the paper.

10) ibid. It is difficult to understand the relation between Figure 4—figure supplement 2 and Figure 7C as opposed to Figure 4—figure supplement 6 and Figure 7—figure supplement 2F. Can you describe how properties of the receptive field plots relate to properties of the population distribution in these two cases?

11) Figure 7 and others Spatiotemporal population properties: Although the article is about spatiotemporal processing it provides only two population measures of purely static spatial properties and one of a purely temporal property for the vision case. Spatiotemporal measures of particular interest would be: DSI (directional selectivity index)/TDI (tilt direction index) (direction selectivity is considered to be a major spatiotemporal property of visual cortex); if possible: a scatter plot of temporal frequency vs. spatial frequency; population distribution of motion tuning. The necessary data for these plots should be already available.

12) Figure 7B and Figure 7—figure supplement 2B: The orientation scatter plot is visually difficult to interpret. The quantitative degree of concentration of the preferred orientations on the vertical and horizontal orientations as opposed to the oblique orientations remains unclear. Please provide either an orientation histogram or the percentage of units which fall into the 30-60, 120-150 deg range. In both cases the lowest spatial frequencies should be omitted for the analysis.

13) Figure 7—figure supplement 1: I am a bit surprised that only 289/400 sparse-coding units can be fitted by a Gabor function. Why is this? Usually most sparse coding units have a good Gabor fit. And with such a high percentage excluded I see the risk of a systematic bias regarding certain tuning parameters.

14) Figure 7C and Figure 7—figure supplement 2F seem to indicate that the model produces two distinct sub-populations with respect to receptive field parameters n_x and n_y. It should be discussed whether this is the case, and if yes, whether it is systematically related to other parameters (selectivities) of the units. Could this be related related to the two apparent sub-populations regarding separability, blob-like shapes, cf. Figure 4—figure supplement 2? And has such a tendency has also been observed in neural data? In contrast, Ringach, (2004) claimed clustering around a one-dimensional curve. Please describe and discuss in text.

15) Figure 8 and subsection “Implementation details: Are we expected to see here that 1600 hidden units are a distinguished optimum? Does this figure not tell us that the prediction error does not substantially depend on the number? And when a biological system could achieve basically the same prediction quality with 100 neurons why should it then invest 1500 additional units for such a small advantage?

16) Figure 8 and Figure 8—figure supplement 1. The comparison between the models prediction capability and the similarity of the model units to real units should not only be presented for the auditory neurons but also for the visual neurons (according to the text the data seem to be already available).

17) Subsection “Optimising predictive capacity” It is not clear whether the similarity measure (only the span of temporal and frequency tuning is considered) is fair or biased for the comparison with the sparse coding model. What will happen, for example, if the distribution of orientation preference would be used instead (for the visual model)? Would then the sparse coding model appear more similar to the real neurons than the prediction model?

Please motivate and discuss.

18) Discussion section: I cannot follow the argumentation that the class of prediction models (or this specific model) should somehow be unique with respect to the ability to provide an independent criterion for the selection of hyperparameters. Should this somehow be a principle, or a logical conclusion? Or an empirical observation only for this special case? Is it logically impossible that we find a measure, for example something entropy-related or whatever, for sparse coding that could have the same status? Please clarify.

19) Subsection “Visual normative models”:does not refer to prediction of the future: Why not? Is not the temporal prediction made in these models that the future spatial pattern is the same as the previous spatial pattern?

20) Subsection “Visual normative models”:The model is selective, throws away information, as opposed to the information preserving properties of other models. But is this really a good strategy for *early* stages of a multi-stage information processing system? Does the principle of least commitment not suggest just the opposite strategy?

21) For the Discussion section it would be of interest to consider the contributions from the two components of the prediction model, i.e., which properties of the units are genuinely caused by prediction and which are more due to the sparse coding part?

[Editors' note: further revisions were requested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Sensory cortex is optimized for prediction of future input" for further consideration at eLife. Your revised article has been favorably evaluated by Sabine Kastner (Senior Editor/Reviewing Editor), and two reviewers.

The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined below:

The authors have addressed my comments in a careful manner. In particular, I appreciate that they made considerable and appropriate changes to text and figures instead of just providing arguments in the response letter. From my view, the manuscript is basically ready for publication. I have a few final minor suggestions.

1).… We examined linear aspects of the tuning of the output units for the visual temporal prediction model using a response-weighted average to white noise input and found punctate un-oriented RFs that decay into the past..…

This is interesting. Can you mention this somewhere in the text?

2) I understand that the model, by its very nature would not care about the sign. But the fact remains that you have an output of a model and you post hoc manipulate this output to obtain a "better suited" presentation (e.g., to ease comparison). My only point is that it should be totally clear to even a superficial reader that such a post hoc change has been applied. So please just include an appropriate sentence that makes this clear, e.g.:

Note that the model does not care about the sign (excitation/inhibition) and thus provides no systematic prediction of it. We hence switched the signs of the respective receptive fields of the model output appropriately to obtain receptive fields which all have positive leading excitation.

(3) Can you mention this alternative goal of least commitment somewhere in the discussion? And the empirical question.

https://doi.org/10.7554/eLife.31557.034

Author response

The reviewers have discussed the reviews with one another, and they and the Reviewing Editor agree that your paper is potentially suitable for publication in eLife after appropriate revision. The Reviewing Editor has drafted this decision to help you prepare a revised submission.

The Reviewing Editor hopes that you will address all of the concerns of the authors, most of which are straightforward. But to help you in revision the major issues are listed here:

1) If you look at the reviews you will see that the list of suggestions is very long, but the vast majority of the comments only ask for clarification, they do not require additional data analysis or substantial rewriting. It would be good if you could address all questions in your reply to reviewers and revise the text appropriately where necessary to provide necessary information to the reader.

We hope that our explanations in this document and clarifications throughout the main text of the paper address these points. We have addressed all of the comments in this document and made corresponding changes in the text.

2) The proposed model is interesting, but it has many components that may differentially contribute to the result, such as the nonlinearity or L1 regularization. The reviewers felt that the paper would be stronger if the specific effects of these various model component choices were analyzed in a bit more detail, in order to try to pin down more precisely why these components are important. (See for example suggestions of reviewer 2.)

In preparing the original manuscript, we performed a grid search over the L1 regularization parameter and number of hidden units. The effect of changing these parameters on the predictive capacity of the model are shown for the auditory model in Figure 8. We did not previously show the effects of these parameters on the receptive field structure. We have now produced interactive figures which illustrate this for visual and auditory RFs (Figure 8—figure supplement 2 and Figure 8—figure supplement 3; https://yossing.github.io/temporal_prediction_model/figures/interactive_supplementary_figures.html). The effects of these parameters are now described in the main text (see paragraph below).

We have also explored the effects of the activation function (nonlinearity). We implemented versions of the network with tanh and rectified linear activation functions and found that the choice of nonlinearity does not have a decisive effect on RF structure However, if linear activation is used its RFs do not look like those seen in V1 or A1, as discussed in the main text (see paragraph below).

We have also explored other components of the model (see the response to point 3). However, two particular components of the model appear to be particularly important for prediction and having RFs that match the biology; having a non-linearity and having the correct amount of L1 weight regularization. We suspect that there are two reasons why having appropriate L1 regularization is likely to be important; first to avoid overfitting and hence find the most predictive code, and second to mimic the efficiency constraints on connectivity of the nervous system due to space and energy limitations. We suspect that a reason why having a nonlinearity is likely to be important, is that the future input depends non-linearly on the past input.

Reviewer #2 makes the interesting point that although the sparseness or independence of the hidden unit activities is not an explicit goal of the model, this may emerge implicitly in cases where the model’s RFs are most similar to the neural data (and where prediction error is lowest). To test for this, we measured the sparsity of the trained model’s hidden unit activities (by the measure of Vinje and Gallant, (2000)) in response to the natural input validation set. Examining the relationship between predictive capacity and sparsity, over a range of L1 weight regularization strength and hidden units, we do not find a clear monotonic relationship. Indeed, the hidden unit and L1 regularization combination with the best prediction was not the sparsest model, but of intermediate sparsity over the span we explored.

Addressing these points and others, we have now added a new section to the Results section titled “Variants of the temporal prediction model”. The relevant part at the start of this new section reads:

“The change in the qualitative structure of the RFs as a function of the number of hidden units and L1 regularization strength, for both the visual and auditory models, can be seen in the interactive supplementary figures (Figure 8—figure supplements 2-3; https://yossing.github.io/temporal_prediction_model/figures/interactive_supplementary_figures.html) [...] The RFs also did not change form or polarity over time, but simply decayed into the past.”.

3) In some places the choices made during modelling seemed arbitrary (e.g., the choice of temporal windows and the regularization parameters). Either grid search over a training set should be used to choose these parameters, or hyperparameter modelling should be performed to show that the specific values chosen are not critical, or the choices should be justified. The first of these is obviously most desirable.

We have explored the effects of all of the crucial parameters, and we explain these results in the new section “Variants on the temporal prediction model”.

Regularization parameter and hidden unit number: In all the model results shown in the original manuscript, we did not choose the values of these parameters, but used the parameters which provided the optimal prediction within a grid search (see response to point 2, above).

Temporal window into the past: We have explored the effect of different temporal windows. As can be seen in Figure 6B and Figure 7A, most of the energy of the real and model RFs is within 100ms (20 time steps) in the auditory case and within 160ms (4 time steps) in the visual case. Hence, the temporal window into the past was chosen to be slightly larger than these values, and so long as the window into the past is sufficiently long, it’s length is not a critical.

Temporal window into the future: We have explored the effect of different temporal windows into the future. We found that increasing the number of time steps being predicted had little effect on the RFs of the auditory model either qualitatively or by the KS similarity measure. In the visual case, it caused the RFs to be more restricted in space and increased the proportion of blob-like units.

Noise: In the Supplementary Material of the original manuscript we included figures showing that the effects of the input noise were subtle and not-critical to the form of RF that we see. We now discuss this in the main Results section as requested by reviewer #3.

Nonlinearity: Although a non-linearity is critical to achieve our results, with the RFs appearing quite different and less like those seen in V1 or A1 when no non-linearity is used, the exact choice of non-linearity (sigmoid, tanh, or rectified linear) was not critical, and the RFs seen were similar (see our response to point 2 above).

In addition to the parts of the new section given in response to the Editor’s previous point, the relevant part of this new section subsection “Visual normative models”) reads:

“The temporal prediction model and sparse coding model results shown in the main figures of this paper were trained on inputs with added Gaussian noise (6dB SNR), mimicking inherent noise in the nervous system. To determine the effect of adding this noise, all models were also trained without noise, producing similar results (Figure 4—figure supplements 5–7; Figure 5—figure supplements 3–5; Figure 6—figure supplement 1; Figure 7—figure supplements 2-3). The results were also robust to changes in the duration of the temporal window being predicted. We trained the auditory model to predict a span of either 1, 3, 6, or 9 time steps into the future and the visual model to predict 1, 3 or 6 time steps into the future. For the auditory case, we found that increasing the number of time steps being predicted had little effect on the RF structure, both qualitatively and by the KS measure of similarity to the real data. In the visual case, Gabor-like units were present in all cases. Increasing the number of time steps made the RFs more restricted in space and increased the proportion of blob-like RFs.”

4) Some of the figures are difficult to interpret (c.f. Figure 7B and Figure 7—figure supplement 2B). Please try to improve the figures where necessary.

We have changed these figures, as suggested by the reviewers, in order to make them more interpretable. We have added an additional panel to Figure 7 and corresponding supplementary figures showing a histogram of orientation tuning preferences for the model units. We have also added insets to Figure 6 and corresponding supplementary figures.

5) Please address the concerns of reviewer 3 regarding the introductory material on temporal prediction and the neurobiological plausibility of the approach.

We have now added the following paragraph to the Introduction on predictive normative models of sensory processing.

“The idea that prediction is an important component of perception dates at least as far back as Helmholtz18,19, although what is meant by prediction and the purpose it serves is quite varied between models incorporating it20,21. […] Our model relates to the predictive information approach in that it is optimized to predict the future from the past, but it has a combination of characteristics, such a non-linear encoder and sparse weight regularization, which have not previously been explored for such an approach.”

We have also extended the paragraph exploring the neurobiological plausibility of the model (subsection “Strengths and limitations of the temporal prediction model”). It now reads:

“Finally, it is interesting to consider possible more explicit biological bases for our model. We envisage the input units of the model as thalamic input, and the hidden units as primary cortical neurons. […] Finally, it is important to note that, although the biological plausibility of backpropagation has long been questioned, recent progress has been made in developing trainable networks that perform similarly to artificial neural networks trained with backpropagation, but with more biologically plausible characteristics77, for example, by having spikes or avoiding the weight transport problem78

Reviewer #1:

This manuscript compares two stimulus coding principles that could explain the form of receptive fields in sensory cortex: efficient sparse coding and temporal prediction. A simple temporal prediction model was found create receptive fields that were similar in many ways to those seen in sensory cortex. It performed much better than a sparse coding model.

This manuscript makes an interesting and important contribution. The "sparse coding" hypothesis is popular, both in neuroscience and in artificial intelligence, where autoencoding in deep-neural networks implements efficient coding. The authors argue that for dynamic stimuli, the need to predict what will happen next may be a more important than merely encoding efficiently what has happened recently. Their model is simple and elegant, and the results are convincing.

Thank you.

I felt there were some places where choices were made during modelling that seemed arbitrary – such as the choice of temporal windows and the regularization parameters. The manuscript would be stronger if these choices were either justified, or hyperparameter modelling done to show that the specific values chosen are not critical, to allay concerns readers may have of "p-hacking".

In order to address this point, we have performed a grid search over a large space of hyperparameter values. The results of this search are outlined above in response to the editor’s third point. From these plots, and the associated text, one can see that there is substantial robustness to the exact modelling choices.

The duration of the temporal window into the past is relatively unimportant because most of the energy is in the most recent few steps (Figure 6B and Figure 7A), and so long as it is long enough to capture this span it suffices. We have now explored the effect of the duration of the temporal window into the future, and as we mention in our reply to point 3 of the editor, this has little effect on the RFs. The regularization parameter was chosen by selecting the value that best predicts the future of a held-out validation set (see Figure 8 and the corresponding subsection “Optimizing Predictive Capacity”.

I found the manuscript to be well-structured, thorough and well written, clearly conveying a convincing message.

Thank you.

Reviewer #2:

The paper presents a two-layer network optimized for predicting immediate future sensory input (auditory or visual) from recent past sensory input. The resulting spatio- or spectrotemporal receptive fields, i.e. weight vectors of the first layer, are analyzed and compared with physiological receptive fields. For model comparison, results from a sparse coding network are used.

The results show that the predicting neural network captures receptive field properties fairly well, in particular temporal structure is reproduced much better than by the sparse coding network.

The topic is interesting, and the results are highly relevant to the field. I must add, however, that I have not followed the field recently, so I cannot really tell, whether some similar work has been published recently. But the authors seem to have done a careful literature research and discuss alternative approaches fairly.

The paper is well structured and has obviously been written very carefully. I have rarely reviewed a manuscript that feels so ready for publication. So, I am tempted to recommend the paper for publication as it is.

We appreciate the reviewer’s very positive comments.

There is just one issue I would invite the authors to consider a bit further: The claim of the paper is that the objective of temporal prediction results in the receptive field properties found. But there are additional factors, such as the nonlinearity and the L1 regularization, that contribute to it. The authors have investigated this to some extent. For example, they find that receptive fields are seriously degraded if the nonlinearity is replaced by a linear activation function. My suggestion is to try to pin down, what objective is implicitly added by the nonlinearity and the L1 regularization. I suggest to perform a similar experiment as in Figure 8, but with a sparseness or independence measure rather than final validation loss. This could also be done on the hidden units.

I suggest this, because I believe that temporal prediction alone does not do the trick. I feel it must be combined with some sparseness or independence objective to yield the receptive fields. And I feel that this missing objective is implicitly added by the nonlinearity and the L1 regularization. Making this more transparent would be great and the suggested experiment should be very easy to do.

We have now examined the effects of these choices in some detail -- see our response to the second point of the Editor.

Reviewer #3:

The authors propose a new principle for the development of cortical receptive fields which combines the concepts of predictive coding and sparse coding. They train a three-layer network with one hidden layer in order to predict the future visual spatial input or the future auditory auditory spectro-temporal input from the recent spatio-/spectro-temporal input, subject to a sparsity constraint.

They perform this training for two examples: for an auditory network, based on training data which contain human speech, animal vocalization and inanimate natural sounds, and for a visual network based on training data with movies of wildlife in natural settings.

They compare the resulting networks to real cortical neurons from A1 and V1 and to an alternative sparse coding approach intended to provide a sparse representation of the complete spatio/spectro-temporal input.

For their comparison they consider the spectrotemporal and spatiotemporal receptive fields and various population measures, e.g. the temporal decay of power in the receptive fields, the temporal span of excitation an inhibition, orientation and frequency tuning properties, and receptive field dimensions.

Except for orientation, for which the majority of visual units is restricted in their orientation preference to horizontal and vertical orientations, the proposed model can capture neural tuning properties as well as the established models. And in case of the asymmetric emphasis of the most recent past it can even provide a better description.

In my opinion this is a quite interesting paper. First, it presents a novel approach which unifies the principles of sparse coding and of temporal prediction. This combination enables the explanation of a large set of spatio-/spectro-temporal tuning properties within one single integrated framework. Second, the authors have an important point in stressing the asymmetry of the temporal response with its emphasis of the most recent past, as observed in typical cortical neurons. This is indeed a property that other learning schemes, like sparse coding, by the very nature of their objective functions, cannot produce.

Thank you.

There are some points that, in my view, need to be clarified or described in more detail. In the following I describe the modifications and additions which I assume to be helpful in a revision of the paper. Due to my background I will put more emphasis on the visual aspects.

1) The description of the history of the concept of temporal prediction is not clear enough, both in the introduction and in the discussion. I am aware of the pressure for novelty in current science but in my view, there is sufficient novelty in the suggested model to allow the authors to avoid such ambiguities. Currently, the paper might be misinterpreted by a swift non-specialist reader as if the concept of the "prediction of the immediate future" is a novel principle being introduced here (Introduction; Discussion section: "We hypothesized"). Only a few selected papers are cited directly in this context (only Bialek in the Introduction), and in the Discussion section they are characterized as unspecific: "The temporal prediction principle we describe.… has been described in a very general manner"(reference only to Bialek, Palmer). Other references exist but are spread out through the further text. But of course, the principle as such has a long history, there are numerous papers which describe the prediction of the future sensory input as an important goal of neural information processing. I am no specialist, and this is not comprehensive but early examples are corollary discharge theories, and already Sutton and Barto, (1981) and Srinivasan et al., (1982), for example, considered the temporal dimension of prediction. Motion extrapolation has also been interpreted as prediction computed in visual cortex (Nijhawan, 1994). A further, canonical example of a method for the optimal prediction of the future sensory input is the Kalman filter, as considered by, e.g., Rao, (1999). I suggest that the authors devote one paragraph to the history of the concept, with all the appropriate references included there, and then make precisely clear in which aspects their novel contribution extends beyond these earlier approaches.

We have added a new paragraph to the Introduction to address this. This paragraph points out the novel aspects of our model’s structure. See our response to the fifth point of the Editor. We have also expanded the final paragraph of the Introduction to set out a central novel contribution of our model, which is that it successfully explains temporal properties of both V1 and A1 neurons, something previous models have not been able to do. The expanded part of the final paragraph (Introduction) reads: “Here we show using qualitative and quantitative comparisons that, for both V1 and A1 RFs, these shortcomings are largely overcome by the temporal prediction approach. This stands in contrast to previous models”

2) Neurobiological plausibility of the approach: I do not think that the authors have to be as clear about the neural implementation of the suggested architecture as the other predictive coding approaches, but at least some rough or speculative ideas should be presented: What is the status of the second-order units? Where in the cortex are they (V2?) and what do they encode? Really the future INPUT itself? That is, they have no selectivity, no tuning properties? Have such units been observed? Where and how is the prediction error computed? Does this model not require a bypass line which brings the retinal spatial input directly to V1 or V2 to enable the comparison?.…

We have now addressed this more fully in the paper -- see our response to the fifth point of the Editor. We now describe how the output units of our model would represent the prediction of the input or the prediction error and note that signals relating to prediction error have been found in A1 (Rubin et al., 2016). These properties of the output units are non-linear and wouldn’t be fully captured by linear RF methods. We examined linear aspects of the tuning of the output units for the visual temporal prediction model using a response-weighted average to white noise input and found punctate un-oriented RFs that decay into the past. We suspect that no bypass line would be required, just appropriate temporal asymmetries in input and/or synaptic plasticity mechanisms.

3) The sparse coding data appear quite unusual. Why are the units not more "localized" in the temporal dimension, in particular for the visual model?

For the auditory data, our results are not unusual, see Carlson et al., (2012) and Carlin and Elhilali, (2013), where RFs that fully span the temporal dimension are common.

For the visual data, our results are also not unusual. If you examine the figure of van Hateran and Ruderman, (1998), they only show four units. It is not clear whether the units they show are reflective of the tuning properties of the entire population or have been selected because they are temporally localised. In the sparse coding results we presented, although we also see some units which are temporally localised, the majority were not. We have now run the sparse coding model in the overcomplete regime as is commonly done in the literature, and while the majority of spatiotemporal RFs are still not temporally localized, there are more examples that are. Notably, in a later paper looking at ICA of spatiotemporal visual inputs by Hyvärinen et al. (2003), which does show the full set of RFs, there are very few examples of units whose RFs are temporally localized, while the vast majority are not.

Furthermore, it seems as if the visual sparse model is not used in the usual overcomplete regime.

We have now run the sparse coding model with more hidden units and restricted the results presented in both the visual and auditory cases to configurations where the sparse coding models were run in the overcomplete regime. We have added a sentence to clarify this in the Materials and methods section. It reads:

“In both cases, the model configurations chosen were restricted to those trained in an overcomplete condition (having more units than the number of input variables) in order to remain consistent with previous instantiations of this model4,5,11…we selected a sparse coding network with 1600 units…in the auditory case (Figure 5 and Figure 6).In the visual case, the network selected was trained with 3200 units, λ=100.5, learning rate = 0.05 and 100 mini-batches.”

I also would have expected a more concentrated distribution of the temporal span of excitation and inhibition for a typical sparse coding model. Please discuss this in the paper.

Our results are in keeping with previous studies in this regard (see Carlson et al., (2012) and Carlin and Elhilali, (2013)). We have added a sentence to the Results section highlighting this point. It reads: “The sparse coding model shows a wide range of temporal spans of excitation and inhibition, in keeping with previous studies11,48.”

4) Subsection “Model receptive fields” by inspection: is there really no other possibility to determine the optimal hyperparameters of the sparse coding model? A fit to the neural data? You have Figure 8—figure supplement 1B anyway. Why have you not made use of it? One could use only a training subset, if this seems critical issue. And one can include KS measures of other tuning properties.

This is a good point. We simply chose the ones that, by eye, presented the sparse coding model in the best light. However, if we choose the model in which RFs lie at the minimum by the KS measure in the auditory case, they do not look much different from the ones we chose. We have now changed the set of sparse coding model units in the auditory case to those that are most similar to the real neurons according to the KS measure, while still being overcomplete. In the visual case, as we do not perform a KS measure due to the limited amount of data, particularly pertaining to temporal response properties (see response to point sixteen below), this model was chosen by inspection. We have changed the text in subsection “Sparse coding model” to reflect this. It now reads: “…we chose the network that produced basis functions whose receptive fields were most similar to those of real neurons. In the auditory case, this was determined using the mean KS measure of similarity (Figure 8—figure supplement 1). In the visual case, as a similarity measure was not performed, this was done by inspection.”

5) Subsection “Addition of Gaussian noise”: Noise. For me the use of noise in this investigation is somewhat unclear. First, it seems to favor the prediction model over the sparse model, which is more susceptible to noise. Second, the noise level used appears unusually strong (is this dB?). This issue should be clearly motivated and discussed in the main text of the paper.

The temporal prediction and sparse coding models were also run without noise, and the results in all cases were similar. We have now added a paragraph in the new Results section which motivates and discusses the influence of the noise in the model as suggested by the reviewer. The paragraph in subsection “Variants of the temporal prediction model” reads: “The temporal prediction model and sparse coding model results shown in the main figures of this paper were trained on inputs with added Gaussian noise (6dB SNR), mimicking inherent noise in the nervous system. To determine the effect of adding this noise, all models were also trained without noise, producing similar results (Figure 4—figure supplements 5-7; Figure 5—figure supplements 3–5; Figure 6—figure supplement 1; Figure 7—figure supplements 2–3).”

We have also added clarifying text to the caption of Figure 7—figure supplement 2 highlighting the main effect of the addition of noise to the quantitative results in the visual case. It reads: “The addition of noise only leads to subtle changes in the RFs; most apparently, there are more units with RFs comprising multiple short subfields (forming an increased number of points towards the lower right quadrant of g) than is seen in the case when noise is used.”

We have also added text to the caption of Figure 6—figure supplement 1 highlighting the main effects of the addition of noise to the model in the auditory case. It reads: “The addition of noise leads to subtle changes in the RFs. Without noise, the inhibition in the temporal prediction model tends to be slightly less extended and the RFs a little less smooth (see Figure 4, Figure 4—figure supplement 5 for qualitative comparison).”

The addition of noise does not seem to negatively impact the sparse coding model’s results. For the sparse coding model in the visual case, the results are much the same with noise as without (Figure 5—figure supplement 1, Figure 5—figure supplement 2, Figure 5—figure supplement 4Figure 5—figure supplement 5 and Figure 7—figure supplement 2 and Figure 7—figure supplement 3). In the auditory case without added noise, the RFs tended to have somewhat smoother backgrounds, but were otherwise much the same in form as when noise was added (Figure 5, Figure 5—figure supplement 3). Quantitative comparison of the models trained on auditory inputs without added noise shows only subtle differences from the case with noise (Figure 6, Figure 6—figure supplement 1).

The signal-to-noise ratio (SNR) given was the variance of the signal divided by the variance of the noise, not in decibels. Apologies this was not more clear. We have now give the SNR in decibels, a more conventional measure. In decibels, the SNR is 6dB. Hence the noise is weaker than may have been implied.

6) Subsection “Model receptive fields”: I am a bit skeptic with respect to the sign-flipping of excitation and inhibition. The argument that the signs could as well be flipped if this is done for the first-order and the second-order units alike appears only valid because any specification and relation to real neurons is omitted for the second-order units. In fact, this sign-flipping will inevitably imply a prediction of how excitation and inhibition operate in the second-order units.

Furthermore, if this argumentation would be accepted then one could arbitrarily flip signs to the desired result in any learning model, because one can always argue that appropriate sign flips at some subsequent processing stages could compensate for this. Used in this way, excitation and inhibition would lose any meaning.

It is perfectly ok for me if a model is agnostic with respect to the correct prediction of excitation and inhibition, a model does not have to be perfect in all aspects. But if this is the case this should be clearly visible for the reader in the presented receptive field plots. (This does not exclude to use of an appropriate sign-flip in population measures.)

We are agnostic as to what the biological analogue of the output units is (see reply to the Editor’s fifth point). Therefore, the only units for which we make a direct comparison with data are the hidden units.

The sign flipping makes no difference to the function of the network, as we describe in subsection “Model receptive fields”. This degeneracy arises because, in the network, the units’ activation functions are symmetric, whereas for real neurons, high firing rates are meaningfully different from low ones. We therefore need to choose between two equivalent descriptions of the receptive fields. We agree that flipping the signs of each pixel in each RF or of each RF independently to maximize similarity to the data would be arbitrary and unjustified. However, this is not what we do. What we do instead is make all the units have positive leading excitation, which reflects the structure of the majority of cortical units and allows the reader to make a visual comparison to the data.

7) Figure 4—figure supplement 2 and Figure 4—figure supplement 3: Two separate populations? Visual inspection of these figures suggests the possible existence of two distinct populations. Is this related to separability, or blob-like units, or both? (Is ordering according to separability?) I am not sure about the current state in the field, but I remember a discussion about the existence of two distinct populations as opposed to a continuous distribution for separability. This issue should be described and discussed.

We discovered that some of the movie snippets we were using to train the model contained writing at a very high contrast. These examples have since been removed and the results updated with RFs obtained when the model was trained on inputs without any of these writing snippets included. In the updated results, the subpopulation of small blob-like units specified by the reviewer is no longer present (Figure 4—Figure supplement 2 and Figure 4—figure supplement 3).

8) ibid. The percentage of blob-like units appears quite high. Is this percentage comparable to the neural data? Or only if the mouse data, which are special in this respect, are being included?

See above – this subpopulation of small, blob-like units is no longer present in the results presented (Figure 4—Figure supplement 2 and Figure 4—figure supplement 3).

9) Figure 4—figure supplement 6 The percentage of blob-like units seems to be substantially reduced in comparison to the noisy case (Figure 4—figure supplement 2). Is there a systematical relation between high noise levels and the emergence of blob-like units? Please discuss this issue in the paper.

As mentioned in response to the previous two points, we discovered and removed visual training examples that contained writing. In the updated results, the subpopulation of small blob-like units that differentiated the results in the noise and non-noise cases is no longer present and the results are now more similar between the two cases.

10) ibid. It is difficult to understand the relation between Figure 4 supplement 2 and Figure 7c as opposed to figure 4—figure supplement 6 and Figure 7—figure supplement 2F. Can you describe how properties of the receptive field plots relate to properties of the population distribution in these two cases?

We now discuss these distributions in more detail. In subsection “Quantitative analysis of visual results”, we write: “The distribution of units extends along a curve from blob-like RFs, which lie close to the origin in this plot, to stretched RFs with several subfields, which lie further from the origin.” In the caption of Figure 7—figure supplement 2, we write “The addition of noise in both cases only leads to subtle changes in the RFs; most apparently, there are more units with RFs comprising multiple short subfields (forming an increased number of points towards the lower right quadrant of g) than is seen in the case when noise is used.”

11) Figure 7 and others Spatiotemporal population properties: Although the article is about spatiotemporal processing it provides only two population measures of purely static spatial properties and one of a purely temporal property for the vision case. Spatiotemporal measures of particular interest would be: DSI (directional selectivity index)/TDI (tilt direction index) (direction selectivity is considered to be a major spatiotemporal property of visual cortex); if possible: a scatter plot of temporal frequency vs. spatial frequency; population distribution of motion tuning. The necessary data for these plots should be already available.

We have now measured the TDI of the model population and added a section specifying the result in the main text (subsection “Quantitative analysis of visual results”). It reads: “In addition to this, we measured the tilt direction index (TDI) of the model units from their 2D spatiotemporal RFs. This index indicates spatiotemporal asymmetry in space-times RFs and correlates with direction selectivity.41,54–57 The mean TDI for the population was 0.33 (0.29 SD), comparable with the ranges in the neural data (mean 0.16; 0.12 SD in cat area 17/1857, mean 0.51; 0.30 SD in macaque V156).”

An explanatory paragraph has also been added to Materials and methods section. It reads: “The tilt direction index (TDI)41,54–57 of an RF is given by (RpRq)/(Rp + Rq), where Rp is the amplitude at the peak of the 2D Fourier transform of the 2D spatiotemporal RF, found at spatial frequency Fspace and temporal frequency Ftime. Rq is the amplitude at (Fspace, -Ftime) in the 2D Fourier transform. The mean and standard deviations of TDI for experimental data for the cat57 and macaque56 were measured from data extracted from figures in the relevant references (Figure 11A and the low-contrast axis of Figure 3A respectively).”

We also describe the relationship between the spatial and temporal frequency of the model units in the Results section.

12) Figure 7B and Figure 7—figure supplement 2B: The orientation scatter plot is visually difficult to interpret. The quantitative degree of concentration of the preferred orientations on the vertical and horizontal orientations as opposed to the oblique orientations remains unclear. Please provide either an orientation histogram or the percentage of units which fall into the 30-60, 120-150 deg range. In both cases the lowest spatial frequencies should be omitted for the analysis.

A histogram showing the distribution of orientation tuning preferences of the model units has now been added to the Figure 7 and corresponding supplementary figures.

13) Figure 7—figure supplement 1: I am a bit surprised that only 289/400 sparse-coding units can be fitted by a Gabor function. Why is this? Usually most sparse coding units have a good Gabor fit. And with such a high percentage excluded I see the risk of a systematic bias regarding certain tuning parameters.

Sparse coding of images produces mostly Gabor-like filters; however, even in this case, some filters are at the edge of the pixel grid and cannot be reliably fitted. When training on movies, Hyvarinen et al., (2003) also appears to show a similar proportion of units that are not Gabor-like when the full dataset is inspected. In van Hateren and Ruderman, (1998) they only show four example units, so it is difficult to assess how many of the units in the full population are Gabor-like.

14) Figure 7C and Figure 7—figure supplement 2F seem to indicate that the model produces two distinct sub-populations with respect to receptive field parameters n_x and n_y. It should be discussed whether this is the case, and if yes, whether it is systematically related to other parameters (selectivities) of the units. Could this be related related to the two apparent sub-populations regarding separability, blob-like shapes, cf. Figure 4—figure supplement 2? And has such a tendency has also been observed in neural data? In contrast, Ringach, (2004) claimed clustering around a one-dimensional curve. Please describe and discuss in text.

Examination of density plots of these figures suggest that the vast majority of units are spread around a one-dimensional curve, as is the case in Ringach, (2002). However, there is a slight wing formed by a small number of units that extends rightwards from the main curve, but this does not form a distinct separate cluster. We have added a sentence describing this subpopulation to the main text (subsection “Quantitative analysis of visual results”). It reads: “The distribution of units extends along a curve from blob-like RFs, which lie close to the origin in this plot, to stretched RFs with several subfields, which lie further from the origin. A small proportion of the population have RFs with several short subfields, forming a wing from the main curve in Figure 7D.”

As mentioned in response to points 7-9 of the reviewer above, the apparent subpopulation of small blob-like units is no longer present in the new results. Hence, the wing in Figure 7D is not related to this.

15) Figure 8 and subsection “Implementation details: Are we expected to see here that 1600 hidden units are a distinguished optimum? Does this figure not tell us that the prediction error does not substantially depend on the number? And when a biological system could achieve basically the same prediction quality with 100 neurons why should it then invest 1500 additional units for such a small advantage?

As the reviewer points out, in the auditory case, there is not a big change in the performance of the model or in the similarity of its RFs for an increase in the number of units, as can be seen from Figure 8. Nevertheless, a shallow minimum does exist when this parameter is varied and we took the network that gave an absolute minimum as measured by the mean squared error prediction on a held-out validation set so as to be unbiased in our selection process. Furthermore, in the visual case, the number of hidden units seems to play a bigger role both in the network’s performance on the prediction task and on the shapes of the RFs produced. This should now be clear from the additional interactive supplementary figure (Figure 8—Figure Supplement 3; https://yossing.github.io/temporal_prediction_model/figures/interactive_supplementary_figures.html).

16) Figure 8 and Figure 8—figure supplement 1. The comparison between the models prediction capability and the similarity of the model units to real units should not only be presented for the auditory neurons but also for the visual neurons (according to the text the data seem to be already available).

Our lab is primarily focused on auditory processing, hence, we were able to easily obtain recordings from a large population of neurons in A1. However, it was harder to find a large population of spatiotemporal RFs of V1 neurons to compare to, despite requests to multiple labs who focus on visual cortex. We were kindly provided with the spatiotemporal RF data from 8 neurons recorded in V1 of cats by Ohzawa et al., but did not feel that a meaningful quantitative comparison of the kind shown in Figure 8 could be performed with so few neurons.

17) Subsection “Optimising predictive capacity” It is not clear whether the similarity measure (only the span of temporal and frequency tuning is considered) is fair or biased for the comparison with the sparse coding model. What will happen, for example, if the distribution of orientation preference would be used instead (for the visual model)? Would then the sparse coding model appear more similar to the real neurons than the prediction model?

Please motivate and discuss.

In comparison to many other papers examining normative models of auditory RFs, we do far more in the way of quantitative comparisons both to other models and to the real data. We took the span of temporal and frequency tuning as reasonable measures of similarity although we concede that other measures could, of course, be chosen. It should be noted that in an attempt to be as fair as possible to the sparse coding model, we did not include measures of similarity (for instance of the proportion of power contained in each time step, as seen in Figure 6B) that seemed to obviously favour our model. Evidence for our hypothesis is also bolstered by the analysis of the effects of multidimensional scaling (Figure 6A), which is a nonparametric measure of similarity.

We have modified the Discussion section to highlight the point raised by the reviewer. The text now reads: “Finally, the more accurate the temporal prediction model is at prediction, the more its RFs tend to be like real neuronal RFs by the measures we use for comparison.”

For the visual data, we did not have sufficient data to compare with the models, particularly in the temporal domain, as we describe above. Quantitative comparison with measurements of data from different datasets and species, recorded using different methods, is no substitute for a single consolidated dataset, which we could not obtain from other labs. Hence, we decided it better to only do this for the auditory dataset and take a more descriptive approach for the visual model.

18) Discussion section: I cannot follow the argumentation that the class of prediction models (or this specific model) should somehow be unique with respect to the ability to provide an independent criterion for the selection of hyperparameters. Should this somehow be a principle, or a logical conclusion? Or an empirical observation only for this special case? Is it logically impossible that we find a measure, for example something entropy-related or whatever, for sparse coding that could have the same status? Please clarify.

It is conceivable that some independent reason for picking the hyperparameters for the sparse coding model could be found that also provides the most neural-like RFs. However, in the previous literature, no such measure is used or proposed. Instead, the hyperparameters, which can greatly affect the nature of the RF structures, are typically chosen in order to produce RFs that look most like the real neural data. Similarly, no obvious measure is apparent to us. To clarify our argument, the following sentence has been added to the end of the paragraph the reviewer refers to (subsection “Optimising predictive capacity”). It reads: “To our knowledge, no such effective, measurable, independent criterion for hyperparameter selection has been proposed for other normative models of RFs.”

19) Subsection “Visual normative models”:does not refer to prediction of the future: Why not? Is not the temporal prediction made in these models that the future spatial pattern is the same as the previous spatial pattern?

Because of the new paragraph in the Introduction that sets out what is meant by predictive coding and temporal prediction, we feel that this paragraph is now largely redundant. The paragraph and sentence to which the reviewer refers has therefore been removed.

20) Subsection “Visual normative models”: The model is selective, throws away information, as opposed to the information preserving properties of other models. But is this really a good strategy for *early* stages of a multi-stage information processing system? Does the principle of least commitment not suggest just the opposite strategy?

It could be true that the structure of sensory RFs is governed by the principle of least commitment, preserving all information. Conversely, it could be the case that temporal prediction is a more important principle governing the structure of sensory RFs. This is an empirical question. We hope that our results provide some good evidence in favour of the latter hypothesis.

21) For the Discussion section it would be of interest to consider the contributions from the two components of the prediction model, i.e., which properties of the units are genuinely caused by prediction and which are more due to the sparse coding part?

We have added a sentence to the Results section clarifying the nature of the sparsity in our model. It reads: “Note that this sparsity constraint differs from that used in sparse coding models, in that it is applied to the weights rather than the activity of the units, being more like a constraint on the wiring between neurons than a constraint on their firing rates.”

We have also examined the specific role of the regularization on the model, as shown in (Figure 8—figure supplement 2 and Figure 8—figure supplement 3;https://yossing.github.io/temporal_prediction_model/figures/interactive_supplementary_figures.html) and discussed this in detail in the responses to the Editor’s second and third points. We find that all three components of the model (prediction, L1 regularization of the weights and nonlinearity) are essential for providing V1 or A1 like RFs.

[Editors' note: further revisions were requested prior to acceptance, as described below.]

The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined below:

The authors have addressed my comments in a careful manner. In particular, I appreciate that they made considerable and appropriate changes to text and figures instead of just providing arguments in the response letter. From my view, the manuscript is basically ready for publication. I have a few final suggestions.

1).… We examined linear aspects of the tuning of the output units for the visual temporal prediction model using a response-weighted average to white noise input, and found punctate un-oriented RFs that decay into the past..…

This is interesting. Can you mention this somewhere in the text?

We have now added the following sentence to the Results section.

“We examined linear aspects of the tuning of the output units for the visual temporal prediction model using a response-weighted average to white noise input, and found punctate non-oriented RFs that decay into the past.”

2) I understand that the model, by its very nature would not care about the sign. But the fact remains that you have an output of a model and you post hoc manipulate this output to obtain a "better suited" presentation (e.g., to ease comparison). My only point is that it should be totally clear to even a superficial reader that such a post hoc change has been applied. So please just include an appropriate sentence that makes this clear, e.g.:

Note that the model does not care about the sign (excitation/inhibition) and thus provides no systematic prediction of it. We hence switched the signs of the respective receptive fields of the model output appropriately to obtain receptive fields which all have positive leading excitation.

We have added the following text to the legend of Figure 2D:

“Note that the overall sign of a receptive field learned by the model is arbitrary. Hence, in all figures and analyses we multiplied each model receptive field by -1 where appropriate to obtain receptive fields which all have positive leading excitation (see Materials and methods section).”

(3) Can you mention this alternative goal of least commitment somewhere in the discussion? And the empirical question.

After further consideration our thoughts on this point have become more nuanced. The principle of least commitment requires not doing something that may later have to be undone. Whether the temporal prediction hypothesis is in conflict with least commitment is unclear, and a detailed exploration of this beyond the scope of this paper. According to the temporal prediction hypothesis, aspects of the past which never influence the future will never be of use to an animal, and thus it could be argued that not encoding those aspects will never need to be undone, and hence there is no conflict. However, specific models instantiating the temporal prediction may have limited capacity to identify predictive information, and thus may discard some information that may be useful in the future, and hence may run into conflict with the principle of least commitment. It is also the case that given limited brain capacities, at some point in the brain commitment is required, and the temporal prediction principle may provide a good mechanism to decide what to commit to representing and what to discard. Hence, it is a complicated empirical and theoretical question as to whether and when the principles are in conflict or congruence, and if in conflict under what conditions one is dominant. To reflect this more nuanced view we have how added the following text to the Discussion section:

“There is an open question as to whether the current model may eliminate some information that is useful for reconstruction of the past input or for prediction of higher order statistical properties of the future input, which might bring it into conflict with the principle of least commitment69. It is an empirical question how much organisms preserve information that is not predictive of the future, although there are theoretical arguments against such preservation2. Such conflict might be remedied, and the model improved, by adding feedback from higher areas or by adding an objective4–6,60 to reconstruct the past or present in addition to predicting the future.”

https://doi.org/10.7554/eLife.31557.035

Article and author information

Author details

  1. Yosef Singer

    Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, United Kingdom
    Contribution
    Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4480-0574
  2. Yayoi Teramoto

    Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, United Kingdom
    Contribution
    Software, Formal analysis, Investigation, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-3419-0351
  3. Ben DB Willmore

    Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, United Kingdom
    Contribution
    Data curation, Software, Formal analysis, Supervision, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-2969-7572
  4. Jan WH Schnupp

    Department of Biomedical Sciences, City University of Hong Kong, Kowloon Tong, Hong Kong
    Contribution
    Resources, Supervision, Funding acquisition, Project administration, Writing—review and editing
    Competing interests
    No competing interests declared
  5. Andrew J King

    Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, United Kingdom
    Contribution
    Supervision, Funding acquisition, Writing—review and editing
    Competing interests
    Senior Editor, eLife
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5180-7179
  6. Nicol S Harper

    Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, United Kingdom
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Supervision, Validation, Investigation, Methodology, Writing—original draft, Writing—review and editing, Funding acquisition
    For correspondence
    nicol.harper@dpag.ox.ac.uk
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7851-4840

Funding

Clarendon Fund

  • Yosef Singer
  • Yayoi Teramoto

Wellcome (WT10525/Z/14/Z)

  • Yayoi Teramoto

Wellcome (WT076508AIA)

  • Ben DB Willmore
  • Andrew J King
  • Nicol S Harper

Wellcome (WT108369/Z/2015/Z)

  • Ben DB Willmore
  • Andrew J King
  • Nicol S Harper

Wellcome (WT082692)

  • Nicol S Harper

University Of Oxford

  • Nicol S Harper

Action on Hearing Loss (PA07)

  • Nicol S Harper

Biotechnology and Biological Sciences Research Council (BB/H008608/1)

  • Nicol S Harper

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

Nicol Harper was supported by a Sir Henry Wellcome Postdoctoral Fellowship (WT082692) and other Wellcome Trust funding (WT076508AIA, WT108369/Z/2015/Z), by the Department of Physiology, Anatomy and Genetics at the University of Oxford, by Action on Hearing Loss (PA07), and by the Biotechnology and Biological Sciences Research Council (BB/H008608/1). Yosef Singer and Yayoi Teramoto were supported by the Clarendon Fund. Yayoi Teramoto was supported by the Wellcome Trust (10525/Z/14/Z). Andrew King and Ben Willmore were supported by the Wellcome Trust (WT076508AIA, WT108369/Z/2015/Z). We thank Bruno Olshausen for discussions on his model.

Ethics

Animal experimentation: Auditory RFs of neurons were recorded in the primary auditory cortex (A1) and anterior auditory field (AAF) of 5 pigmented ferrets of both sexes (all > 6 months of age) and used as a basis for comparison with the RFs of model units trained on auditory stimuli. These recordings were performed under license from the UK Home Office and were approved by the University of Oxford Committee on Animal Care and Ethical Review. Full details of the recording methods are described in earlier studies (Willmore et al., 2016; Bizley et al., 2009). Briefly, we induced general anaesthesia with a single intramuscular dose of medetomidine (0.022 mg · kg−1 · h−1) and ketamine (5 mg · kg−1 · h−1), which was then maintained with a continuous intravenous infusion of medetomidine and ketamine in saline. Oxygen was supplemented with a ventilator, and we monitored vital signs (body temperature, end-tidal CO2, and the electrocardiogram) throughout the experiment. The temporal muscles were retracted, a head holder was secured to the skull surface, and a craniotomy and a durotomy were made over the auditory cortex. Extracellular recordings were made using silicon probe electrodes (Neuronexus Technologies) and acoustic stimuli were presented via Panasonic RPHV27 earphones, which were coupled to otoscope specula that were inserted into each ear canal, and driven by Tucker-Davis Technologies System III hardware (48 kHz sample rate).

Senior Editor

  1. Sabine Kastner, Princeton University, United States

Reviewing Editor

  1. Jack L Gallant, University of California, Berkeley, United States

Publication history

  1. Received: August 25, 2017
  2. Accepted: June 16, 2018
  3. Accepted Manuscript published: June 18, 2018 (version 1)
  4. Accepted Manuscript updated: June 22, 2018 (version 2)
  5. Version of Record published: August 24, 2018 (version 3)
  6. Version of Record updated: September 18, 2018 (version 4)

Copyright

© 2018, Singer et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 3,949
    Page views
  • 685
    Downloads
  • 4
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, Scopus, PubMed Central.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)