Population codes enable learning from few examples by shaping inductive bias

Abstract
Editor's evaluation
Introduction
Results
Discussion
Methods
Appendix 1
Data availability
References
Article and author information
Metrics

Abstract

Learning from a limited number of experiences requires suitable inductive biases. To identify how inductive biases are implemented in and shaped by neural codes, we analyze sample-efficient learning of arbitrary stimulus-response maps from arbitrary neural codes with biologically-plausible readouts. We develop an analytical theory that predicts the generalization error of the readout as a function of the number of observed examples. Our theory illustrates in a mathematically precise way how the structure of population codes shapes inductive bias, and how a match between the code and the task is crucial for sample-efficient learning. It elucidates a bias to explain observed data with simple stimulus-response maps. Using recordings from the mouse primary visual cortex, we demonstrate the existence of an efficiency bias towards low-frequency orientation discrimination tasks for grating stimuli and low spatial frequency reconstruction tasks for natural images. We reproduce the discrimination bias in a simple model of primary visual cortex, and further show how invariances in the code to certain stimulus variations alter learning performance. We extend our methods to time-dependent neural codes and predict the sample efficiency of readouts from recurrent networks. We observe that many different codes can support the same inductive bias. By analyzing recordings from the mouse primary visual cortex, we demonstrate that biological codes have lower total activity than other codes with identical bias. Finally, we discuss implications of our theory in the context of recent developments in neuroscience and artificial intelligence. Overall, our study provides a concrete method for elucidating inductive biases of the brain and promotes sample-efficient learning as a general normative coding principle.

Editor's evaluation

This important study presents a theory of generalization in neural population codes and proposes sample efficiency as a new normative principle. The theory can be used to identify the set of 'easily learnable' stimulus-response mappings from neural data and makes strong behavioral predictions that can be evaluated experimentally. Overall, the new method for elucidating inductive biases of the brain is highly compelling and will be of interest to theoretical and experimental neuroscientists working towards understanding how the cortex works.

https://doi.org/10.7554/eLife.78606.sa0

Introduction

The ability to learn quickly is crucial for survival in a complex and an everchanging environment, and the brain effectively supports this capability. Often, only a few experiences are sufficient to learn a task, whether acquiring a new word (Carey and Bartlett, 1978) or recognizing a new face (Peterson et al., 2009). Despite the importance and ubiquity of sample efficient learning, our understanding of the brain’s information encoding strategies that support this faculty remains poor (Tenenbaum et al., 2011; Lake et al., 2017; Sinz et al., 2019).

In particular, when learning and generalizing from past experiences, and especially from few experiences, the brain relies on implicit assumptions it carries about the world, or its inductive biases (Wolpert, 1996; Sinz et al., 2019). Reliance on inductive bias is not a choice: inferring a general rule from finite observations is an ill-posed problem which requires prior assumptions since many hypotheses can explain the same observed experiences (Hume, 1998). Consider learning a rule that maps photoreceptor responses to a prediction of whether an observed object is a threat or is neutral. Given a limited number of visual experiences of objects and their threat status, many threat-detection rules are consistent with these experiences. By choosing one of these threat-detection rules, the nervous system reveals an inductive bias. Without the right biases that suit the task at hand, successful generalization is impossible (Wolpert, 1996; Sinz et al., 2019). In order to understand why we can quickly learn to perform certain tasks accurately but not others, we must understand the brain’s inductive biases (Tenenbaum et al., 2011; Lake et al., 2017; Sinz et al., 2019).

In this paper, we study sample efficient learning and inductive biases in a general neural circuit model which comprises of a population of sensory neurons and a readout neuron learning a stimulus-response map with a biologically-plausible learning rule (Figure 1A). For this circuit and learning rule, inductive bias arises from the nature of the neural code for sensory stimuli, specifically its similarity structure. While different population codes can encode the same stimulus variables and allow learning of the same output with perfect performance given infinitely many samples, learning performance can depend dramatically on the code when restricted to a small number of samples, where the reliance on and the effect of inductive bias are strong (Figure 1B, C and D). Given the same sensory examples and their associated response values, the readout neuron may make drastically different predictions depending on the inductive bias set by the nature of the code, leading to successful or failing generalizations (Figure 1C and D). We say that a code and a learning rule, together, have a good inductive bias for a task if the task can be learned from a small number of examples.

Figure 1

Download asset Open asset

Learning tasks through linear readouts exploit representations of the population code to approximate a target response.

(A) The readout weights from the population to a downstream neuron, shown in blue, are updated to fit target values $y$ , using the local, biologically plausible delta rule. (B) Examples of tuning curves for two different population codes: Smooth tuning curves (Code 1) and rapidly varying tuning curves (Code 2). (C) (Left) A target function with low frequency content is approximated through the learning rule shown in A using these two codes. The readout from Code 1 (turquoise) fits the target function (black) almost perfectly with only $P = 12$ training examples, while readout from Code 2 (purple) does not accurately approximate the target function. (Right) However, when the number of training examples is sufficiently large ( $P = 120$ ), the target function is estimated perfectly by both codes, indicating that both codes are equally expressive. (D) The same experiment is performed on a task with higher frequency content. (Left) Code 1 fails to perform well with $P = 12$ samples indicating mismatch between inductive bias and the task can prevent sample efficient learning while Code 2 accurately fits the target. (Right) Again, provided enough data $P = 120$ , both models can accurately estimate the target function. Details of these simulations are given in Methods Generating example codes (Figure 1).

In order to understand how population codes shape inductive bias and allow fast learning of certain tasks over others with a biologically plausible learning rule, we develop an analytical theory of the readout neuron’s learning performance as a function of the number of sampled examples, or sample size. We find that the readout’s performance is completely determined by the code’s kernel, a function which takes in pairs of population response vectors and outputs a representational similarity defined by the inner product of these vectors. We demonstrate that the spectral properties of the kernel introduce an inductive bias toward explaining sampled data with simple stimulus-response maps and determine compatibility of the population code with the learning task, and hence the sample-efficiency of learning. We apply this theory to data from the mouse primary visual cortex (V1) (Stringer et al., 2021; Pachitariu et al., 2019; Stringer et al., 2018a; Stringer et al., 2018b), and show that mouse V1 responses support sample-efficient learning of low frequency orientation discrimination and low spatial frequency reconstruction tasks over high frequency ones. We demonstrate the discrimination bias in a simple model of V1 and show how response nonlinearity, sparsity, and relative proportion of simple and complex cells influence the code’s bias and performance on learning tasks, including ones that involve invariances. We extend our theory to temporal population codes, including codes generated by recurrent neural networks learning a delayed response task. We observe that many codes could support the same kernel function, however, by analyzing data from mouse primary visual cortex (V1) (Stringer et al., 2021; Pachitariu et al., 2019; Stringer et al., 2018a; Stringer et al., 2018b), we find that the biological code is metabolically more efficient than others.

Overall, our results demonstrate that for a fixed learning rule, the neural sensory representation imposes an inductive bias over the space of learning tasks, allowing some tasks to be learned by a downstream neuron more sample-efficiently than others. Our work provides a concrete method for elucidating inductive biases of populations of neurons and suggest sample-efficient learning as a novel functional role for population codes.

Results

Problem setup

We denote vectors with bold lower-case symbols $r$ and matrices $K$ with bold upper-case symbols. We denote an average of a function $g (θ)$ over random variable $θ$ as ${⟨ g (θ) ⟩}_{θ}$ . Euclidean inner products between vectors are denoted either as $x \cdot y$ or $x^{⊤} y$ and real Euclidean $n$ -space is denoted $ℝ^{n}$ . Sets of variables are represented with ${\cdot}$ .

We consider a population of $N$ neurons whose responses, ${r_{1} (θ), r_{2} (θ), \dots, r_{N} (θ)}$ , vary with the input stimuli, which is parameterized by a vector variable $θ \in ℝ^{d}$ , such as the orientation and the phase of a grating (Figure 1A). These responses define the population code. Throughout this work, we will mostly assume that this population code is deterministic: that identical stimuli generate identical neural responses.

From the population responses, a readout neuron learns its weights $w$ to approximate a stimulus-response map, or a target function $y (θ)$ , such as one that classifies stimuli as apetitive ( $y = 1$ ) or aversive ( $y = - 1$ ), or a more smooth one that attaches intermediate values of valence. We emphasize that in our model only the readout neuron performs learning, and the population code is assumed to be static through learning. Our theory is general in its assumptions about the structure of the population code and the stimulus-response map considered (Methods Theory of generalization), and can apply to many scenarios.

The readout neuron learns from $P$ stimulus-response examples with the goal of generalizing to previously unseen ones. Example stimuli $θ^{μ}$ , ( $μ = 1, \dots, P$ ) are sampled from a probability distribution describing stimulus statistics $p (θ)$ . This distribution can be natural or artificially created, for example, for a laboratory experiment (Appendix Discrete stimulus spaces: finding eigenfunctions with matrix eigendecomposition). From the set of learning examples, $D = {θ^{μ}, y (θ^{μ})}_{μ = 1}^{P}$ , the readout weights are learned with the local, biologically-plausible delta-rule, $Δ w_{j} = η \sum_{μ} r_{j} (θ^{μ}) (y (θ^{μ}) - r (θ^{μ}) \cdot w)$ ,where $η$ is a learning rate (Figure 1A). Learning with weight decay, which privileges readouts with smaller norm, can also be accommodated in our theory as we discuss in (Appendix Weight decay and ridge regression). With or without weight decay, the learning rule converges to a unique set of weights $w^{*} (D)$ (Appendix Convergence of the delta-rule without weight decay). Generalization error with these weights is given by

\begin{aligned} E_{g} (D) = \int p (θ) (w^{*} (D) \cdot r (θ) - y (θ))^{2} d θ, \end{aligned}

which quantifies the expected error of the trained readout over the entire stimulus distribution $p (θ)$ . This quantity will depend on the population code $r (θ)$ , the target function $y (θ)$ and the set of training examples $D$ . Our theoretical analysis of this model provides insights into how populations of neurons encode information and allow sample-efficient learning.

Kernel structure of population codes controls learning performance

First, we note that the generalization performance of the learned readout on a given task depends entirely on the inner product kernel, defined by $K (θ, θ^{'}) = \frac{1}{N} \sum_{i = 1}^{N} r_{i} (θ) r_{i} (θ^{'})$ , which quantifies the similarity of population responses to two different stimuli $θ$ and $θ^{'}$ . The kernel, or similarity matrix, encodes the geometry of the neural responses. Concretely, distances (in neural space) between population vectors for stimuli $θ, θ^{'}$ can be computed from the kernel $\frac{1}{N} {|| r (θ) - r (θ^{'}) ||}^{2} = K (θ, θ) + K (θ^{'}, θ^{'}) - 2 K (θ, θ^{'})$ (Edelman, 1998; Kriegeskorte et al., 2008; Laakso and Cottrell, 2000; Kornblith et al., 2019; Cadieu et al., 2014; Pehlevan et al., 2018). The fact that the solution to the learning problem only depends on the kernel is due to the convergence of the learning rule to a unique solution $w^{*} (D)$ for the training set $D$ (Neal, 1994; Girosi et al., 1995). The dataset-dependent fixed point $w^{*} (D)$ of the learning rule is a linear combination of the population vectors on the dataset $w^{*} (D) = \frac{1}{N} \sum_{μ = 1}^{P} α^{μ} r (θ^{μ})$ . Thus, the learned function computed by the readout neuron is

\begin{aligned} f (θ) = w^{*} (D) \cdot r (θ) = \sum_{μ = 1}^{P} α^{μ} (\frac{1}{N} r (θ^{μ}) \cdot r (θ)) = \sum_{μ = 1}^{P} α^{μ} K (θ^{μ}, θ), \end{aligned}

where the coefficient vector satisfies $α = K^{+} y$ (Appendix Convergence of the delta-rule without weight decay), and the matrix $K$ has entries $K_{μ ν} = K (θ^{μ}, θ^{ν})$ and $y_{μ} = y (θ^{μ})$ . The matrix K⁺ is the pseudo-inverse of $K$ . In these expressions the population code only appears through the kernel $K$ , showing that the kernel alone controls the learned response pattern. This result applies also to nonlinear readouts (Appendix Convergence of Delta-rule for nonlinear readouts), showing that the kernel can control the learned solution in a variety of cases.

Since predictions only depend on the kernel, a large set of codes achieve identical desired performance on learning tasks. This is because the kernel is invariant with respect to rotation of the population code. An orthogonal transformation $Q$ applied to a population code $r (θ)$ generates a new code $\tilde{r} (θ) = Qr (θ)$ with an identical kernel (Appendix Alternative neural codes with same kernel) since $\frac{1}{N} \tilde{r} (θ) \cdot \tilde{r} (θ^{'}) = \frac{1}{N} r {(θ)}^{⊤} Q^{⊤} Qr (θ^{'}) = \frac{1}{N} r (θ) \cdot r (θ^{'})$ . Codes $r (θ)$ and $\tilde{r} (θ)$ will have identical readout performance on all possible learning tasks. We illustrate this degeneracy in Figure 2 using a publicly available dataset which consists of activity recorded from ∼20,000 neurons from the primary visual cortex of a mouse while shown static gratings (Stringer et al., 2021; Pachitariu et al., 2019). An original code $r (θ)$ is rotated to generate $\tilde{r} (θ)$ (Figure 2A) which have the same kernels (Figure 2B) and the same performance on a learning task (Figure 2C).

Figure 2

Download asset Open asset

The inner product kernel controls the generalization performance of readouts.

(A) Tuning curves $r (θ)$ for three example recorded Mouse V1 neurons to varying static grating stimuli oriented at angle $θ$ (Stringer et al., 2021; Pachitariu et al., 2019) (Left) are compared with a randomly rotated version (Middle) $\tilde{r} (θ)$ of the same population code. (Right) These two codes, original (Ori.) and rotated (Rot.) can be visualized as parametric trajectories in neural space. (B) The inner product kernel matrix has elements $K (θ_{1}, θ_{2})$ . The original V1 code and its rotated counterpart have identical kernels. (C) In a learning task involving uniformly sampled angles, readouts from the two codes perform identically, resulting in identical approximations of the target function (shown on the left as blue and red curves) and consequently identical generalization performance as a function of training set size $P$ (shown on right with blue and red points). The theory curve will be described in the main text.

Code-task alignment governs generalization

We next examine how the population code affects generalization performance of the readout. We calculated analytical expressions of the average generalization error in a task defined by the target response $y (θ)$ after observing $P$ stimuli using methods from statistical physics (Methods Theory of generalization). Because the relevant quantity in learning performance is the kernel, we leveraged results from our previous work studying generalization in kernel regression (Bordelon et al., 2020; Canatar et al., 2021), and approximated the generalization error averaged over all possible realizations of the training dataset composed of $P$ stimuli, $E_{g} = {⟨ E_{g} (D) ⟩}_{D}$ . As $P$ increases, the variance in $E_{g}$ due to the composition of the dataset decreases, and our expressions become descriptive of the typical case. Our final analytical result is given in Equation (11) in Methods Theory of generalization. We provide details of our calculations in Methods Theory of generalization and Appendix Theory of generalization, and focus on their implications here.

One of our main observations is that given a population code $r (θ)$ , the singular value decomposition of the code gives the appropriate basis to analyze the inductive biases of the readouts (Figure 3A). The tuning curves for individual neurons $r_{i} (θ)$ form an $N$ -by- $M$ matrix $R$ , where $M$ , possibly infinite, is the number of all possible stimuli. We discuss the SVD for continuous stimulus spaces in Appendix Singular value decomposition of continuous population responses. The left-singular vectors (or principal axes) and singular values of this matrix have been used in neuroscience for describing lower dimensional structure in the neural activity and estimating its dimensionality, see e.g. (Stopfer et al., 2003; Kato et al., 2015; Bathellier et al., 2008; Gallego et al., 2017; Sadtler et al., 2014; Stringer et al., 2018b, Stringer et al., 2021; Litwin-Kumar et al., 2017; Gao et al., 2017; Gao and Ganguli, 2015). We found that the function approximation properties of the code are controlled by the singular values, or rather their squares ${λ_{k}}$ which give variances along principal axes, indexed in decreasing order, and the corresponding right singular vectors ${ψ_{k} (θ)}$ , which are also the kernel eigenfunctions (Methods Theory of generalization and Appendix Singular value decomposition of continuous population responses). This follows from the fact that learned response (Equation (2)) is only a function of the kernel $K$ , and the eigenvalues $λ_{k}$ and orthonormal (uncorrelated) eigenfunctions $ψ_{k} (θ)$ collectively define the code’s inner-product kernel $K (θ, θ^{'})$ through an eigendecomposition $K (θ, θ^{'}) = \frac{1}{N} \sum_{i = 1}^{N} r_{i} (θ) r_{i} (θ^{'}) = \sum_{k} λ_{k} ψ_{k} (θ) ψ_{k} (θ^{'})$ (Mercer, 1909) (Methods Theory of generalization and Appendix Theory of generalization).

Figure 3 with 1 supplement see all

Download asset Open asset

The singular value decomposition (SVD) of the population code reveals the structure and inductive bias of the code.

(A) SVD of the response matrix $R$ gives left singular vectors $u_{k}$ (principal axes), kernel eigenvalues $λ_{k}$ , and kernel eigenfunctions $ψ_{k} (θ)$ . The ordering of eigenvalues provides an ordering of which modes $ψ_{k}$ can be learned by the code from few training examples. The eigenfunctions were offset by 0.5 for visibility. (B) (Left) Two different learning tasks $y (θ)$ , a low frequency (blue) and high frequency (red) function, are shown. (Middle) The cumulative power distribution rises more rapidly for the low frequency task than the high frequency, indicating better alignment with top kernel eigenfunctions and consequently more sample-efficient learning as shown in the learning curves (right). Dashed lines show theoretical generalization error while dots and solid vertical lines are experimental average and standard deviation over 30 repeats. (C) The feature space representations of the low (left) and high (middle and right) frequency tasks. Each point represents the embedding of a stimulus response vector along the $k$ -th principal axis $r (θ^{μ}) \cdot u_{k} = \sqrt{λ_{k}} ψ_{k} (θ^{μ})$ . The binary target value ${\pm 1}$ is indicated with the color of the point. The easy (left), low frequency task is well separated along the top two dimensions, while the hard, high frequency task is not linearly separable in two (middle) or even with four feature dimensions (right). (D) On an image discrimination task (recognizing birds vs mice), V1 has an entangled representation which does not allow good performance of linear readouts. This is evidenced by the top principal components (middle) and the slowly rising $C (k)$ curve (right).

Our analysis shows the existence of a bias in the readout towards learning certain target responses faster than others. The target response $y (θ) = \sum_{k} v_{k} ψ_{k} (θ)$ and the learned readout response $f (θ) = \sum_{k} {\hat{v}}_{k} (D) ψ_{k} (θ)$ can be expressed in terms of these eigenfunctions $ψ_{k}$ . Our theory shows that the readout’s generalization is better if the target function $y (θ)$ is aligned with the top eigenfunctions $ψ_{k}$ , equivalent to $v_{k}^{2}$ decaying rapidly with $k$ (Appendix Spectral bias and code-task alignment). We formalize this notion by the following metric. Mathematically, generalization error $⟨ E_{g} ⟩$ can be decomposed into normalized estimation errors $E_{k}$ for the coefficients of these eigenfunctions $ψ_{k}$ , ${⟨ E_{g} ⟩}_{D} = \sum_{k} v_{k}^{2} E_{k}$ , where $E_{k} = {⟨ ({\hat{v}}_{k} (D) - v_{k})^{2} ⟩}_{D} / v_{k}^{2}$ . We found that the ordering of the eigenvalues $λ_{k}$ controls the rates at which these mode errors $E_{k}$ decrease as $P$ increases (Methods Theory of generalization, Appendix Spectral bias and code-task alignment), (Bordelon et al., 2020): $λ_{k} > λ_{ℓ} ⟹ E_{k} < E_{ℓ}$ . Hence, larger eigenvalues mean lower generalization error for those normalized mode errors $E_{k}$ . We term this phenomenon the spectral bias of the readout. Based on this observation, we propose code-task alignment as a principle for good generalization. To quantify code-task alignment, we use a metric which was introduced in Canatar et al., 2021 to measure the compatibility of a kernel with a learning task. This is the cumulative power distribution $C (k)$ which measures the total power of the target function in the top $k$ eigenmodes, normalized by the total power (Canatar et al., 2021):

C (k) = \frac{\sum_{ℓ = 1}^{k} v_{ℓ}^{2}}{\sum_{ℓ = 1}^{\infty} v_{ℓ}^{2}} .

Stimulus-response maps that have high alignment with the population code’s kernel will have quickly rising cumulative power distributions $C (k)$ , since a large proportion of power is placed in the top modes. Target responses with high $C (k)$ can be learned with fewer training samples than target responses with low $C (k)$ since the mode errors $E_{k}$ are ordered for all $P$ (Appendix Spectral bias and code-task alignment).

Probing learning biases in neural data

Our theory can be used to probe the learning biases of neural populations. Here, we provide various examples of this using publicly available calcium imaging recordings from mouse primary visual cortex (V1). Our examples illustrate how our theory can be used to analyze neural data.

We first analyzed population responses to static grating stimuli oriented at an angle $θ$ (Stringer et al., 2021; Pachitariu et al., 2019). We found that the kernel eigenfunctions have sinusoidal shape with differing frequency. The ordering of the eigenvalues and eigenfunctions in Figure 3A (and Figure 3—figure supplement 1) indicates a frequency bias: lower frequency functions of $θ$ are easier to estimate at small sample sizes.

We tested this idea by constructing two different orientation discrimination tasks shown in Figure 3B and C, where we assign static grating orientations to positive or negative valence with different frequency square-wave functions of $θ$ . We trained the readout using a subset of the experimentally measured neural responses, and measured the readout’s generalization performance. We found that the cumulative power distribution for the low frequency task has a more rapidly rising $C (k)$ (Figure 3B). Using our theory of generalization, we predicted learning curves for these two tasks, which express the generalization error as a function of the number of sampled stimuli $P$ . The error for the low frequency task is lower at all sample sizes than the hard, high-frequency task. The theoretical predictions and numerical experiments show perfect agreement (Figure 3B). More intuition can be gained by visualizing the projection of the neural response along the top principal axes (Figure 3C). For the low-frequency task, the two target values are well separated along the top two axes. However, the high-frequency task is not well separated along even the top four axes (Figure 3C).

Using the same ideas, we can use our theory to get insight into tasks which the V1 population code is ill-suited to learn. For the task of identifying mice and birds (Stringer et al., 2018b, Stringer et al., 2018a) the linear rise in cumulative power indicates that there is roughly equal power along all kernel eigenfunctions, indicating that the representation is poorly aligned to this task (Figure 3D).

To illustrate how our approach can be used for different learning problems, we evaluate the ability of linear readouts to reconstruct natural images from neural responses to those images (Figure 4). The ability to reconstruct sensory stimuli from a neural code is an influential normative principle for primary visual cortex (Olshausen and Field, 1997). Here, we ask which aspects of the presented natural scene stimuli are easiest to learn to reconstruct. Since mouse V1 neurons tend to be selective towards low spatial frequency bands (Niell and Stryker, 2008 Bonin et al., 2011; Vreysen et al., 2012), we consider reconstruction of band-pass filtered images with spatial frequency wave-vector $k \in ℝ^{2}$ constrained to an annulus $| k | \in [\sqrt{max (s_{m a x}^{2} - r^{2}, 0)}, s_{m a x}]$ for $r = 0.2$ (in units of ${pixels}^{- 1}$ ) and plot the cumulative power $C (k)$ associated with each choice of the upper limit $s_{m a x}$ (Figure 4C and D). The frequency cutoffs were chosen in this way to preserve the volume in Fourier space to $V_{k} = π r^{2}$ for $r < s_{m a x}$ , which quantifies the dimension of the function space. We see that the lower frequency band-limited images are easier to reconstruct, as evidenced by their cumulative power $C (k)$ and learning curves $E_{g}$ (Figure 4D and E). This reflects the fact that the population preferentially encodes low spatial frequency content in the image (Figure 4F). Experiments with additional values of $r$ are provided in the Figure 4—figure supplement 1 with additional details found in the Appendix Visual scene reconstruction task.

Figure 4 with 1 supplement see all

Download asset Open asset

Reconstructing filtered natural images from V1 responses reveals preference for low spatial frequencies.

(A) Natural scene stimuli $θ$ were presented to mice and V1 cells were recorded. (B) The images weighted by the top eigenfunctions $v_{k} = {⟨ ψ_{k} (θ) θ ⟩}_{θ}$ . These “eigenimages" collectively define the difficulty of reconstructing images through readout. (C) The kernel spectrum of the V1 code for natural images. (D) The cumulative power curves for reconstruction of band-pass filtered images. Filters preserve spatial frequencies in the range $| k | \in [\sqrt{max (s_{m a x}^{2} - {0.2}^{2}, 0)}, s_{m a x}]$ , chosen to preserve volume in Fourier space as $s_{m a x}$ is varied. (E) The learning curves obey the ordering of the cumulative power curves. The images filtered with the lowest band-pass cutoff are easiest to reconstruct from the neural responses. (F) Examples of a band-pass filtered image with different preserved frequency bands.

Mechanisms of spectral bias and code-task alignment in a simple model of V1

How do population level inductive biases arise from properties of single neurons? To illustrate that a variety of mechanisms may be involved in a complex manner, we study a simple model of V1 to elucidate neural mechanisms that lead to the low frequency bias at the population level. In particular, we focus on neural nonlinearities and selectivity profiles.

We model responses of V1 neurons as photoreceptor inputs passed through Gabor filters and a subsequent experimentally motivated power-law nonlinearity (Adelson and Bergen, 1985; Olshausen and Field, 1997; Rumyantsev et al., 2020), modeling a population of orientation selective simple cells (Figure 5A) (see Appendix A simple feedforward model of V1). In this model, the kernel for static gratings with orientation $θ \in [0, π]$ is of the form $K (θ, θ^{'}) = κ (| θ - θ^{'} |)$ , and, as a consequence, the eigenfunctions of the kernel in this setting are Fourier modes. The eigenvalues, and hence the strength of the spectral bias, are determined by the nonlinearity as we discuss in Appendix Gabor model spectral bias and fit to V1 data. We numerically fit the parameters of the nonlinearity to the V1 responses and use these parameters our investigations in Figure 5—figure supplement 1.

Figure 5 with 3 supplements see all

Download asset Open asset

A model of V1 as a bank of Gabor filters recapitulates experimental inductive bias.

(A) Gabor filtered inputs are mapped through nonlinearity. A grating stimulus (left) with orientation $θ$ and phase $ϕ$ is mapped through a circuit of simple and complex cells (middle). Some examples of randomly sampled Gabor filters (right) generate preferred orientation tuning of neurons in the population. (B) We plot the top 12 eigenfunctions $ψ_{k} (θ, ϕ)$ (modes) for pure simple cell population, pure complex cell population and a mixture population with half simple and half complex cells. The pure complex cell population has all eigenfunctions independent of phase $ϕ$ . A pure simple cell population $s = 1$ or mixture codes $0 < 1$ depend on both orientation phase in a nontrivial way. (C) Three tasks are visualized, where color indicates the binary target value ± 1. The left task only depends on orientation stimulus variable $θ$ , the middle only depends on phase $ϕ$ , the hybrid task (right) depends on both. (D) (top) Generalization error and cumulative power distributions for the three tasks as a function of the simple-complex cell mixture parameter $s$ .

Next, to further illustrate the importance of code-task alignment, we study how invariances in the code to stimulus variations may affect the learning performance. We introduce complex cells in addition to simple cells in our model with proportion $s \in [0, 1]$ of simple cells (Appendix Gabor model spectral bias and fit to V1 data; Figure 5A), and allow phase, $ϕ$ , variations in static gratings. We use the energy model (Adelson and Bergen, 1985; Simoncelli and Heeger, 1998) to capture the phase invariant complex cell responses (Appendix Phase variation, complex cells and invariance, complex cell populations are phase invariant). We reason that in tasks that do not depend on phase information, complex cells should improve sample efficiency.

In this model, the kernel for the V1 population is a convex combination of the kernels for the simple and complex cell populations $K_{V 1} (θ, θ^{'}, ϕ, ϕ^{'}) = s K_{s} (θ, θ^{'}, ϕ, ϕ^{'}) + (1 - s) K_{c} (θ, θ^{'})$ where $K_{s}$ is the kernel for a pure simple cell population that depends on both orientation and phase, and $K_{c}$ is the kernel of a pure complex cell population that is invariant to phase (Appendix Complex cell populations are phase invariant). Figure 5C shows top kernel eigenfunctions for various values of $s$ elucidating inductive bias of the readout.

Figure 5D and E show generalization performance on tasks with varying levels of dependence on phase and orientation. On pure orientation discrimination tasks, increasing the proportion of complex cells by decreasing $s$ improves generalization. Increasing the sensitivity to the nuisance phase variable, $ϕ$ , only degrades performance. The cumulative power curve is also maximized at $s = 0$ . However, on a task which only depends on the phase, a pure complex cell population cannot generalize, since variation in the target function due to changes in phase cannot be explained in the codes’ responses. In this setting, a pure simple cell population attains optimal performance. The cumulative power curve is maximized at $s = 1$ . Lastly, in a nontrivial hybrid task which requires utilization of both variables $θ, ϕ$ , an optimal mixture $s$ exists for each sample budget $P$ which minimizes the generalization error. The cumulative power curve is maximized at different $s$ values depending on $k$ , the component of the target function. This is consistent with an optimal heterogenous mix, because components of the target are learned successively with increasing sample size. V1 must code for a variety of possible tasks and we can expect a nontrivial optimal simple cell fraction $s$ . We conclude that the degree of invariance required for the set of natural tasks, and the number of samples determine the optimal simple cell, complex cell mix. We also considered a more realistic model where the relative selectivity of each visual cortex neuron to phase $ϕ$ , measured with the F1/F0 ratio takes on a continuum of possible values with some cells more invariant to phase and some less invariant. In (Appendix Energy model with partially phase-selective cells, Figure 5—figure supplement 3) we discuss a simple adaptation of the energy model which can interpolate between a population of entirely simple cells and a population of entirely complex cells, giving diverse selectivity for the intermediate regime. We show that this model reproduces the inductive bias of Figure 5.

Small and large sample size behaviors of generalization

Recently, Stringer et al., 2018b argued that the input-output differentiability of the code, governed by the asymptotic rate of spectral decay, may be enabling better generalization. Our results provide a more nuanced view of the relation between generalization and kernel spectra. First, generalization with low sample sizes crucially depend on the top eigenvalues and eigenfunctions of the code’s kernel, not the tail. Second, generalization requires alignment of the code with the task of interest. Non-differentiable codes can generalize well if there is such an alignment. To illustrate these points, here, we provide examples where asymptotic conditions on the kernel spectrum are insufficient to describe generalization performance for small sample sizes (Figure 6, Figure 6—figure supplement 1 and Appendix Asymptotic power law scaling of learning curves), and where non-differentiable kernels generalize better than differentiable kernels (Figure 6—figure supplement 2).

Figure 6 with 2 supplements see all

Download asset Open asset

The top eigensystem of a code determines its low- $P$ generalization error.

(A) A periodic variable is coded by a population of neurons with tuning curves of different widths (top). Narrow, wide and optimal refers to the example in C. These codes are all smooth (infinitely differentiable) but have very different feature space representations of the stimulus variable $θ$ , as random projections reveal (below). (B) (left) The population codes in the above figure induce von Mises kernels $K (θ) \propto e^{\cos (θ) / σ^{2}}$ with different bandwidths $σ$ . (right) Eigenvalues of the three kernels. (C) (left) As an example learning task, we consider estimating a ‘bump’ target function. The optimal kernel (red, chosen as optimal bandwidth for $P = 10$ ) achieves a better generalization error than either the wide (green) or narrow (blue) kernels. (middle) A contour plot shows generalization error for varying bandwidth $σ$ and sample size $P$ . (right) The large $P$ generalization error scales in a power law. Solid lines are theory, dots are simulations averaged over 15 repeats, dashed lines are asymptotic power law scalings described in main text. Same color code as B and C-left.

Our example demonstrates how a code allowing good generalization for large sample sizes can be disadvantageous for small sizes. In Figure 6A, we plot three different populations of neurons with smooth (infinitely differentiable) tuning curves that tile a periodic stimulus variable, such as the direction of a moving grating. The tuning width, $σ$ , of the tuning curves strongly influences the structure of these codes: narrower widths have more high frequency content as we illustrate in a random 3D projection of the population code for $θ \in [0, 2 π]$ (Figure 6A). Visualization of the corresponding (von Mises) kernels and their spectra are provided in Figure 6B. The width of the tuning curves control bandwidths of the kernel spectra Figure 6B, with narrower curves having an later decay in the spectrum and higher high frequency eigenvalues. These codes can have dramatically different generalization performance, which we illustrate with a simple “bump" target response (Figure 6C). In this example, for illustration purposes, we let the network learn with a delta-rule with a weight decay, leading to a regularized kernel regression solution (Appendix Weight decay and ridge regression). For a sample size of $P = 10$ , we observe that codes with too wide or too narrow tuning curves (and kernels) do not perform well, and there is a well-performing code with an optimal tuning curve width $σ$ , which is compatible with the width of the target bump, $σ_{T}$ . We found that optimal $σ$ is different for each $P$ (Figure 6C). In the large- $P$ regime, the ordering of the performance of the three codes are reversed (Figure 6C). In this regime generalization error scales in a power law (Appendix Asymptotic power law scaling of learning curves) and the narrow code, which performed worst for $P \sim 10$ , performs the best. This example demonstrates that asymptotic conditions on the tail of the spectra are insufficient to understand generalization in the small sample size limit. The bulk of the kernel’s spectrum needs to match the spectral structure of the task to generalize efficiently in the low-sample size regime. However, for large sample sizes, the tail of the eigenvalue spectrum becomes important. We repeat the same exercise and draw the same conclusions for a non-differentiable kernel (Laplace) (Figure 6—figure supplement 1) showing that these results are not an artifact of the infinite differentiability of von Mises kernels. We further provide examples where non-differentiable kernels generalizing better than differentiable kernels in Figure 6—figure supplement 2.

Time-dependent neural codes

Our framework can directly be extended to learning of arbitrary time-varying functions of time-varying inputs from an arbitrary spatiotemporal population code (Methods RNN experiment, Appendix Time dependent neural codes). In this setting, the population code $r ({θ (t)}, t)$ is a function of an input stimulus sequence $θ (t)$ and possibly its entire history, and time $t$ . A downstream linear readout $f ({θ}, t) = w \cdot r ({θ}, t)$ learns a target sequence $y ({θ}, t)$ from a total of $P$ examples that can come at any time during any sequence.

As a concrete example, we focus on readout from a temporal population code generated by a recurrent neural network in a task motivated by a delayed reach task (Ames et al., 2019; Figure 7A and B). In this task, the network is presented for a short time an input cue sequence coding an angular variable which is drawn randomly from a distribution (Figure 7C). The recurrent neural network must remember this angle and reproduce an output sequence which is a simple step function whose height depends on the angle which begins after a time delay from the cessation of input stimulus and lasts for a short time (Figure 7D).

Figure 7

Download asset Open asset

The performance of time-dependent codes when learning dynamical systems can be understood through spectral bias.

(A) We study the performance of time dependent codes on a delayed response task which requires memory retrieval. A cue (black dot) is presented at an angle $γ^{μ}$ . After a delay time $d$ , the cursor position (blue triangle) must be moved to the remembered cue position and then subsequently moved back to the origin after a short time. (B) The readout weights (blue) of a time dependent code can be learned through a modified delta rule. (C) Input is presented to the network as a time series which terminates at $t = 1$ . The sequences are generated by drawing an angle $γ^{μ} \sim Uniform [0, 2 π]$ and using two step functions as input time-series that code for the cosine and the sine of the angle (Methods RNN experiment, Appendix Time dependent neural codes). We show an example of the one of the variables in a input sequence. (D) The target functions for the memory retrieval task are step functions delayed by a time $d$ . (E) The kernel $K_{t, t^{'}}^{μ, μ^{'}}$ compares the code for two sequences at two distinct time points. We show the time dependent kernel for identical sequences (left) and the stimulus dependent kernel for equal time points (middle left) as well as for non-equal stimuli (middle right) and non-equal time (right). (F) The kernel can be diagonalized, and the eigenvalues $λ_{k}$ determine the spectral bias of the reservoir computer (left). We see that higher gain $g$ networks have higher dimensional representations. The ‘eigensystems’ $ψ_{k} (θ^{μ}, t)$ are functions of time and cue angle. We plot only $μ = 0$ components of top systems $k = 1, 2, 3, 4$ (right). (G) The readout is trained to approximate a target function $y^{μ} (t)$ , which requires memory of the presented cue angle. (left) The theoretical (solid) and experimental (vertical errorbar, 100 trials) generalization error $E_{g}$ are plotted for the three delays $d$ against training sample size $P$ . (right) The ordering of $E_{g}$ matches the ordering of the $C (k)$ curves as expected.

The kernel induced by the spatiotemporal code is shown in Figure 7E. The high dimensional nature of the activity in the recurrent network introduces complex and rich spatiotemporal similarity structure. Figure 7F shows the kernel’s eigensystem, which consists of stimulus dependent time-series $ψ_{k} ({θ}; t)$ for each eigenvalue $λ_{k}$ . An interesting link can be made with this eigensystem and linear low-dimensional manifold dynamics observed in several cortical areas (Stopfer et al., 2003; Kato et al., 2015; Gallego et al., 2017; Cunningham and Yu, 2014; Sadtler et al., 2014; Gao and Ganguli, 2015; Gallego et al., 2018; Chapin and Nicolelis, 1999; Bathellier et al., 2008). The kernel eigenfunctions also define the latent variables obtained through a singular value decomposition of the neural activity $r ({θ}; t) = \sum_{k} \sqrt{λ_{k}} u_{k} ψ_{k} ({θ}; t)$ (Gallego et al., 2017). With enough samples, the readout neuron can learn to output the desired angle with high fidelity (Figure 7G). Unsurprisingly, tasks involving long time delays are more difficult and exhibit lower cumulative power curves. Consequently, the generalization error for small delay tasks drops much more quickly with increasing samples $P$ .

Biological codes are metabolically more efficient and more selective than other codes with identical kernels

Although, the performance of linear readouts may be invariant to rotations that preserve kernels (Figure 2), metabolic efficiency may favor certain codes over others (Barlow, 1961; Atick and Redlich, 1992; Attneave, 1954; Olshausen and Field, 1997; Simoncelli and Olshausen, 2001), reducing degeneracy in the space of codes with identical kernels. To formalize this idea, we define $δ$ to be the vector of spontaneous firing rates of a population of neurons, and $s^{μ} = r (θ^{μ}) + δ$ be the spiking rate vector in response to a stimulus $θ^{μ}$ . The vector $δ$ ensures that neural responses are non-negative. The modulation with respect to the spontaneous activity, $r (θ^{μ})$ , gives the population code and defines the kernel, $K (θ^{μ}, θ^{μ}) = \frac{1}{N} r (θ^{μ}) \cdot r (θ^{ν})$ . To avoid confusion with $r (θ^{μ})$ , we will refer to $s^{μ}$ as total spiking activity. We propose that population codes prefer smaller spiking activity subject to a fixed kernel. In other words, because the kernel is invariant to any change of the spontaneous firing rates and left rotations of $r (θ)$ , the orientation and shift of the population code $r (θ)$ should be chosen such that the resulting total spike count $\sum_{i = 1}^{N} \sum_{μ = 1}^{P} s_{i}^{μ}$ is small.

We tested whether biological codes exhibit lower total spiking activity than others exhibiting the same kernel on mouse V1 recordings, using deconvolved calcium activity as a proxy for spiking events (Stringer et al., 2021; Pachitariu et al., 2019; Pachitariu et al., 2018) (Methods Data analysis; Figure 8). To compare the experimental total spiking activity to other codes with identical kernels, we computed random rotations of the neural responses around spontaneous activity, $\tilde{r} (θ^{μ}) = Qr (θ^{μ})$ , and added the $δ$ that minimizes total spiking activity and maintains its nonnegativity (Methods Generating RROS codes). We refer to such an operation as RROS (random rotation and optimal shift), and a code generated by an RROS operation as an RROS code. The matrix $Q$ is a randomly sampled orthogonal matrix (Anderson et al., 1987). In other words, we compare the true code to the most metabolically efficient realizations of its random rotations. This procedure may result in an increased or decreased total spike count in the code, and is illustrated in a synthetic dataset in Figure 8A. We conducted this procedure on subsets of various sizes of mouse V1 neuron populations, as our proposal should hold for any subset of neurons (Methods Generating RROS codes), and found that the true V1 code is much more metabolically efficient than randomly rotated versions of the code (Figure 8B and C). This finding holds for both responses to static gratings and to natural images as we show in Figure 8B and C respectively.

Figure 8 with 2 supplements see all

Download asset Open asset

The biological code is more metabolically efficient than random codes with same inductive biases.

(A) We illustrate our procedure in a synthetic example. A non-negative population code (left) can be randomly rotated about its spontaneous firing rate (middle), illustrated as a purple dot, and optimally shifted to a new non-negative population code (right). If the kernel is measured about the spontaneous firing rate, these transformations leave the inductive bias of the code invariant but can change the total spiking activity of the neural responses. We refer to such an operation as random rotation + optimal shift (RROS). We also perform gradient descent over rotations and shifts, generating an optimized code (opt). (B) Performing RROS on $N$ neuron subsamples of experimental Mouse V1 recordings (Stringer et al., 2021; Pachitariu et al., 2019), shows that the true code has much lower average cost $\frac{1}{N P} \sum_{i μ} s_{i}^{μ}$ compared to random rotations of the code. The set of possible RROS transformations (Methods Generating RROS codes, and Methods Comparing sparsity of population codes) generates a distribution over average cost, which has higher mean than the true code. We also optimize metabolic cost over the space of RROS transformations, which resulted in the red dashed lines. We plot the distance (in units of standard deviations) between the cost of the true and optimal codes and the cost of randomly rotated codes for different neuron subsample sizes $N$ . (C) The same experiment performed on Mouse V1 responses to ImageNet images from 10 relevant classes (Stringer et al., 2018a; Stringer et al., 2018b). (D) The *lifetime* (LS) and *population sparseness* (PS) levels (Methods Lifetime and population sparseness) are higher for the Mouse V1 code than for a RROS code. The distance between average LS and PS of true code and RROS codes increases with $N$ .

To further explore metabolic efficiency, we posed an optimization problem which identifies the most efficient code with the same kernel as the biological V1 code. This problem searches over rotation matrices $Q$ and finds the $Q$ matrix and off-set vector $δ$ which gives the lowest cost $\sum_{i μ} s_{i}^{μ}$ (Methods Comparing sparsity of population codes) (Figure 8). Although the local optimum identified with the algorithm is lower in cost than the biological code, both the optimal and biological codes are significantly displaced from the distribution of random codes with same kernel. Our findings do not change when data is preprocessed with an alternative strategy, an upper bound on neural responses is imposed on rotated codes, or subsets of stimuli are considered (Figure 8—figure supplement 1). We further verified these results on electrophysiological recordings of mouse visual cortex from the Allen Institute Brain Observatory (de Vries et al., 2020), (Figure 8—figure supplement 2). Overall, the large disparity in total spiking activity between the true and randomly generated codes with identical kernels suggests that metabolic constraints may favor the biological code over others that realize the same kernel.

The disparity between the true biological code and the RROS code is not only manifested in terms of total activity level, but also in terms of single neuron and single stimulus sparseness measures, specifically lifetime and population sparseness distributions (Methods Lifetime and population sparseness) (Willmore and Tolhurst, 2001; Lehky et al., 2005; Treves and Rolls, 1991; Pehlevan and Sompolinsky, 2014). In Figure 8D, we compare the lifetime and population sparseness distributions of the true biological code with a RROS version of the same code, revealing biological neurons have significantly higher lifetime sparseness. In Appendix Necessary conditions for optimally sparse codes, we provide analytical arguments which suggest that tuning curves of optimally sparse non-negative codes with full-rank kernels will have selective tuning.

Discussion

Elucidating inductive biases of the brain is fundamentally important for understanding natural intelligence (Tenenbaum et al., 2011; Lake et al., 2017; Sinz et al., 2019; Zador, 2019). These biases are coded into the brain by the dynamics of its neurons, the architecture of its networks, its representations and plasticity rules. Finding ways to extract the inductive biases from neuroscience datasets requires a deeper theoretical understanding of how all these factors shape the biases, and is an open problem. In this work, we attempted to take a step towards filling this gap by focusing on how the structure of static neural population codes shape inductive biases for learning of a linear readout neuron under a biologically plausible learning rule. If the readout neuron’s output is correlated with behavior, and that correlation is known, then our theory could possibly be modified to predict what behavioral tasks can be learned faster.

Under the delta rule, the generalization performance of the readout is entirely dependent on the code’s inner product kernel; the kernel is a determinant of inductive bias. In its finite dimensional form, the kernel is an example of a representational similarity matrix and is a commonly used tool to study neural representations (Edelman, 1998; Kriegeskorte et al., 2008; Laakso and Cottrell, 2000; Kornblith et al., 2019; Cadieu et al., 2014; Pehlevan et al., 2018). Our work elucidates a concrete link between this experimentally measurable mathematical object, and sample-efficient learning.

We derived an analytical expression for the generalization error as a function of sample-size under very general conditions, for an arbitrary stimulus distribution, arbitrary population code and an arbitrary target stimulus-response map. We used our findings in both theoretical and experimental analysis of primary visual cortex, and temporal codes in a delayed reach task. This generality of our theory is a particular strength.

Our analysis elucidated two principles that define the inductive bias. The first one is spectral bias: kernel eigenfunctions with large eigenvalues can be estimated using a smaller number of samples. The second principle is the code-task alignment: target functions with most of their power in top kernel eigenfunctions can be estimated efficiently and are compatible with the code. The cumulative power distribution, $C (k)$ (Canatar et al., 2021), provides a measure of this alignment. These findings define a notion of ‘simplicity’ bias in learning from examples, and provides a solution to the question of what stimulus-response maps are easier to learn. A similar simplicity bias has been also observed in training deep neural networks (Rahaman et al., 2019; Xu et al., 2019; Kalimeris et al., 2019). Due to a correspondence between gradient-descent trained neural networks in the infinite-width limit and kernel machines (Jacot et al., 2018), results on the spectral bias of kernel machines may shed light onto these findings (Bordelon et al., 2020; Canatar et al., 2021). Though our present analysis focused on learning a single layer weight vector with the biologically plausible delta-rule, future work could explore the learning curves of other learning rules for deep networks (Bordelon and Pehlevan, 2022a), such as feedback alignment (Lillicrap et al., 2016) or perturbation methods (Jabri and Flower, 1992). Such analysis could explore how inductive bias is also shaped by choice of learning rule, as well as the structure of the initial population code.

We applied our findings in both theoretical and experimental analysis of mouse primary visual cortex. We demonstrated a bias of neural populations towards low frequency orientation discrimination and low spatial frequency reconstruction tasks. The latter finding is consistent with the finding that mouse visual cortex neurons are selective for low spatial frequency (Niell and Stryker, 2008; Vreysen et al., 2012). The toy model of the visual cortex as a mixture of simple and complex cells demonstrated how invariances, specifically the phase invariance of the complex cells, in the population code can facilitate learning some tasks involving phase invariant responses at the expense of performance on others. The role of invariances in learning with kernel methods and deep networks have recently been investigated in machine learning literature, showing that invariant representations can improve capacity (Farrell et al., 2021) and sample efficiency for invariant learning problems (Mei et al., 2021; Li et al., 2019; Xiao and Pennington, 2022).

A recent proposal considered the possibility that the brain acts as an overparameterized interpolator (Hasson et al., 2020). Suitable inductive biases are crucial to prevent overfitting and generalize well in such a regime (Belkin et al., 2019). Our theory can explain these inductive biases since, when the kernel is full-rank, which typically is the case when there are more neurons in the population than the number of learning examples, the delta rule without weight decay converges to an interpolator of the learning examples. Modern deep learning architectures also operate in an overparameterized regime, but generalize well (Zhang et al., 2016; Belkin et al., 2019), and an inductive bias towards simple functions has been proposed as an explanation (Bordelon et al., 2020; Canatar et al., 2021; Kalimeris et al., 2019; Valle-Perez et al., 2018). However, we also showed that interpolation can be harmful to prediction accuracy when the target function has some variance unexplained by the neural code or if the neural responses are significantly noisy, motivating use of explicit regularization.

Our work promotes sample efficiency as a general coding principle for neural populations, relating neural representations to the kinds of problems they are well suited to solve. These codes may be shaped through evolution or themselves be learned through prior experience (Zador, 2019). Prior related work in this area demonstrated the dependence of sample-efficient learning of a two-angle estimation task on the width of the individual neural tuning curves (Meier et al., 2020) and on additive function approximation properties of sparsely connected random networks (Harris, 2019).

A sample efficiency approach to population coding differs from the classical efficient coding theories (Attneave, 1954; Barlow, 1961; Atick and Redlich, 1992; Srinivasan et al., 1982; van Hateren, 1992; Rao and Ballard, 1999; Olshausen and Field, 1997; Chalk et al., 2018), which postulate that populations of neurons optimize information content of their code subject to metabolic constraints or noise. While these theories emphasize different aspects of the code’s information content (such as reduced redundancy, predictive power, or sparsity), they do not address sample efficiency demands on learning. Further, recent studies demonstrated hallmarks of redundancy and correlation in population responses (Chapin and Nicolelis, 1999; Bathellier et al., 2008; Pitkow and Meister, 2012; Gao and Ganguli, 2015; Abbasi-Asl et al., 2016; Gallego et al., 2018; Stringer et al., 2018b), violating a generic prediction of efficient coding theories that responses of different neurons should be uncorrelated across input stimuli in high signal-to-noise regimes to reduce redundancy in the code and maximize information content (Barlow, 1961; Atick and Redlich, 1992; Srinivasan et al., 1982; van Hateren, 1992; Haft and van Hemmen, 1998; Huang and Rao, 2011). In our theory, the structured correlations of neural responses correspond to the decay in the spectrum of the kernel, and play a key role in biasing learned readouts towards simple functions.

In recent related studies, the asymptotic decay rate of the kernel’s eigenspectrum was argued to be important for generalization (Stringer et al., 2018b) and robustness (Nassar et al., 2020). The spectral decay rate in the mouse V1 was found to be consistent with a high dimensional (power law) but smooth (differentiable) code, and smoothness was argued to be an enabler of generalization (Stringer et al., 2018b). While we also identify power law spectral decays, we show that sample-efficient learning requires more than smoothness conditions in the form of asymptotic decay rates on the kernel’s spectrum. The interplay between the stimulus distribution, target response and the code gives rise to sample efficient learning. Because of spectral bias, the top eigenvalues govern the small sample size behavior. The tail of the spectrum becomes important at large sample sizes.

Though the kernel is degenerate with respect to rotations of the code in the neural activity space, we demonstrated that the true V1 code has much lower average activity than random codes with the same kernel, suggesting that evolution and learning may be selecting neural codes with low average spike rates which preserve sample-efficiency demands for downstream learning tasks. We predict that metabolic efficiency may be a determinant in the orientation and placement of the ubiquitously observed low-dimensional coding manifolds in neural activity space in other parts of the brain (Gallego et al., 2018). The demand of metabolic efficiency is consistent with prior sparse coding theories (Niven and Laughlin, 2008; Olshausen and Field, 1997; Simoncelli and Olshausen, 2001; Hromádka et al., 2008), however, our theory emphasizes sample-efficient learning as the primary normative objective for the code. As a note of caution, while our analysis holds under the assumption that the neural code is deterministic, real neurons exhibit variability in their responses to repeated stimuli. Such noisy population codes do not generally achieve identical generalization performance under RROS transformations. For example, if each neuron is constrained to produce i.i.d. Poisson noise, then simple shifts of the baseline firing rate reduce the information content of the code. However, if the neural noise is Gaussian (even with stimulus dependent noise covariance), then the generalization error is conserved under RROS operations (Appendix Effect of noise on RROS symmetry). Further studies could focus on revealing the space of codes with equivalent inductive biases under realistic noise models.

Our work constitutes a first step towards understanding inductive biases in neuronal circuits. To achieve this, we focused on a linear, delta-rule readout of a static population code. More work is need to study other factors that affect inductive bias. Importantly, sensory neuron tuning curves can adapt during perceptual learning tasks (Gilbert, 1994; Goltstein et al., 2021; Ghose et al., 2002; Schoups et al., 2001) with the strength of adaptation dependent on brain area (Yang and Maunsell, 2004; Adab et al., 2014; Op de Beeck et al., 2007). In many experiments, these changes to tuning in sensory areas are small (Schoups et al., 2001; Ghose et al., 2002), satisfying the assumptions of our theory. For example monkeys trained on noisy visual motion detection exhibit changes in sensory-motor (LIP) but not sensory areas (MT), consistent with a model of readout from a static sensory population code (Law and Gold, 2008; Shadlen and Newsome, 2001). However, other perceptual learning tasks and other brain areas can exhibit significant changes in neural tuning (Recanzone et al., 1993; Pleger et al., 2003; Furmanski et al., 2004). This diversity of results motivates more general analysis of learning in multi-layer networks, where representations in each layer can adapt flexibly to task structure (Shan and Sompolinsky, 2021; Mastrogiuseppe et al., 2022; Bordelon and Pehlevan, 2022b; Ahissar and Hochstein, 2004). Alternatively, our current analysis of inductive bias can still be consistent with multi-layer learning if the network is sufficiently overparameterized and tuning curves change very little (Jacot et al., 2018; Lee et al., 2018; Shan and Sompolinsky, 2021). In this case, network training is equivalent to kernel learning with a kernel that depends on the learning rule and architecture (Bordelon and Pehlevan, 2022a). However, in the regime of neural network training where tuning curves change significantly, more sophisticated analytical tools are needed to predict generalization (Flesch et al., 2021; Yang and Hu, 2021; Bordelon and Pehlevan, 2022b). Although our work focused on linear readouts, arbitrary nonlinear readouts which generate convex learning objectives have been recently studied in the high dimensional limit, giving qualitatively similar learning curves which depend on kernel eigenvalues and task model alignment (Loureiro et al., 2021b; Cui et al., 2022) (see Appendix Typical case analysis of nonlinear readouts).

Our work focused on how signal correlations influence inductive bias (Averbeck et al., 2006; Cohen and Kohn, 2011). However, since real neurons do exhibit variability in their responses to identical stimuli, one should consider the effect of neural noise and noise correlations in learning. We provide a preliminary analysis of learning with neural noise in Appendix Impact of neural noise and unlearnable targets on learning, where we show that neural noise can lead to irreducible asymptotic error which depends on the geometry of the signal and noise correlations. Further, if the target function is not fully expressible as linear combinations of neural responses, overfitting peaks in the learning curves are possible, but can be mitigated with regularization implemented by a weight decay in the learning rule (see Appendix 1—figure 1). Future work could extend our analysis to study how signal and noise correlations interact to shape inductive bias and generalization performance in the case where the noise correlation matrices are non-isotropic, including the role of differential correlations (Moreno-Bote et al., 2014). Overall, future work could build on the present analysis to incorporate a greater degree of realism in a theory of inductive bias.

Finally, we discuss possible applications of our work to experimental neuroscience. Our theory has potential implications for experimental studies of task learning. First, in cases where the population selective to stimuli can be measured directly, an experimenter could design easy or difficult tasks for an animal to learn from few examples, under a hypothesis that the behavioral output is a linear function of the observed neurons. Second, in cases where it is unclear which neural population contributes to learning, one could utilize our theory to solve the inverse problem of inferring the relevant kernel from observed learning curves on different tasks (Wilson et al., 2015). From these tasks, the experimenter could compare the inferred kernel to those of different recorded populations. For instance, one could compare the kernels from separate populations to the inferred kernel obtained from learning curves on certain visual learning tasks. This could provide new ways to test theories of perceptual learning (Gilbert, 1994). Lastly, extensions of our framework could quantify the role of neural variability on task learning and the limitation it imposes on accuracy and sample efficiency.

Methods

Generating example codes (Figure 1)

The two codes in Figure 1 were constructed to produce two different kernels for $θ \in S^{1}$ :

K_{1} (θ, θ^{'}) = \exp (0.25 \cos (θ - θ^{'})), K_{2} (θ, θ^{'}) = \sum_{k = 1}^{20} \cos (k (θ - θ^{'})) .

An infinite number of codes could generate either of these kernels. After diagonalizing the kernel into its eigenfunctions on a grid of 120 points, $K_{1} = Ψ_{1} Λ_{1} Ψ_{1}^{⊤}, K_{2} = Ψ_{2} Λ_{2} Ψ_{2}^{⊤}$ , we used a random rotation matrix $Q \in ℝ^{N \times N}$ (which satisfies ${QQ}^{⊤} = Q^{⊤} Q = I$ ) to generate a valid code

R_{1} = Q Λ_{1}^{1 / 2} Ψ_{1}^{⊤}, R_{2} = Q Λ_{2}^{1 / 2} Ψ_{2}^{⊤} .

This construction guarantees that $R_{1}^{⊤} R_{1} = K_{1}$ and $R_{2}^{⊤} R_{2} = K_{2}$ . We plot the tuning curves for the first three neurons. The target function in the first experiment is $y = \cos (θ) - 0.6 \cos (4 θ)$ , while the second experiment used $y = \cos (6 θ) - \cos (8 θ)$ .

Theory of generalization

Recent work has established analytic results that predict the average case generalization error for kernel regression

E_{g} = {⟨ E_{g} (D) ⟩}_{D} = {⟨ {[f (θ, D) - y (θ)]}^{2} ⟩}_{θ, D}

where $E_{g} (D) = {⟨ [f (θ, D) - y (θ)]^{2} ⟩}_{θ}$ is the generalization error for a certain sample $D$ of size $P$ and $f (θ, D) = w \cdot r (θ)$ is the kernel regression solution for $D$ (Appendix Convergence of the delta-rule without weight decay) (Bordelon et al., 2020; Canatar et al., 2021). The typical or average case error $E_{g}$ is obtained by averaging over all possible datasets of size $P$ . This average case generalization error is determined solely by the decomposition of the target function $y (x)$ along the eigenbasis of the kernel and the eigenspectrum of the kernel. This continuous diagonalization again takes the form (Appendix Singular value decomposition of continuous population responses) (Rasmussen and Williams, 2005)

\int p (θ) K (θ, θ^{'}) ψ_{k} (θ) d θ = λ_{k} ψ_{k} (θ^{'}) .

Our theory is also applicable to discrete stimuli if $p (θ)$ is a Dirac measure as we describe in (Appendix Discrete stimulus spaces: finding eigenfunctions with matrix eigendecomposition). Since the eigenfunctions form a complete set of square integrable functions (Rasmussen and Williams, 2005), we expand both the target function $y (θ)$ and the learned function $f (θ, D)$ in this basis $y (θ) = \sum_{k} v_{k} ψ_{k} (θ), f (θ, D) = \sum_{k} {\hat{v}}_{k} ψ_{k} (θ)$ , where ${\hat{v}}_{k}$ are understood to be functions of the dataset $D$ . The eigenfunctions are orthonormal $\int d θ p (θ) ψ_{k} (θ) ψ_{ℓ} (θ) = δ_{k, ℓ}$ , which implies that the generalization error for any set of coefficients $\hat{v}$ is

E_{g} (D) = {⟨ (y (θ) - f (θ, D))^{2} ⟩}_{θ} = \sum_{k, ℓ} ({\hat{v}}_{k} - v_{k}) ({\hat{v}}_{ℓ} - v_{ℓ}) {⟨ ψ_{k} (θ) ψ_{ℓ} (θ) ⟩}_{θ} = | | \hat{v} - v | |^{2}

We now introduce the equivalent training error, or empirical loss, written directly in terms of eigenfunction coefficients $\hat{v}$ , which depends on the random dataset $D = {(θ^{μ}, y^{μ})}_{μ = 1}^{P}$

H (\hat{v}, D) = \sum_{μ} [(\hat{v} - v) \cdot ψ (θ^{μ})]^{2} + λ \sum_{k} \frac{{\hat{v}}_{k}^{2}}{λ_{k}}

This loss function is minimized by delta rule updates with weight decay constant $λ$ . It is straightforward to verify that the $H$ -minimizing coefficients are ${\hat{v}}^{*} = (Ψ Ψ^{⊤} + λ Λ^{- 1})^{- 1} Ψ Ψ^{⊤} v$ , giving the learned function $f (θ, D) = {\hat{v}}^{*} \cdot ψ (θ) = k (θ)^{⊤} (K + λ I)^{- 1} y$ where the vectors $k$ and $y$ have entries $[k (θ)]_{μ} = K (θ, θ^{μ})$ and $[y]_{μ} = y (θ^{μ})$ for each training stimulus $θ^{μ} \in D$ . The $P \times P$ kernel gram matrix $K$ has entries $[K]_{μ ν} = K (θ^{μ}, θ^{ν})$ . The $λ \to 0$ limit of the minimizer of $H$ coincides with kernel interpolation. This allows us to characterize generalization without reference to learned readout weights $w$ . The generalization error for this optimal function is

\begin{aligned} E_{g} (D) & = | | {\hat{v}}^{*} - v | |^{2} = v^{⊤} Λ^{- 1} G (D)^{2} Λ^{- 1} v \\ G (D) & = {(\frac{1}{λ} Ψ Ψ^{⊤} + Λ^{- 1})}^{- 1} . \end{aligned}

We note that the dependence on the randomly sampled dataset $D$ only appears through the matrix $G (D)$ . Thus to compute the typical generalization error we need to average this matrix over realizations of datasets, i.e. ${⟨ G (D) ⟩}_{D}$ . There are multiple strategies to perform such an average and we will study one here based on a partial differential equation which was introduced in Sollich, 1998; Sollich, 2002 and studied further in Bordelon et al., 2020. We describe in detail one method for performing such an average in Appendix Computation of learning curves. After this computation, we find that the generalization error can be approximated at large $P$ as

\begin{matrix} E_{g} = \frac{κ^{2}}{1 - γ} \sum_{k} \frac{v_{k}^{2}}{{(λ_{k} P + κ)}^{2}}, κ = λ + κ \sum_{k} \frac{λ_{k}}{λ_{k} P + κ}, \end{matrix}

where $γ = P \sum_{k} \frac{λ_{k}^{2}}{{(λ_{k} p + κ)}^{2}}$ , giving the desired result. We note that (11) defines the function $κ$ implicitly in terms of the sample size $P$ . Taking $λ \to 0$ gives the generalization error of the minimum norm interpolant, which desribes the generalization error of the solution. This result was recently reproduced using the replica method from statistical mechanics in an asymptotic limit where the number of neurons and samples are large (Bordelon et al., 2020; Canatar et al., 2021). Other recent works have verified our theoretical expressions on a variety of kernels and datasets (Loureiro et al., 2021b; Simon et al., 2021).

Additional intuition for the spectral bias phenomenon can be gained from the expected learned function ${⟨ f (θ, D) ⟩}_{D} = \sum_{k} \frac{λ_{k} P}{λ_{k} P + κ} v_{k} ψ_{k} (θ)$ , which is the average readout prediction over possible datasets $D$ . The function $κ (P)$ is defined implicitly as $κ = λ + κ \sum_{k} \frac{λ_{k}}{λ_{k} P + κ}$ and decreases with $P$ from $κ (0) = λ + \sum_{k} λ_{k}$ to its asymptotic value ${lim}_{P \to \infty} κ (P) = λ$ . The coefficient for the $k$ -th eigenfunction $\frac{λ_{k} P}{λ_{k} P + κ} v_{k}$ approaches the true coefficient v_k as $P \to \infty$ . The $k$ -th eigenfunction $ψ_{k}$ is effectively learned when $P ≫ \frac{κ}{λ_{k}}$ . For large eigenvalues $λ_{k}$ , fewer samples $P$ are needed to satisfy this condition, while small eigenvalue modes will require more samples.

RNN experiment

For the simulations in Figure 7, we integrated a rate-based recurrent network model with $N = 6000$ neurons, time constant $τ = 0.05$ and gain $g = 1.5$ . Each of the $P = 80$ randomly chosen angles $γ^{μ}$ generates a trajectory over $T = 100$ equally spaced points in $t \in [0, 3]$ . The two dimensional input sequence is simply $θ (t) = H (t) H (1 - t) {[\cos (γ^{μ}), \sin (γ^{μ})]}^{⊤} \in ℝ^{2}$ . Target function for a delay $d$ is $y (θ^{μ}, t) = H (1.5 + d - t) H (t - d - 1) {[\cos (γ^{μ}), \sin (γ^{μ})]}^{⊤}$ which is nonzero for times $t \in [1 + d, 1.5 + d]$ . In each simulation, the activity in the network is initialized to $u (0) = 0$ . The kernel gram matrix $K \in ℝ^{P T \times P T}$ is computed by taking inner products of the time varying code at for different inputs $γ^{μ}$ and at different times. Learning curves represent the generalization error obtained by randomly sampling $P$ time points from the $P T$ total time points generated in the simulation process and training readout weights $w$ to convergence with gradient descent.

Data analysis

Data source and processing

Mouse V1 neuron responses to orientation gratings were obtained from a publicly available dataset (Stringer et al., 2021; Pachitariu et al., 2019). Two-photon calcium microscopy fluorescence traces were deconvolved into spike trains and spikes were counted for each stimulus, as described in Stringer et al., 2021. The presented grating angles were distributed uniformly over $[0, 2 π]$ radians. Data pre-processing, which included z-scoring against the mean and standard deviation of null stimulus responses, utilized the provided code for this experiment, which also publicly available at https://github.com/MouseLand/stringer-et-al-2019 (Stringer, 2019). This preprocessing technique was used in all Figures in the paper. To reduce corruption of the estimated kernel from neural noise (trial-to-trial variability), we first trial average responses, binning the grating stimuli oriented at different angles $θ$ into a collection of 100 bins over the interval from $[0, 2 π]$ and averaging over all of the available responses from each bin. Since grating angles were sampled uniformly, there is a roughly even distribution of about 45 responses in each bin. After trial averaging, SVD was performed on the response matrix $R$ , generating the eigenspectrum and kernel eigenfunctions as illustrated in Figure 3. Figures 2, 3 and 8, all used this data anytime responses to grating stimuli were mentioned.

In Figures 3D, 4 and 8C, the responses of mouse V1 neurons to ImageNet images (Deng et al., 2009) were obtained from a different publicly available dataset (Stringer et al., 2018a). The images were taken from 15 different classes from the Imagenet dataset with ethological relevance to mice (birds, cats, flowers, hamsters, holes, insects, mice, mushrooms, nests, pellets, snakes, wildcats, other animals, other natural, other man made). In the experiment in Figure 3D we use all images from the mice and birds category for which responses were recorded. The preprocessing code and image category information were obtained from the publicly available code base at https://github.com/MouseLand/stringer-pachitariu-et-al-2018b (Stringer, 2018c). Again, spike counts were obtained from deconvolved and z-scored calcium fluorescence traces. In the reconstruction experiment shown in Figure 4 we use the entire set of images for which neural responses were recorded.

Generating RROS codes

In Figure 8, the randomly rotated codes are generated by sampling a matrix $Q$ from the Haar measure on the set of $N$ -by- $N$ orthogonal matrices (Anderson et al., 1987), and chosing a $δ$ by solving the following optimization problem:

\min_{δ \in ℝ^{N}} \sum_{i = 1}^{N} \sum_{μ = 1}^{P} s_{i}^{μ}, s.t. s^{μ} = Qr (θ^{μ}) + δ, s_{i}^{μ} \geq 0, i \in [N], μ \in [P],

which minimizes the total spike count subject to the kernel and nonnegativity of firing rates. The solution to this problem is given by $δ_{i}^{*} = - \min_{μ = 1, \dots, P} {[Qr (θ^{μ})]}_{i}$ .

Comparing sparsity of population codes

To explore the metabolic cost among the set of codes with the same inductive biases, we estimate the distribution of average spike counts of codes with the same inner product kernel as the biological code. These codes are generated in the form $s^{μ} = {Qr}^{μ} + δ$ where $δ$ solves the optimization problem

min_{δ \in R^{N}} \sum_{i, μ} s_{i}^{μ}, s . t . s^{μ} = Q r^{μ} + δ, s_{i}^{μ} \geq 0

To quantify the distribution of such codes, we randomly sample $Q$ from the invariant (Haar) measure for $N \times N$ orthogonal matrices and compute the optimal $δ$ as described above. This generates the aqua colored distribution in Figure 8B and C.

We also attempt to characterize the most efficient code with the same inner product kernel

min_{Q, δ} \sum_{i, μ} s_{i}^{μ}, s . t . s^{μ} = Q r^{μ} + δ, s_{i}^{μ} \geq 0.

Since this optimization problem is non-convex in $Q$ , there is no theoretical guarantee that minima are unique. Nonetheless, we attempt to optimize the code by starting $Q$ at the identity matrix and conduct gradient descent over orthogonal matrices (Plumbley, 2004). Such updates take the form

Q_{t + 1} = \exp (- η \nabla ℒ) Q_{t}, \nabla ℒ = \frac{\partial ℒ}{\partial Q} Q^{⊤} - Q {\frac{\partial ℒ}{\partial Q}}^{⊤}

where $\exp (\cdot)$ is the matrix exponential. To make the loss function differentiable, we incorporate the non-negativity constraint with a soft-minimum:

\begin{aligned} L & = \sum_{i μ} (q_{i}^{⊤} r^{μ} - {softmin}_{ν} (q_{i}^{⊤} r^{ν}, β)) \\ softmin (a^{1}, a^{2}, . . ., a^{P}; β) = \frac{1}{Z} \sum_{μ = 1}^{P} a^{μ} \exp (- β a^{μ}), \end{aligned}

where $Z = \sum_{ν} \exp (- β a^{ν})$ is a normalizing constant and $Q = [q_{1}, \dots q_{N}]$ . In the $β \to \infty$ limit, this cost function converges to the exact optimization problem with non-negativity constraint. Finite $β$ , however, allows learning with gradient descent. Gradients are computed with automatic differentiation in JAX (Bradbury et al., 2018). This optimization routine is run until convergence and the optimal value is plotted as dashed red lines labeled ‘opt’. in Figure 8.

We show that our result is robust to different pre-processing techniques and to imposing bounds on neural firing rates in the Figure 8—figure supplement 1. To demonstrate that our result is not an artifact of z-scoring the deconvolved signals against the spontaneous baseline activity level, we also conduct the random rotation experiment on the raw deconvolved signals. In addition, we show that imposing realistic constraints on the upper bound of the each neuron’s responses does not change our findings. We used a subset of $N = 100$ neurons and computed random rotations. However, we only accepted a code as valid if it’s maximum value was less than some upper bound u_b. Subsets of $N = 100$ neurons in the biological code achieve maxima in the range between 3.2 and 4.7. We performed this experiment for $u_{b} \in {3, 4, 5}$ so that the artificial codes would have maxima that lie in the same range as the biological code.

Lifetime and population sparseness

We compute two more refined measures of sparseness in a population code. For each neuron $i$ we compute the lifetime sparseness $L S_{i}$ (also known as selectivity) and for each stimulus $θ$ we compute the population sparseness $P S_{θ}$ which are defined as the following two ratios (Willmore and Tolhurst, 2001; Lehky et al., 2005; Treves and Rolls, 1991; Pehlevan and Sompolinsky, 2014)

L S_{i} = \frac{1}{1 - \frac{1}{P}} \frac{{Var}_{θ} r_{i} (θ)}{{⟨ r_{i} (θ)^{2} ⟩}_{θ}}, P S_{θ} = \frac{1}{1 - \frac{1}{N}} \frac{{Var}_{i} r_{i} (θ)}{{⟨ r_{i} (θ)^{2} ⟩}_{i}}

The normalization factors ensure that these quantities lie in the interval between $(0, 1)$ . Intuitively, lifetime sparseness quantifies the variability of each neuron’s responses over the full set of stimuli, whereas population sparseness quantifies the variability of responses in the code for a given stimulus $θ$ .

Fitting a Gabor model to mouse V1 kernel

Under the assumption of translation symmetry in the kernel $K (θ, θ^{'})$ , we averaged the elements of the over rows of the empirical mouse V1 kernel (Pachitariu et al., 2019)

K (Δ) = \frac{1}{P} \sum_{μ = 1}^{P} K (θ^{μ}, θ^{μ} + Δ)

where angular addition is taken mod $π$ . This generates the black dots in Figure 5B. We aimed to fit a threshold-power law nonlinearity of the form $g_{q, a} (z) = \max {0, z - a}^{q}$ to the kernel. Based on the Gabor model discussed above, we parameterized tuning curves as

r_{σ^{2}, q, a} (θ, θ_{i}) = g_{q, a} (\frac{\cosh (σ^{- 2} \cos (θ - θ_{i}))}{\cosh (σ^{- 2})}),

where $θ_{i}$ is the preferred angle of the $i$ -th neuron’s tuning curve. Rather than attempting to perform a fit of $σ^{2}, a, q, {θ_{i}}_{i = 1}^{N}$ of this form to the responses of each of the $\sim 20$ -k neurons, we instead simply attempt to fit to the population kernel by optimizing over $(σ^{2}, a, q)$ under the assumption of uniform tiling of $θ_{i}$ . However, we noticed that two of these variables $σ^{2}, a$ are constrained by the sparsity level of the code. If each neuron, on average, fires for only a fraction $f$ of the uniformly sampled angles $θ$ , then the following relationship holds between $σ^{2}$ and

a = \frac{\cosh (σ^{- 2} \cos (\frac{π}{2} f))}{\cosh (σ^{- 2})} .

Calculation of the coding level $f$ for the recorded responses allowed us to infer $a$ from $σ^{2}$ during optimization. This reduced the free parameter set to $(σ^{2}, q)$ . We then solve the following optimization problem

\begin{aligned} min_{σ^{2}, q} {⟨ {({\hat{K}}_{σ^{2}, q} (θ) - K (θ))}^{2} ⟩}_{θ} {\hat{K}}_{σ^{2}, q} (θ) = {⟨ r_{σ^{2}, q} (θ, θ^{'}) r_{σ^{2}, q} (0, θ^{'}) ⟩}_{θ^{'}}, \end{aligned}

where integration over $θ_{i}$ is performed numerically. Using the Scipy Trust-Region constrained optimizer, we found $(q, σ^{- 2}, a) = (1.7, 5.0, 0.2)$ which we use as the fit parameters in Figure 5.

Lead contact

Requests for information should be directed to the lead contact, Cengiz Pehlevan (cpehlevan@seas.harvard.edu).

Data and code availability

Responses to ImageNet images and preprocessing code were obtained from another publicly available dataset, https://github.com/MouseLand/stringer-pachitariu-et-al-2018b (Stringer et al., 2018a).

The code generated by the authors for this paper is also available https://github.com/Pehlevan-Group/sample_efficient_pop_codes (Pehlevan-Group, 2022).

Appendix 1

Singular value decomposition of continuous population responses

SVD of population responses is usually evaluated with respect to a discrete and finite set of stimuli. In the main paper, we implicitly assumed that a generalization of SVD to a continuum of stimuli. In this section we provide an explicit construction of this generalized SVD using techniques from functional analysis. Our construction is an example of the quasimatrix SVD defined in Townsend and Trefethen, 2015 and justifies our use of SVD in the main text.

For our construction, we note that Mercer’s theorem guarantees the existence of an eigendecomposition of any inner product kernel $K (θ, θ^{'})$ in terms of a complete orthonormal set of functions ${ψ_{k}}_{k = 1}^{\infty}$ (Rasmussen and Williams, 2005). In particular, there exist a non-negative (but possibly zero) summable eigenvalues ${λ_{k}}_{k = 1}^{\infty}$ and a corresponding set of orthonormal eigenfunctions such that

\begin{matrix} K (θ, θ^{'}) = \sum_{k = 1}^{\infty} λ_{k} ψ_{k} (θ) ψ_{k} (θ^{'}) . \end{matrix}

For a stimulus distribution $p (θ)$ , the set of functions ${ψ_{k}}_{k = 1}^{\infty}$ are orthonormal and form a complete basis for square integrable functions L₂ which means

\begin{aligned} {⟨ ψ_{k} (θ) ψ_{ℓ} (θ) ⟩}_{θ} & = \int p (θ) ψ_{k} (θ) ψ_{ℓ} (θ) d θ = δ_{k ℓ}, \\ f (θ) & = \sum_{k} {⟨ f (θ^{'}) ψ_{k} (θ^{'}) ⟩}_{θ^{'}} ψ_{k} (θ), \forall f \in L_{2} . \end{aligned}

Next, we use this basis to construct the SVD. Each of the tuning curves $r_{i} \in L_{2}$ (assumed to be square integrable) can be expressed in this basis with the top $N$ of the functions in the set

r_{i} (θ) = \sum_{k = 1}^{N} A_{i k} ψ_{k} (θ),

where we introduced a matrix $A \in ℝ^{N \times N}$ of expansion coefficients. Note that $rank (A) \leq N$ . We compute the singular value decomposition of the finite matrix $A$

A = \sqrt{N} \sum_{k = 1}^{rank (A)} \sqrt{λ_{k}} u_{k} v_{k}^{⊤} .

We note that the signal correlation matrix for this population code can be computed in closed form

Σ_{s} = \frac{1}{N} A {⟨ ψ (θ) ψ {(θ)}^{⊤} ⟩}_{θ} A^{⊤} = \frac{1}{N} {AA}^{⊤} = \sum_{k = 1}^{rank (A)} λ_{k} u_{k} u_{k}^{⊤},

due to the orthonormality of ${ψ_{k}}$ . Thus the principal axes $u_{k}$ of the neural correlations are the left singular vectors of $A$ . We may similarly express the inner product kernel in terms of the eigenfunctions

K (θ, θ^{'}) = \frac{1}{N} r (θ) \cdot r (θ^{'}) = \frac{1}{N} ψ {(θ)}^{⊤} A^{⊤} A ψ (θ^{'}) .

The kernel eigenvalue problem demands (Rasmussen and Williams, 2005)

\begin{array}{ll} \int p (θ) K (θ, θ^{'}) ψ (θ) d θ = \frac{1}{N} A^{⊤} A ψ (θ^{'}) = Λ ψ (θ^{'}) ⟹ \frac{1}{N} A^{⊤} A = Λ \\ ⟹ \sum_{k = 1}^{rank (A)} λ_{k} v_{k} v_{k}^{⊤} = \sum_{k = 1}^{rank (A)} λ_{k} e_{k} e_{k}^{⊤} . \end{array}

The $v_{k}$ vectors must be identical to $\pm e_{k}$ , the Cartesian unit vectors, if the eigenvalues are non-degenerate. From this exercise, we find that the SVD for $A$ has the form $A = \sqrt{N} \sum_{k = 1}^{rank (A)} \sqrt{λ_{k}} u_{k} e_{k}^{⊤}$ . With this choice, the population code admits a singular value decomposition

r (θ) = A ψ (θ) = \sqrt{N} \sum_{k = 1}^{rank (A)} \sqrt{λ_{k}} u_{k} ψ_{k} (θ) .

This singular value decomposition demonstrates the connection between neural manifold structure (principal axes $u_{k}$ ) and function approximation (kernel eigenfunctions $ψ_{k}$ ). This singular value decomposition can be verified by computing the inner product kernel and the correlation matrix, utilizing the orthonormality of ${u_{k}}$ and ${ψ_{k}}$ . This exercise has important consequences for the space of learnable functions, which is at most $rank (A)$ dimensional since linear readouts lie in $span {r_{i} (θ)}_{i = 1}^{N}$ .

Discrete stimulus spaces: finding eigenfunctions with matrix eigendecomposition

In our discussion so far, our notation suggested that $θ$ take a continuum of values. Here we want to point that our theory still applies if $θ$ take a discrete set of values. In this case, we can think of a Dirac measure $p (θ) = \sum_{i = 1}^{\tilde{P}} p_{i} δ (θ - θ^{i})$ , where $i$ indexes all the $\tilde{P}$ values $θ$ can take. With this choice

\int p (θ) K (θ, θ^{'}) ψ_{k} (θ) d θ = \sum_{i = 1}^{\tilde{P}} p_{i} K (θ^{i}, θ^{'}) ψ_{k} (θ^{i}) = λ_{k} ψ_{k} (θ^{'}) .

Demanding this equality for $θ^{'} = θ^{i}, i = 1, . . ., \tilde{P}$ generates a matrix eigenvalue problem

K B Ψ = Ψ Λ,

where $B_{i j} = δ_{i j} p_{i}$ . The eigenfunctions over the stimuli are identified as the columns of $Ψ$ while the eigenvalues are the diagonal elements of $Λ_{k ℓ} = λ_{k} δ_{k ℓ}$ .

Experimental considerations

In an experimental setting, a finite number of stimuli are presented and the SVD is calculated over this finite set regardless of the support of $p (θ)$ . This raises the question of the interpretation of this SVD and its relation to the inductive bias theory we presented. Here we provide two interpretations.

In the first interpretation, we think of the empirical SVD as providing an estimate of the SVD over the full distribution $p (θ)$ . To formalize this notion, we can introduce a Monte-Carlo estimate of the integral eigenvalue problem

\int p (θ) K (θ, θ^{'}) ψ_{k} (θ) d θ \approx \frac{1}{\tilde{P}} \sum_{μ = 1}^{\tilde{P}} K (θ^{μ}, θ^{'}) ψ_{k} (θ^{μ}) = λ_{k} ψ_{k} (θ^{'}) .

For this interpretation to work, the experimenter must sample the stimuli from $p (θ)$ , which could be the natural stimulus distribution. Measuring responses to a larger number of stimuli gives a more accurate approximation of the integral above, which will provide a better estimate of generalization performance on the true distribution $p (θ)$ .

In the second interpretation, we construct an empirical measure on $\tilde{P}$ experimental stimulus values $\hat{p} (θ) = \frac{1}{\tilde{P}} \sum_{μ = 1}^{\tilde{P}} δ (θ - θ^{μ})$ , and consider learning and generalization over this distribution. This allows the application of our theory to an experimental setting where $\hat{p} (θ)$ is designed by an experimenter. For example, the experimenter could procure a complicated set of $\tilde{P}$ videos, to which an associated function $y (θ)$ must be learned. After showing these videos to the animal and measuring neural responses, the experimenter could compute, with our theory, generalization error for a uniform distribution over this full set of $\tilde{P}$ videos. Our theory would predict generalization over this distribution after providing supervisory feedback for only a strict subset of $P < \tilde{P}$ videos. Under this interpretation, the relationship between the integral eigenvalue problem and matrix eigenvalue problem is exact rather than approximate

\int \hat{p} (θ) K (θ, θ^{'}) ψ_{k} (θ) d θ = \frac{1}{\tilde{P}} \sum_{μ = 1}^{\tilde{P}} K (θ^{μ}, θ^{'}) ψ_{k} (θ^{μ}) = λ_{k} ψ_{k} (θ^{'}) .

Demanding either of (32) or (33) equalities for $θ^{'} = θ^{ν}, ν = 1, \dots, P$ generates a matrix eigenvalue problem

K Ψ = P Ψ Λ .

The eigenfunctions restricted to ${θ^{μ}}$ are identified as the columns of $Ψ$ while the eigenvalues are the diagonal elements of $Λ_{k ℓ} = λ_{k} δ_{k ℓ}$ . For the case where $N$ and $P$ are finite, the spectrum obtained through eigendecomposition of the kernel $K$ is the same as would be obtained through the finite $N$ signal correlation matrix $Σ_{s}$ , since they are inner and outer products of trial averaged population response matrices $R$ .

Translation invariant kernels

For the special case where the data distribution $p (θ) = \frac{1}{V}$ is uniform over volume $V$ and the kernel is translation invariant $K (θ, θ^{'}) = κ (θ - θ^{'})$ , the kernel can be diagonalized in the basis of plane waves

\int p (θ) K (θ, θ^{'}) ψ_{k} (θ) d θ = \frac{1}{V} \int κ (θ - θ^{'}) e^{i k \cdot θ} d θ = \frac{1}{V} \hat{κ} (k) e^{i k \cdot θ^{'}}

The eigenvalues are the Fourier components of the Kernel $λ_{k} = \frac{1}{V} \hat{κ} (k) = \frac{1}{V} \int d θ e^{i k \cdot θ} κ (θ)$ while the eigenfunctions are plane waves $ψ_{k} (θ) = e^{i k \cdot θ}$ . The set of admissible momenta $S_{k} = {k_{0}, \pm k_{1}, \pm k_{2}, . . .}$ are determined by the boundary conditions. The diagonalized representation of the kernel is therefore

K (θ, θ^{'}) = \sum_{k \in S_{k}} λ_{k} e^{i k \cdot (θ - θ^{'})}

For example, if the space is the torus $T^{n} = S^{1} \times S^{1} \times \dots \times S^{1}$ , then the space of admissable momenta are the points on the integer lattice $S_{k} = Z^{n} = {k \in R^{n} | k_{i} \in Z \forall i = 1, . . ., n}$ . Reality and symmetry of the kernel demand that $Im (λ_{k}) = 0$ and $λ_{- k} = λ_{k} \geq 0$ . Most of the models in this paper consider $θ \sim Unif (S^{1})$ , where the kernel has the following Fourier/Mercer decomposition

\begin{matrix} K (θ - θ^{'}) & = \sum_{k = - \infty}^{\infty} λ_{k} e^{i k (θ - θ^{'})} = 2 \sum_{k = 0}^{\infty} λ_{k} \cos (k (θ - θ^{'})) \\ = \sum_{k = 0}^{\infty} λ_{k} [\sqrt{2} \cos (k θ) \sqrt{2} \cos (k θ^{'}) + \sqrt{2} \sin (k θ) \sqrt{2} \sin (k θ^{'})] \end{matrix}

where we invoked the simple trigonometric identity $\cos (a - b) = \cos (a) \cos (b) + \sin (a) \sin (b)$ . By recognizing that ${\sqrt{2} \cos (k θ), \sqrt{2} \sin (k θ)}_{k = 0}^{\infty}$ form a complete orthonormal set of functions with respect to $Unif (S^{1})$ , we have identified this as the collection of kernel eigenfunctions.

Invariant kernels possess invariant eigenfunctions

Suppose the kernel $K (θ, θ^{'})$ is invariant to some set of transformations $t \in T$ , by which we mean that

\begin{aligned} K (t θ, θ^{'}) = K (θ, t θ^{'}) = K (θ, θ^{'}), \forall t \in T \end{aligned}

We will now show that any eigenfunction of such a kernel with nonzero eigenvalue must be an invariant function. Let $ψ_{k} (θ)$ be an eigenfunction with eigenvalue $λ_{k} > 0$ , then

\begin{aligned} ψ_{k} (t θ) = \frac{1}{λ_{k}} \int p (θ^{'}) K (θ^{'}, t θ) d θ^{'} = \frac{1}{λ_{k}} \int p (θ^{'}) K (θ^{'}, θ) d θ^{'} = ψ_{k} (θ) \end{aligned}

This establishes that all functions which depend on $T$ transformations must necessarily lie in the null-space of $K$ .

Theory of generalization

Convergence of the delta-rule without weight decay

In this section, we discuss the delta-rule convergence when weight decay parameter is set to $λ = 0$ . The next section considers the simpler case where $λ > 0$ . Gradient descent training of readout weights $w$ on a finite sample of size $P$ converges to the kernel regression solution (Bartlett et al., 2020; Hastie et al., 2020). Let $D = {θ^{μ}, y^{μ}}_{μ = 1}^{P}$ be the dataset with samples $θ^{μ}$ and target values $y^{μ}$ . We introduce a shorthand $r^{μ} = r (θ^{μ})$ for convenience. The empirical loss we aim to minimize is a sum of the squared losses of each data point in the training set

ℒ (w) = \frac{1}{2} \sum_{μ = 1}^{P} {(r^{μ} \cdot w - y^{μ})}^{2} .

Performing gradient descent updates

\begin{matrix} w_{t + 1} = w_{t} - η \frac{\partial ℒ}{\partial w_{t}} = w_{t} - η \sum_{μ = 1}^{P} r^{μ} (r^{μ} \cdot w_{t} - y^{μ}), \end{matrix}

recovers the delta rule that we discussed in the main text (Widrow and Hoff, 1960; Hertz et al., 1991). Letting the empirical response matrix $R = [r^{1}, \dots, r^{P}] \in ℝ^{N \times P}$ have a SVD $R = \sum_{k} \sqrt{{\hat{λ}}_{k}} {\hat{u}}_{k} {\hat{ψ}}_{k}^{⊤}$ , and expanding the weights $w_{t} = \sum_{k} w_{t, k} {\hat{u}}_{k}$ and labels $y = \sum_{k} {\hat{v}}_{k} {\hat{ψ}}_{k}$ in their respective SVD bases, we find

\begin{matrix} w_{t + 1, k} = w_{t, k} - η {\hat{λ}}_{k} w_{t, k} + η \sqrt{{\hat{λ}}_{k}} {\hat{v}}_{k} \end{matrix}

For all directions with ${\hat{λ}}_{k} > 0$ , the dynamics converge to the unique fixed point $w_{k}^{*} = \frac{{\hat{v}}_{k}}{\sqrt{{\hat{λ}}_{k}}}$ , while for all modes with ${\hat{λ}}_{k} = 0$ , the weights remain at $w_{k}^{*} = 0$ . Thus

\begin{aligned} w^{*} & = [\sum_{k : {\hat{λ}}_{k} > 0} \frac{{\hat{u}}_{k} {\hat{ψ}}_{k}^{⊤}}{\sqrt{{\hat{λ}}_{k}}}] y = R [\sum_{k : {\hat{λ}}_{k} > 0} \frac{{\hat{ψ}}_{k} {\hat{ψ}}_{k}^{⊤}}{{\hat{λ}}_{k}}] y = R K^{+} y \end{aligned}

where K⁺ is the Moore-Penrose inverse of the kernel matrix $K_{μ, ν} = K (θ^{μ}, θ^{ν})$ . The predictions of the learned function are given by $f = w^{*} \cdot r (θ)$ which can be expressed as

\begin{matrix} f (θ) = k {(θ)}^{⊤} K^{+} y \end{matrix}

The fact that the solution can be written in terms of a linear combination of ${K (θ, θ^{μ})}_{μ = 1}^{P}$ is known as the representer theorem (Schölkopf et al., 2001; Rasmussen and Williams, 2005). A similar analysis for nonlinear readouts where $f (θ) = g (w \cdot r (θ))$ is provided in Appendix Convergence of delta-rule for nonlinear readouts.

Weight decay and ridge regression

We can introduce a regularization term in our learning problem which penalizes the size of the readout weights. This leads to a modified learning objective of the form

L (w) = \sum_{μ} (r^{μ} \cdot w - y^{μ})^{2} + λ | | w | |^{2} .

Inclusion of this regularization alters the learning rule through weight decay $w_{t + 1} = (1 - η λ) w_{t} - η \sum_{μ} r^{μ} (r^{μ} \cdot w_{t} - y^{μ})$ , which multiplies the existing weight value by a factor of $1 - η λ$ before adding the data dependent update. The fixed point of these dynamics is $w = {({RR}^{⊤} + λ I)}^{- 1} Ry$ . This learning problem and gradient descent dynamics have a closed form solution

f (θ) = r (θ) \cdot w^{*} = \sum_{μ = 1}^{P} α^{μ} K (θ, θ^{μ}), α = {(K + λ I)}^{- 1} y .

The generalization benefits of explicit regularization through weight decay is known to be related to the noise statistics in the learning problem (Canatar et al., 2021). This is visible in the Appendix 1—figure 1 , where unlearnable target functions demand nonzero optimal regularization. We simulate weight decay only in Figure 6C, where we use $λ = 0.01 \sum_{k} λ_{k}$ to improve numerical stability at large $P$ .

Computation of learning curves

Recent work has established analytic results that predict the average case generalization error for kernel regression

E_{g} = {⟨ E_{g} (D) ⟩}_{D} = {⟨ (f (θ, D) - y (θ))^{2} ⟩}_{θ, D}

where $E_{g} (D) = {⟨ (f (θ, D) - y (θ))^{2} ⟩}_{θ}$ is the generalization error for a certain sample $D$ of size $P$ and $f (θ, D)$ is the kernel regression solution for $D$ (Bordelon et al., 2020; Canatar et al., 2021). The typical or average case error $E_{g}$ is obtained by averaging over all possible datasets of size $P$ . This average case generalization error is determined solely by the decomposition of the target function $y (x)$ along the eigenbasis of the kernel and the eigenspectrum of the kernel. This diagonalization takes the form

\int p (θ) K (θ, θ^{'}) ψ_{k} (θ) d θ = λ_{k} ψ_{k} (θ^{'})

Since the eigenfunctions form a complete set of square integrable functions, we expand both the target function $y (θ)$ and the learned function $f (θ)$ in this basis

y (θ) = \sum_{k} v_{k} ψ_{k} (θ), f (θ) = \sum_{k} {\hat{v}}_{k} ψ_{k} (θ)

Due to the orthonormality of the kernel eigenfunctions ${ψ_{k}}$ , the generalization error for any set of coefficients $\hat{v}$ is

E_{g} (w) = {⟨ {(y (θ) - f (θ))}^{2} ⟩}_{θ} = \sum_{k} {({\hat{v}}_{k} - v_{k})}^{2} = {|| \hat{v} - v ||}^{2}

We now introduce training error, or empirical loss, which depends on the disorder in the dataset $D = {(θ^{μ}, y^{μ})}_{μ = 1}^{P}$

H (\hat{v}, D) = \sum_{μ} (\hat{v} \cdot ψ (θ^{μ}) - v \cdot ψ (θ^{μ}))^{2} + λ \sum_{k} \frac{{\hat{v}}_{k}^{2}}{λ_{k}}

It is straightforward to verify that the optimal ${\hat{v}}^{*}$ which minimizes $H (\hat{v}, D)$ is the kernel regression solution for kernel with eigenvalues ${λ_{k}}$ when $λ \to 0$ . The optimal weights $\hat{v}$ can be identified through the first order condition $\nabla H (\hat{v}, D) = 0$ which gives

{\hat{v}}^{*} = (Ψ Ψ^{⊤} + λ Λ^{- 1})^{- 1} Ψ Ψ^{⊤} v = v - λ (Ψ Ψ^{⊤} + λ Λ^{- 1})^{- 1} Λ^{- 1} v

where $Ψ_{k, μ} = ψ_{k} (θ^{μ})$ are the eigenfunctions evaluated on the training data and $Λ_{k, ℓ} = δ_{k, ℓ} λ_{k}$ is a a diagonal matrix containing the kernel eigenvalues. The generalization error for this optimal solution is

E_{g} (D) = | | {\hat{v}}^{*} - v | |^{2} = v^{⊤} Λ^{- 1} G (D)^{2} Λ^{- 1} v, G (D) = {(\frac{1}{λ} Ψ Ψ^{⊤} + Λ^{- 1})}^{- 1}

We note that the dependence on the randomly sampled dataset $D$ only appears through the matrix $G (D)$ . Thus to compute the typical generalization error we need to average over this matrix ${⟨ G (D) ⟩}_{D}$ . There are multiple strategies to perform such an average and we will study one here based on a partial differential equation which was introduced in Sollich, 1998; Sollich, 2002 and studied further in Bordelon et al., 2020; Canatar et al., 2021. In this setting, we denote the average matrix $G (P) = {⟨ G (D) ⟩}_{| D | = P}$ for a dataset of size $P$ . We first will derive a recursion relationship using the Sherman Morrison formula for a rank-1 update to an inverse matrix. We imagine adding a new sampled feature vector $ϕ$ to a dataset $ψ$ with size $P$ . The average matrix $G (P + 1)$ at $P + 1$ samples can be related to $G (P)$ through the Sherman Morrison rule

\begin{aligned} G (P + 1) & = {⟨ {(\frac{1}{λ} Ψ Ψ^{⊤} + \frac{1}{λ} ψ ψ^{⊤} + Λ^{- 1})}^{- 1} ⟩}_{ψ, D} = G (P) - {⟨ \frac{G (D) ψ ψ^{⊤} G (D)}{λ + ψ^{⊤} G (D) ψ} ⟩}_{ψ, D} \\ \approx G (P) - \frac{{⟨ G (D) {⟨ ψ ψ^{⊤} ⟩}_{ψ} G (D) ⟩}_{D}}{λ + {⟨ ψ^{⊤} G (D) ψ ⟩}_{ψ, D}} \end{aligned}

where in the last step we approximated the average of the ratio with the ratio of averages. This operation, is of course, unjustified theoretically, but has been shown to produce accurate learning curves (Sollich, 2002; Bordelon et al., 2020). Since the chosen basis of kernel eigenfunctions are orthonormal, the average over the new sample is trivial ${⟨ ψ ψ^{⊤} ⟩}_{ψ} = I$ . We thus arrive at the following recursion relationship for $G$

G (P + 1) = G (P) - \frac{{⟨ G (D)^{2} ⟩}_{D}}{λ + Tr G (P)}

By introducing an additional source $J$ so that $G (D, J)^{- 1} = \frac{1}{λ} Ψ Ψ^{⊤} + Λ^{- 1} + J I$ , we can relate $G (D, J)$ ’s first and second moments through differentiation

\frac{\partial}{\partial J} G (P, J) = \frac{\partial}{\partial J} {⟨ {(\frac{1}{λ} Ψ Ψ^{⊤} + J I + Λ^{- 1})}^{- 1} ⟩}_{D} = - {⟨ G (D, J)^{2} ⟩}_{D} .

Thus the recursion relation simplifies to

G (P + 1, J) - G (P, J) \approx \frac{\partial}{\partial P} G (P, J) = \frac{1}{λ + Tr G (P, J)} \frac{\partial}{\partial J} G (P, J),

where we approximated the finite difference in $P$ as a derivative, treating $P$ as a continuous variable. Taking the trace of both sides and defining $κ (P, J) = λ + Tr G (P, J)$ we arrive at the following quasilinear PDE

\frac{\partial}{\partial P} κ (P, J) = \frac{1}{κ (P, J)} \frac{\partial}{\partial J} κ (P, J)

with the initial condition $κ (0, J) = λ + Tr (Λ^{- 1} + J I)^{- 1}$ . Using the method of characteristics, we arrive at the solution $κ (P, J) = λ + Tr {(Λ^{- 1} + (v + \frac{P}{κ (P, J)}) I)}^{- 1}$ . Using this solution to $κ$ , we can identify the solution to $G$

G {(P, J)}_{k, ℓ} = {(\frac{P}{κ} + J + λ_{k}^{- 1})}^{- 1} δ_{k, ℓ} = \frac{κ λ_{k}}{λ_{k} P + κ + J κ λ_{k}} δ_{k, ℓ} .

The generalization error, therefore can be written as

\begin{aligned} E_{g} = v^{⊤} Λ^{- 1} {⟨ G (D)^{2} ⟩}_{D} Λ^{- 1} v = - \frac{\partial}{\partial J} v^{⊤} Λ^{- 1} G (P, J) Λ^{- 1} v \end{aligned}

\begin{matrix} = - \sum_{k} \frac{v_{k}^{2}}{λ_{k}^{2}} \frac{\partial}{\partial J} {(\frac{P}{κ} + J + λ_{k}^{- 1})}^{- 1} = \frac{κ^{2}}{1 - γ} \sum_{k} \frac{v_{k}^{2}}{{(λ_{k} P + κ)}^{2}}, \end{matrix}

where $γ = P \sum_{k} \frac{λ_{k}^{2}}{{(λ_{k} P + κ)}^{2}}$ , giving the desired result. Note that $κ$ depends on $J$ implicitly, which is the source of the $\frac{1}{1 - γ}$ factor. This result was recently reproduced using techniques from statistical mechanics (Bordelon et al., 2020; Canatar et al., 2021).

Spectral bias and code-task alignment

Through implicit differentiation it is straightforward to verify that the ordering of the mode errors $E_{k} = \frac{κ^{2}}{1 - γ} {(λ_{k} P + κ)}^{- 2}$ matches the ordering of the eigenvalues (Canatar et al., 2021). Let $λ_{k} > λ_{ℓ}$ , then we have

\begin{matrix} \frac{d}{d P} \log (\frac{E_{k}}{E_{ℓ}}) = & 2 [\frac{λ_{ℓ}}{λ_{ℓ} P + κ} - \frac{λ_{k}}{λ_{k} P + κ}] + 2 κ^{'} (P) [\frac{1}{λ_{ℓ} P + κ} - \frac{1}{λ_{k} P + κ}] . \end{matrix}

Since $λ_{ℓ} < λ_{k}$ , the first bracket must be negative and the second bracket must be positive. Further, it is straightforward to compute that $κ^{'} (P) = - \frac{κ γ}{P (1 + γ)} < 0$ . Therefore $λ_{k} > λ_{ℓ}$ implies $\frac{d}{d P} \log (\frac{E_{k}}{E_{ℓ}}) < 0$ for all $P$ . Since $\log (\frac{E_{k}}{E_{ℓ}}) = 0$ at $P = 0$ we therefore have that $\log (E_{k} / E_{ℓ}) < 0$ for all $P$ and consequently $E_{k} < E_{ℓ}$ . Modes with larger eigenvalues $λ_{k}$ have lower normalized mode errors $E_{k}$ . This observation can be used to prove that target functions acting on the same data distribution with higher cumulative power distributions $C (k)$ for all $k$ will have lower generalization error normalized by total target power, $E_{g} (P) / E_{g} (0)$ , for all $P$ . Proof can be found in Canatar et al., 2021.

Asymptotic power law scaling of learning curves

Exponential spectral decays:

First, we will study the setting relevant to the von-Mises kernel where $λ_{k} \sim β^{k}$ and $v_{k}^{2} \sim α^{k}$ where $α, β < 1$ . This exponential behavior accounts for differences in bandwidth between kernels which modulates the base $β$ of the exponential scaling of $λ_{k}$ with $k$ . We will approximate the sum over all mode errors with an integral

E_{g} = \frac{κ^{2}}{1 - γ} \sum_{k = 0}^{\infty} \frac{v_{k}^{2}}{{(λ_{k} P + κ)}^{2}} \sim κ^{2} \int_{0}^{\infty} \frac{α^{k}}{{(β^{k} P + κ)}^{2}} d k .

If we include a regularization parameter $λ$ , then $κ \sim λ$ as $P \to \infty$ . Making the change of variables $u = P β^{k} / λ$ , we transform the above integral into

E_{g} \sim \frac{1}{\ln (1 / β)} {(\frac{λ}{P})}^{\ln (1 / α) / \ln (1 / β)} \int_{0}^{P / λ} \frac{u^{\frac{\ln (1 / α)}{\ln (1 / β)} - 1}}{(1 + u)^{2}} d u

The remaining integral over $u$ is either dominated near $u \approx 0$ if $\frac{\ln (1 / α)}{\ln (1 / β)} < 2$ and behaves as a $P$ -independent constant or else is dominated near $u \approx P / λ$ , in which case the integral scales as $\sim P^{\frac{\ln (1 / α)}{\ln (1 / β)} - 2}$ . Multiplying these resulting functions with the prefactor, we find the following scaling laws for generalization.

E_{g} \sim {\begin{cases} P^{- \frac{\ln (1 / α)}{\ln (1 / β)}} & \frac{\ln (1 / α)}{\ln (1 / β)} < 2 \\ P^{- 2} & \frac{\ln (1 / α)}{\ln (1 / β)} > 2 \end{cases}

Thus, we obtain a power law scaling of the learning curve $E_{g}$ which is dominated at large $P$ by $E_{g} \sim P^{- \min (2, \frac{\ln (1 / α)}{\ln (1 / β)})}$ . For the von-Mises kernel we can approximate the spectra with $λ_{k} \sim σ^{- 2 k}$ and $v_{k}^{2} \sim σ_{T}^{- 2 k}$ giving rise to a generalization scaling scaling $E_{g} \sim P^{- \min (2, \frac{\ln σ_{T}}{\ln σ})}$ .

Power law spectral decays

The same arguments can be applied for power law kernels $λ_{k} \sim k^{- b}$ and power law targets $v_{k}^{2} \sim k^{- a}$ , which is of interest due to its connection to nonlinear rectified neural populations. In this setting, the generalization error is

\begin{matrix} E_{g} \approx \int_{1}^{\infty} \frac{k^{- a}}{{(k^{- b} P + κ)}^{2}} d k \approx \frac{κ^{2}}{P^{2}} \int_{1}^{P^{1 / b}} k^{- a + 2 b} d k + \int_{P^{1 / b}}^{\infty} k^{- a} d k \\ = \frac{1}{P^{2} (1 - a + 2 b)} [P^{(1 - a) / b + 2} - 1] + \frac{1}{a - 1} P^{(1 - a) / b} . \end{matrix}

We see that there are two possible power law scalings for $E_{g}$ with the exponents $(a - 1) / b$ and 2. At large $P$ this formula will be dominated by the term with minimum exponent so $E_{g} \sim P^{- \min (a - 1, 2 b) / b}$ .

Laplace kernel generalization

We calculate similar learning curves as we did for the von-Mises kernel but with Laplace kernels to show that our results is not an artifact of the infinite differentiability of the Von Mises kernel. Each of these Laplace kernels has the same asymptotic power law spectrum $λ_{k} \sim o (k^{- 2})$ , exhibiting a discontinuous first derivative (Figure 6A). Despite having the same spectral scaling at large $k$ , these kernels can give dramatically different performance in learning tasks, again indicating the influence of the top eigenvalues on generalization at small $P$ (Figure 6). Again, the trend for which kernels perform best at low $P$ can be reversed at large $P$ . In this case, all generalization errors scale with $E_{g} \sim P^{- 2}$ (Figure 6B). More generally, our theory shows that if the task power spectrum and kernel eigenspectrum are both falling as power laws with exponents $a$ and $b$ respectively, then the generalization error asymptotically falls with a power law, $E_{g} \sim P^{- \min (a - 1, 2 b) / b}$ (Methods) (Bordelon et al., 2020). This decay is fastest when $b \geq \frac{a - 1}{2}$ for which $E_{g} \sim P^{- 2}$ . Therefore, the tail of the kernel’s eigenvalue spectrum determines the large sample size behavior of the generalization error for power law kernels. Small sample size limit is still governed by the bulk of the spectrum.

Learning with multiple output channels

Our theory is not limited to scalar target functions but rather can be easily extended to multiple output functions $y_{1}, \dots, y_{C}$ from the same data, if for example the task requires computing class membership for $C$ categories. In this setting, each data point has the form $(θ^{μ}, y^{μ})$ with $y^{μ} \in ℝ^{C}$ . For these $C$ classes, the generalization error takes the form

E_{g} = ⟨ | | f (θ) - y (θ) | |^{2} ⟩ = \sum_{c = 1}^{C} ⟨ {(f_{c} (θ) - y_{c} (θ))}^{2} ⟩ = \sum_{k} [\sum_{c} {⟨ y_{c} (θ) ϕ_{k} (θ) ⟩}^{2}] E_{k} .

We therefore find that the generalization error in the multi-class setting is the same as the $E_{g}$ obtained for a single scalar target function with power spectrum $v_{k}^{2} = \sum_{c} {⟨ y_{c} (θ) ϕ_{k} (θ) ⟩}^{2}$ (Bordelon et al., 2020; Canatar et al., 2021). The relevant cumulative power distribution measures the fraction of total output variance captured by the first $k$ eigenfunctions of the population code

C (k) = \frac{\sum_{c} \sum_{ℓ = 1}^{k} {⟨ y_{c} (θ) ϕ_{k} (θ) ⟩}^{2}}{\sum_{c} \sum_{ℓ = 1}^{\infty} {⟨ y_{c} (θ) ϕ_{k} (θ) ⟩}^{2}} .

Convergence of Delta-rule for nonlinear readouts

In this section, we consider gradient descent dynamics of an arbitrary convex loss function. For instance, we can consider a binary classification problem where $y \in {\pm 1}$ by outputting a prediction of $\hat{y} = sign (w \cdot r)$ . We could, for example, train a model using the hinge loss $ℓ (w \cdot r, y) = max (0, 1 - w \cdot r y)$ so that the classifier will converge to a kernel support vector machine (SVM) (Schölkopf et al., 2002). The generalization of the classifier would be the error rate of $\hat{y} (θ) = sign (w \cdot r (θ))$ compared to the ground truth $y (θ)$ .

Let $D = {θ^{μ}, y^{μ}}_{μ = 1}^{P}$ be the dataset with samples $θ^{μ}$ and target values $y^{μ}$ . We introduce a shorthand $r^{μ} = r (θ^{μ})$ for convenience. The loss we aim to minimize is the sum of the losses of each data point in the training set with an additional weight decay parameter

H (w, D) = \sum_{μ = 1}^{P} ℓ (w \cdot r^{μ}, y^{μ}) + λ | w |^{2} .

For convex $ℓ$ and nonzero $λ$ , the above objective is strongly convex, indicating the existence of a unique minimizer which can be found from simple first order learning rules. For $λ > 0$ the initial condition for $w$ does not influence the final result.

We will now show that the dynamics will converge to a function which only depends on the code $r (θ)$ through the kernel $K (θ, θ^{'})$ . To simplify the argument, we consider starting from an initial condition of $w_{t = 0} = 0$ and performing gradient descent updates. Under such an assumption, the weights $w_{t}$ will always be in the span of the population vectors on the training set ${r^{μ}}_{μ = 1}^{P}$ since

\begin{aligned} w_{t + 1} = w_{t} - η \frac{\partial H}{\partial w} |_{w_{t}} = (1 - η λ) w_{t} - η \sum_{μ = 1}^{P} r^{μ} ℓ^{'} (w \cdot r^{μ}, y^{μ}) . \end{aligned}

The derivative in the final term is taken with respect to the first argument $ℓ^{'} (f, y) = - \frac{\partial ℓ (f, y)}{\partial f}$ . This update is still local and recovers the delta rule that we discussed in the main text for $ℓ (w \cdot r, y) = \frac{1}{2} {(w \cdot r - y)}^{2}$ (Widrow and Hoff, 1960; Hertz et al., 1991). We can express $w_{t}$ in terms of the population vectors $w_{t} = \sum_{μ = 1}^{P} α_{t}^{μ} r^{μ} = R α_{t}$ so that $α_{t} \in ℝ^{P}$ defines the linear weighting of each sample. The dynamics of these coefficients are

\begin{matrix} R α_{t + 1} = (1 - η λ) R α_{t} - η R ℓ^{'} (K α, y), \end{matrix}

where $K = R^{⊤} R \in ℝ^{P \times P}$ is the kernel Gram matrix evaluated on the training points. We multiply both sides of this equation by $R^{⊤}$ , and define $β_{t} = K α_{t}$ , which satisfy the following simplified dynamics

\begin{matrix} β_{t + 1} = (1 - η λ) β_{t} - η K ℓ^{'} (β_{t}, y), w_{t} = {RK}^{+} β_{t} . \end{matrix}

where K⁺ is the pseudo-inverse of $K$ . The nonlinear fixed point condition is $β = - \frac{1}{λ} K ℓ^{'} (β, y)$ , which transparently only depends on the kernel $K$ rather than the full code $R$ . The above equation recovers the correct linear equation $β = K {(K + λ I)}^{- 1} y$ for square loss. For an arbitrary test point $θ$ , the model makes prediction using $f (θ) = r (θ) \cdot w = r (θ) \cdot [{RK}^{+} β] = k (θ) \cdot α$ , which also only depends on the kernel $[k (θ)]_{μ} = K (θ, θ^{μ})$ on test point $θ$ and train points $θ_{μ}$ , as well as the kernel gram matrix $[K]_{μ ν} = K (θ^{μ}, θ^{ν})$ .

To illustrate a specific case with a square error and nonlinear readout, consider output neurons which produce activity $g (w \cdot r (θ))$ for invertible nonlinear function $g$ with non-vanishing gradient, and gradient based learning on $L = \sum_{μ} {(g (w \cdot r^{μ}) - y (θ^{μ}))}^{2}$ . This gives $Δ w \propto \sum_{μ} r^{μ} g^{'} (w \cdot r^{μ}) (y^{μ} - g (w \cdot r^{μ})) \in span {r^{μ}}_{μ = 1}^{P}$ , which is still a local learning rule. Thus the weights at convergence can be written as $w = \sum_{μ} α^{μ} r^{μ}$ and the learned function can be written as $f (θ) = g (\sum_{μ} α^{μ} K (θ, θ^{μ}))$ is given by $α = K^{+} g^{- 1} (y)$ . To see this, first note that $w_{t} \in span {r^{μ}}_{μ = 1}^{P}$ for all $t$ so that $w^{*} = R α^{*}$ . The interpolation condition can be expressed as $g (R^{⊤} w^{*}) = g (K α) = y$ , giving the desired result $α^{*} = K^{+} g^{- 1} (y)$ . The predictions of the model on a test stimulus $θ$ are given by $f (θ) = g (\sum_{μ = 1}^{P} α^{* μ} K (θ, θ^{μ}))$ . We see that this solution only depends on the kernel (directly and indirectly through $α^{*}$ ), rather than the full code.

Typical case analysis of nonlinear readouts

The analysis of typical case generalization can be extended to nonlinear predictors and loss functions described by (68) which depend on the scalar prediction variable $w \cdot r (θ)$ (Loureiro et al., 2021a). Thanks Further, if $r (θ)$ is well approximated as a Gaussian process, then the generalization performance can still be characterized using statistical mechanics methods (Loureiro et al., 2021a). Many qualitative features of our results continue to hold, including that the kernel’s diagonalization governs training and generalization and that improvements in code task alignment lead to improvements in generalization.

In a later work by Cui et al., 2022, SVM and ridge classifiers trained on codes and tasks with power law spectra were analyzed asymptotically, showing power law generalization error decay rates $E_{g} \sim P^{- β}$ . These classification learning curves for power law spectra were shown to follow power laws with exponents $β$ which are qualitatively similar to the exponents obtained with the square loss which we describe in our section titled Small and Large Sample Size Behaviors of Generalization. Just as in our theory, decay rate exponents $β$ are larger for codes which are well aligned to the task and are smaller for codes which are non-aligned.

Visual scene reconstruction task

Reconstruction of natural scenes from neural responses

Using the mouse V1 responses to natural scenes, we attempt to reconstruct original images from the neural codes using different numbers of images. The presented natural scenes are taken from ten classes of imagenet which can be downloaded from https://github.com/MouseLand/stringer-pachitariu-et-al-2018b. Let $θ^{μ} \in ℝ^{D}$ be a $D$ -dimensional flattened vector containing the pixel values of the μ-th image and let $r^{μ} \in ℝ^{N}$ represent the neural response to the μ-th image. The goal in the problem is to learn a collection of weights $W \in ℝ^{D \times N}$ which map neural responses $r^{μ}$ to images $θ^{μ}$

θ^{μ} \approx {Wr}^{μ} .

The generalization error $E_{g}$ again measures the average error on all points, averaged over all possible datasets $D = {(θ^{μ}, r^{μ})}_{μ = 1}^{P}$ of size $P$ . If the optimal weights for dataset $D$ is $W (D)$ then the generalization error is

E_{g} = {⟨ | | W (D) r (θ) - θ | |^{2} ⟩}_{θ, D} .

After identifying eigenfunctions $ϕ_{k} (θ)$ , we expand the images in this basis $θ = \sum_{k} v_{k} ψ_{k} (θ)$ where $v_{k} \in ℝ^{D}$ . The generalization error is therefore $E_{g} = \sum_{k} {| v_{k} |}^{2} E_{k} (P)$ and the cumulative power is $C (k) = \frac{\sum_{ℓ < k} | v_{ℓ} |^{2}}{\sum_{ℓ = 1}^{\infty} | v_{ℓ} |^{2}}$ . We perform this reconstruction task on many filtered versions of the natural scenes. To construct a filter, we first compute the Fourier transform of the image. Let $M (θ) \in ℝ^{\sqrt{D} \times \sqrt{D}}$ represent the non-flattened image and let $\hat{M} (θ) \in ℝ^{\sqrt{D} \times \sqrt{D}}$ represent the Fourier transform of the image, computed explicitly as

{\hat{M}}_{k, k^{'}} (θ) = D^{- 1 / 4} \sum_{n, m} M_{n, m} (θ) \exp (2 π i (n k + m k^{'}) / \sqrt{D})

To develop the band-pass filter, we calculate $| k | = \sqrt{k^{2} + {(k^{'})}^{2}}$ for each of the indices in the matrix. For a band-pass filter with parameters $s_{m a x}, r$ we simply zero out the entries in $\hat{M}$ which correspond to states with frequencies outside the appropriate band: for any $k, k^{'}$ with $| k | \notin [\sqrt{s_{m a x}^{2} - r^{2}}, s_{m a x}^{2}]$ then ${\hat{M}}_{k, k^{'}} \to 0$ . We then perform the inverse Fourier transform on $\hat{M}$ to obtain a filtered version of the original image.

A simple feedforward model of V1

Linear neurons

We consider a simplified but instructive model of the V1 population code as a linear-nonlinear map from photoreceptor responses through Gabor filters and then nonlinearity (Adelson and Bergen, 1985; Olshausen and Field, 1997; Rumyantsev et al., 2020). Let $x \in ℝ^{2}$ represent the two-dimensional retinotopic position of photoreceptors. The firing rates of the photoreceptor at position $x$ to a static grating stimulus oriented at angle $θ$ is

h (x, θ) = \cos (k (θ) \cdot x), k = [\begin{matrix} \cos (θ) \\ \sin (θ) \end{matrix}] \in ℝ^{2}, θ \in [0, 2 π] .

We model each V1 neuron’s receptive field as a Gabor filter of the receptor responses $h (x, θ)$ . The $i$ -th V1 neuron has preferred wavevector $k_{i}$ , generating the following set of weights between photoreceptors and the $i$ -th V1 neuron

ℱ (x, θ_{i}) = \frac{σ^{2}}{2 π} e^{- \frac{σ^{2}}{2} {| x |}^{2}} \cos (k (θ_{i}) \cdot x) .

The V1 population code is obtained by filtering the photoreceptor responses. By approximating the resulting sum over all retinal photoreceptors with an integral, we find the response of neuron $i$ to grating stimulus with wavenumber $k$ is

h (θ) \cdot ℱ (θ_{i}) = \int ℱ (x, θ_{i}) h (x, θ) d x = \frac{1}{2} e^{- \frac{1}{2 σ^{2}} {| k + k_{i} |}^{2}} + \frac{1}{2} e^{- \frac{1}{2 σ^{2}} {| k - k_{i} |}^{2}} .

The response of neuron $i$ is computed through nonlinear rectification of this input current $r_{i} (θ) = g (w (θ_{i}) \cdot h (θ))$ . For a linear neuron $g (z) = z$ , the kernel has the following form

K (θ, θ^{'}) = \frac{\cosh (\cos (θ - θ^{'}) / σ^{2})}{\cosh (σ^{- 2})},

where the kernel is normalized to have maximum value of 1. Note that this normalization of the kernel is completely legitimate since it merely rescales each eigenvalue by a constant and does not change the learning curves.

Since the kernel only depends on the difference between angles $θ - θ^{'}$ , it is said to posess translation invariance. Such translation invariant kernels admit a Mercer decomposition in terms of Fourier modes $K (θ) = \sum_{n} λ_{n} \cos (n θ)$ since the Fourier modes diagonalize shift invariant integral operators on $S^{1}$ . For the linear neuron, the kernel eigenvalues scale like $λ_{n} \sim \frac{β^{n}}{2^{n} n!}$ , indicating infinite differentiability of the tuning curves. Since $λ_{n}$ decays rapidly with $n$ , we find that this Gabor code has an inductive bias that favors low frequency functions of orientation $θ$ .

Nonlinear simple cells

Introducing nonlinear functions $g (z)$ that map input currents $z$ into the V1 population into firing rates, we can obtain a non-linear kernel $K_{g} (θ)$ which has the following definition

\begin{matrix} K_{g} (k, k^{'}) = \int p (k_{i}) g (ℱ (k_{i}) \cdot h (k)) g (ℱ (k_{i}) \cdot h (k^{'})) d k_{i} . \end{matrix}

In this setting, it is convenient to restrict $k_{i}, k, k^{'} \in S^{1}$ and assume that the preferred wavevectors $k_{i}$ are uniformly distributed over the circle. In this case, it suffices to identify a decomposition of the composed function $g (w_{i} \cdot h (θ))$ in the basis of Chebyshev polynomials $T_{n} (z)$ which satisfy $T_{n} (\cos (θ)) = \cos (n θ)$

\begin{aligned} a_{n} = \frac{1}{2 π} \int_{0}^{2 π} g (e^{- \frac{1}{σ^{2}}} \cosh (\frac{1}{σ^{2}} \cos (θ))) \cos (n θ) d θ \\ = \frac{1}{2 π} \int_{- 1}^{1} \frac{1}{\sqrt{1 - z^{2}}} g (e^{- σ^{- 2}} \cosh (z σ^{- 2})) T_{n} (z) d z, \end{aligned}

which can be computed efficiently with an appropriate quadrature scheme. Once the coefficients a_n are determined, we can compute the kernel by first letting $θ_{i}$ to be the angle between $k$ and $k_{i}$ and letting $θ$ be the angle between $k$ and $k^{'}$

\begin{matrix} K_{g} (θ) & = \int_{0}^{2 π} \frac{d θ_{i}}{2 π} \sum_{n, n^{'}} a_{n} a_{n^{'}} T_{n} (\cos (θ_{i})) T_{n^{'}} (\cos (θ_{i} + θ)) d θ_{i} = \frac{1}{2} \sum_{n} a_{n}^{2} \cos (n θ) . \end{matrix}

Thus the kernel eigenvalues are $λ_{n} = \frac{1}{2} a_{n}^{2} (ψ)$ .

Asymptotic scaling of spectra

Activation functions that encourage sparsity have slower eigenvalue decays. If the nonlinear activation function has the form $g_{q, a} (z) = \max {0, z - a}^{q}$ , then the spectrum decays like $λ_{n} \sim n^{- 2 q - 2}$ . A simple argument justifies this scaling: if the function $g (e^{- σ^{- 2}} \cosh (σ^{- 2} z))$ is only $q - 1$ times differentiable then $a_{n} n^{q} \sim n^{- 1}$ since $\sum_{n} a_{n} n^{q}$ must diverge. Therefore $λ_{n} = a_{n}^{2} \sim n^{- 2 q - 2}$ . Note that this scaling is independent of the threshold. Examples of these scalings can be found in Figure 5—figure supplements 1 and 2.

Phase variation, complex cells and invariance

We can consider a slightly more complicated model where Gabors and stimuli have phase shifts

h (x, θ, ϕ) = \cos (k (θ) \cdot x - ϕ), ℱ (x, θ_{i}, ϕ_{i}) = \frac{σ^{2}}{2 π} e^{- \frac{σ^{2}}{2} {| x |}^{2}} \cos (k_{i} \cdot x - ϕ_{i}) .

The simple cells are generated by nonlinearity

r_{i} (θ, ϕ) = g (ℱ (θ_{i}, ϕ_{i}) \cdot h (θ, ϕ)) .

The input currents into the simple V1 cells can be computed exactly

\begin{aligned} h (θ, ϕ) \cdot F (θ_{i}, ϕ_{i}) & = {⟨ \cos (k_{i} \cdot x - ϕ_{i}) \cos (k \cdot x - ϕ) ⟩}_{x \sim N (0, σ^{2} I)} . \\ = \frac{1}{2} \cos (ϕ + ϕ_{i}) e^{- \frac{1}{2 σ^{2}} | k + k_{i} |^{2}} + \frac{1}{2} \cos (ϕ - ϕ_{i}) e^{- \frac{1}{2 σ^{2}} | k - k_{i} |^{2}} . \end{aligned}

When $| k | = | k_{i} | = 1$ , the simple cell tuning curves $r_{i} = g (w_{i} \cdot h)$ only depend on $\cos (θ - θ_{i})$ and $ϕ$ , allowing a Fourier decomposition

r_{i} (θ, ϕ) = \sum_{n} a_{n} (ϕ, ϕ_{i}) \cos (n (θ - θ_{i})) .

The simple cell kernel $K_{s}$ , therefore decomposes into Fourier modes over $θ$

K_{s} (θ, θ^{'}, ϕ, ϕ^{'}) = \sum_{n} b_{n} (ϕ, ϕ^{'}) \cos (n (θ - θ^{'})),

where $b_{n} (ϕ, ϕ^{'}) = {⟨ a_{n} (ϕ, ϕ_{i}) a_{n} (ϕ^{'}, ϕ_{i}) ⟩}_{ϕ_{i}}$ . It therefore suffices to solve the infinite sequence of integral eigenvalue problems over $ϕ$

\begin{matrix} \frac{1}{2 π} \int_{0}^{2 π} b_{n} (ϕ, ϕ^{'}) v_{n, k} (ϕ) d ϕ & = λ_{n, k} v_{n, k} (ϕ^{'}) \\ ⟹ K_{s} (θ, θ^{'}, ϕ, ϕ^{'}) & = \sum_{n, k} λ_{n, k} \cos (n (θ - θ^{'})) v_{n, k} (ϕ) v_{n, k} (ϕ^{'}) . \end{matrix}

With this choice it is straightforward to verify that the kernel eigenfunctions are $v_{n, k} (θ, ϕ) = e^{i n θ} v_{n, k} (ϕ)$ with corresponding eigenvalue $λ_{n, k}$ . Since b_n is not translation invariant in $ϕ - ϕ^{'}$ , the eigenfunctions $v_{n, k}$ are not necessarily Fourier modes. These eigenvalue problems for b_n must be solved numerically when using arbitrary nonlinearity $g$ . The top eigenfunctions of the simple cell kernel depend heavily on the phase of the two grating stimuli $ϕ$ . Thus, a pure orientation discrimination task which is independent of phase requires a large number of samples to learn with the simple cell population.

Complex cell populations are phase invariant

V1 also contains complex cells which possess invariance to the phase $ϕ$ of the stimulus.

ℱ (x, θ_{i}, ϕ_{i}) = \frac{σ^{2}}{2 π} e^{- \frac{σ^{2}}{2} {| x |}^{2}} \cos (k (θ_{i}) \cdot x - ϕ_{i}),

Again using Gabor filters we model the complex cell responses with a quadratic nonlinearity and sum over two squared filters which are phase shifted by $π / 2$

\begin{aligned} r_{i} (θ, ϕ) & = (F (θ_{i}, ϕ_{i}) \cdot h (θ, ϕ))^{2} + (F (θ_{i}, ϕ_{i} - π / 2) \cdot h (θ, ϕ))^{2} \\ = \frac{1}{4} e^{- \frac{1}{σ^{2}} | k + k_{i} |^{2}} + \frac{1}{4} e^{- \frac{1}{σ^{2}} | k - k_{i} |^{2}} + \frac{1}{2} e^{- σ^{- 2}} \cos (2 ϕ_{i}), \end{aligned}

which we see is independent of the phase $ϕ$ of the grating stimulus. Integrating over the set of possible Gabor filters $(k_{i}, ϕ_{i})$ with $| k | = 1$ again gives the following kernel for the complex cells

K_{c} (θ) = \frac{1}{\cosh (2 β)} \cosh (2 β \cos (θ)) .

Remarkably, this kernel is independent of the phase $ϕ$ of the grating stimulus. Thus, complex cell populations possess good inductive bias for vision tasks where the target function only depends on the orientation of the stimulus rather than it’s phase. In reality, V1 is a mixture of simple and complex cells. Let $s \in [0, 1]$ represent the relative proportion of neurons which are simple cells and $(1 - s)$ the relative proportion of complex cells. The kernel for the mixed V1 population is given by a simple convex combination of the simple and complex cell kernels

\begin{aligned} K_{V 1} (θ, θ^{'}, ϕ, ϕ^{'}) & = \frac{1}{N} \sum_{i = 1}^{N} r_{i} (θ, ϕ) r_{i} (θ^{'}, ϕ^{'}) \to {⟨ r (θ, ϕ, c) r (θ^{'}, ϕ^{'}, n) ⟩}_{n \sim p_{V 1} (n)} \\ = s {⟨ r (θ, ϕ, n) r (θ^{'}, ϕ^{'}, n) ⟩}_{n \sim p_{s} (n)} + (1 - s) {⟨ r (θ, ϕ, n) r (θ^{'}, ϕ^{'}, n) ⟩}_{n \sim p_{c} (n)} \\ = s K_{s} (θ, θ^{'}, ϕ, ϕ^{'}) + (1 - s) K_{c} (θ, θ^{'}), \end{aligned}

where $n$ denotes neuron type (simple vs complex, tuning etc) and $p_{V 1} (n), p_{s} (n), p_{c} (n)$ are probability distributions over the V1 neuron identities, the simple cell identities and the complex cell identities respectively. Increasing $s$ increases the phase dependence of the code by giving greater weight to the simple cell population. Decreasing $s$ gives weight to the complex cell population, encouraging phase invariance of readouts.

Visualization of feedforward Gabor V1 model and induced kernels

Examples of the induced kernels for the Gabor-bank V1 model are provided in Figure 5. We show how choice of rectifying nonlinearity $g (z)$ and sparsifying threshold $a$ influence the kernel and their spectra. Learning curves for simple orientation tasks are provided.

Gabor model spectral bias and fit to V1 data

Motivated by findings in the primary visual cortex (Hansel and van Vreeswijk, 2002; Miller and Troyer, 2002; Priebe et al., 2004; Priebe and Ferster, 2008), we studied the spectral bias induced by rectified power-law nonlinearities of the form $g (z) = \max {0, z - a}^{q}$ . From theory, such a power-law activation function arises in a spiking neuron when firing is driven by input fluctuations (Hansel and van Vreeswijk, 2002; Miller and Troyer, 2002). Further, this activation is observed in intracellular recordings over the dynamic range of neurons in primary visual cortex (Priebe and Ferster, 2008). For example, in cats, the power, $q$ , ranges from 2.7 to 3.9 (Priebe et al., 2004). We fit parameters of our model to the Mouse V1 kernel and compared to other parameter sets in Figure 5—figure supplement 1. Our best fit value of $q = 1.7$ is lower but on par with the estimates from the cat and reproduces the observed kernel. Computation of the kernel and its eigenvalues (Appendix Nonlinear simple cells) indicates a low frequency bias: the eigenvalues for low frequency modes are higher than those for high frequency modes, indicating a strong inductive bias to learn functions of low frequency in the orientation. Decreasing sparsity (lower $a$ ) leads to a faster decay in the spectrum (but similar asymptotic scaling at the tail, see Figure 5—figure supplements 1 and 2) and a stronger bias towards lower frequency functions (Figure 5). The effect of the power of nonlinearity $q$ is more nuanced: increasing power may increase spectra at lower frequencies, but may also lead to a faster decay at the tail (Figure 5—figure supplements 1 and 2 ). In general, an exponent $q$ implies a power-law asymptotic spectral decay $λ_{k} \sim k^{- 2 q - 2}$ as $k \to \infty$ (Appendix Nonlinear simple cells). The behavior at low frequencies may have significant impact for learning with few samples. Overall, our findings show that the spectral bias of a population code can be determined in non-trivial ways by its biophysical parameters, including neural thresholds and nonlinearities.

Energy model with partially phase-selective cells

The model of the V1 population as a mixture of purely simple and purely complex cells is an idealization which fails to capture the variability in phase selectivity of cells observed in experiment. In this section, we describe a simple model which can interpolate between an invariant code and a code which has high alignment with phase-dependent eigenfunctions. Further, a single scalar parameter $α$ will control how strongly the population is biased towards invariance. We define $r_{i} (θ, ϕ) = g (z_{i} (θ, ϕ))$ for nonlinear function $g$ and scalar $z$ which is constructed as follows

\begin{matrix} z_{i} (θ, ϕ) & = β_{1} {[ℱ (θ_{i}, ϕ_{i}) \cdot h (θ, ϕ)]}_{+}^{2} + β_{2} {[ℱ (θ_{i}, ϕ_{i} + π / 2) \cdot h (θ, ϕ)]}_{+}^{2} \\ + β_{3} {[ℱ (θ_{i}, ϕ_{i} + π) \cdot h (θ, ϕ)]}_{+}^{2} + β_{4} {[ℱ (θ_{i}, ϕ_{i} + 3 π / 2) \cdot h (θ, ϕ)]}_{+}^{2} . \end{matrix}

This linear combination is inspired by the construction of simple cells in Dayan & Abbot Chapter 2 (Dayan and Abbott, 2001). If all $β$ are equal, then this tuning curve is invariant to phase $ϕ$ . To generate variability in selectivity to phase $ϕ$ , we will draw $β$ from a Dirichlet distrbution on the simplex with concentration parameter $α 1$ so that $p (β) \propto \prod_{j = 1}^{4} β_{j}^{α - 1}$ with $\sum_{j = 1}^{4} β_{j} = 1$ . In the $α \to \infty$ limit, the probability density concentrates on $\frac{1}{4} 1$ , leading to a code comprised entirely of complex cells which are invariant to phase $ϕ$ . In the $α \to 0$ limit, the density is concentrated around the “edges” of the simplex such as $(1, 0, 0, 0), (0, 1, 0, 0)$ , where only one preferred phase is present per neuron. For intermediate values, neurons are partially selective to phase. As before, the selectivity or invariance to phase is manifested in the kernel decomposition and leads to similar learning curves for the three tasks of the main paper (Orientation, Phase, Hybrid). We provide an illustration of tuning curves, F1/F0 distributions, eigenfunctions, and learning curves in Figure 5—figure supplement 3.

Time dependent neural codes

RNN model and decomposition

In this setting, the population code $r ({θ (t)}, t)$ is a function of an input stimulus sequence $θ (t)$ and time $t$ . In general the neural code $r$ at time $t$ can depend on the entire history of the stimulus input $θ (t^{'})$ for $t^{'} \leq t$ , as is the case for recurrent neural networks. We denote dependence of a function $f$ on $θ (t)$ in this causal manner with the notation $f ({θ}, t)$ . In a learning task, a set of readout weights $w$ are chosen so that a downstream linear readout $f ({θ}, t) = w \cdot r ({θ}, t)$ approximates a target sequence $y ({θ}, t)$ which maps input stimulus sequences to output scalar sequences. The quantity of interest is the generalization $E_{g}$ , which in this case is an average over both input sequences and time, $E_{g} = {⟨ {(y ({θ}, t) - f ({θ}, t))}^{2} ⟩}_{θ (t), t}$ . The average is computed over a distribution of input stimulus sequences $p (θ (t))$ . To train the readout, $w$ , the network is given a sample of $P$ stimulus sequences $θ^{μ} (t), μ = 1, \dots, P$ . For the μ-th training input sequence, the target system $y$ is evaluated at a set of discrete time points $T_{μ} = {t_{1}, t_{2}, . . ., t_{| T_{μ} |}}$ giving a collection of target values ${y_{t}^{μ}}_{t \in T_{μ}}$ and a total dataset of size $P = \sum_{μ = 1}^{P} | T_{μ} |$ . The average case generalization computes a further average of the generalization error $E_{g}$ over randomly sampled datasets of size $P$ .

Learning is again achieved through iterated weight updates with delta-rule form, but now have contributions from both sequence index and time $Δ w = η \sum_{μ} \sum_{t \in T_{μ}} r_{t}^{μ} (y_{t}^{μ} - f_{t}^{μ})$ . As before, optimization of the readout weights is equivalent to kernel regression with a kernel that computes inner products of neural population vectors at different times $t, t^{'}$ for different input sequences ${θ}, {θ^{'}}$ . This kernel depends on details of the time varying population code including its recurrent intrinsic dynamics as well as its encoding of the time-varying input stimuli. The optimization problem and delta rule described above converge to the kernel regression solution for kernel gram matrix $K_{t, t^{'}}^{μ, μ^{'}} = \frac{1}{N} r_{t}^{μ} \cdot r_{t^{'}}^{μ^{'}}$ (Dong et al., 2020; Yang, 2019; Yang, 2020). The learned function has the form $f ({θ}, t) = \sum_{μ, t^{'} \in T_{μ}} α_{t}^{μ} K ({θ}, {θ}^{μ}, t, t^{'})$ , where $α = K^{+} y$ for kernel gram matrix $K \in R^{P \times P}$ which is computed for the entire set of training sequences, and the vector $y \in R^{P}$ is the vector containing the desired target outputs for each sequence. Assuming a probability distribution over sequences $θ (t)$ , the kernel can be diagonalized with orthonormal eigenfunctions $ψ_{k} ({θ}, t)$ . Our theory carries over from the static case: kernels whose top eigenfunctions have high alignment with the target dynamical system $y ({θ}, t)$ will achieve the best average case generalization performance.

Alternative neural codes with same kernel

Orthogonal transformations are sufficient for linear kernel-preserving transformations

We will now show that for any linear transformation $\tilde{r} = Ar$ which preserves the inner product kernel $K (θ, θ^{'})$ , there exists an orthogonal matrix $Q$ such that $\tilde{r} = Qr$ .

Proof.

Let $\tilde{r} (θ) = Ar (θ)$ for all stimuli $θ$ . To preserve the kernel, we must have

\begin{matrix} K (θ, θ^{'}) = \tilde{r} (θ) \cdot \tilde{r} (θ^{'}) = r (θ) \cdot r (θ^{'}) ⟹ r (θ) A^{⊤} Ar (θ^{'}) = r (θ) \cdot r (θ^{'}) . \end{matrix}

Taking projections against each of the orthonormal eigenfunctions $ψ_{ℓ} (θ)$ (see Appendix Singular value decomposition of continuous population responses), we define vectors $u_{k}$ as $\sqrt{λ_{k}} u_{k} = {⟨ r (θ) ψ_{k} (θ) ⟩}_{θ}$ , allowing us to express the SVD of the population code $r (θ) = \sum_{k} \sqrt{λ_{k}} u_{k} ψ_{k} (θ)$ . These vectors ${u_{k}}$ are orthonormal $u_{k} \cdot u_{ℓ} = δ_{k ℓ}$ since, by the definition of the kernel eigenfunctions $ψ_{k}$ ,

\begin{aligned} \sqrt{λ_{k} λ_{ℓ}} u_{k} \cdot u_{ℓ} & = {⟨ r (θ) \cdot r (θ^{'}) ψ_{k} (θ) ψ_{ℓ} (θ^{'}) ⟩}_{θ, θ^{'}} = {⟨ ψ_{k} (θ) {⟨ K (θ, θ^{'}) ψ_{ℓ} (θ^{'}) ⟩}_{θ^{'}} ⟩}_{θ} \\ = λ_{ℓ} {⟨ ψ_{k} (θ) ψ_{ℓ} (θ) ⟩}_{θ} = λ_{k} δ_{k, ℓ} . \end{aligned}

Since $r (θ)$ and $\tilde{r} (θ)$ have the same inner product kernel, they must posess the same kernel eigenfunctions $ψ_{k}$ and kernel eigenvalues $λ_{k}$ , which are identified through the eigenvalue problem

\begin{matrix} \int p (θ) K (θ, θ^{'}) ψ_{k} (θ) d θ = λ_{k} ψ_{k} (θ) . \end{matrix}

We therefore have the following two singular value decompositions for $r$ and $\tilde{r}$

\begin{matrix} r (θ) = \sum_{k = 1}^{N} \sqrt{λ_{k}} u_{k} ψ_{k} (θ), \tilde{r} (θ) = \sum_{k = 1}^{N} \sqrt{λ_{k}} {\tilde{u}}_{k} ψ_{k} (θ) . \end{matrix}

where ${u_{k}}_{k = 1}^{N}$ and ${{\tilde{u}}_{k}}_{k = 1}^{N}$ are both complete sets of orthonormal vectors (the sums above run over possible zero eigenvalues). Taking the equation $\tilde{r} (θ) = Ar (θ)$ , we multiply both sides of the equation by $ψ_{k} (θ)$ and average over $θ$ giving

\begin{matrix} ⟨ \tilde{r} (θ) ψ_{k} (θ) ⟩ = \sqrt{λ_{k}} {\tilde{u}}_{k} = A {⟨ r (θ) ψ_{k} (θ) ⟩}_{θ} = \sqrt{λ_{k}} {Au}_{k} \end{matrix}

For an eigenmode $k$ with positive eigenvalue $λ_{k} > 0$ , this implies ${\tilde{u}}_{k} = {Au}_{k}$ , while there is no corresponding constraint for the null modes with $λ_{k} = 0$ . However, the action of $A$ on the nullspace of the code has no influence on $\tilde{r}$ so there is no loss in generality to restrict consideration to transformations $A$ which satisfy ${\tilde{u}}_{k} = {Au}_{k}$ for all $k \in [N]$ (rather than just the $λ_{k} > 0$ modes). This choice gives $A = \sum_{k = 1}^{N} {\tilde{u}}_{k} u_{k}^{⊤}$ . Thus, the space of codes $\tilde{r} (θ)$ with equivalent kernels to $r (θ) \cdot r (θ^{'})$ generated through linear transformations is equivalent to all possible orthogonal transformations of the original code ${Qr (θ) : {QQ}^{⊤} = Q^{⊤} Q = I}$ . ∎

Effect of noise on RROS symmetry

The random rotation and optimal shift (RROS) operations introduced in the main text preserve generalization performance under the assumption of a deterministic neural code. However, for noisy codes, the presence of RROS symmetry is dependent on the noise distribution. Below we discuss two commonly analyzed distributions: the Gaussian distribution and the Poisson distribution. For Gaussian noise, the RROS operations preserve the generalization performance and the local Fisher information. However, if noise is constrained to be Poisson then RROS operations do not preserve generalization or Fisher information.

First, we will analyze stimulus dependent Gaussian noise, where generalization performance is preserved under rotations and baseline shifts. Note that if the code at $θ$ obeyed $r (θ) \sim N (\bar{r} (θ), Σ_{n} (θ))$ , then the rotated and shifted code follows $Q r (θ) + δ \sim N (Q \bar{r} (θ) + δ, Q Σ_{n} (θ) Q^{⊤})$ . This rotated and shifted code $Qr (θ) + δ$ , when centered, will exhibit identical generalization performance as the original code. This is true both for learning from a trial averaged or non-trial averaged code. In the case of Gaussian noise on a centered code, the dataset transforms under a rotation as $D = {r_{μ}, y_{μ}} \to D^{'} = {Q r_{μ}, y_{μ}}$ . The optimal weights for a linear model similarly transform as $w (D) \to Q w (D)$ . Under these transformations the predictor on test point $θ$ is unchanged since

\begin{aligned} f (θ) = w \cdot r (θ) \to w^{⊤} Q^{⊤} Q r (θ) = w \cdot r (θ) \end{aligned}

Further, the local Fisher information matrix is $I (θ) = \frac{\partial \bar{r} {(θ)}^{⊤}}{\partial θ} Σ_{n}^{- 1} (θ) \frac{\partial \bar{r} (θ)}{\partial θ^{⊤}} + \frac{1}{2} Tr Σ_{n}^{- 1} (θ) \frac{\partial Σ_{n} (θ)}{\partial θ} Σ_{n}^{- 1} (θ) \frac{\partial Σ_{n} (θ)}{\partial θ^{⊤}}$ is unchanged under the transformation $r \to Qr + δ$ . Under this transformation, the covariance simply transforms linearly $Σ_{n} \to Q Σ_{n} (θ) Q^{⊤}$ and the $Q$ matrices will annihilate under the trace. This shows that, for some noise models, our assumption that rotations and baseline shifts preserve generalization performance will be valid.

However, for Poisson noise, where the variance is tied to the mean firing rate, the RROS operations will not preserve noise structure or information content. The Fisher information at scalar stimulus $θ$ for a Poisson neuron is $I (θ) = \frac{{\bar{r}}^{'} {(θ)}^{2}}{\bar{r} (θ)}$ . A baseline shift $r \to r + δ$ to the tuning curve will not change the numerator since the derivative of the tuning curve is invariant to this transformation, but it will increase the denominator.

Necessary conditions for optimally sparse codes

Next we argue why optimally sparse codes should be lifetime and population selective. We consider the following optimization problem: find a non-negative neural responses $S \in ℝ^{N \times P}$ and baseline vector $δ \in ℝ^{N}$ so that baseline subtracted responses $R = S - δ 1^{⊤}$ realize a desired inner product kernel $K \in ℝ^{P \times P}$ and have minimal total firing. This is equivalent to finding the most metabolically efficient code among the space of codes with equivalent inductive bias. Mathematically, we formulate this problem as

\begin{matrix} \min_{S \in ℝ^{N \times P}, δ \in ℝ^{N}} \sum_{i μ} S_{i μ}, s.t. {(S - δ 1^{⊤})}^{⊤} (S - δ 1^{⊤}) = K, S_{i μ} \geq 0 \forall i \in [N], μ \in [P] . \end{matrix}

To enforce the constraints for the definition of the kernel and the non-negativity of the responses, we introduce the following Lagrangian

\begin{matrix} ℒ (S, δ, A, V) = 1^{⊤} S1 - Tr ([{(S - δ 1^{⊤})}^{⊤} (S - δ 1^{⊤}) - K] A) - Tr V^{⊤} S \end{matrix}

where 1 is the vector containing all ones, the Lagrange multiplier matrix $A$ enforces the definition of the kernel and the KKT multiplier matrix $V$ enforces the non-negativity constraints for each element of $S$ . The KKT conditions require that any local optimum of the objective would have to satisfy the following equations (Kuhn and Tucker, 2014)

\begin{matrix} \frac{\partial ℒ}{\partial S} & = 11^{⊤} - (S - δ 1^{⊤}) A - V = 0 \\ \frac{\partial ℒ}{\partial δ} & = - (S - δ 1^{⊤}) A1 = 0 \\ \frac{\partial ℒ}{\partial A} & = {(S - δ 1^{⊤})}^{⊤} (S - δ 1^{⊤}) - K = 0 \\ V ⊙ S & = 0, \end{matrix}

where $⊙$ denotes the element-wise Hadamard product. Using the complementary slackness condition $S ⊙ V = 0$ , and the first optimality condition $\frac{\partial ℒ}{\partial S} = 0$ , we have

\begin{matrix} S = S ⊙ (S - δ 1^{⊤}) A \end{matrix}

Therefore, for any neuron-stimulus pair $(i, μ)$ , either $S_{i μ} = 0$ or $\sum_{ν \in [P]} (S_{i ν} - δ_{i}) A_{ν μ} = 1$ . Further, under the condition that K is full rank, we conclude that for any stimulus μ, $\sum_{ν \in [P]} A_{μ ν} = 0$ from the equation $\frac{\partial ℒ}{\partial δ} = 0$ . Let $I_{i} = {μ \in [P] : S_{i μ} > 0}$ represent the set of stimuli for which neuron $i$ fires. We will call this the receptive field set for neuron $i$ . Let $B_{(i)} \in R^{P \times P}$ have entries

\begin{array}{ll} [B_{(i)}]_{μ ν} = {\begin{cases} [A_{(i)}^{+}]_{μ ν} & μ, ν \in I_{i} \\ 0 & μ \notin I_{i} or ν \notin I_{i} \end{cases} \end{array}

where the matrix $A_{(i)}$ is the $| ℐ_{i} | \times | ℐ_{i} |$ minor of $A$ obtained by taking all rows and columns with indices $μ, ν \in ℐ_{i}$ , and $A^{+}$ denotes pseudo-inverse of $A$ . Then the $i$ -th neuron’s tuning curve is a function of the index set $ℐ_{i}$ the baseline $δ_{i}$ and the neuron-independent $P \times P$ matrix $A$ . The non-negativity constraint for neuron $i$ ’s tuning curve implies that $S_{i μ} = \sum_{ν \in I_{i}} B_{(i), μ ν} [δ_{i} \sum_{γ \in [P]} A_{ν γ} + 1] > 0$ for all $μ \in ℐ_{i}$ . To satisfy the definition of the kernel, we have the following constraint on the matrix $A$ , the index sets $ℐ_{i}$ and baselines $δ_{i}$

\begin{matrix} K & = \sum_{i = 1}^{N} (s (ℐ_{i}, δ_{i}, A) - δ_{i} 1) {(s (ℐ_{i}, δ_{i}, A) - δ_{i} 1)}^{⊤} \end{matrix}

This equation implictly defines the index sets $ℐ_{i}$ the baselines $δ_{i}$ and the KKT matrix $A$ . We see that, in order to fit an arbitrary kernel, the receptive field sets ${ℐ_{i}}$ and baselines $δ_{i}$ for each neuron must be sufficiently diverse since otherwise only a low rank kernel matrix can be achieved from the optimally sparse code. As a concrete example, suppose that $ℐ_{i} = ℐ$ so that $V_{(i)} = V$ and $δ_{i} = δ$ for all $i$ . For example, this could occur if each neuron fired for every possible stimulus. In this case, the kernel would be rank one: $K = N (s (ℐ, δ, A) - δ 1) {(s (ℐ, δ, A) - δ 1)}^{⊤}$ . In order to achieve a higher rank code there must be sufficient diversity of the receptive fields $ℐ_{i}$ . Thus the only way for optimally sparse codes to realize high rank kernels $K$ is to have neurons to have different receptive field sets $ℐ_{i}$ . The necessary optimality conditions thus reveal a preference for sparse neural tuning curves to have high lifetime sparseness; to achieve diverse index sets $ℐ_{i}$ , any given neuron will fire only for a unique subset of the possible stimuli.

Impact of neural noise and unlearnable targets on learning

While our analysis so far has focused on deterministic population codes, our theory can be extended to neural populations which exhibit variability in responses to identical stimuli. For each stimulus $θ$ , we let the population response $r (θ)$ be a random vector with mean $\bar{r} (θ) = {⟨ r (θ) ⟩}_{r | θ}$ and covariance $Σ_{n} (θ) = {⟨ (r (θ) - \bar{r} (θ)) {(r (θ) - \bar{r} (θ))}^{⊤} ⟩}_{r | θ}$ .

The (deterministic) target function can be decomposed in terms of the mean response as $y (θ) = w^{*} \cdot \bar{r} (θ)$ (the usual decomposition $y = w^{*} \cdot r (θ)$ gives an unphysical target function which fluctuates with the variability in neural responses). For a given configuration of weights $w$ , the generalization error (which is an average over the joint distribution of $r, θ$ ) is determined only by the signal $Σ_{s} = {⟨ \bar{r} (θ) \bar{r} {(θ)}^{⊤} ⟩}_{θ}$ and noise $Σ_{n} = {⟨ Σ_{n} (θ) ⟩}_{θ}$ correlation matrices:

\begin{matrix} E_{g} (w) & = {⟨ {(r (θ) \cdot w - \bar{r} (θ) \cdot w^{*})}^{2} ⟩}_{r, θ} = ⟨ {[(w - w^{*}) \cdot \bar{r} (θ) + w \cdot (r (θ) - \bar{r} (θ))]}^{2} ⟩ \\ = {(w - w^{*})}^{⊤} ⟨ \bar{r} (θ) \bar{r} {(θ)}^{⊤} ⟩ (w - w^{*}) + w^{⊤} ⟨ (r (θ) - \bar{r} (θ)) {(r (θ) - \bar{r} (θ))}^{⊤} ⟩ w \\ = {(w - w^{*})}^{⊤} Σ_{s} (w - w^{*}) + w^{⊤} Σ_{n} w \end{matrix}

where we utilized the fact that ${⟨ r (θ) - \bar{r} (θ) ⟩}_{r | θ} = 0$ to eliminate the cross-term. The two terms in the final expression can be thought of as a bias-variance decomposition over the noise in neural responses. The minimum achievable loss can be obtained by differentiation of the generalization error expression with respect to $w$ , giving $E_{g}^{*} = w^{*} Σ_{n} {(Σ_{s} + Σ_{n})}^{- 1} Σ_{s} w^{*}$ . We note that any noise correlation matrix with noise orthogonal to coding direction $Σ_{n} w^{*} = 0$ will give the minimal (zero) asymptotic error. Alignment of the noise $Σ_{n}$ with $w^{*}$ gives higher asymptotic error.

In addition to the irreducible error, the presence of neural noise can alter the learning curve at finite $P$ . An analytical study of this is difficult, which we leave for future work. We numerically study the effect of neural variability on generalization performance in the orientation discrimination tasks for non-trial-averaged Mouse V1 code in Appendix 1—figure 1 . We note that the generalization error is worse at each finite value of $P$ when compared to trial averaged (noise free) learning curves. We varied the regularization parameter and did not find an obvious non-zero optimal weight decay $λ$ , consistent with small noise levels.

Neural noise is not the only phenomenon that can degrade task learning. Codes which are incapable of expressing the target function through linear readouts are also susceptible to overfitting. As explained in Canatar et al., 2021, the components of the target function that are inexpressible act as a source of noise on the learning process which can overfit this noise. Such a scenario can occur, for example, when the readout neuron only gets input from a sparse subset of the coding neural population (Seeman et al., 2018). We show in Appendix 1—figure 1C-D that using subsampled populations of size $N$ can indeed lead to a regime where more data can hurt performance leading to an overfitting error peak, a subsequent non-vanishing asymptotic error, and an optimal weight decay parameter $λ$ . This phenomenon is known as double descent in machine learning literature (Belkin et al., 2019; Mei and Montanari, 2020; Canatar et al., 2021). At small $N$ , these codes are not sufficiently expressive to learn the target function through linear readout. The overfitting peak occurs near the interpolation threshold, the largest value of $P$ where all training sets could be perfectly fit in the $λ \to 0$ limit (Canatar et al., 2021). At infinite $P$ , generalization error asymptotes to the amount of unexplained variance in the target function.

Appendix 1—figure 1

Download asset Open asset

Neural noise and subsampled neural codes can lead to overfitting.

(A) The learning curves without trial averaging (solid) and with trial averaging (dashed) for the high and low frequency orientation discrimination task. In principle, neural noise could limit asymptotic performance and lead to the existence of an optimal weight decay parameter $λ$ . (B) Performance at $P = 500$ vs ridge $λ$ shows that there is not an optimal weight decay parameter. (C) Generalization of readouts trained on subsets of $N$ V1 neurons exhibit non-monotonic learning curves with an overfitting peak around $P \approx N$ . (D) The performance of subsamples of $N$ neurons as a function of the weight decay parameter $λ$ at $P = 500$ samples show that, for sufficiently small $N$ , there is a non-zero optimal $λ$ .

Data availability

Mouse V1 neuron responses to orientation gratings and preprocessing code were obtained from a publicly available dataset: https://github.com/MouseLand/stringer-et-al-2019. Responses to ImageNet images and preprocessing code were obtained from another publicly available dataset, https://github.com/MouseLand/stringer-pachitariu-et-al-2018b. The code generated by the authors for this paper is also available https://github.com/Pehlevan-Group/sample_efficient_pop_codes, (copy archived at swh:1:rev:6cd4f0fe7043ae214dd682a9dc035a497ffa2d61).

The following previously published data sets were used

1. Carsen S
2. Marius P
3. Nicholas S
4. Matteo C
5. Kenneth DH
(2018) Figshare
Recordings of ten thousand neurons in visual cortex in response to 2,800 natural images.

https://doi.org/10.25378/janelia.6845348.v4
(2019) Figshare
Recordings of ~20,000 neurons from V1 in response to oriented stimuli.

https://doi.org/10.25378/janelia.8279387.v3

References

Conference
(2016) Do retinal ganglion cells project natural scenes to their principal subspace and whiten them?
2016 50th Asilomar Conference on Signals, Systems and Computers.

https://doi.org/10.1109/ACSSC.2016.7869658
- Google Scholar
(2014) Perceptual learning of simple stimuli modifies stimulus representations in posterior inferior temporal cortex
Journal of Cognitive Neuroscience 26:2187–2200.

https://doi.org/10.1162/jocn_a_00641
- PubMed
- Google Scholar
1. Adelson EH
2. Bergen JR
(1985) Spatiotemporal energy models for the perception of motion
Journal of the Optical Society of America. A, Optics and Image Science 2:284–299.

https://doi.org/10.1364/josaa.2.000284
- PubMed
- Google Scholar
1. Ahissar M
2. Hochstein S
(2004) The reverse hierarchy theory of visual perceptual learning
Trends in Cognitive Sciences 8:457–464.

https://doi.org/10.1016/j.tics.2004.08.011
- PubMed
- Google Scholar
1. Ames KC
2. Ryu SI
3. Shenoy KV
(2019) Simultaneous motor preparation and execution in a last-moment reach correction task
Nature Communications 10:1–13.

https://doi.org/10.1038/s41467-019-10772-2
- Google Scholar
(1987) Generation of random orthogonal matrices
SIAM Journal on Scientific and Statistical Computing 8:625–629.

https://doi.org/10.1137/0908055
- Google Scholar
1. Atick JJ
2. Redlich AN
(1992) What does the retina know about natural scenes?
Neural Computation 4:196–210.

https://doi.org/10.1162/neco.1992.4.2.196
- Google Scholar
1. Attneave F
(1954) Some informational aspects of visual perception
Psychological Review 61:183–193.

https://doi.org/10.1037/h0054663
- PubMed
- Google Scholar
(2006) Neural correlations, population coding and computation
Nature Reviews. Neuroscience 7:358–366.

https://doi.org/10.1038/nrn1888
- PubMed
- Google Scholar
Book
1. Barlow H
(1961)
Possible Principles Underlying the Transformation of Sensory Messages

Cambridge University.
- Google Scholar
(2020) Benign overfitting in linear regression
PNAS 117:30063–30070.

https://doi.org/10.1073/pnas.1907378117
- PubMed
- Google Scholar
(2008) Dynamic ensemble odor coding in the mammalian olfactory bulb: sensory information at different timescales
Neuron 57:586–598.

https://doi.org/10.1016/j.neuron.2008.02.011
- PubMed
- Google Scholar
1. Belkin M
2. Hsu D
3. Ma S
4. Mandal S
(2019) Reconciling modern machine-learning practice and the classical bias-variance trade-off
PNAS 116:15849–15854.

https://doi.org/10.1073/pnas.1903070116
- PubMed
- Google Scholar
1. Bonin V
2. Histed MH
3. Yurgenson S
4. Reid RC
(2011) Local diversity and fine-scale organization of receptive fields in mouse visual cortex
The Journal of Neuroscience 31:18506–18521.

https://doi.org/10.1523/JNEUROSCI.2974-11.2011
- PubMed
- Google Scholar
Conference
(2020)
Spectrum dependent learning curves in kernel regression and wide neural networks

Proceedings of the 37th International Conference on Machine Learning of Proceedings of Machine Learning Research. pp. 1024–1034.
- Google Scholar
Preprint
1. Bordelon B
2. Pehlevan C
(2022a) The Influence of Learning Rule on Representation Dynamics in Wide Neural Networks
arXiv.

https://arxiv.org/abs/2210.02157
- Google Scholar
Conference
1. Bordelon B
2. Pehlevan C
(2022b)
Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks

Advances In Neural Information Processing Systems.
- Google Scholar
Software
1. Bradbury J
2. Frostig R
3. Hawkins P
4. Johnson MJ
5. Leary C
6. Maclaurin D
7. Necula G
8. Paszke A
9. VanderPlas J
10. Wanderman-Milne S
11. Zhang Q
(2018) JAX: composable transformations of python+numpy programs
Github.

https://github.com/google/jax
1. Cadieu CF
2. Hong H
3. Yamins DLK
4. Pinto N
5. Ardila D
6. Solomon EA
7. Majaj NJ
8. DiCarlo JJ
(2014) Deep neural networks rival the representation of primate it cortex for core visual object recognition
PLOS Computational Biology 10:e1003963.

https://doi.org/10.1371/journal.pcbi.1003963
- PubMed
- Google Scholar
(2021) Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks
Nature Communications 12:2914.

https://doi.org/10.1038/s41467-021-23103-1
- PubMed
- Google Scholar
Book
1. Carey S
2. Bartlett E
(1978)
Acquiring a Single New Word

Elsevier.
- Google Scholar
(2018) Toward a unified theory of efficient, predictive, and sparse coding
PNAS 115:186–191.

https://doi.org/10.1073/pnas.1711114115
- PubMed
- Google Scholar
1. Chapin JK
2. Nicolelis MA
(1999) Principal component analysis of neuronal ensemble activity reveals multidimensional somatosensory representations
Journal of Neuroscience Methods 94:121–140.

https://doi.org/10.1016/s0165-0270(99)00130-2
- PubMed
- Google Scholar
1. Cohen MR
2. Kohn A
(2011) Measuring and interpreting neuronal correlations
Nature Neuroscience 14:811–819.

https://doi.org/10.1038/nn.2842
- PubMed
- Google Scholar
Preprint
(2022) Error Rates for Kernel Classification under Source and Capacity Conditions
arXiv.

https://arxiv.org/abs/2201.12655
- Google Scholar
1. Cunningham JP
2. Yu BM
(2014) Dimensionality reduction for large-scale neural recordings
Nature Neuroscience 17:1500–1509.

https://doi.org/10.1038/nn.3776
- PubMed
- Google Scholar
Book
1. Dayan P
2. Abbott LF
(2001)
Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems

The MIT Press.
- Google Scholar
Conference
1. Deng J
2. Dong W
3. Socher R
4. Li LJ
5. Li K
6. Fei-Fei L
(2009)
Imagenet: A large-scale hierarchical image database

In 2009 IEEE conference on computer vision and pattern recognition.
- Google Scholar
1. de Vries SEJ
2. Lecoq JA
3. Buice MA
4. Groblewski PA
5. Ocker GK
6. Oliver M
7. Feng D
8. Cain N
9. Ledochowitsch P
10. Millman D
11. Roll K
12. Garrett M
13. Keenan T
14. Kuan L
15. Mihalas S
16. Olsen S
17. Thompson C
18. Wakeman W
19. Waters J
20. Williams D
21. Barber C
22. Berbesque N
23. Blanchard B
24. Bowles N
25. Caldejon SD
26. Casal L
27. Cho A
28. Cross S
29. Dang C
30. Dolbeare T
31. Edwards M
32. Galbraith J
33. Gaudreault N
34. Gilbert TL
35. Griffin F
36. Hargrave P
37. Howard R
38. Huang L
39. Jewell S
40. Keller N
41. Knoblich U
42. Larkin JD
43. Larsen R
44. Lau C
45. Lee E
46. Lee F
47. Leon A
48. Li L
49. Long F
50. Luviano J
51. Mace K
52. Nguyen T
53. Perkins J
54. Robertson M
55. Seid S
56. Shea-Brown E
57. Shi J
58. Sjoquist N
59. Slaughterbeck C
60. Sullivan D
61. Valenza R
62. White C
63. Williford A
64. Witten DM
65. Zhuang J
66. Zeng H
67. Farrell C
68. Ng L
69. Bernard A
70. Phillips JW
71. Reid RC
72. Koch C
(2020) A large-scale standardized physiological survey reveals functional organization of the mouse visual cortex
Nature Neuroscience 23:138–151.

https://doi.org/10.1038/s41593-019-0550-9
- PubMed
- Google Scholar
Conference
(2020)
Reservoir computing meets recurrent kernels and structured transforms

NeurIPS Proceedings.
- Google Scholar
1. Edelman S
(1998) Representation is representation of similarities
The Behavioral and Brain Sciences 21:449–467.

https://doi.org/10.1017/s0140525x98001253
- PubMed
- Google Scholar
Preprint
(2021) Capacity of Group-Invariant Linear Readouts from Equivariant Representations: How Many Objects Can Be Linearly Classified under All Possible Views?
arXiv.

https://doi.org/10.48550/arXiv.2110.07472
- Google Scholar
Preprint
(2021) Rich and Lazy Learning of Task Representations in Brains and Neural Networks
bioRxiv.

https://doi.org/10.1101/2021.04.23.441128
- Google Scholar
(2004) Learning strengthens the response of primary visual cortex to simple patterns
Current Biology 14:573–578.

https://doi.org/10.1016/j.cub.2004.03.032
- PubMed
- Google Scholar
(2017) Neural manifolds for the control of movement
Neuron 94:978–984.

https://doi.org/10.1016/j.neuron.2017.05.025
- PubMed
- Google Scholar
1. Gallego JA
2. Perich MG
3. Naufel SN
4. Ethier C
5. Solla SA
6. Miller LE
(2018) Cortical population activity within a preserved neural manifold underlies multiple motor behaviors
Nature Communications 9:4233.

https://doi.org/10.1038/s41467-018-06560-z
- PubMed
- Google Scholar
1. Gao P
2. Ganguli S
(2015) On simplicity and complexity in the brave new world of large-scale neuroscience
Current Opinion in Neurobiology 32:148–155.

https://doi.org/10.1016/j.conb.2015.04.003
- PubMed
- Google Scholar
Preprint
1. Gao P
2. Trautmann E
3. Yu B
4. Santhanam G
5. Ryu S
6. Shenoy K
7. Ganguli S
(2017) A Theory of Multineuronal Dimensionality, Dynamics and Measurement
bioRxiv.

https://doi.org/10.1101/214262
- Google Scholar
(2002) Physiological correlates of perceptual learning in monkey V1 and V2
Journal of Neurophysiology 87:1867–1888.

https://doi.org/10.1152/jn.00690.2001
- PubMed
- Google Scholar
1. Gilbert CD
(1994) Early perceptual learning
PNAS 91:1195–1197.

https://doi.org/10.1073/pnas.91.4.1195
- PubMed
- Google Scholar
(1995) Regularization theory and neural networks architectures
Neural Computation 7:219–269.

https://doi.org/10.1162/neco.1995.7.2.219
- Google Scholar
(2021) Mouse visual cortex areas represent perceptual and semantic features of learned visual categories
Nature Neuroscience 24:1441–1451.

https://doi.org/10.1038/s41593-021-00914-5
- PubMed
- Google Scholar
1. Haft M
2. van Hemmen JL
(1998) Theory and implementation of infomax filters for the retina
Network 9:39–71.

https://doi.org/10.1088/0954-898X_9_1_003
- PubMed
- Google Scholar
1. Hansel D
2. van Vreeswijk C
(2002) How noise contributes to contrast invariance of orientation tuning in cat visual cortex
The Journal of Neuroscience 22:5118–5128.

https://doi.org/10.1523/JNEUROSCI.22-12-05118.2002
- PubMed
- Google Scholar
Preprint
1. Harris KD
(2019) Additive Function Approximation in the Brain
arXiv.

https://doi.org/10.48550/arXiv.1909.02603
- Google Scholar
(2020) Direct fit to nature: an evolutionary perspective on biological and artificial neural networks
Neuron 105:416–434.

https://doi.org/10.1016/j.neuron.2019.12.002
- PubMed
- Google Scholar
Preprint
(2020) Surprises in High-Dimensional Ridgeless Least Squares Interpolation
arXiv.

https://doi.org/10.48550/arXiv.1903.08560
- Google Scholar
1. Hertz J
2. Krogh A
3. Palmer RG
4. Horner H
(1991) Introduction to the theory of neural computation
Physics Today 44:70.

https://doi.org/10.1063/1.2810360
- Google Scholar
(2008) Sparse representation of sounds in the unanesthetized auditory cortex
PLOS Biology 6:e16.

https://doi.org/10.1371/journal.pbio.0060016
- PubMed
- Google Scholar
1. Huang Y
2. Rao RPN
(2011) Predictive coding
Wiley Interdisciplinary Reviews. Cognitive Science 2:580–593.

https://doi.org/10.1002/wcs.142
- PubMed
- Google Scholar
Book
1. Hume D
(1998)
An Enquiry Concerning Human Understanding

Oxford University Press.
- Google Scholar
1. Jabri M
2. Flower B
(1992) Weight perturbation: an optimal architecture and learning technique for analog vlsi feedforward and recurrent multilayer networks
IEEE Transactions on Neural Networks 3:154–157.

https://doi.org/10.1109/72.105429
- PubMed
- Google Scholar
(2018)
Advances in Neural Information Processing Systems

Neural tangent kernel: convergence and generalization in neural networks, Advances in Neural Information Processing Systems, Curran Associates, Inc.
- Google Scholar
Conference
1. Kalimeris D
2. Kaplun G
3. Nakkiran P
4. Edelman BL
5. Yang T
6. Barak B
7. Zhang H
(2019)
SGD on neural networks learns functions of increasing complexity

In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019.
- Google Scholar
1. Kato S
2. Kaplan HS
3. Schrödel T
4. Skora S
5. Lindsay TH
6. Yemini E
7. Lockery S
8. Zimmer M
(2015) Global brain dynamics embed the motor command sequence of Caenorhabditis elegans
Cell 163:656–669.

https://doi.org/10.1016/j.cell.2015.09.034
- PubMed
- Google Scholar
Preprint
1. Kornblith S
2. Norouzi M
3. Lee H
4. Hinton G
(2019) Similarity of Neural Network Representations Revisited
arXiv.

https://doi.org/10.48550/arXiv.1905.00414
- Google Scholar
(2008) Representational similarity analysis-connecting the branches of systems neuroscience
Frontiers in Systems Neuroscience 2:4.

https://doi.org/10.3389/neuro.06.004.2008
- PubMed
- Google Scholar
Book
1. Kuhn HW
2. Tucker AW
(2014) Nonlinear programming
In: Kuhn HW, editors. Traces and Emergence of Nonlinear Programming. Springer. pp. 1–4.

https://doi.org/10.1007/978-3-0348-0439-4
- Google Scholar
1. Laakso A
2. Cottrell G
(2000) Content and cluster analysis: assessing representational similarity in neural systems
Philosophical Psychology 13:47–76.

https://doi.org/10.1080/09515080050002726
- Google Scholar
(2017) Building machines that learn and think like people
The Behavioral and Brain Sciences 40:e253.

https://doi.org/10.1017/S0140525X16001837
- PubMed
- Google Scholar
1. Law CT
2. Gold JI
(2008) Neural correlates of perceptual learning in a sensory-motor, but not a sensory, cortical area
Nature Neuroscience 11:505–513.

https://doi.org/10.1038/nn2070
- PubMed
- Google Scholar
Conference
(2018)
Deep neural networks as gaussian processes

In International Conference on Learning Representations.
- Google Scholar
(2005) Selectivity and sparseness in the responses of striate complex cells
Vision Research 45:57–73.

https://doi.org/10.1016/j.visres.2004.07.021
- PubMed
- Google Scholar
Preprint
1. Li Z
2. Wang R
3. Yu D
4. Du SS
5. Hu W
6. Salakhutdinov R
7. Arora S
(2019) Enhanced Convolutional Neural Tangent Kernels
arXiv.

https://doi.org/10.48550/arXiv.1911.00809
- Google Scholar
(2016) Random synaptic feedback weights support error backpropagation for deep learning
Nature Communications 7:13276.

https://doi.org/10.1038/ncomms13276
- PubMed
- Google Scholar
(2017) Optimal degrees of synaptic connectivity
Neuron 93:1153–1164.

https://doi.org/10.1016/j.neuron.2017.01.030
- PubMed
- Google Scholar
1. Loureiro B
2. Gerbelot C
3. Cui H
4. Goldt S
5. Krzakala F
6. Mezard M
7. Zdeborová L
(2021a) Learning curves of generic features maps for realistic datasets with a teacher-student model
Advances in Neural Information Processing Systems 34:18137–18151.

https://doi.org/10.1088/1742-5468/ac9825
- Google Scholar
Preprint
1. Loureiro B
2. Gerbelot C
3. Cui H
4. Goldt S
5. Krzakala F
6. Mézard M
7. Zdeborová L
(2021b) Capturing the Learning Curves of Generic Features Maps for Realistic Data Sets with a Teacher-Student Model
arXiv.

https://doi.org/10.48550/arXiv.2102.08127
- Google Scholar
Preprint
(2022) Evolution of Neural Activity in Circuits Bridging Sensory and Abstract Knowledge
bioRxiv.

https://doi.org/10.1101/2022.01.29.478317
- Google Scholar
1. Mei S
2. Montanari A
(2020) The generalization error of random features regression: precise asymptotics and the double descent curve
Communications on Pure and Applied Mathematics 75:667–766.

https://doi.org/10.1002/cpa.22008
- Google Scholar
Preprint
(2021) Learning with Invariances in Random Features and Kernel Models
arXiv.

https://doi.org/10.48550/arXiv.2102.13219
- Google Scholar
(2020) Adaptive tuning curve widths improve sample efficient learning
Frontiers in Computational Neuroscience 14:12.

https://doi.org/10.3389/fncom.2020.00012
- PubMed
- Google Scholar
1. Mercer J
(1909) XVI. Functions of positive and negative type, and their connection the theory of integral equations
Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of A Mathematical or Physical Character 209:415–446.

https://doi.org/10.1098/rsta.1909.0016
- Google Scholar
1. Miller KD
2. Troyer TW
(2002) Neural noise can explain expansive, power-law nonlinearities in neural response functions
Journal of Neurophysiology 87:653–659.

https://doi.org/10.1152/jn.00425.2001
- PubMed
- Google Scholar
(2014) Information-limiting correlations
Nature Neuroscience 17:1410–1417.

https://doi.org/10.1038/nn.3807
- PubMed
- Google Scholar
Conference
1. Nassar J
2. Sokol P
3. Chang S
4. Harris K
(2020)
On 1/n neural representation and robustness

Advances in Neural Information Processing Systems 33.
- Google Scholar
Book
1. Neal MR
(1994)
Bayesian Learning for Neural Networks

Springer.
- Google Scholar
1. Niell CM
2. Stryker MP
(2008) Highly selective receptive fields in mouse visual cortex
The Journal of Neuroscience 28:7520–7536.

https://doi.org/10.1523/JNEUROSCI.0623-08.2008
- PubMed
- Google Scholar
1. Niven JE
2. Laughlin SB
(2008) Energy limitation as a selective pressure on the evolution of sensory systems
The Journal of Experimental Biology 211:1792–1804.

https://doi.org/10.1242/jeb.017574
- PubMed
- Google Scholar
1. Olshausen BA
2. Field DJ
(1997) Sparse coding with an overcomplete basis set: a strategy employed by V1?
Vision Research 37:3311–3325.

https://doi.org/10.1016/s0042-6989(97)00169-7
- PubMed
- Google Scholar
(2007) Effects of perceptual learning in visual backward masking on the responses of macaque inferior temporal neurons
Neuroscience 145:775–789.

https://doi.org/10.1016/j.neuroscience.2006.12.058
- PubMed
- Google Scholar
(2018) Robustness of spike deconvolution for neuronal calcium imaging
The Journal of Neuroscience 38:7976–7985.

https://doi.org/10.1523/JNEUROSCI.3339-17.2018
- PubMed
- Google Scholar
Book
(2019)
Recordings of 20,000 Neurons from V1 in Response to Oriented Stimuli

American Physiological Society.
- Google Scholar
1. Pehlevan C
2. Sompolinsky H
(2014) Selectivity and sparseness in randomly connected balanced networks
PLOS ONE 9:e89992.

https://doi.org/10.1371/journal.pone.0089992
- PubMed
- Google Scholar
(2018) Why do similarity matching objectives lead to hebbian/anti-hebbian networks?
Neural Computation 30:84–124.

https://doi.org/10.1162/neco_a_01018
- PubMed
- Google Scholar
Software
1. Pehlevan-Group
(2022) Sample_efficient_pop_codes, version 6cd4f0f
GitHub.

https://github.com/Pehlevan-Group/sample_efficient_pop_codes
(2009) The surprisingly high human efficiency at learning to recognize faces
Vision Research 49:301–314.

https://doi.org/10.1016/j.visres.2008.10.014
- PubMed
- Google Scholar
1. Pitkow X
2. Meister M
(2012) Decorrelation and efficient coding by retinal ganglion cells
Nature Neuroscience 15:628–635.

https://doi.org/10.1038/nn.3064
- PubMed
- Google Scholar
1. Pleger B
2. Foerster AF
3. Ragert P
4. Dinse HR
5. Schwenkreis P
6. Malin JP
7. Nicolas V
8. Tegenthoff M
(2003) Functional imaging of perceptual learning in human primary and secondary somatosensory cortex
Neuron 40:643–653.

https://doi.org/10.1016/s0896-6273(03)00677-9
- PubMed
- Google Scholar
Conference
1. Plumbley MD
(2004)
Lie group methods for optimization with orthogonality constraints

In International Conference on Independent Component Analysis and Signal Separation.
- Google Scholar
(2004) The contribution of spike threshold to the dichotomy of cortical simple and complex cells
Nature Neuroscience 7:1113–1122.

https://doi.org/10.1038/nn1310
- PubMed
- Google Scholar
1. Priebe NJ
2. Ferster D
(2008) Inhibition, spike threshold, and stimulus selectivity in primary visual cortex
Neuron 57:482–497.

https://doi.org/10.1016/j.neuron.2008.02.005
- PubMed
- Google Scholar
Conference
1. Rahaman N
2. Baratin A
3. Arpit D
4. Draxler F
5. Lin M
6. Hamprecht F
7. Bengio Y
8. Courville A
(2019)
On the spectral bias of neural networks

In International Conference on Machine Learning.
- Google Scholar
1. Rao RP
2. Ballard DH
(1999) Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects
Nature Neuroscience 2:79–87.

https://doi.org/10.1038/4580
- PubMed
- Google Scholar
Book
1. Rasmussen CE
2. Williams CKI
(2005) Gaussian Processes for Machine Learning
The MIT Press.

https://doi.org/10.7551/mitpress/3206.001.0001
- Google Scholar
(1993) Plasticity in the frequency representation of primary auditory cortex following discrimination training in adult owl monkeys
The Journal of Neuroscience 13:87–103.

https://doi.org/10.1523/JNEUROSCI.13-01-00087.1993
- PubMed
- Google Scholar
1. Rumyantsev OI
2. Lecoq JA
3. Hernandez O
4. Zhang Y
5. Savall J
6. Chrapkiewicz R
7. Li J
8. Zeng H
9. Ganguli S
10. Schnitzer MJ
(2020) Fundamental bounds on the fidelity of sensory cortical coding
Nature 580:100–105.

https://doi.org/10.1038/s41586-020-2130-2
- PubMed
- Google Scholar
1. Sadtler PT
2. Quick KM
3. Golub MD
4. Chase SM
5. Ryu SI
6. Tyler-Kabara EC
7. Yu BM
8. Batista AP
(2014) Neural constraints on learning
Nature 512:423–426.

https://doi.org/10.1038/nature13665
- PubMed
- Google Scholar
Conference
(2001)
A generalized representer theorem

In Proceedings of the 14th Annual Conference on Computational Learning Theory and 5th European Conference on Computational Learning Theory, COLT ’01/EuroCOLT ’01.
- Google Scholar
Book
(2002)
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

MIT press.
- Google Scholar
1. Schoups A
2. Vogels R
3. Qian N
4. Orban G
(2001) Practising orientation identification improves orientation coding in v1 neurons
Nature 412:549–553.

https://doi.org/10.1038/35087601
- PubMed
- Google Scholar
1. Seeman SC
2. Campagnola L
3. Davoudian PA
4. Hoggarth A
5. Hage TA
6. Bosma-Moody A
7. Baker CA
8. Lee JH
9. Mihalas S
10. Teeter C
11. Ko AL
12. Ojemann JG
13. Gwinn RP
14. Silbergeld DL
15. Cobbs C
16. Phillips J
17. Lein E
18. Murphy G
19. Koch C
20. Zeng H
21. Jarsky T
(2018) Sparse recurrent excitatory connectivity in the microcircuit of the adult mouse and human cortex
eLife 7:e37349.

https://doi.org/10.7554/eLife.37349
- Google Scholar
1. Shadlen MN
2. Newsome WT
(2001) Neural basis of a perceptual decision in the parietal cortex (area lip) of the rhesus monkey
Journal of Neurophysiology 86:1916–1936.

https://doi.org/10.1152/jn.2001.86.4.1916
- PubMed
- Google Scholar
Preprint
1. Shan H
2. Sompolinsky H
(2021) A Minimum Perturbation Theory of Deep Perceptual Learning
bioRxiv.

https://doi.org/10.1101/2021.10.05.463260
- Google Scholar
Conference
(2021) Neural tangent kernel eigenvalues accurately predict generalization
ICLR 2022 Conference.

https://openreview.net/forum?id=lycl1GD7fVP
- Google Scholar
1. Simoncelli EP
2. Heeger DJ
(1998) A model of neuronal responses in visual area MT
Vision Research 38:743–761.

https://doi.org/10.1016/s0042-6989(97)00183-1
- PubMed
- Google Scholar
1. Simoncelli EP
2. Olshausen BA
(2001) Natural image statistics and neural representation
Annual Review of Neuroscience 24:1193–1216.

https://doi.org/10.1146/annurev.neuro.24.1.1193
- PubMed
- Google Scholar
1. Sinz FH
2. Pitkow X
3. Reimer J
4. Bethge M
5. Tolias AS
(2019) Engineering a less artificial intelligence
Neuron 103:967–979.

https://doi.org/10.1016/j.neuron.2019.08.034
- PubMed
- Google Scholar
Conference
1. Sollich P
(1998) Approximate learning curves for Gaussian processes
9th International Conference on Artificial Neural Networks.

https://doi.org/10.1049/cp:19991148
- Google Scholar
Book
1. Sollich P
(2002)
Gaussian process regression with mismatched models

In: Dietterich T, Becker S, Ghahramani Z, editors. Advances in Neural Information Processing Systems. MIT Press. pp. 1–2.
- Google Scholar
(1982) Predictive coding: a fresh view of inhibition in the retina
Proceedings of the Royal Society of London. Series B, Biological Sciences 216:427–459.

https://doi.org/10.1098/rspb.1982.0085
- PubMed
- Google Scholar
(2003) Intensity versus identity coding in an olfactory system
Neuron 39:991–1004.

https://doi.org/10.1016/j.neuron.2003.08.011
- PubMed
- Google Scholar
Data
(authors) (2018a) Recordings of 10,000 neurons in visual cortex in response to 2,800 natural images
Figshare.

https://doi.org/10.25378/janelia.6845348
Preprint
(2018b) High-Dimensional Geometry of Population Responses in Visual Cortex
bioRxiv.

https://doi.org/10.1101/374090
- Google Scholar
Software
1. Stringer C
(2018c) MouseLand / stringer-pachitariu-et-al-2018b, version 79850db
GitHub.

https://github.com/MouseLand/stringer-pachitariu-et-al-2018b
Software
1. Stringer C
(2019) MouseLand / stringer-et-al-2019, version bd294c4
GitHub.

https://github.com/MouseLand/stringer-et-al-2019
(2021) High-Precision coding in visual cortex
Cell 184:2767–2778.

https://doi.org/10.1016/j.cell.2021.03.042
- Google Scholar
(2011) How to grow a mind: statistics, structure, and abstraction
Science 331:1279–1285.

https://doi.org/10.1126/science.1192788
- PubMed
- Google Scholar
1. Townsend A
2. Trefethen LN
(2015) Continuous analogues of matrix factorizations
Proceedings of the Royal Society A 471:20140585.

https://doi.org/10.1098/rspa.2014.0585
- Google Scholar
1. Treves A
2. Rolls ET
(1991) What determines the capacity of autoassociative memories in the brain?
Network 2:371–397.

https://doi.org/10.1088/0954-898X_2_4_004
- Google Scholar
Conference
(2018)
Deep learning generalizes because the parameter-function map is biased towards simple functions

In International Conference on Learning Representations.
- Google Scholar
1. van Hateren JH
(1992) A theory of maximizing sensory information
Biol Cybern 68:23–29.

https://doi.org/10.1007/BF00203134
- PubMed
- Google Scholar
(2012) Dynamics of spatial frequency tuning in mouse visual cortex
Journal of Neurophysiology 107:2937–2949.

https://doi.org/10.1152/jn.00022.2012
- PubMed
- Google Scholar
Conference
1. Widrow B
2. Hoff ME
(1960)
Adaptive switching circuits

In 1960 IRE WESCON Convention Record.
- Google Scholar
1. Willmore B
2. Tolhurst DJ
(2001) Characterizing the sparseness of neural codes
Network: Computation in Neural Systems 12:255.

https://doi.org/10.1080/net.12.3.255.270
- Google Scholar
Conference
1. Wilson AG
2. Dann C
3. Lucas C
4. Xing EP
(2015)
The human kernel

Advances in Neural Information Processing Systems.
- Google Scholar
1. Wolpert DH
(1996) The lack of a priori distinctions between learning algorithms
Neural Computation 8:1341–1390.

https://doi.org/10.1162/neco.1996.8.7.1341
- Google Scholar
Preprint
1. Xiao L
2. Pennington J
(2022) Synergy and Symmetry in Deep Learning: Interactions between the Data, Model, and Inference Algorithm
arXiv.

https://doi.org/10.48550/arXiv.2207.04612
- Google Scholar
Preprint
1. Xu ZQJ
2. Zhang Y
3. Luo T
4. Xiao Y
5. Ma Z
(2019) Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks
arXiv.

https://doi.org/10.48550/arXiv.1901.06523
- Google Scholar
1. Yang T
2. Maunsell JHR
(2004) The effect of perceptual learning on neuronal responses in monkey visual area v4
The Journal of Neuroscience 24:1617–1626.

https://doi.org/10.1523/JNEUROSCI.4442-03.2004
- PubMed
- Google Scholar
Preprint
1. Yang G
(2019) Tensor Programs i: Wide Feedforward or Recurrent Neural Networks of Any Architecture Are Gaussian Processes
arXiv.

https://doi.org/10.48550/arXiv.1910.12478
- Google Scholar
Preprint
1. Yang G
(2020) Tensor Programs Ii: Neural Tangent Kernel for Any Architecture
arXiv.

https://doi.org/10.48550/arXiv.2006.14548
- Google Scholar
Conference
1. Yang G
2. Hu EJ
(2021)
Tensor programs iv: Feature learning in infinite-width neural networks

In International Conference on Machine Learning.
- Google Scholar
1. Zador AM
(2019) A critique of pure learning and what artificial neural networks can learn from animal brains
Nature Communications 10:1–7.

https://doi.org/10.1038/s41467-019-11786-6
- Google Scholar
Conference
1. Zhang C
2. Bengio S
3. Hardt M
4. Recht B
5. Vinyals O
(2016)
Understanding deep learning requires rethinking generalization

In 5th Int. Conf. on Learning Representations (ICLR 2017).
- Google Scholar

Article and author information

Author details

Blake Bordelon
1. John A Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, United States
2. Center for Brain Science, Harvard University, Cambridge, United States
Contribution
Conceptualization, Software, Formal analysis, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0003-0455-9445
Cengiz Pehlevan
1. John A Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, United States
2. Center for Brain Science, Harvard University, Cambridge, United States
Contribution
Conceptualization, Supervision, Funding acquisition, Investigation, Methodology, Writing - original draft, Project administration, Writing - review and editing

For correspondence
cpehlevan@seas.harvard.edu

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0001-9767-6063

Funding

National Science Foundation (DMS-2134157)

Blake Bordelon
Cengiz Pehlevan

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Jacob Zavatone-Veth and Abdulkadir Canatar for useful comments and discussions about this manuscript. BB acknowledges the support of the NSF-Simons Center for Mathematical and Statistical Analysis of Biology at Harvard (award #1764269) and the Harvard Q-Bio Initiative. CP and BB were also supported by NSF grant DMS-2134157.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

2,374

views
376

downloads
9

citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Citations by DOI

9

citations for umbrella DOI https://doi.org/10.7554/eLife.78606

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Blake Bordelon
Cengiz Pehlevan

(2022)

Population codes enable learning from few examples by shaping inductive bias

eLife 11:e78606.

https://doi.org/10.7554/eLife.78606

Categories and tags

Research organism

Mouse

Share this article

Cite this article

Learning tasks through linear readouts exploit representations of the population code to approximate a target response.

The inner product kernel controls the generalization performance of readouts.

The singular value decomposition (SVD) of the population code reveals the structure and inductive bias of the code.

Reconstructing filtered natural images from V1 responses reveals preference for low spatial frequencies.

A model of V1 as a bank of Gabor filters recapitulates experimental inductive bias.

The top eigensystem of a code determines its low-P generalization error.

The performance of time-dependent codes when learning dynamical systems can be understood through spectral bias.

The biological code is more metabolically efficient than random codes with same inductive biases.

Neural noise and subsampled neural codes can lead to overfitting.

Author details

Blake Bordelon

Contribution

Competing interests

Cengiz Pehlevan

Contribution

For correspondence

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism

The top eigensystem of a code determines its low- $P$ generalization error.