Task-dependent optimal representations for cerebellar learning

  1. Marjorie Xie
  2. Samuel P Muscinelli
  3. Kameron Decker Harris
  4. Ashok Litwin-Kumar  Is a corresponding author
  1. Zuckerman Mind Brain Behavior Institute, Columbia University, United States
  2. Department of Computer Science, Western Washington University, United States

Abstract

The cerebellar granule cell layer has inspired numerous theoretical models of neural representations that support learned behaviors, beginning with the work of Marr and Albus. In these models, granule cells form a sparse, combinatorial encoding of diverse sensorimotor inputs. Such sparse representations are optimal for learning to discriminate random stimuli. However, recent observations of dense, low-dimensional activity across granule cells have called into question the role of sparse coding in these neurons. Here, we generalize theories of cerebellar learning to determine the optimal granule cell representation for tasks beyond random stimulus discrimination, including continuous input-output transformations as required for smooth motor control. We show that for such tasks, the optimal granule cell representation is substantially denser than predicted by classical theories. Our results provide a general theory of learning in cerebellum-like systems and suggest that optimal cerebellar representations are task-dependent.

Editor's evaluation

Models of cerebellar function and the coding of inputs in the cerebellum often assume that random stimuli are a reasonable stand-in for real stimuli. However, the important contribution of this paper is that conclusions about optimality and sparseness in these models do not generalize to potentially more realistic sets of stimuli, for example, those drawn from a low-dimensional manifold. The mathematical analyses in the paper are convincing and possible limitations, including the abstraction from biological details, are well discussed.

https://doi.org/10.7554/eLife.82914.sa0

Introduction

A striking property of cerebellar anatomy is the vast expansion in number of granule cells compared to the mossy fibers that innervate them (Eccles et al., 1967). This anatomical feature has led to the proposal that the function of the granule cell layer is to produce a high-dimensional representation of lower-dimensional mossy fiber activity (Marr, 1969; Albus, 1971; Cayco-Gajic and Silver, 2019). In such theories, granule cells integrate information from multiple mossy fibers and respond in a nonlinear manner to different input combinations. Detailed theoretical analysis has argued that anatomical parameters such as the ratio of granule cells to mossy fibers (Babadi and Sompolinsky, 2014), the number of inputs received by individual granule cells (Litwin-Kumar et al., 2017; Cayco-Gajic et al., 2017), and the distribution of granule-cell-to-Purkinje-cell synaptic weights Brunel et al., 2004 have quantitative values that maximize the dimension of the granule cell representation and learning capacity. Sparse activity, which increases dimension, is also cited as a key property of this representation (Marr, 1969; Albus, 1971; Babadi and Sompolinsky, 2014; but see Spanne and Jörntell, 2015). Sparsity affects both learning speed (Cayco-Gajic et al., 2017) and generalization, the ability to predict correct labels for previously unseen inputs (Barak et al., 2013; Babadi and Sompolinsky, 2014; Litwin-Kumar et al., 2017).

Theories that study the effects of dimension on learning typically focus on the ability of a system to perform categorization tasks with random, high-dimensional inputs (Barak et al., 2013; Babadi and Sompolinsky, 2014; Litwin-Kumar et al., 2017; Cayco-Gajic et al., 2017). In this case, increasing the dimension of the granule cell representation increases the number of inputs that can be discriminated. However, cerebellar circuits participate in diverse behaviors, including dexterous movements, inter-limb coordination, the formation of internal models, and cognitive behaviors (Ito and Itō, 1984; Wolpert et al., 1998; Strick et al., 2009). Cerebellum-like circuits, such as the insect mushroom body and the electrosensory system of electric fish, support other functions such as associative learning (Modi et al., 2020) and the cancellation of self-generated sensory signals (Kennedy et al., 2014), respectively. This diversity raises the question of whether learning high-dimensional categorization tasks is a sufficient framework for probing the function of granule cells and their analogs.

Several recent studies have reported dense activity in cerebellar granule cells in response to sensory stimulation or during motor control tasks (Jörntell and Ekerot, 2006; Knogler et al., 2017; Wagner et al., 2017; Giovannucci et al., 2017; Badura and De Zeeuw, 2017; Wagner et al., 2019), at odds with classical theories (Marr, 1969; Albus, 1971). Moreover, there is evidence that granule cell firing rates differ across cerebellar regions (Heath et al., 2014; Witter and De Zeeuw, 2015). In contrast to this reported dense activity in cerebellar granule cells, odor responses in Kenyon cells, the analogs of granule cells in the Drosophila mushroom body, are sparse, with 5–10% of neurons responding to odor stimulation (Turner et al., 2008; Honegger et al., 2011; Lin et al., 2014).

We propose that these differences can be explained by the capacity of representations with different levels of sparsity to support learning of different tasks. We show that the optimal level of sparsity depends on the structure of the input-output relationship of a task. When learning input-output mappings for motor control tasks, the optimal granule cell representation is much denser than predicted by previous analyses. To explain this result, we develop an analytic theory that predicts the performance of cerebellum-like circuits for arbitrary learning tasks. The theory describes how properties of cerebellar architecture and activity control these networks’ inductive bias: the tendency of a network toward learning particular types of input-output mappings (Sollich, 1998; Jacot et al., 2018; Bordelon et al., 2020; Canatar et al., 2021b; Simon et al., 2021). The theory shows that inductive bias, rather than the dimension of the representation alone, is necessary to explain learning performance across tasks. It also suggests that cerebellar regions specialized for different functions may adjust the sparsity of their granule cell representations depending on the task.

Results

In our model, a granule cell layer of M neurons receives connections from a random subset of N mossy fiber inputs. Because MN in the cerebellar cortex and cerebellum-like structures (approximately M=200,000 and N=7,000 for the neurons presynaptic to a single Purkinje cell in the cat brain; Eccles et al., 1967), we refer to the granule cell layer as the expansion layer and the mossy fiber layer as the input layer (Figure 1A and B).

Schematic of cerebellar cortex model.

(A) Mossy fiber inputs (blue) project to granule cells (green), which send parallel fibers that contact a Purkinje cell (black). (B) Diagram of neural network model. D task variables are embedded, via a linear transformation A, in the activity of N input layer neurons. Connections from the input layer to the expansion layer are described by a synaptic weight matrix J. (C) Illustration of task subspace. Points x in a D-dimensional space of task variables are embedded in a D-dimensional subspace of the N-dimensional input layer activity n (D=2, N=3 illustrated).

A typical assumption in computational theories of the cerebellar cortex is that inputs are randomly distributed in a high-dimensional space (Marr, 1969; Albus, 1971; Brunel et al., 2004; Babadi and Sompolinsky, 2014; Billings et al., 2014; Litwin-Kumar et al., 2017). While this may be a reasonable simplification in some cases, many tasks, including cerebellum-dependent tasks, are likely best-described as being encoded by a low-dimensional set of variables. For example, the cerebellum is often hypothesized to learn a forward model for motor control (Wolpert et al., 1998), which uses sensory input and motor efference to predict an effector’s future state. Mossy fiber activity recorded in monkeys correlates with position and velocity during natural movement (van Kan et al., 1993). Sources of motor efference copies include motor cortex, whose population activity lies on a low-dimensional manifold (Wagner et al., 2019; Huang et al., 2013; Churchland et al., 2010; Yu et al., 2009). We begin by modeling the low dimensionality of inputs and later consider more specific tasks.

We therefore assume that the inputs to our model lie on a D-dimensional subspace embedded in the N-dimensional input space, where D is typically much smaller than N (Figure 1B). We refer to this subspace as the ‘task subspace’ (Figure 1C). A location in this subspace is described by a D dimensional vector x, while the corresponding input layer activity is given by n=Ax, with A an N×D matrix describing the embedding of the task variables in the input layer. An M×D effective weight matrix Jeff=JAJeff, which describes the selectivity of expansion layer neurons to task variables, is determined by A and the M×N input-to-expansion-layer synaptic weight matrix J. The activity of neurons in the expansion layer is given by:

(1) h=ϕ(Jeffxθ),

where ϕ is a rectified linear activation function ϕ(u)=max(u,0) applied element-wise. Our results also hold for other threshold-polynomial activation functions. The scalar threshold θ is shared across neurons and controls the coding level, which we denote by f, defined as the average fraction of neurons in the expansion layer that are active. We show results for f<0.5, since extremely dense codes are rarely observed in experiments (Olshausen and Field, 2004; see Discussion). For analytical tractability, we begin with the case where the entries of Jeff are independent Gaussian random variables, as in previous theories (Rigotti et al., 2013; Barak et al., 2013; Babadi and Sompolinsky, 2014). This holds when the columns of A are orthonormal (ensuring that the embedding of the task variables in the input layer preserves their geometry) and the entries of J are independent and Gaussian. Later, we will show that networks with more realistic connectivity behave similarly to this case.

Optimal coding level is task-dependent

In our model, a learning task is defined by a mapping from task variables x to an output f(x), representing a target change in activity of a readout neuron, for example a Purkinje cell. The limited scope of this definition implies our results should not strongly depend on the influence of the readout neuron on downstream circuits. The readout adjusts its incoming synaptic weights from the expansion layer to better approximate this target output. For example, for an associative learning task in which sensory stimuli are classified into categories such as appetitive or aversive, the task may be represented as a mapping from inputs to two discrete firing rates corresponding to the readout’s response to stimuli of each category (Figure 2A). In contrast, for a forward model, in which the consequences of motor commands are computed using a model of movement dynamics, an input encoding the current sensorimotor state is mapped to a continuous output representing the readout neuron’s tuning to a predicted sensory variable (Figure 2B).

Figure 2 with 4 supplements see all
Optimal coding level depends on task.

(A) A random categorization task in which inputs are mapped to one of two categories (+1 or –1). Gray plane denotes the decision boundary of a linear classifier separating the two categories. (B) A motor control task in which inputs are the sensorimotor states x(t) of an effector which change continuously along a trajectory (gray) and outputs are components of predicted future states x(t+δ). (C) Schematic of random categorization tasks with P input-category associations. The value of the target function f(x) (color) is a function of two task variables x1 and x2. (D) Schematic of tasks involving learning a continuously varying Gaussian process target parameterized by a length scale γ. (E) Error rate as a function of coding level for networks trained to perform random categorization tasks similar to (C). Arrows mark estimated locations of minima. (F) Error as a function of coding level for networks trained to fit target functions sampled from Gaussian processes. Curves represent different values of the length scale parameter γ. Standard error of the mean is computed across 20 realizations of network weights and sampled target functions in (E) and 200 in (F).

To examine how properties of the expansion layer representation influence learning performance across tasks, we designed two families of tasks: one modeling categorization of random stimuli, which is often used to study the performance of expanded neural representations (Rigotti et al., 2013; Barak et al., 2013; Babadi and Sompolinsky, 2014; Litwin-Kumar et al., 2017; Cayco-Gajic et al., 2017), and the other modeling learning of a continuously varying output. The former we refer to as a ‘random categorization task’ and is parameterized by the number of input pattern-to-category associations P learned during training (Figure 2C). During the training phase, the network learns to associate random input patterns xμRD for μ=1,,P with random binary categories yμ=±1. The elements of xμ are drawn i.i.d. from a normal distribution with mean 0 and variance 1/D. We refer to xμ as ‘training patterns’. To assess the network’s generalization performance, it is presented with ‘test patterns’ generated by adding noise (parameterized by a noise magnitude ϵ; see Methods) to the training patterns. Tasks with continuous outputs (Figure 2D) are parameterized by a length scale that determines how quickly the output changes as a function of the input (specifically, input-output functions are drawn from a Gaussian process with length scale γ for variations in f(x) as a function of x; see Methods). In this case, both training and test patterns are drawn uniformly on the unit sphere. Later, we will also consider tasks implemented by specific cerebellum-like systems. See Table 1 for a summary of parameters throughout this study.

Table 1
Summary of simulation parameters.

M: number of expansion layer neurons. N: number of input layer neurons. K: number of connections from input layer to a single expansion layer neuron. S: total number of connections from input to expansion layer. f: expansion layer coding level. D: number of task variables. P: number of training patterns. γ: Gaussian process length scale. ϵ: magnitude of noise for random categorization tasks. We do not report N and K for simulations in which Jeff contains Gaussian i.i.d. elements as results do not depend on these parameters in this case.

Figure panelNetwork parametersTask parameters
Figure 2EM=10,000D=50,P=1,000,ϵ=0.1
Figures 2F, 4G and 5B (full)M=200,000D=3,P=30
Figure 5B and EM=200,000,N=7,000,K=4D=3,P=30
Figure 6AS=MK=10,000,N=100,f=0.3D=3,P=200
Figure 6BN=700,K=4,f=0.3D=3,P=200
Figure 6CM=5,000,f=0.3D=3,P=100,γ=1
Figure 6DM=1,000D=3,P=50
Figure 7AM=20,000D=6,P=100; see Methods
Figure 7BM=10,000,N=50,K=7D=50,P=100,ϵ=0.1
Figure 7CM=20,000,N=206,1K3see Methods
Figure 7DM=20,000,N=K=24D=1,P=30; see Methods
Figure 2—figure supplement 1M=10,000See Figure
Figure 2—figure supplement 2M=20,000D=3,P=30
Figure 2—figure supplement 3M=20,000D=3,P=30
Figure 2—figure supplement 4M=20,000D=3
Figure 7—figure supplement 1M=20,000D=3,P=200
Figure 7—figure supplement 2M=10,000,f=0.3D=3,P=30,γ=1

We trained the readout to approximate the target output for training patterns and generalize to unseen test patterns. The network’s prediction is f^(x)=wh(x) for tasks with continuous outputs, or f^(x)=sign(wh(x)) for categorization tasks, where w are the synaptic weights of the readout from the expansion layer. These weights were set using least squares regression. Performance was measured as the fraction of incorrect predictions for categorization tasks, or relative mean squared error for tasks with continuous targets: Error=E[(f(x)f^(x))2]E[f(x)2], where the expectation is across test patterns.

We began by examining the dependence of learning performance on the coding level of the expansion layer. For random categorization tasks, performance is maximized at low coding levels (Figure 2E), consistent with previous results (Barak et al., 2013; Babadi and Sompolinsky, 2014). The optimal coding level remains below 0.1 in the model, regardless of the number of associations P, the level of input noise, and the dimension D (Figure 2—figure supplement 1). For continuously varying outputs, the dependence is qualitatively different (Figure 2F). The optimal coding level depends strongly on the length scale, with learning performance for slowly varying functions optimized at much higher coding levels than quickly varying functions. This dependence holds for different choices of threshold-nonlinear functions (Figure 2—figure supplement 2) or input dimension (Figure 2—figure supplement 3) and is most pronounced when the number of training patterns is limited (Figure 2—figure supplement 4). Our observations are at odds with previous theories of the role of sparse granule cell representations (Marr, 1969; Albus, 1971; Babadi and Sompolinsky, 2014; Billings et al., 2014) and show that sparse activity does not always optimize performance for this broader set of tasks.

Geometry of the expansion layer representation

To determine how the optimal coding level depends on the task, we begin by quantifying how the expansion layer transforms the geometry of the task subspace. Later we will address how this transformation affects the ability of the network to learn a target. For ease of analysis, we will assume for now that inputs are normalized, x=1, so that they lie on the surface of a sphere in D dimensions. The set of neurons in the expansion layer activated by an input x are those neurons i for which the alignment of their effective weights with the input, Jieffx, exceeds the activation threshold θ (Equation 1; Figure 3A). Increasing θ reduces the size of this set of neurons and hence reduces the coding level.

Effect of coding level on the expansion layer representation.

(A) Effect of activation threshold on coding level. A point on the surface of the sphere represents a neuron with effective weights Jieff. Blue region represents the set of neurons activated by x, i.e., neurons whose input exceeds the activation threshold θ (inset). Darker regions denote higher activation. (B) Effect of coding level on the overlap between population responses to different inputs. Blue and red regions represent the neurons activated by x and x, respectively. Overlap (purple) represents the set of neurons activated by both stimuli. High coding level leads to more active neurons and greater overlap. (C) Kernel K(x,x) for networks with rectified linear activation functions (Equation 1), normalized so that fully overlapping representations have an overlap of 1, plotted as a function of overlap in the space of task variables. The vertical axis corresponds to the ratio of the area of the purple region to the area of the red or blue regions in (B). Each curve corresponds to the kernel of an infinite-width network with a different coding level f. (D) Dimension of the expansion layer representation as a function of coding level for a network with M=10,000 and D=3.

Different inputs activate different sets of neurons, and more similar inputs activate sets with greater overlap. As the coding level is reduced, this overlap is also reduced (Figure 3B). In fact, this reduction in overlap is greater than the reduction in number of neurons that respond to either of the individual inputs, reflecting the fact that representations with low coding levels perform ‘pattern separation’ (Figure 3B, compare purple and red or blue regions).

This effect is summarized by the ‘kernel’ of the network (Schölkopf and Smola, 2002; Rahimi and Recht, 2007), which measures overlap of representations in the expansion layer as a function of the task variables:

(2) K(x,x)=1Mh(x)h(x).

Equations 1 and 2 show that the threshold θ, which determines the coding level, influences the kernel through its effect on the expansion layer activity h(x). When inputs are normalized and the effective weights are Gaussian, we compute a semi-analytic expression for the kernel of the expansion layer in the limit of a large expansion (M; see Appendix). In this case, the kernel depends only on the overlap of the task variables, K(x,x)=K(xx). Plotting the kernel for different choices of coding level demonstrates that representations with lower coding levels exhibit greater pattern separation (Figure 3C; Babadi and Sompolinsky, 2014). This is consistent with the observation that decreasing the coding level increases the dimension of the representation (Figure 3D).

Frequency decomposition of kernel and task explains optimal coding level

We now relate the geometry of the expansion layer representation to performance across the tasks we have considered. Previous studies focused on high-dimensional, random categorization tasks in which inputs belong to a small number of well-separated clusters whose centers are random uncorrelated patterns. Generalization is assessed by adding noise to previously observed training patterns (Babadi and Sompolinsky, 2014; Litwin-Kumar et al., 2017; Figure 4A). In this case, performance depends only on overlaps at two spatial scales: the overlap between training patterns belonging to different clusters, which is small, and the overlap between training and test patterns belonging to the same cluster, which is large (Figure 4B). For such tasks, the kernel evaluated near these two values—specifically, the behavior of K(xx) near xx=0 and xx=1Δ, where Δ is a measure of within-cluster noise—fully determines generalization performance (Figure 4C; see Appendix). Sparse expansion layer representations reduce the overlap of patterns belonging to different clusters, increasing dimension and generalization performance (Figure 3D, Figure 2E).

Figure 4 with 2 supplements see all
Frequency decomposition of network and target function.

(A) Geometry of high-dimensional categorization tasks where input patterns are drawn from random, noisy clusters (light regions). Performance depends on overlaps between training patterns from different clusters (green) and on overlaps between training and test patterns from the same cluster (orange). (B) Distribution of overlaps of training and test patterns in the space of task variables for a high-dimensional task (D=200) with random, clustered inputs as in (A) and a low-dimensional task (D=5) with inputs drawn uniformly on a sphere. (C) Overlaps in (A) mapped onto the kernel function. Overlaps between training patterns from different clusters are small (green). Overlaps between training and test patterns from the same cluster are large (orange). (D) Schematic illustration of basis function decomposition, for eigenfunctions on a square domain. (E) Kernel eigenvalues (normalized by the sum of eigenvalues across modes) as a function of frequency for networks with different coding levels. (F) Power cα2 as a function of frequency for Gaussian process target functions. Curves represent different values of γ, the length scale of the Gaussian process. Power is averaged over 20 realizations of target functions. (G) Generalization error predicted using kernel eigenvalues (E) and target function decomposition (F) for the three target function classes shown in (F). Standard error of the mean is computed across 100 realizations of network weights and target functions.

We study tasks where training patterns used for learning and test patterns used to assess generalization are both drawn according to a distribution over a low-dimensional space of task variables. While the mean overlap between pairs of random patterns remains zero regardless of dimension, fluctuations around the mean increase when the space is low dimensional, leading to a broader distribution of overlaps (Figure 4B). In this case, generalization performance depends on values of the kernel function evaluated across this entire range of overlaps. Methods from the theory of kernel regression (Sollich, 1998; Jacot et al., 2018; Bordelon et al., 2020; Canatar et al., 2021b; Simon et al., 2021) capture these effects by quantifying a network’s performance on a learning task through a decomposition of the target function into a set of basis functions (Figure 4D). Performance is assessed by summing the contribution of each mode in this decomposition to generalization error.

The decomposition expresses the kernel as a sum of eigenfunctions weighted by eigenvalues, K(x,x)=αλαψα(x)ψα(x). The eigenfunctions are determined by the network architecture and the distribution of inputs. As we show below, the eigenvalues λα determine the ease with which each corresponding eigenfunction ψα(x)—one element of the basis function decomposition—is learned by the network. Under our present assumptions of Gaussian effective weights and uniformly distributed, normalized input patterns, the eigenfunctions are the spherical harmonic functions. These functions are ordered by increasing frequency, with higher frequencies corresponding to functions that vary more quickly as a function of the task variables. Spherical harmonics are defined for any input dimension; for example, in two dimensions they are the Fourier modes. We find that coding level substantially changes the frequency dependence of the eigenvalues associated with these eigenfunctions (Figure 4E). Higher coding levels increase the relative magnitude of the low frequency eigenvalues compared to high-frequency eigenvalues. As we will show, this results in a different inductive bias for networks with different coding levels.

To calculate learning performance for an arbitrary task, we decompose the target function in the same basis as that of the kernel:

(3) f(x)=αcαψα(x)

The coefficient cα quantifies the weight of mode α in the decomposition. For the Gaussian process targets, we have considered, increasing length scale corresponds to a greater relative contribution of low versus high frequency modes (Figure 4F). Using these coefficients and the eigenvalues (Figure 4E), we obtain an analytical prediction of the mean-squared generalization error (‘Error’) for learning any given task (Figure 4G; see Methods):

(4) Error=C1α(cαC2+λα)2,

where C1 and C2 do not depend on α (Canatar et al., 2021b; Simon et al., 2021; see Methods). Equation 4 illustrates that for equal values of cα, modes with greater λα contribute less to the generalization error.

Our theory reveals that the optima observed in Figure 2F are a consequence of the difference in eigenvalues of networks with different coding levels. This reflects an inductive bias (Sollich, 1998; Jacot et al., 2018; Bordelon et al., 2020; Canatar et al., 2021b; Simon et al., 2021) of networks with low and high coding levels toward the learning of high and low frequency functions, respectively (Figure 4E, Figure 4—figure supplement 1). Thus, the coding level’s effect on a network’s inductive bias, rather than dimension alone, determines learning performance. Previous studies that focused only on random categorization tasks did not observe this dependence, since errors in such tasks are dominated by the learning of high frequency components, for which sparse activity is optimal (Figure 4—figure supplement 2).

Performance of sparsely connected expansions

To simplify our analysis, we have so far assumed full connectivity between input and expansion layers without a constraint on excitatory or inhibitory synaptic weights. In particular, we have assumed that the effective weight matrix Jeff contains independent Gaussian entries (Figure 5A, top). However, synaptic connections between mossy fibers and granule cells are sparse and excitatory (Sargent et al., 2005), with a typical in-degree of K=4 mossy fibers per granule cell (Figure 5A, bottom). We therefore analyzed the performance of model networks with more realistic connectivity. Surprisingly, when J is sparse and nonnegative, both overall generalization performance and the task-dependence of optimal coding level remain unchanged (Figure 5B).

Performance of networks with sparse connectivity.

(A) Top: Fully connected network. Bottom: Sparsely connected network with in-degree K<N and excitatory weights with global inhibition onto expansion layer neurons. (B) Error as a function of coding level for fully connected Gaussian weights (gray curves) and sparse excitatory weights (blue curves). Target functions are drawn from Gaussian processes with different values of length scale γ as in Figure 2. (C) Distributions of synaptic weight correlations Corr(Jieff,Jjeff), where Jieff is the ith row of Jeff, for pairs of expansion layer neurons in networks with different numbers of input layer neurons N (colors) when K=4 and D=3. Black distribution corresponds to fully connected networks with Gaussian weights. We note that when D=3, the distribution of correlations for random Gaussian weight vectors is uniform on [-1,1] as shown (for higher dimensions the distribution has a peak at 0). (D) Schematic of the selectivity of input layer neurons to task variables in distributed and clustered representations. (E) Error as a function of coding level for networks with distributed (black, same as in B) and clustered (orange) representations. (F) Distributions of Corr(Jieff,Jjeff) for pairs of expansion layer neurons in networks with distributed and clustered input representations when K=4, D=3, and N=1,000. Standard error of the mean was computed across 200 realizations in (B) and 100 in (E), orange curve.

To understand this result, we examined how J and A shape the statistics of the effective weights onto the expansion layer neurons Jeff. A desirable property of the expansion layer representation is that these effective weights sample the space of task variables uniformly (Figure 3A), increasing the heterogeneity of tuning of expansion layer neurons (Litwin-Kumar et al., 2017). This occurs when Jeff is a matrix of independent random Gaussian entries. If the columns of A are orthornormal and J is fully-connected with independent Gaussian entries, Jeff has this uniform sampling property.

However, when J is sparse and nonnegative, expansion layer neurons that share connections from the same input layer neurons receive correlated input currents. When N is small and A is random, fluctuations in A lead to biases in the input layer’s sampling of task variables which are inherited by the expansion layer. We quantify this by computing the distribution of correlations between the effective weights for pairs of expansion layer neurons, Corr(Jieff,Jjeff). This distribution indeed deviates from uniform sampling when N is small (Figure 5C). However, even when N is moderately large (but much less than M), only small deviations from uniform sampling of task variables occur for low dimensional tasks as long as D<KN (see Appendix). In contrast, for high-dimensional tasks (DN), KD is sufficient, in agreement with previous findings (Litwin-Kumar et al., 2017). For realistic cerebellar parameters (N=7,000 and K=4), the distribution is almost indistinguishable from that corresponding to uniform sampling (Figure 5C), consistent with the similar learning performance of these two cases (Figure 5B).

In the above analysis, an important assumption is that A is dense and random, so that the input layer forms a distributed representation in which each input layer neuron responds to a random combination of task variables (Figure 5D, top). If, on the other hand, the input layer forms a clustered representation containing groups of neurons that each encode a single task variable (Figure 5D, bottom), we may expect different results. Indeed, with a clustered representation, sparse connectivity dramatically reduces performance (Figure 5E). This is because the distribution of Corr(Jieff,Jjeff) deviates substantially from that corresponding to uniform sampling (Figure 5F), even as N (see Appendix). Specifically, increasing N does not reduce the probability of two expansion layer neurons being connected to input layer neurons that encode the same task variables and therefore receiving highly correlated currents. As a result, expansion layer neurons do not sample task variables uniformly and performance is dramatically reduced.

Our results show that networks with small K, moderately large N, and a distributed input layer representation approach the performance of networks that sample task variables uniformly. This equivalence validates the applicability of our theory to these more realistic networks. It also argues for the importance of distributed sensorimotor representations in the cortico-cerebellar pathway, consistent with the distributed nature of representations in motor cortex (Shenoy et al., 2013; Muscinelli et al., 2023).

Optimal cerebellar architecture is consistent across tasks

A history of theoretical modeling has shown a remarkable correspondence between anatomical properties of the cerebellar cortex and model parameters optimal for learning. These include the in-degree K of granule cells (Marr, 1969; Litwin-Kumar et al., 2017; Cayco-Gajic et al., 2017), the expansion ratio of the granule cells to the mossy fibers M/N(Babadi and Sompolinsky, 2014; Litwin-Kumar et al., 2017), and the distribution of synaptic weights from granule cells to Purkinje cells (Brunel et al., 2004; Clopath et al., 2012; Clopath and Brunel, 2013). In these studies, model performance was assessed using random categorization tasks. We have shown that optimal coding level is dependent on the task being learned, raising the question of whether optimal values of these architectural parameters are also task-dependent.

Sparse connectivity (K=4, consistent with the typical in-degree of cerebellar granule cells) has been shown to optimize learning performance in cerebellar cortex models (Litwin-Kumar et al., 2017; Cayco-Gajic et al., 2017). We examined the performance of networks with different granule cell in-degrees learning Gaussian process targets. The optimal in-degree is small for all the tasks we consider, suggesting that sparse connectivity is sufficient for high performance across a range of tasks (Figure 6A). This is consistent with the previous observation that the performance of a sparsely connected network approaches that of a fully connected network (Figure 5B).

Task-independence of optimal anatomical parameters.

(A) Error as a function of in-degree K for networks learning Gaussian process targets. Curves represent different values of γ, the length scale of the Gaussian process. The total number of synaptic connections S=MK is held constant. This constraint introduces a trade-off between having many neurons with small synaptic degree and having fewer neurons with large synaptic degree (Litwin-Kumar et al., 2017). S=104, D=3, f=0.3. (B) Error as a function of expansion ratio M/N for networks learning Gaussian process targets. D=3, N=700, f=0.3. (C) Distribution of granule-cell-to-Purkinje cell weights w for a network trained on nonnegative Gaussian process targets with f=0.3, D=3, γ=1. Granule-cell-to-Purkinje cell weights are constrained to be nonnegative (Brunel et al., 2004). (D) Fraction of granule-cell-to-Purkinje cell weights that are silent in networks learning nonnegative Gaussian process targets (blue) and random categorization tasks (gray).

Previous studies also showed that the expansion ratio from mossy fibers to granule cells M/N controls the dimension of the granule cell representation (Babadi and Sompolinsky, 2014; Litwin-Kumar et al., 2017). The dimension increases with expansion ratio but saturates as expansion ratio approaches the anatomical value (M/N30 when f0.1 for the inputs presynaptic to an individual Purkinje cell). These studies assumed that mossy fiber activity is uncorrelated (D=N) rather than low-dimensional (D<N). This raises the question of whether a large expansion is beneficial when D is small. We find that when the number of training patterns P is sufficiently large, performance still improves as M/N approaches its anatomical value, showing that Purkinje cells can exploit their large number of presynaptic inputs even in the case of low-dimensional activity (Figure 6B).

Brunel et al., 2004 showed that the distribution of granule-cell-to-Purkinje cell synaptic weights is consistent with the distribution that maximizes the number of random binary input-output mappings stored. This distribution exhibits a substantial fraction of silent synapses, consistent with experiments. These results also hold for analog inputs and outputs (Clopath and Brunel, 2013) and for certain forms of correlations among binary inputs and outputs (Clopath et al., 2012). However, the case we consider, where targets are a smoothly varying function of task variables, has not been explored. We observe a similar weight distribution for these tasks (Figure 6C), with the fraction of silent synapses remaining high across coding levels (Figure 6D). The fraction of silent synapses is lower for networks learning Gaussian process targets than those learning random categorization tasks, consistent with the capacity of a given network for learning such targets being larger (Clopath et al., 2012).

Although optimal coding level is task-dependent, these analyses suggest that optimal architectural parameters are largely task-independent. Whereas coding level tunes the inductive bias of the network to favor the learning of specific tasks, these architectural parameters control properties of the representation that improve performance across tasks. In particular, sparse connectivity and a large expansion support uniform sampling of low-dimensional task variables (consistent with Figure 5C), while a large fraction of silent synapses is a consequence of a readout that maximizes learning performance (Brunel et al., 2004).

Modeling specific behaviors dependent on cerebellum-like structures

So far, we have considered analytically tractable families of tasks with parameterized input-output functions. Next, we extend our results to more realistic tasks constrained by the functions of specific cerebellum-like systems, which include both highly structured, continuous input-output mappings and random categorization tasks.

To model the cerebellum’s role in predicting the consequences of motor commands (Wolpert et al., 1998), we examined the optimal coding level for learning the dynamics of a two-joint arm (Fagg et al., 1997). Given an initial state, the network predicts the change in the future position of the arm (Figure 7A). Performance is optimized at substantially higher coding levels than for random categorization tasks, consistent with our previous results for continuous input-output mappings (Figure 2E and F).

Figure 7 with 2 supplements see all
Optimal coding level across tasks and neural systems.

(A) Left: Schematic of two-joint arm. Center: Cerebellar cortex model in which sensorimotor task variables at time t are used to predict hand position at time t+δ. Right: Error as a function of coding level. Black arrow indicates location of optimum. Dashed line indicates performance of a readout of the input layer. (B) Left: Odor categorization task. Center: Drosophila mushroom body model in which odors activate olfactory projection neurons and are associated with a binary category (appetitive or aversive). Right: Error rate, similar to (A), right. (C) Left: Schematic of electrosensory system of the mormyrid electric fish, which learns a negative image to cancel the self-generated feedback from electric organ discharges sensed by electroreceptors. Center: Electrosensory lateral line lobe (ELL) model in which MG cells learn a negative image. Right: Error as a function of coding level. Gray arrow indicates location of coding level estimated from biophysical parameters (Kennedy et al., 2014). (D) Left: Schematic of the vestibulo-cular reflex (VOR). Head rotations with velocity H trigger eye motion in the opposite direction with velocity E. During VOR adaptation, organisms adapt to different gains (E/H). Center: Cerebellar cortex model in which the target function is the Purkinje cell’s firing rate as a function of head velocity. Right: Error, similar to (A), right.

The mushroom body, a cerebellum-like structure in insects, is required for learning of associations between odors and appetitive or aversive valence (Modi et al., 2020). This behavior can be represented as a mapping from random representations of odors in the input layer to binary category labels (Figure 7B). The optimal coding level in a model with parameters consistent with the Drosophila mushroom body is less than 0.1, consistent with our previous results for random categorization tasks (Figure 2E) and the sparse odor-evoked responses in Drosophila Kenyon cells (Turner et al., 2008; Honegger et al., 2011; Lin et al., 2014).

The prediction and cancellation of self-generated sensory feedback has been studied extensively in mormyrid weakly electric fish and depends on the electrosensory lateral line lobe (ELL), a cerebellum-like structure (Bell et al., 2008). Granule cells in the ELL provide a temporal basis for generating negative images that are used to cancel self-generated feedback (Figure 7C). We extended a detailed model of granule cells and their inputs (Kennedy et al., 2014) to study the influence of coding level on the effectiveness of this basis. The performance of this model saturated at relatively high coding levels, and notably the coding level corresponding to biophysical parameters estimated from data coincided with the value at which further increases in performance were modest. This observation suggests that coding level is also optimized for task performance in this system.

A canonical function of the mammalian cerebellum is the adjustment of the vestibulo-ocular reflex (VOR), in which motion of the head is detected and triggers compensatory ocular motion in the opposite direction. During VOR learning, Purkinje cells are tuned to head velocity, and their tuning curves are described as piecewise linear functions (Lisberger et al., 1994; Figure 7D). Although in vivo population recordings of granule cells during VOR adaptation are not, to our knowledge, available for comparison, our model predicts that performance for learning such tuning curves is high across a range of coding levels and shows that sparse codes are sufficient (although not necessary) for such tasks (Figure 7D).

These results predict diverse coding levels across different behaviors dependent on cerebellum-like structures. The odor categorization and VOR tasks both have input-output mappings that exhibit sharp nonlinearities and can be efficiently learned using sparse representations. In contrast, the forward modeling and feedback cancellation tasks have smooth input-output mappings and exhibit denser optima. These observations are consistent with our previous finding that more structured tasks favor denser coding levels than do random categorization tasks (Figure 2E and F).

Discussion

We have shown that the optimal granule cell coding level depends on the task being learned. While sparse representations are suitable for learning to categorize inputs into random categories, as predicted by classic theories, tasks involving structured input-output mappings benefit from denser representations (Figure 2). This reconciles such theories with the observation of dense granule cell activation during movement (Knogler et al., 2017; Wagner et al., 2017; Giovannucci et al., 2017; Badura and De Zeeuw, 2017; Wagner et al., 2019). We also show that, in contrast to the task-dependence of optimal coding level, optimal anatomical values of granule cell and Purkinje cell connectivity are largely task-independent (Figure 6). This distinction suggests that a stereotyped cerebellar architecture may support diverse representations optimized for a variety of learning tasks.

Relationship to previous theories

Previous studies assessed the learning performance of cerebellum-like systems with a model Purkinje cell that associates random patterns of mossy fiber activity with one of two randomly assigned categories (Marr, 1969; Albus, 1971; Brunel et al., 2004; Babadi and Sompolinsky, 2014; Litwin-Kumar et al., 2017; Cayco-Gajic et al., 2017), a common benchmark for artificial learning systems (Gerace et al., 2022). In this case, a low coding level increases the dimension of the granule cell representation, permitting more associations to be stored and improving generalization to previously unseen inputs. The optimal coding level is low but not arbitrarily low, as extremely sparse representations introduce noise that hinders generalization (Barak et al., 2013; Babadi and Sompolinsky, 2014).

To examine a broader family of tasks, our learning problems extend previous studies in several ways. First, we consider inputs that may be constrained to a low-dimensional task subspace. Second, we consider input-output mappings beyond random categorization tasks. Finally, we assess generalization error for arbitrary locations on the task subspace, rather than only for noisy instances of previously presented inputs. As we have shown, these considerations require a complete analysis of the inductive bias of cerebellum-like networks (Figure 4). Our analysis generalizes previous approaches (Barak et al., 2013; Babadi and Sompolinsky, 2014; Litwin-Kumar et al., 2017) that focused on dimension and noise alone. In particular, both dimension and noise for random patterns can be directly calculated from the kernel function (Figure 3C; see Appendix).

Our theory builds upon techniques that been developed for understanding properties of kernel regression (Sollich, 1998; Jacot et al., 2018; Bordelon et al., 2020; Canatar et al., 2021b; Simon et al., 2021). Kernel approximations of wide neural networks are a major area of current research providing analytically tractable theories (Rahimi and Recht, 2007; Jacot et al., 2018; Chizat et al., 2018). Prior studies have analyzed kernels corresponding to networks with zero (Cho and Saul, 2010) or mean-zero Gaussian thresholds (Basri et al., 2019; Jacot et al., 2018), which in both cases produce networks with a coding level of 0.5. Ours is the first kernel study of the effects of nonzero average thresholds. Our full characterization of the eigenvalue spectra and their decay rates as a function of the threshold extends previous work (Bach, 2017; Bietti and Bach, 2021). Furthermore, artificial neural network studies typically assume either fully-connected or convolutional layers, yet pruning connections after training barely degrades performance (Han et al., 2015; Zhang et al., 2018). Our results support the idea that sparsely connected networks may behave like dense ones if the representation is distributed (Figure 5), providing insight into the regimes in which pruning preserves performance.

Other studies have considered tasks with smooth input-output mappings and low-dimensional inputs, finding that heterogeneous Golgi cell inhibition can improve performance by diversifying individual granule cell thresholds (Spanne and Jörntell, 2013). Extending our model to include heterogeneous thresholds is an interesting direction for future work. Another proposal states that dense coding may improve generalization (Spanne and Jörntell, 2015). Our theory reveals that whether or not dense coding is beneficial depends on the task.

Assumptions and extensions

We have made several assumptions in our model for the sake of analytical tractability. When comparing the inductive biases of networks with different coding levels, our theory assumes that inputs are normalized and distributed uniformly in a linear subspace of the input layer activity. This allows us to decompose the target function into a basis in which we can directly compare eigenvalues, and hence learning performance, for different coding levels (Figure 4E–G). A similar analysis can be performed when inputs are not uniformly distributed, but in this case the basis is determined by an interplay between this distribution and the nonlinearity of expansion layer neurons, making the analysis more complex (see Appendix). We have also assumed that generalization is assessed for inputs drawn from the same distribution as used for learning. Recent and ongoing work on out-of-distribution generalization may permit relaxations of this assumption (Shen et al., 2021; Canatar et al., 2021a).

When analyzing properties of the granule cell layer, our theory also assumes an infinitely wide expansion. When P is small enough that performance is limited by number of samples, this assumption is appropriate, but finite-size corrections to our theory are an interesting direction for future work. We also have not explicitly modeled inhibitory input provided by Golgi cells, instead assuming such input can be modeled as a change in effective threshold, as in previous studies (Billings et al., 2014; Cayco-Gajic et al., 2017; Litwin-Kumar et al., 2017). This is appropriate when considering the dimension of the granule cell representation (Litwin-Kumar et al., 2017), but more work is needed to extend our model to the case of heterogeneous inhibition.

Another key assumption concerning the granule cells is that they sample mossy fiber inputs randomly, as is typically assumed in Marr-Albus models (Marr, 1969; Albus, 1971; Litwin-Kumar et al., 2017; Cayco-Gajic et al., 2017). Other studies instead argue that granule cells sample from mossy fibers with highly similar receptive fields (Garwicz et al., 1998; Brown and Bower, 2001; Jörntell and Ekerot, 2006) defined by the tuning of mossy fiber and climbing fiber inputs to cerebellar microzones (Apps et al., 2018). This has led to an alternative hypothesis that granule cells serve to relay similarly tuned mossy fiber inputs and enhance their signal-to-noise ratio (Jörntell and Ekerot, 2006; Gilbert and Chris Miall, 2022) rather than to re-encode inputs. Another hypothesis is that granule cells enable Purkinje cells to learn piece-wise linear approximations of nonlinear functions (Spanne and Jörntell, 2013). However, several recent studies support the existence of heterogeneous connectivity and selectivity of granule cells to multiple distinct inputs at the local scale (Huang et al., 2013; Ishikawa et al., 2015). Furthermore, the deviation of the predicted dimension in models constrained by electron-microscopy data as compared to randomly wired models is modest (Nguyen et al., 2023). Thus, topographically organized connectivity at the macroscopic scale may coexist with disordered connectivity at the local scale, allowing granule cells presynaptic to an individual Purkinje cell to sample heterogeneous combinations of the subset of sensorimotor signals relevant to the tasks that Purkinje cell participates in. Finally, we note that the optimality of dense codes for learning slowly varying tasks in our theory suggests that observations of a lack of mixing (Jörntell and Ekerot, 2002) for such tasks are compatible with Marr-Albus models, as in this case nonlinear mixing is not required.

We have quantified coding level by the fraction of neurons that are above firing threshold. We focused on coding levels f<0.5, as extremely dense codes are rarely found in experiments (Olshausen and Field, 2004), but our theory applies for f>0.5 as well. In general, representations with coding levels of f and 1-f perform similarly in our model due to the symmetry of most of their associated eigenvalues (Figure 4—figure supplement 1 and Appendix). Under the assumption that the energetic costs associated with neural activity are minimized, the f<0.5 region is likely the biologically plausible one. We also note that coding level is most easily defined when neurons are modeled as rate, rather than spiking units. To investigate the consistency of our results under a spiking code, we implemented a model in which granule cell spiking exhibits Poisson variability and quantify coding level as the fraction of neurons that have nonzero spike counts (Figure 7—figure supplement 1; Figure 7C). In general, increased spike count leads to improved performance as noise associated with spiking variability is reduced. Granule cells have been shown to exhibit reliable burst responses to mossy fiber stimulation (Chadderton et al., 2004), motivating models using deterministic responses or sub-Poisson spiking variability. However, further work is needed to quantitatively compare variability in model and experiment and to account for more complex biophysical properties of granule cells (Saarinen et al., 2008).

For the Purkinje cells, our model assumes that their responses to granule cell input can be modeled as an optimal linear readout. Our model therefore provides an upper bound to linear readout performance, a standard benchmark for the quality of a neural representation that does not require assumptions on the nature of climbing fiber-mediated plasticity, which is still debated. Electrophysiological studies have argued in favor of a linear approximation (Brunel et al., 2004). To improve the biological applicability of our model, we implemented an online climbing fiber-mediated learning rule and found that optimal coding levels are still task-dependent (Figure 7—figure supplement 2). We also note that although we model several timing-dependent tasks (Figure 7), our learning rule does not exploit temporal information, and we assume that temporal dynamics of granule cell responses are largely inherited from mossy fibers. Integrating temporal information into our model is an interesting direction for future investigation.

Implications for cerebellar representations

Our results predict that qualitative differences in the coding levels of cerebellum-like systems, across brain regions or across species, reflect an optimization to distinct tasks (Figure 7). However, it is also possible that differences in coding level arise from other physiological differences between systems. In the Drosophila mushroom body, which is required for associative learning of odor categories, random and sparse subsets of Kenyon cells are activated in response to odor stimulation, consistent with our model (Figure 7B; Turner et al., 2008; Honegger et al., 2011; Lin et al., 2014). In a model of the electrosensory system of the electric fish, the inferred coding level of a model constrained by the properties of granule cells is similar to that which optimizes task performance (Figure 7C). Within the cerebellar cortex, heterogeneity in granule cell firing has been observed across cerebellar lobules, associated with both differences in intrinsic properties (Heath et al., 2014) and mossy fiber input (Witter and De Zeeuw, 2015). It would be interesting to correlate such physiological heterogeneity with heterogeneity in function across the cerebellum. Our model predicts that regions involved in behaviors with substantial low-dimensional structure, for example smooth motor control tasks, may exhibit higher coding levels than regions involved in categorization or discrimination of high-dimensional stimuli.

Our model also raises the possibility that individual brain regions may exhibit different coding levels at different moments in time, depending on immediate behavioral or task demands. Multiple mechanisms could support the dynamic adjustment of coding level, including changes in mossy fiber input (Ozden et al., 2012), Golgi cell inhibition (Eccles et al., 1966; Palay and Chan-Palay, 1974), retrograde signaling from Purkinje cells (Kreitzer and Regehr, 2001), or unsupervised plasticity of mossy fiber-to-granule cell synapses (Schweighofer et al., 2001). The predictions of our model are not dependent on which of these mechanisms are active. A recent study demonstrated that local synaptic inhibition by Golgi cells controls the spiking threshold and hence the population coding level of cerebellar granule cells in mice (Fleming et al., 2022). Further, the authors observed that granule cell responses to sensory stimuli are sparse when movement-related selectivity is controlled for. This suggests that dense movement-related activity and sparse sensory-evoked activity are not incompatible.

While our analysis makes clear qualitative predictions concerning comparisons between the optimal coding levels for different tasks, in some cases it is also possible to make quantitative predictions about the location of the optimum for a single task. Doing so requires determining the appropriate time interval over which to measure coding level, which depends on the integration time constant of the readout neuron. It also requires estimates of the firing rates and biophysical properties of the expansion layer neurons. In the electrosensory system, for which a well-calibrated model exists and the learning objective is well-characterized (Kennedy et al., 2014), we found that the coding level estimated based on the data is similar to that which optimizes performance (Figure 7C).

If coding level is task-optimized, our model predicts that manipulating coding level artificially will diminish performance. In the Drosophila mushroom body, disrupting feedback inhibition from the GABAergic anterior paired lateral neuron onto Kenyon cells increases coding level and impairs odor discrimination (Lin et al., 2014). A recent study demonstrated that blocking inhibition from Golgi cells onto granule cells results in denser granule cell population activity and impairs performance on an eye-blink conditioning task (Fleming et al., 2022). These examples demonstrate that increasing coding level during sensory discrimination tasks, for which sparse activity is optimal, impairs performance. Our theory predicts that decreasing coding level during a task for which dense activity is optimal, such as smooth motor control, would also impair performance.

While dense activity has been taken as evidence against theories of combinatorial coding in cerebellar granule cells (Knogler et al., 2017; Wagner et al., 2019), our theory suggests that the two are not incompatible. Instead, the coding level of cerebellum-like regions may be determined by behavioral demands and the nature of the input to granule-like layers (Muscinelli et al., 2023). Sparse coding has also been cited as a key property of sensory representations in the cerebral cortex (Olshausen and Field, 1996). However, recent population recordings show that such regions exhibit dense movement-related activity (Musall et al., 2019), much like in cerebellum. While the theory presented in this study does not account for the highly structured recurrent interactions that characterize cerebrocortical regions, it is possible that these areas also operate using inductive biases that are shaped by coding level in a similar manner to our model.

Methods

Network model

The expansion layer activity is given by h=ϕ(Jeffxθ), where Jeff=JA describes the selectivity of expansion layer neurons to task variables. For most simulations, A is an N×D matrix sampled with random, orthonormal columns and J is an M×N matrix with i.i.d. unit Gaussian entries. The nonlinearity ϕ is a rectified linear activation function ϕ(u)=max(u,0) applied element-wise. The input layer activity n is given by n=Ax.

Sparsely connected networks

To model sparse excitatory connectivity, we generated a sparse matrix JE, where each row contains precisely K nonzero elements at random locations. The nonzero elements are either identical and equal to 1 (homogeneous excitatory weights) or sampled from a unit truncated normal distribution (heterogeneous excitatory weights). To model global feedforward inhibition that balances excitation, J=JEJI, where JI is a dense matrix with every element equal to 1MNijJijE.

For Figure 5B, Figure 6A and B, Figure 7B, sparsely connected networks were generated with homogeneous excitatory weights and global inhibition. For Figure 5E, the network with clustered representations was generated with homogeneous excitatory weights without global inhibition. For Figure 5C and F, networks were generated with heterogeneous excitatory weights and global inhibition.

Clustered representations

For clustered input-layer representations, each input layer neuron encodes one task variable (that is, A is a block matrix, with nonoverlapping blocks of N/D elements equal to 1 for each task variable). In this case, in order to obtain good performance, we found it necessary to fix the coding level for each input pattern, corresponding to winner-take-all inhibition across the expansion layer.

Dimension

The dimension of the expansion layer representation (Figure 3D) is given by Abbott et al., 2011; Litwin-Kumar et al., 2017:

(5) d=(iλi)2(iλi2),

where λi are the eigenvalues of the covariance matrix Cijh=Cov(hi,hj) of expansion layer responses (not to be confused with λα, the eigenvalues of the kernel operator). The covariance is computed by averaging over inputs x.

Learning tasks

Random categorization task

In a random categorization task (Figure 2E, Figure 7B), the network learns to associate a random input pattern xμRD for μ=1,,P with a random binary category yμ=±1. The elements of xμ are drawn i.i.d. from a normal distribution with mean 0 and variance 1/D. Test patterns x^μ are generated by adding noise to the training patterns:

(6) x^μ=1ϵ2xμ+ϵη,

where ηN(0,1DI). For Figure 2E, Figure 7B, and Figure 4—figure supplement 2, we set ϵ=0.1.

Gaussian process tasks

To generate a family of tasks with continuously varying outputs (Figure 2D and F, Figure 4F and G, Figure 5B, and Figure 6), we sampled target functions from a Gaussian process (Rasmussen and Williams, 2006), f(x)GP(0,C), with covariance

(7) C(xμ,xν)=exp(12γ2xμxν2),

where γ determines the spatial scale of variations in f(x). Training and test patterns are drawn uniformly on the unit sphere.

Learning of readout weights

With the exception of the ELL task and Figure 7—figure supplement 2, we performed unregularized least squares regression to determine the readout weights w. For the ELL sensory cancellation task (Figure 7C), we used 2 regularization, a.k.a. ridge regression:

(8) w=argminwμ=1Pf(xμ)wh(xμ)2+Mαridgew22,

where αridge is the regularization parameter. Solutions were found using Python’s scikit-learn package (Pedregosa, 2011).

In Figure 7—figure supplement 2, we implement a model of an online climbing fiber-mediated plasticity rule. The climbing fiber activity c is assumed to encode the error between the target and the network prediction c=f(x)f^(x). During each of Nepochs training epochs, the P training patterns are shuffled randomly and each pattern is presented one at a time. For each pattern µ, the weights are updated according to Δwμ=ηch(xμ). Parameter values were P=30,η=0.7/M,M=10,000,Nepochs=20,000.

Performance metrics

For tasks with continuous targets, the prediction of the network is given by f^(x)=wh(x), where w are the synaptic weights of the readout from the expansion layer. Error is measured as relative mean squared error (an expectation across patterns x in the test set): Error=E[(f(x)f^(x))2]x[f(x)2]. In practice we use a large test set to estimate this error over x drawn from the distribution of test patterns. For categorization tasks, the network’s prediction is given by f^(x)=sign(wh(x)). Performance is measured as the fraction of incorrect predictions. Error bars represent standard error of the mean across realizations of network weights and tasks.

Optimal granule–Purkinje cell weight distribution

We adapted our model to allow for comparisons with Brunel et al., 2004 by constraining readout weights w to be nonnegative and adding a bias, f(x)=wh(x)+b. To guarantee that the target function is nonnegative, we set f(x){0,1} for the random categorization task and f(x)|f(x)| for the Gaussian process tasks. The weights and bias were determined with the Python convex optimization package cvxopt (Andersen et al., 2011).

Model of two-joint arm

We implemented a biophysical model of a planar two-joint arm (Fagg et al., 1997). The state of the arm is specified by six variables: joint angles θ1 and θ2, angular velocities θ˙1 and θ˙2, and torques u1 and u2. The upper and lower segments of the arm have lengths l1 and l2 and masses m1 and m2, respectively. The arm has the following dynamics:

(9) M(θ)θ¨+C(θ,θ˙)θ˙=u,

where M(θ) is the inertia matrix and C(θ,θ˙) is the matrix of centrifugal, Coriolis, and friction forces:

(10) M(θ)=(I1+I2+m2l12+2m2l1l¯2cos(θ2)I2+m2l1l¯2cos(θ2)I2+m2l1l2cos(θ2)I2),
(11) C(θ,θ˙)=m2l1l2sin(θ2)(2θ2˙θ2˙θ1˙0)+(D100D2),

where l¯2 is the center of mass of the lower arm, I1 and I2 are moments of inertia and D1 and D2 are friction terms of the upper and lower arm respectively. These parameters were m1=3 kg, m2=2.5 kg, l1=0.3 m, l2=0.35 m, l¯2=0.21 m, I1=0.1 kg m2, I2=0.12 kg m2, D1=0.05 kg m2/s and D2=0.01 kg m2/s.

The task is to predict the position of the hand based on the forward dynamics of the two-joint arm system, given the arm initial condition and the applied torques. More precisely, the P network inputs xμ were generated by sampling 6-dimensional Gaussian vectors with covariance matrix C=diag(σθ,σθ,σθ˙,σθ˙,σu,σu), to account for the fact that angles, angular velocities and torques might vary on different scales across simulations. For our results, we used σθ=σθ˙=0.1 and σu=1. Each sample xμ was then normalized and used to generate initial conditions of the arm, by setting θ1μ=π4+x1μ, θ2μ=π4+x2μ, θ˙1μ=x3μ, and θ˙2μ=x4μ. Torques were generated by setting u1μ=x5μ and u2μ=x6μ. The target was constructed by running the dynamics of the arm forward in time for a time δ=0.2 s, and by computing the difference in position of the “hand” (i.e. the end of the lower segment) in Cartesian coordinates. As a result, the target in this task is two-dimensional, with each target dimension corresponding the one of the two Cartesian coordinates of the hand. The overall performance is assessed by computing the error on each task separately and then averaging the errors.

Model of electrosensory lateral line lobe (ELL)

We simulated 20,000 granule cells using the biophysical model of Kennedy et al., 2014. We varied the granule cell layer coding level by adjusting the spiking threshold parameter in the model. For each choice of threshold, we generated 30 different trials of spike rasters. Each trial is 160ms long with a 1ms time bin and consists of a time-locked response to an electric organ discharge command. Trial-to-trial variability in the model granule cell responses arises from noise in the mossy fiber responses. To generate training and testing data, we sampled 4 trials (P=640 patterns) from the 30 total trials for training and 10 trials for testing (1600 patterns). Coding level is measured as the fraction of granule cells that spike at least once in the training data. We repeated this sampling process 30 times.

The targets were smoothed broad-spike responses of 15 MG cells time-locked to an electric organ discharge command measured during experiments (Muller et al., 2019). The original data set consisted of 55 MG cells, each with a 300ms long spike raster with a 1ms time bin. The spike rasters were trial-averaged and then smoothed with a Gaussian-weighted moving average with a 10ms time window. Only MG cells whose maximum spiking probability across all time bins exceeded 0.01 after smoothing were included in the task. The same MG cell responses were used for both training and testing. To match the length of the granule cell data, we discarded MG cell data beyond 160ms and then concatenated 4 copies of the 160ms long responses for training and 10 copies for testing. We measured the ability of the model to construct MG cell targets out of granule cell activity, generalizing across noise in granule cell responses. Errors for each MG cell target were averaged across the 30 repetitions of sampling of training and testing data, and then averaged across targets. Standard error of the mean was computed across the 30 repetitions.

Model of vestibulo-ocular reflex (VOR)

Recordings of Purkinje cell activity in monkeys suggest that these neurons exhibit piecewise-linear tuning to head velocity (Lisberger et al., 1994). Thus, we designed piecewise-linear target functions representing Purkinje cell firing rate as a function of head velocity v, a one-dimensional input:

(12) f(v)={m1(vc)+bx<cm2(vc)+bxc.

Inputs v were sampled uniformly from [-1,1] 100 times. We generated 25 total target functions using all combinations of slopes m1 and m2 sampled from 5 equally spaced points on the interval [-2,2]. We set b=0.1 and c=-0.2.

Mossy fiber responses to head velocity input were modeled as exponential tuning curves:

(13) nj(v)=gjexp(vrj)+bj,

where gj is a gain term, rj±1 determines a mossy fiber preference for positive or negative velocities, and bj is the baseline firing rate. We generated 24 different tuning curves from all combinations of the following parameter values: The gain gj was sampled from 6 equally spaced points on the interval [0.1,1], rj was set to either –1 or 1, and bj was set to either 0 or 1. Qualitative results did not depend strongly on this parameterization. Mossy fiber to granule cell weights were random zero-mean Gaussians. Errors were averaged across targets.

Appendix 1

1 Connection between kernel and previous theories

Previous theories (Babadi and Sompolinsky, 2014; Litwin-Kumar et al., 2017) studied generalization performance for random clusters of inputs associated with binary targets, where test patterns are formed by adding noise to training patterns (Figure 4A). The readout is trained using a supervised Hebbian rule with mean-subtracted expansion layer responses, w=μyμ(hμh¯), with h¯=1Pμ=1Phμ. The net input to a readout in response to a test pattern h^μ from cluster µ is gμ=w(h^μh¯). The statistics of gμ determine generalization performance. For a Hebbian readout, the error rate is expressed in terms of the signal-to-noise ratio (SNR) (Babadi and Sompolinsky, 2014):

(A1) P(Error)=12erfc(SNR/2).

SNR is given in terms of the mean and variance of gμ:

(A2) SNR=(Eμ[yμgμ])2Var(gμ).

The numerator of SNR is proportional to the average overlap of the expansion layer representations of training and test patterns belonging to the same cluster, which can be expressed in terms of the kernel function K:

(A3) Eμ[yμgμ]=Eμ[(h^μh¯)(hμh¯)]=MEμ[K(x^μ,xμ)]h¯h¯.

For large networks with Gaussian i.i.d. expansion weights, K(x^μ,xμ)=K(t), where t=x^μxμ, and the above equation reduces to MK(ttrain/test)h¯h¯, where ttrain/test is the typical overlap of training and test patterns belonging to the same cluster. When xμxμ=1, ttrain/test can be written as ttrain/test=1Δ, where Δ is a measure of within-cluster noise (Babadi and Sompolinsky, 2014; Litwin-Kumar et al., 2017).

Babadi and Sompolinsky, 2014 demonstrated that, for random categorization tasks and when M and D are large, Var(gμ)=C(1M+Q21D) where C is a constant and Q[0,1] is given by

(A4) Q2=1ZhEμν[((hμh¯)(hνh¯))2]1ZxEμν[(xμxν)2],

assuming the entries of x are zero-mean. Za=Eμ[aμa¯2] for a{h,x} normalizes the overlaps to the typical overlap of a pattern with itself. The quantity Q2 is the ratio of the variance of overlaps between patterns belonging to different clusters in the expansion layer to that of the input layer. This describes the extent to which the geometry of the input layer representation is preserved in the expansion layer. When overlaps in the input layer are small, as they are for random clusters, 1Zh(hμh¯)(hνh¯)QZx(xμxν) as M. This relation illustrates that, for random clusters and M, Q is equal to the slope of the normalized kernel function K(t) evaluated at t=0. Litwin-Kumar et al., 2017 also showed that the dimension of the expansion layer representation is equal to C(1M+Q21D), where C is a constant.

Thus, for the random categorization task studied in Babadi and Sompolinsky, 2014; Litwin-Kumar et al., 2017, dimension and readout SNR can be calculated by evaluating K(ttrain/test) and the slope of K(t) at t=0.

2 Dot-product kernels with arbitrary threshold

As M, the normalized dot product between features (Equation 2) converges pointwise to

(A5) K(x,x)=EJ1[ϕ(J1Txθ)ϕ(J1Txθ)],

where J1 is a row of the weight matrix J (without loss of generality, the first row) with entries drawn i.i.d. from a Gaussian distribution N(0,1). Our goal is to compute Equation A5 for a given θ and inputs drawn on the unit sphere x,xSD1.

Because the Gaussian weight distribution is spherically symmetric, Equation A5 restricted to the unit sphere for any nonlinearity is only a function of the dot-product t:=xTx, making the kernel a dot-product kernel K(x,x)=K(t).

Denote by Ji the entries of J1. Let I1=i=1DJixi and I2=i=1DJixi be the pre-activations for each input. Then (I1,I2) are jointly Gaussian with mean 0, variance 1, and covariance E[I1I2]=t. If t>0, we can re-parameterize these pre-activations as the sum of an independent and shared component Ii=yi1t+zt, where yiN(0,1) for i=1,2 and zN(0,1). In these coordinates, Equation A5 becomes

(A6) K(t)=Ey1,y2,z[ϕ(y11t+ztθ)ϕ(y21t+ztθ)]=Ez[Ey1[ϕ(y11t+ztθ)|z]Ey2[ϕ(y21t+ztθ)|z]]=Ez[Ey1[ϕ(y11t+ztθ)|z]2],

where the second line follows from the conditional independence of h1|z and h2|z and the third from the fact that they are identically distributed. Similarly, if t<0, we can write I1=y11-t+zt, I2=y21-t-zt.

We will use Equation A6 to solve for the kernel assuming ϕ is a ReLU nonlinearity. Let

(A7) g1(t,z)=E[ϕ(y11t+ztθ)|z].

Using the fact that ϕ is nonzero only when y11t+ztθ>0, i.e. for y1>T=θzt1t, we obtain

(A8) g1(t,z)=(2π)1/2T(y11t+ztθ)ey12/2dy1=(1t2π)1/2eT2/2+(ztθ2)erfc(T/2).

Performing a similar calculation for t<0 and collecting the results leads to:

(A9) K(t)={Ez[g1(t,z)2]t>0Ez[g1(|t|,z)g2(|t|,z)]t<0,
(A10) g1(t,z)=(1t2π)1/2eT2/2+(ztθ2)erfc(T1/2)
(A11) g2(t,z)=(1t2π)1/2eT2/2+(ztθ2)erfc(T2/2)
(A12) T1=θzt1t,andT2=θ+zt1t.

3 Spherical harmonic decompositions

Our theory of generalization requires us to work in function spaces which are natural to the problem. The spherical harmonics are the natural basis for working with dot-product kernels on the sphere. For a thorough treatment of spherical harmonics, see Atkinson and Han, 2012, whose notation we generally follow. Both our kernel and Gaussian process (GP) tasks are defined over the sphere in D dimensions

(A13) SD1={xRD:x2=1}.

A spherical harmonic Ykm()—where k indexes frequency and m indexes modes of the same frequency—is a harmonic homogeneous polynomial of degree k restricted to the sphere SD1. For each frequency k, there are N(D,k) linearly independent polynomials, where

(A14) N(D,k)=2k+D2k(k+D3k1).

3.1 Decomposition of the kernel and target function

We remind the reader here of the setting for our theory:

  1. Ridge regression using random features with a dot-product limiting kernel.

  2. Data drawn uniformly from the unit sphere.

Let σ be the Lebesgue measure on SD1. We will denote the surface area of the sphere as

(A15) |SD1|=SD1dσ=2πD/2Γ(D/2).

On the other hand, the uniform probability measure on the sphere, denoted by σ¯, must integrate to 1, so σ¯=σ/|SD1|. Finally, we define the space of real-valued square integrable functions L2(σ) as the Hilbert space with inner product

(A16) f,gL2(σ)=SD1f(x)g(x)dσ(x)

and fL2(σ)=f,fL2(σ)1/2. The space L2(σ¯) is defined analogously.

Eigendecompositions describe the action of linear operators, not functions, thus we must associate a linear operator with our kernel for its eigenvalues to make sense. The kernel eigenvalues λα that we will use to compute the error are the eigenvalues of the integral operator TK:L2(σ¯)L2(σ¯) defined as

(A17) (TKf)(x)=K(x,),f()L2(σ¯)=SD1K(x,x)f(x)dσ¯(x).

This is because σ¯ is the data distribution, and these eigenvalues are approximated by the eigenvalues of the kernel matrix evaluated on a large but finite dataset (Koltchinskii et al., 2000). Similarly, we define the analogous operator UK:L2(σ)L2(σ) under the measure σ with eigenvalues ξα. Since TK=UK/|SD1|, the eigenvalues are related by

(A18) λα=ξα|SD1|,

and they share the same eigenfunctions, up to normalization. For the rest of this section we will study eigendecompositions of operator UK, which may be translated into statements about TK via (18) (These differences are liable to cause some confusion and pain when reading the literature).

Under mild technical conditions that our kernels satisfy, Mercer’s theorem states that positive semidefinite kernels can be expanded as a series in the orthonormal basis of eigenfunctions ψα weighted by nonnegative eigenvalues ξα:

(A19) K(x,x)=αξαψα(x)ψα(x).

Again, (λα,ψα) are eigenpairs for the operator UK and form an orthonormal set under the L2(σ) inner product.

As stated earlier, the kernel (Equation A5) is spherically symmetric and thus a dot-product kernel. Because of this, we can take the eigenfunctions ψα to be the spherical harmonics Ykm. The index α is a multi-index into mode m of frequency k. Writing the Mercer decomposition in the spherical harmonic basis gives:

(A20) K(x,x)=k=0ξkm=1N(D,k)Ykm(x)Ykm(x).

Because our kernel is rotation invariant, all N(D,k) harmonics of frequency k share eigenvalue ξk.

Any function in L2(σ) can be expanded in the spherical harmonic basis as follows:

(A21) f(x)=k=0m=1N(D,k)ckmYkm(x), with ckm=f,YkmL2(σ).

The expansion is analogous to that of the Fourier series. In fact when D=2, the spherical harmonics are sines and cosines on the unit circle.

3.2 Ultraspherical polynomials

Adding together all harmonics of a given frequency relates them to a polynomial in t by the addition formula

(A22) m=1N(D,k)Ykm(x)Ykm(x)=N(D,k)|SD1|Pk,D(xTx).

The polynomial Pk,D(t) is the k th ultraspherical polynomial. These are also called Legendre or Gegenbauer polynomials, although these usually have different normalizations and can be defined more generally.

The ultraspherical polynomials {Pk,D} form an orthogonal basis for

L2([-1,1],(1-t2)(D-3)/2dt).

As special cases, Pk,2(t) and Pk,3(t) are the classical Chebyshev and Legendre polynomials, respectively. For any D, the first two of these polynomials are P0(t)=1 and P1(t)=t. We use the Rodrigues formula (Atkinson and Han, 2012), which holds for k0 and D2, to generate these polynomials:

(A23) Pk,D(t)=(1/2)kΓ((D1)/2)Γ(k+(D1)/2)(1t2)(3D)/2(ddt)k(1t2)k+(D3)/2.

Combining Equation A20 with the addition formula (Equation A22), we can express the kernel in terms of ultraspherical polynomials evaluated at the dot-product of the inputs:

(A24) K(t)=k=0ξkN(D,k)|SD1|Pk,D(t).

3.3 Computing kernel eigenvalues

The Funk-Hecke theorem states that

(A25) xSD1K(xTx)Yk(x)dσ(x)=|SD2|Yk(x)11K(t)Pk,D(t)(1t2)(D3)/2dt.

Equation A25 implies that the eigenvalues of UK are given as

(A26) ξk=|SD2|11K(t)Pk,D(t)(1t2)(D3)/2dt.

For our kernels, the kernel eigenvalues can be conveniently computed using polar coordinates. When the entries of J1 are i.i.d. unit Gaussian,

K(t)=RDϕ(J1Txθ)ϕ(J1Txθ)(2π)D/2eJ12/2d(J1)1d(J1)D=(2π)D/20er2/2rD1SD1ϕ(rJ^Txθ)ϕ(rJ^Txθ)dσ(J^)dr,

where x^=J1/r and r=J1. The ReLU nonlinearity is positively homogeneous, so ϕ(rJ^Txθ)=rϕ(J^Txθ/r). We can write

(A27) K(t)=(2π)D/20er2/2rD+1SD1(J^Txθ/r)+(J^Txθ/r)+dσ(J^):=Kshell(t;θ/r)dr=(2π)D/20er2/2rD+1Kshell(t;θ/r)dr,

where we have introduced a new kernel Kshell(t;θ) which is |SD1| times the dot-product kernel that arises when the weights are distributed uniformly on the sphere (σ is not the probability measure). The above equation shows that the network restricted to inputs x,xSD1 has different kernels depending on whether the weights are sampled according to a Gaussian distribution or uniformly on the sphere. Without the threshold, this difference disappears due to the positive homogeneity of the ReLU (Churchland et al., 2010).

Next we expand the nonlinearity in the spherical harmonic basis (following Bietti and Bach, 2021; Bach, 2017)

(A28) ϕ(J^Txθ)=(J^Txθ)+=k=0ak(θ)j=1N(D,k)Ykj(J^)Ykj(x),

where the k th coefficient is given by the Funk-Hecke formula (Equation A25) as

(A29) ak(θ)=|SD2|11(tθ)+Pk(t)(1t2)(D3)/2dt,

and we explicitly note the dependence on θ. Using the representation Equation A28, we can recover the eigendecomposition:

(A30) Kshell(t;θ)=SD1(J^Txθ)+(J^Txθ)+dσ(J^)=k,kak(θ)ak(θ)j,jYkj(x)Ykj(x)SD1Ykj(J^)Ykj(J^)dσ(J^)δkkδjj=kak(θ)2jYkj(x)Ykj(x)=kak(θ)2N(k,D)|SD1|Pk(t),

which follows from orthonormality and the addition formula (Equation A22). We have that ak(θ)2 is the k th eigenvalue of Kshell(t;θ).

Using Equation A30 in Equation A27 leads to

(A31) K(t)=kN(k,D)|SD1|Pk(t)(2π)D/20er2/2rD+1ak(θ/r)2dr,

i.e. the eigenvalues satisfy

(A32) ξk=(2π)D/20er2/2rD+1ak(θ/r)2dr.
3.3.1 Eigenvalues of Kshell

It is possible to compute ak(θ) analytically (Bietti and Bach, 2021; Bach, 2017). Letting

(A33) Iα,k(θ)=θ1tαPk(t)(1t2)(D3)/2dt,

we have that Equation A29 reduces to ak(θ)=|SD2|(I1,k(θ)θI0,k(θ)). Equation A33 requires θ[-1,1], but θ/r± in Equation A32 as r0. So we take θ*=min(max(θ,-1),1), which still assures that Equation A29 is satisfied. For the rest of this section, assume wlog that θ[-1,1).

Using Rodrigues’ formula (Equation A23) in Equation A33 gives

Iα,k(θ)=(1/2)kΓ((D1)/2)Γ(k+(D1)/2):=Cθ1tα(ddt)k(1t2)k+(D3)/2dt=Cθ1tα(ddt)k(1t2)k+(D3)/2dt

which may be integrated by parts. We will treat α=0 and 1 separately.

In the case of α=0, since tα=1 we have the integral of a derivative, so for k1

I0,k(θ)=Cθ1(ddt)k(1t2)k+(D3)/2dt=C(ddt)k1(1t2)k+(D3)/2|θ1=C(ddt)k1(1t2)k+(D3)/2|t=θ(k1)

When k=0 we find that

I0,0(θ)=θ1(1t2)(D3)/2dt=t2F1(1/2,(3D)/2;3/2;t2)|θ1=πΓ((D1)/2)2Γ(D/2)θ2F1(1/2,(3D)/2;3/2;θ2).

For α=1, we integrate by parts once and find that for k2,

I1,k(θ)=Cθ1t(ddt)k(1t2)k+(D3)/2dt=C[t(ddt)k1(1t2)k+(D3)/2|θ1θ1(ddt)k1(1t2)k+(D3)/2dt]=C[t(ddt)k1(1t2)k+(D3)/2|θ1(ddt)k2(1t2)k+(D3)/2|θ1]=C[(ddt)k2(1t2)k+(D3)/2t(ddt)k1(1t2)k+(D3)/2]|t=θ(k2)

When α=0, we have a straightforward integral

I1,0(θ)=θ1t(1t2)(D3)/2dt=(1θ2)(D1)/2(D1)=I0,1(θ).

Finally, for k=1, we obtain

I1,1(θ)=θ1t2(1t2)(D3)/2dt=(t3/3)2F1(3/2,(3D)/2;5/2;t2)|θ1=πΓ((D1)/2)4Γ((D+2)/2)(θ3/3)2F1(3/2,(3D)/2;5/2;θ2)
3.3.2 Properties of the eigenvalues of Kshell

The above show that for k2

(A34) ak=|SD2|(I1,k(θ)θI0,k(θ))(ddt)k2(1t2)k+(D3)/2|t=θ.

Taking θ=-1 leads to ak=0, since fewer derivatives than k+(D-3)/2 appear in Equation A34, which reflects the fact that higher degree ultraspherical polynomials are orthogonal to a linear function. Furthermore, since 1-t2 is an even function, the parity of ak as a function of θ matches the parity of k. However, ak appears squared in Equation A32, so ξk will always be an even function of θ. This explains the parity symmetry of the eigenvalues with coding level for k2. Also, Equation A34 for θ=0 gives ak=0 when k is odd, as was shown by Bach, 2017; Basri et al., 2019. This is because

(ddt)p(1t2)p+|t=0=(ddt)p(1t)p+l(1+t)p+l|t=0=j=0p(pj)((ddt)j(1t)p+l)((ddt)pj(1+t)p+l)|t=0=j=0p(pj)(1)j((ddt)j(1+t)p+l)((ddt)pj(1+t)p+l)|t=0=0 if p is odd,

because the j and p-j terms have opposite parity and cancel.

We may also compute the tail asymptotics of these eigenvalues for large k. Let p=k-2 and =(D+1)/2, so we want to evaluate

(ddt)p(1t2)p+=p!2πi(1z2)p+(zt)p+1dz=p!2πie(p+o(p))F(z)dz

for large p at t=θ(-1,1). The first line follows from Cauchy’s intergral formula for a counterclockwise contour encircling t, and the second comes from defining

F(z):=log(1z2)log(zt)(1+/p)log(1z2)(1+1/p)log(zt),

when p is large and is constant. We will use the saddle point method (Butler, 2007) to evaluate the contour integral asymptotically, ignoring the o(p) term in the exponent. Note that the only singularity in the original integrand occurs at z=t.

The function F has derivatives

F(z)=2z1z21zt,
F(z)=1(zt)24z2(1z2)221z2.

We find the saddle points by setting F(z)=0. This leads to a quadratic equation with two roots: z±=t±t2-1=sgn(t)(|t|±i1-t2). Since these are evaluated at t=θ with |θ|<1, both roots are complex, |z±|=1, and F′′(z±)0. Also, the saddle points avoid the singularity in the original integrand, so we can deform our contour to pass through these points and apply the standard approximation.

Applying the saddle point approximation, we obtain

(ddt)p(1t2)p+p!2πiepF(z)dzp!2πiz0{z+,z}epF(z0)ei(πargF(z0))/2(2πp|F(z0)|)1/2cp!z0{z+,z}epF(z0)p1/2=cp!p1/2((1z+2z+t)p+(1z2zt)p)=cp!p1/2((2z+)p+(2z)p)2cp!p1/2(2)p

for some c which is constant in p and depends on D. In the last step, we use that z+p+zp2 since z± are conjugate pairs with magnitude 1.

Now recall the full equation for the coefficients:

ak=|SD2|(1/2)kΓ((D1)/2)Γ(k+(D1)/2)(ddt)p(1t2)p+.

Plugging in the result from the saddle point approximation, substituting p=k-2, and dropping all terms that are constant in k, we find that

akC(1/2)k(k2)!k1/2(2)kΓ(k+(D1)/2)=Ck1/2Γ(k1)Γ(k+(D1)/2)=CkD/21,

where C is a new constant. The rate of k-D/2-1 is the same decay rate found by Bach, 2017; Bietti and Bach, 2021 using a different mathematical technique for θ=0. These decay rates are important for obtaining general worst-case bounds for kernel learning of general targets; (Bach, 2012) is an example.

3.4 Gaussian process targets

Taking our target function to be a GP on the unit sphere f(x)GP(0,C) with some covariance function C:SD1×SD1R, we can represent our target function by performing an eigendecomposition of the covariance operator UC. When C itself is spherically symmetric and positive definite, this becomes

(A35) C(xμ,xν)=k=0ρkm=1N(D,k)Ykm(xμ)Ykm(xν),

where ρk>0 are the eigenvalues. Then a sample from the GP with this covariance function is a random series

(A36) f(x)=k=0ρkm=1N(D,k)gkmYkm(x),

where gkmN(0,1) by the Kosambi-Karhunen–Loève theorem (Kosambi, 1943). In other words, the coefficient of Ykm in the series expansion of f(x) is ckm=ρkgkm.

We take the squared exponential covariance on the sphere

(A37) C(xμ,xν)=exp(xμxν22γ2)=exp(t1γ2),

for t=xμxν and length scale γ.

3.5 Numerical details

All of our spherical harmonic expansions are truncated at frequency Nk. This is typically Nk=50 for experiments in D=3 dimensions. In higher dimensions, N(D,k) grows very quickly in k, requiring truncation at a lower frequency.

To compute the kernel eigenvalues λk, we can either numerically integrate the Funk-Hecke formula (Equation A25) or compute the coefficients ak(θ/r) semi-analytically, following Equation A34, then integrate Equation A32 with numerical quadrature and rescale by Equation A18.

We use the Funk-Hecke formula (Equation A25) and numerical quadrature to find ρk. To compute the expected error using Equation A38, we use E[cα2]=ρk. After generating a sample from the GP, we normalize the functions by dividing the labels and coefficients by their standard deviation. This ensures that the relative mean squared error is equivalent to the mean squared error computed in the next section.

4 Calculation of generalization error

The generalization error of kernel ridge regression is derived in Canatar et al., 2021a; Simon et al., 2021; Gerace et al., 2021, which show that the mean squared error, in the absence of noise in the target, can be written as

(A38) Ex (f(x)f^(x))2=αβαcα2,

where βα depend on P and the kernel but not on the target, and cα are the coefficients (Equation A21) of the target function in the basis L2(σ). The exact form of this expression differs from that given in Canatar et al., 2021b due to differences in the conventions we take for our basis expansions. Specifically,

(A39) βα=(11χ)(κλαP+κ)2,

where α indexes the kernel eigenfunctions and

(A40) χ=αλα2P(λαP+κ)2,
(A41) κ=αridge+αλακλαP+κ,

with αridge the ridge parameter. Note that Equation A41 is an implicit equation for κ, which we solve by numerical root-finding.

Thus,

(A42) Ex(f(x)f^(x))2=C1α(cαλα+C2)2,

with C1=(11-χ)κ2P2 and C2=κP.

5 Dense-sparse networks

To compare with more realistic networks, we break the simplifying assumption that Jeff is densely connected and instead consider sparse connections between the input and expansion layer. Consider a random matrix Jeff=JA, where ARN×D and JRM×N, with N>D and M>N. The entries of A are i.i.d. Gaussian, i.e. AijN(0,1/D). In contrast, J is a sparse matrix with exactly K nonzero entries per row, and nonzero entries equal to 1/K. With these scaling choices, the elements of Jeff are of order 1/D, which is appropriate when the input features xi are order 1. This is in contrast to the rest of this paper, where we considered features of order 1/D and therefore assumed order 1 weights. The current scaling allows us to study the properties of Jeff for different values of D, N and K.

First, we examine properties of Jeff=JA under these assumptions. Recall that the rows of Jeff are the weights of each hidden layer neuron. Since Jeff is Gaussian, any given row JieffRD is marginally Gaussian and distributed identically to any other row. But the rows are not independent, since they are all linear combinations of the rows of A. Thus, the kernel limit of an infinitely large dense-sparse network is equal to that of a fully dense network, but convergence to that kernel behaves differently and requires taking a limit of both N,M. In this section, we study how finite N introduces extra correlations among the rows of Jeff compared to dense networks.

The distribution of Jeff is spherically symmetric in the sense that Jeff and JeffQ have the same distribution for any rotation matrix QRD×D. In contrast, a densely connected network with weights G drawn i.i.d. as GijN(0,1/D) will of course have independent rows and also be spherically symmetric. The spherical Gaussian is the only vector random variable which is spherically symmetric with independent entries (Nash and Klamkin, 1976). Furthermore, each row of G may be rotated by a different orthogonal matrix and the resulting random variable would still have the same distribution.

With these symmetry considerations in mind, the statistics of the rows of Jeff can be described by their multi-point correlations. The simplest of these is the two-point correlation, which in the case of spherical symmetry is captured by the overlaps:

(A43) νij:=k=1D(Jeff)ik(Jeff)jk=k=1Dm,n=1NJinAnkJjmAmk.

The overlap νij is doubly stochastic: one source of stochasticity are the elements of A, and the second one is the random sampling of nonzero elements of J. Ideally, we are interested in studying the statistics of νij when varying i and j, i.e. when J varies (since the rows of J are sampled independently from each other). However, this will leave us with the quenched disorder given by the specific realization of A. To obtain a more general and interpretable result, we want to compute the probability distribution

(A44) PA,J(νij)=EA[EJ[δ(νijk=1Dm,n=1NJinJjmAnkAmk)]].

Notice that the order in which we perform the averaging is irrelevant.

5.1Computation of the moment-generating function

Instead of computing directly the probability distribution in Equation A44, we compute the moment-generating function

(A45) Z(μ):=EA[EJ[exp(μνij)]],

which fully characterizes the probability distribution of νij. We indicate the set of indices in which the i-th row of J takes nonzero values by Si={S1i,S2i,,SKi} such that JiSli0, l=1,,K, and analogously for the j-th row. We also indicate the intersection Sij=SiSj, i.e. the set of indices in which both the i-th and the j-th rows are nonzero. Sij has size 0|Sij|K. Notice that setting i=j causes |Sij|=K deterministically. With this definitions, the overlap can be written as

(A46) νij=k=1DmSinSjJinJjmAnkAmk=1Kk=1DmSinSjAnkAmk

We start by perform swapping the averaging order in Equation A45 and averaging over A.

Z(μ)=EJ[(m=1Nl=1DDAml)exp(μKk=1DmSinSjAnkAmk)]=EJ[(mSiSjl=1DDAml)exp(μKk=1DmSinSjAnkAmk)]=EJ[k=1D(mSiSjDAmk)exp(μKmSinSjAnkAmk)],

where in the first equality we marginalized over all the elements of A which do not enter the definition of νij, i.e. we went from having to integrate over N×D variables to only |SiSj|×D=(2K-|Sij|)×D variables. In the second equality we factorized the columns of A.

We now explicitly compute integral for a fixed value of k, by reducing it to a Gaussian integral:

(mSiSjDAmk)exp(μKmSinSjAnkAmk)=(mSiSjdAmk)(2πD)|SiSj|2exp(μKmSinSjAnkAmkD2rSiSjArk2)=(mSiSjdAmk)(2πD)|SiSj|2exp(D2rSiSjArkPrsAsk)=det(P)1/2,

where PR|SiSj|×|SiSj|, which has a 3-by-3 block structure and can be written as

(A47) P=(IK|Sij|μKD1|Sij|μKD1K|Sij|μKD1|Sij|×(K|Sij|)I|Sij|2μKD1|Sij|μKD1|Sij|×(K|Sij|)μKD1K|Sij|μKD1|Sij|×(K|Sij|)IK|Sij|),

where In is the n-by-n identity matrix and Jeff is the n-by-m matrix of all ones (if m is omitted, then it is n-by-n). Due to the block structure, the determinant of the matrix above is identical to the determinant of a 3-by-3 matrix

(A48) det(P)=det(1μKD|Sij|μKD(K|Sij|)μKD(K|Sij|)12μKD|Sij|μKD(K|Sij|)μKD(K|Sij|)μKD|Sij|1)=K2D2K2μ2+|Sij|2μ22DK|Sij|μK2D2.

By plugging this result into the expression for the moment-generating function, we have that

(A49) Z(μ)=EJ[(K2D2K2μ2+|Sij|2μ22DK|Sij|μK2D2)D/2].

This expression is our core result, and needs to be averaged over J. This average can be written explicitly by noticing that, when ij, |Sij| is a random variable that follows a hypergeometric distribution in which the number of draws is equal to number of success state and is equal to K. By using the explicit expression of the probability mass function of a hypergeometric distribution, we have that

(A50) Z(μ)=s=0K(Ks)(NKKs)(NK)(K2D2K2μ2+s2μ22DKsμK2D2)D/2.

Notice that the term s=0 yields the same moment-generating function (up to a factor) as for a fully-connected Jeff with Gaussian i.i.d. entries with variance 1/D. In contrast, when i=j we obtain

(A51) Zi=j(μ)=(12μD)D/2.

5.2 Computation of the moments of νij

In this section, we assume that ij and use the moment-generating function to compute the moments of νij. The non-central moments of the overlap are easily obtained from the moment-generating function as

(A52) EJ[νijq]=dqdμqZ(μ)|μ=0,

which can be computed in a symbolic manipulation tool.

We now explicitly compute the first two moments of νij.

(A53) EJ[νij]=ddμZ(μ)|μ=0=1KEJ[|Sij|]=KN,

where we used the fact that the mean of sHypergeom(N,K,K) is given by E[s]=K2N. For the second moment, we have

E[νij2]=EJ[K2+|Sij|2(1+D)K2D]=1D+D+1D((NK)2N2(N1)+K2N2),

while to compute the variance we use the law of total variance

Var(νij)=Es[Var(νij|s)]+Var(E[νij|s])=1D+D+1K2DE[s2]1K2E[s2]+Var(1Ks)=1D+D+1D((NK)2N2(N1)+K2N2)K2N2=1D+D+1D((NK)2+(N1)K2(N1)K2DD+1N2(N1))=1D+D+1D1N1+1DK22(D+1)KN(N1)+K2N2(N1).

As N we have that Var(νij)1D, which is the same variance of the overlap for a fully-connected Jeff with Gaussian i.i.d. entries. This is expected since when N is large, the probability of i and j having common afferents goes to zero.

5.3 Comparison to clustered embedding

Instead of distributed embedding, i.e. A being a Gaussian matrix, here we consider a clustered embedding by setting

(A54) A=ID1N/D.

i.e. the Kronecker product of the D-dimensional identity matrix and a vector of all ones and length N/D. This means that we can separate the input layer of N neurons in D non overlapping subsets Bn={ND(n-1)+1,ND(n-1)+2,,NDn}, each of size N/D, and we can write

(A55) Amn={1if mBn0otherwise.

In this case the overlap is given by

(A56) νij=1Kl=1DmSinSj1[mBl]1[nBl],

where 1[] is the indicator function, i.e. it is one if the argument is true and zero if the argument is false. We indicate by Kli the number of elements of Si which belongs to group l, i.e. Kli=mSi1[mBl]. The overlap can then be written as

(A57) νij=1Kl=1DKliKlj.

The vector Ki=(K1i,,KDi) follows a multivariate hypergeometric distribution with D classes, K draws, a population of size N, and number of successes for each class equal to N/D. Notice that Ki and Kj are independent from each other since each neuron samples its pre-synaptic partners independently. We can now compute explicitly the mean of νij using the fact that E[Kli]=KD

(A58) E[νij]=1Kl=1DE[KLi]2=KD.

Similarly, we can write the second moment of νij as

(A59) E[νij2]=1K2(l=1DE[(Kli)2]2+llE[KliKli]2).

Once again, we can use known result for variance and covariance of multivariate hypergeometric variables to simplify the above expression. Indeed, we can write

(A60) E[(Kli)2]=KNKN1D1D2+K2D2
(A61) E[KliKli]=KNKN1(1D)2+K2D2

from which we obtain the final expression for the second moment

(A62) E[νij2]=K2D2+1D(11D)(NKN1)2.

Data availability

The current manuscript is a computational study, so no data have been generated for this manuscript. Code implementing the model is available on github: https://github.com/marjoriexie/cerebellar-task-dependent (copy archived at Xie, 2023).

References

  1. Book
    1. Abbott LF
    2. Rajan K
    3. Sompolinsky H
    (2011) Interactions between intrinsic and stimulus-evoked activity in recurrent neural networks
    In: Abbott LF, editors. The Dynamic Brain: An Exploration of Neuronal Variability and Its Functional Significance. Oxford University Press. pp. 65–82.
    https://doi.org/10.1093/acprof:oso/9780195393798.001.0001
    1. Bach F
    (2017)
    Breaking the curse of dimensionality with convex neural networks
    Journal of Machine Learning Research 18:1–53.
  2. Conference
    1. Basri R
    2. Jacobs D
    3. Kasten Y
    4. Kritchman S
    (2019)
    The convergence rate of neural networks for learned functions of different frequencies
    Advances in Neural Information Processing Systems. pp. 4761–4771.
  3. Conference
    1. Bordelon B
    2. Canatar A
    3. Pehlevan C
    (2020)
    Spectrum dependent learning curves in kernel regression and wide neural networks
    International Conference on Machine Learning. pp. 1024–1034.
    1. Canatar A
    2. Bordelon B
    3. Pehlevan C
    (2021a)
    Out-of-distribution generalization in kernel regression
    Advances in Neural Information Processing Systems 34:12600–12612.
  4. Conference
    1. Fagg AH
    2. Sitkoff N
    3. Barto AG
    4. Houk JC
    (1997)
    Cerebellar learning for control of a two-link arm in muscle space
    Proceedings of International Conference on Robotics and Automation. pp. 2638–2644.
  5. Book
    1. Ito M
    2. Itō M
    (1984)
    The Cerebellum and Neural Control
    Raven Press.
  6. Conference
    1. Jacot A
    2. Gabriel F
    3. Hongler C
    (2018)
    Neural tangent kernel: Convergence and generalization in neural networks
    Advances in Neural Information Processing Systems.
    1. Kosambi DD
    (1943)
    Statistics in function space
    Journal of the Indian Mathematical Society 7:76–88.
    1. Pedregosa F
    (2011)
    Scikit-learn: machine learning in python
    Journal of Machine Learning Research 12:2825–2830.
  7. Conference
    1. Rahimi A
    2. Recht B
    (2007)
    Random features for large-scale kernel machines
    Advances in Neural Information Processing Systems.
  8. Book
    1. Schölkopf B
    2. Smola AJ
    (2002)
    Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
    MIT press.
  9. Conference
    1. Sollich P
    (1998) Approximate learning curves for Gaussian processes
    9th International Conference on Artificial Neural Networks.
    https://doi.org/10.1049/cp:19991148

Decision letter

  1. Jörn Diedrichsen
    Reviewing Editor; Western University, Canada
  2. Michael J Frank
    Senior Editor; Brown University, United States
  3. Jörn Diedrichsen
    Reviewer; Western University, Canada
  4. Henrik Jörntell
    Reviewer; Lund University, Sweden

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Task-dependent optimal representations for cerebellar learning" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, including Jörn Diedrichsen as Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Michael Frank as the Senior Editor. The following individual involved in the review of your submission has agreed to reveal their identity: Henrik Jörntell (Reviewer #3).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission. You will see that the recommended revisions most concern clarity of presentation, as well as tempering some of the claims.

Essential revisions:

1) All reviewers had questions about specific details of the study (see below comments). I hope you can use these questions as a guideline to improve the clarity and accessibility of the paper. As suggested by reviewer #2, this may involve moving some details from the supplementary materials to the main manuscript.

2) The current eLife assessment (see below) emphasizes the main limitations of the paper – namely the question of to what degree the conclusions of the paper would change if more specific (and biologically more plausible) details about the cerebellar circuit would be taken into account. I hope you will take the opportunity to respond in detail to these points in your response to the reviewer document, which will be posted alongside the revised version of the manuscript and cover the main issues in the Discussion section of the main document.Reviewer #1 (Recommendations for the authors):

This paper provides compelling and clear analyses that show that the coding level (sparsity) of the granule-cell layer of a cerebellar-like network does not only change the dimensionality of the representation, but also the inductive bias of the network: Higher sparsity biases the network to learning more high-frequency representations. Depending on the dominant frequencies of the target function, different coding levels are therefore optimal. The results are important/fundamental to theories of cerebellar learning and speak to a relevant ongoing debate in cerebellar learning, but will be of interest to readers outside of cerebellar neurophysiology.

I had two problems in understanding the paper that the authors hopefully can clarify in the revision:

Page 8: Third paragraph: At this point in the text it is a bit unclear why the K(x * x') changes shape as shown in Figure 3c. I assume the shape of the kernel K depends on the activation function in the hidden layer. It may be useful for the reader if you could give an example of the Kernel for the specific case you are showing in Figure 3c.

Page 11, 4th paragraph: Why is the distribution corr(Jeffi,Jeffj) uniform on -1 to 1? Since both are high-dimensional random vectors, should the correlation not be centered around 0? The target of uniform distribution needs to be better explained to make this analysis accessible.Reviewer #2 (Recommendations for the authors):

Here, a simple model of cerebellar computation is used to study the dependence of task performance on input type: it is demonstrated that task performance and optimal representations are highly dependent on task and stimulus type. This challenges many standard models which use simple random stimuli and concludes that the granular layer is required to provide a sparse representation. This is a useful contribution to our understanding of cerebellar circuits, though, in common with many models of this type, the neural dynamics and circuit architecture are not very specific to the cerebellum, the model includes the feed-forward structure and the high dimension of the granule layer, but little else. This paper has the virtue of including tasks that are more realistic, but by the paper's own admission, the same model can be applied to the electrosensory lateral line lobe and it could, though it is not mentioned in the paper, be applied to the dentate gyrus and large pyramidal cells of CA3. The discussion does not include specific elements related to, for example, the dynamics of the Purkinje cells or the role of Golgi cells, and, in a way, the demonstration that the model can encompass different tasks and stimuli types is an indication of how abstract the model is. Nonetheless, it is useful and interesting to see a generalization of what has become a standard paradigm for discussing cerebellar function.

I was impressed by the clarity of this manuscript. My only comment is that I found too much was deferred to the appendix, I thought the bits and pieces in the appendix were very clarifying, and given the appendix contains short pieces of extra information explaining the exact nature of the models and tasks, the manuscript would have been easier to follow and think about if these had just been integrated into the text. Often you read papers and wish some details had been shifted into an appendix since they distract from the flow of the description, but the opposite is true here, integrating the details into the text would've made it more concrete in a useful way and the tables of parameters, as tables, would not have interrupted the flow while making it much easier to see the scale of the models and their architecture.

Reviewer #3 (Recommendations for the authors):

The paper by Xie et al. is a modelling study of the mossy fiber-to-granule cell-to-Purkinje cell network, reporting that the optimal type of representations in the cerebellar granule cell layer depends on the type task. The paper stresses that the findings indicate a higher overall bias towards dense representations than stated in the literature, but it appears the authors have missed parts of the literature that already reported on this. While the modelling and analysis appear mathematically solid, the model is lacking many known constraints of the cerebellar circuitry, which makes the applicability of the findings to the biological counterpart somewhat limited.

I have some concerns with the novelty of the main conclusion, here from the abstract:

'Here, we generalize theories of cerebellar learning to determine the optimal granule cell representation for tasks beyond random stimulus discrimination, including continuous input-output transformations as required for smooth motor control. We show that for such tasks, the optimal granule cell representation is substantially denser than predicted by classic theories.'

Stated like this, this has in principle already been shown, i.e. for example:

Spanne and Jorntell (2013) Processing of multi-dimensional sensorimotor information in the spinal and cerebellar neuronal circuitry: a new hypothesis. PLoS Comput Biol. 9(3):e1002979.

Indeed, even the 2 DoF arm movement control that is used in the present paper as an application, was used in this previous paper, with similar conclusions with respect to the advantage of continuous input-output transformations and dense coding. Thus, already from the beginning of this paper, the novelty aspect of this paper is questionable. Even the conclusion in the last paragraph of the Introduction: 'We show that, when learning input-output mappings for motor control tasks, the optimal granule cell representation is much denser than predicted by previous analyses.' was in principle already shown by this previous paper.

However, the present paper does add several more specific investigations/characterizations that were not previously explored. Many of the main figures report interesting new model results. However, the model is implemented in a highly generic fashion. Consequently, the model relates better to general neural network theory than to specific interpretations of the function of the cerebellar neuronal circuitry. One good example is the findings reported in Figure 2. These represent an interesting extension to the main conclusion, but they are also partly based on arbitrariness as the type of mossy fiber input described in the random categorization task has not been observed in the mammalian cerebellum under behavior in vivo, whereas in contrast, the type of input for the motor control task does resemble mossy fiber input recorded under behavior (van Kan et al. 1993).

The overall conclusion states:

'Our results…suggest that optimal cerebellar representations are task-dependent.'

This is not a particularly strong or specific conclusion. One could interpret this statement as simply saying: ' if I construct an arbitrary neural network, with arbitrary intrinsic properties in neurons and synapses, I can get outputs that depend on the intensity of the input that I provide to that network.'

Further, the last sentence of the Introduction states: 'More broadly, we show that the sparsity of a neural code has a task-dependent influence on learning…' This is very general and unspecific, and would likely not come as a surprise to anyone interested in the analysis of neural networks. It doesn't pinpoint any specific biological problem but just says that if I change the density of the input to a [generic] network, then the learning will be impacted in one way or another.

The interpretation of the distribution of the mossy fiber inputs to the granule cells, which would have a crucial impact on the results of a study like this, is likely incorrect. First, unlike the papers that the authors cite, there are many studies indicating that there is a topographic organization in the mossy fiber termination, such that mossy fibers from the same inputs, representing similar types of information, are regionally co-localized in the granule cell layer. Hence, there is no support for the model assumption that there is a predominantly random termination of mossy fibers of different origins. This risks invalidating the comparisons that the authors are making, i.e. such as in Figure 3. This is a list of example papers, there are more:

van Kan, Gibson and Houk (1993) Movement-related inputs to intermediate cerebellum of the monkey. Journal of Neurophysiology.

Garwicz et al. (1998) Cutaneous receptive fields and topography of mossy fibres and climbing fibres projecting to cat cerebellar C3 zone. The Journal of Physiology.

Brown and Bower (2001) Congruence of mossy fiber and climbing fiber tactile projections in the lateral hemispheres of the rat cerebellum. The Journal of Comparative Neurology.

Na, Sugihara, Shinoda (2019) The entire trajectories of single pontocerebellar axons and their lobular and longitudinal terminal distribution patterns in multiple aldolase C-positive compartments of the rat cerebellar cortex. The Journal of Comparative Neurology.

The nature of the mossy fiber-granule cell recording is also reviewed here:

Gilbert and Miall (2022) How and Why the Cerebellum Recodes Input Signals: An Alternative to Machine Learning. The Neuroscientist

Further, considering the recoding idea, the following paper shows that detailed information, as it is provided by mossy fibers, is transmitted through the granule cells without any evidence of recoding: Jorntell and Ekerot (2006) Journal of Neuroscience; and this paper shows that these granule inputs are powerfully transmitted to the molecular layer even in a decerebrated animal (i.e. where only the ascending sensory pathways remains) Jorntell and Ekerot 2002, Neuron.

I could not find any description of the neuron model used in this paper, so I assume that the neurons are just modelled as linear summators with a threshold (in fact, Figure 5 mentions inhibition, but this appears to be just one big lump inhibition, which basically is an incorrect implementation). In reality, granule cells of course do have specific properties that can impact the input-output transformation, PARTICULARLY with respect to the comparison of sparse versus dense coding, because the low-pass filtering of input that occurs in granule cells (and other neurons) as well as their spike firing stochasticity (Saarinen et al. (2008). Stochastic differential equation model for cerebellar granule cell excitability. PLoS Comput. Biol. 4:e1000004) will profoundly complicate these comparisons and make them less straight forward than what is portrayed in this paper. There are also several other factors that would be present in the biological setting but are lacking here, which makes it doubtful how much information in relation to the biological performance that this modelling study provides:

What are the types of activity patterns of the inputs? What are the learning rules? What is the topography? What is the impact of Purkinje cell outputs downstream, as the Purkinje cell output does not have any direct action, it acts on the deep cerebellar nuclear neurons, which in turn act on a complex sensorimotor circuitry to exert their effect, hence predictive coding could only become interpretable after the PC output has been added to the activity in those circuits. Where is the differentiated Golgi cell inhibition?

The problem of these, in my impression, generic, arbitrary settings of the neurons and the network in the model becomes obvious here: 'In contrast to the dense activity in cerebellar granule cells, odor responses in Kenyon cells, the analogs of granule cells in the Drosophila mushroom body, are sparse…' How can this system be interpreted as an analogy to granule cells in the mammalian cerebellum when the model does not address the specifics lined up above? I.e. the 'inductive bias' that the authors speak of, defined as 'the tendency of a network toward learning particular types of input-output mappings', would be highly dependent on the specifics of the network model.

More detailed comments:

Abstract:

'In these models [Marr-Albus], granule cells form a sparse, combinatorial encoding of diverse sensorimotor inputs. Such sparse representations are optimal for learning to discriminate random stimuli.' Yes, I would agree with the first part, but I contest the second part of this statement. I think what is true for sparse coding is that the learning of random stimuli will be faster, as in a perceptron, but not necessarily better. As the sparsification essentially removes information, it could be argued that the quality of the learning is poorer. So from that perspective, it is not optimal. The authors need to specify from what perspective they consider sparse representations optimal for learning.

Introduction:

'Indeed, several recent studies have reported dense activity in cerebellar granule cells in response to sensory stimulation or during motor control tasks (Knogler et al., 2017; Wagner et al., 2017; Giovannucci et al., 2017; Badura and De Zeeuw, 2017; Wagner et al., 2019), at odds with classic theories (Marr, 1969; Albus, 1971).' In fact, this was precisely the issue that was addressed already by Jorntell and Ekerot (2006) Journal of Neuroscience. The conclusion was that these actual recordings of granule cells in vivo provided essentially no support for the assumptions in the Marr-Albus theories.

Results:

First para: There is no information about how the granule cells are modelled.

Second para: 'A typical assumption in computational theories of the cerebellar cortex is that inputs are randomly distributed in a high-dimensional space.' Yes, I agree, and this is in fact in conflict with the known topographical organization in the cerebellar cortex (see broader comment above). Mossy fiber inputs coding for closely related inputs are co-localized in the cerebellar cortex. I think for this model to be of interest from the point of view of the mammalian cerebellar cortex, it would need to pay more attention to this organizational feature.

https://doi.org/10.7554/eLife.82914.sa1

Author response

Essential revisions:

1) All reviewers had questions about specific details of the study (see below comments). I hope you can use these questions as a guideline to improve the clarity and accessibility of the paper. As suggested by reviewer #2, this may involve moving some details from the supplementary materials to the main manuscript.

As detailed in our response to Reviewer 2, we have moved a number of equations that clarify our model to the main text, as well as including Table 1 which provides details about model parameters. Furthermore, we have substantially extended our Discussion section on Assumptions and Limitations, which addresses comments on biological plausibility.

2) The current eLife assessment (see below) emphasizes the main limitations of the paper – namely the question of to what degree the conclusions of the paper would change if more specific (and biologically more plausible) details about the cerebellar circuit would be taken into account. I hope you will take the opportunity to respond in detail to these points in your response to the reviewer document, which will be posted alongside the revised version of the manuscript and cover the main issues in the Discussion section of the main document.

As mentioned above, the Assumptions and Limitations section section now includes an extended discussion of several issues relating to biological plausibility, including randomness in connectivity onto granule cells, topographic organization of the cerebellar cortex, influence of Golgi cell inhibition, and several other topics. We have additionally added a figure supplement (Figure 7—figure supplement 2) showing that our qualitative results hold when learning is mediated by an online climbing fiber-dependent learning rule, under the assumption that climbing fibers encode an error signal.

Reviewer #1 (Recommendations for the authors):

This paper provides compelling and clear analyses that show that the coding level (sparsity) of the granule-cell layer of a cerebellar-like network does not only change the dimensionality of the representation, but also the inductive bias of the network: Higher sparsity biases the network to learning more high-frequency representations. Depending on the dominant frequencies of the target function, different coding levels are therefore optimal. The results are important/fundamental to theories of cerebellar learning and speak to a relevant ongoing debate in cerebellar learning, but will be of interest to readers outside of cerebellar neurophysiology.

We appreciate the Reviewer’s positive comments.

I had two problems in understanding the paper that the authors hopefully can clarify in the revision:

Page 8: Third paragraph: At this point in the text it is a bit unclear why the K(x * x') changes shape as shown in Figure 3c. I assume the shape of the kernel K depends on the activation function in the hidden layer. It may be useful for the reader if you could give an example of the Kernel for the specific case you are showing in Figure 3c.

We agree this point needed more clarity. Figure 3C shows the kernel for a rectified linear (ReLU) activation function. The key qualitative effect is the dependence of the kernel on the coding level f, and this effect is present across different choices of activation function (see Figure 2—figure supplement 2). We show the influence of coding level on the kernel visually in Figures 3a and 3b and in Equations 1 and 2. Equation 2 shows that the kernel depends on the hidden layer activation h(x), and Equation 1 shows h(x) depends on the threshold θ.

To communicate this idea more clearly, we added the following sentences (see bolded text) to Figure 3C’s caption:

(“C) Kernel K(x,x0) for networks with rectified linear activation functions (Equation 1), normalized so that fully overlapping representations have an overlap of 1, plotted as a function of overlap in the space of task variables. The vertical axis corresponds to the ratio of the area of the purple region to the area of the red or blue regions in (B). Each curve corresponds to the kernel of an infinite-width network with a different coding level f.”

In the main text, in the last paragraph on page 8, we also added the following text:

Equations 1 and 2 show that the threshold θ, which determines the coding level, influences the kernel through its effect on the expansion layer activity h(x).

Page 11, 4th paragraph: Why is the distribution corr(Jeffi,Jeffj) uniform on -1 to 1? Since both are high-dimensional random vectors, should the correlation not be centered around 0? The target of uniform distribution needs to be better explained to make this analysis accessible.

We agree that the uniform distribution might be unexpected. This is in fact a consequence of our assumption in this figure that the points are drawn from a task subspace of dimension D = 3. For a unit sphere in D = 3 dimensions, the dot product between randomly chosen pairs of points is uniformly distributed on [−1,1]. For higher dimensions, the distribution, as the Reviewer expects, is centered at zero but decays away from zero (it is a type of β distribution). We have added a sentence to the figure caption stating this:

(“C) Distributions of synaptic weight correlations Corr  (jieff,jjeff), where Jeffi is the ith row of Jeff, for pairs of expansion layer neurons in networks with different numbers of input layer neurons N (colors) when K = 4 and D = 3. Black distribution corresponds to fully connected networks with Gaussian weights. We note that when D = 3, the distribution of correlations for random Gaussian weight vectors is uniform on [−1,1] as shown (for higher dimensions the distribution has a peak at 0).”Reviewer #2 (Recommendations for the authors):

Here, a simple model of cerebellar computation is used to study the dependence of task performance on input type: it is demonstrated that task performance and optimal representations are highly dependent on task and stimulus type. This challenges many standard models which use simple random stimuli and concludes that the granular layer is required to provide a sparse representation. This is a useful contribution to our understanding of cerebellar circuits, though, in common with many models of this type, the neural dynamics and circuit architecture are not very specific to the cerebellum, the model includes the feed-forward structure and the high dimension of the granule layer, but little else. This paper has the virtue of including tasks that are more realistic, but by the paper's own admission, the same model can be applied to the electrosensory lateral line lobe and it could, though it is not mentioned in the paper, be applied to the dentate gyrus and large pyramidal cells of CA3. The discussion does not include specific elements related to, for example, the dynamics of the Purkinje cells or the role of Golgi cells, and, in a way, the demonstration that the model can encompass different tasks and stimuli types is an indication of how abstract the model is. Nonetheless, it is useful and interesting to see a generalization of what has become a standard paradigm for discussing cerebellar function.

We appreciate the Reviewer’s positive comments. Regarding the simplifications of our model, we agree that we have taken a modeling approach that abstracts away certain details to permit comparisons across systems. We now include an in-depth discussion of our simplifying assumptions (Assumptions and Extensions section in the Discussion) and have further noted the possibility that other biophysical mechanisms we have not accounted for may also underlie differences across systems.

Our results predict that qualitative differences in the coding levels of cerebellum-like systems, across brain regions or across species, reflect an optimization to distinct tasks (Figure 7). However, it is also possible that differences in coding level arise from other physiological differences between systems.

I was impressed by the clarity of this manuscript. My only comment is that I found too much was deferred to the appendix, I thought the bits and pieces in the appendix were very clarifying, and given the appendix contains short pieces of extra information explaining the exact nature of the models and tasks, the manuscript would have been easier to follow and think about if these had just been integrated into the text. Often you read papers and wish some details had been shifted into an appendix since they distract from the flow of the description, but the opposite is true here, integrating the details into the text would've made it more concrete in a useful way and the tables of parameters, as tables, would not have interrupted the flow while making it much easier to see the scale of the models and their architecture.

1) When we introduce the model in the Results section, we clarify that we use ReLU activation functions throughout the study.

The activity of neurons in the expansion layer is given by:

h = φ(Jeffx − θ),

where φ is a rectified linear activation function φ(u) = max(u,0) applied element-wise. Our results also hold for other threshold-polynomial activation functions.

2) We provide more specifics about the random categorization task for Figure 2:

The former we refer to as a “random categorization task” and is parameterized by the number of input pattern-to-category associations P learned during training (Figure 2C). During the training phase, the network learns to associate random input patterns xµ ∈ RD for µ = 1,…,P with random binary categories yµ = ±1. The elements of xµ are drawn i.i.d. from a normal distribution with mean 0 and variance 1/D. We refer to xµ as “training patterns.” To assess the network’s generalization performance, it is presented with “test patterns” generated by adding noise (parameterized by a noise magnitude; see Methods) to the training patterns. Tasks with continuous outputs (Figure 2D) are parameterized by a length scale that determines how quickly the output changes as a function of the input (specifically, input-output functions are drawn from a Gaussian process with length scale γ for variations in f(x) as a function of x; see Methods). In this case, both training and test patterns are drawn uniformly on the unit sphere. Later, we will also consider tasks implemented by specific cerebellum-like systems. See Table 1 for a summary of parameters throughout this study.

3) We specify the readouts and the performance metrics we use:

We trained the readout to approximate the target output for training patterns and generalize to unseen test patterns. The network’s prediction is fˆ(x) = w·h(x) for tasks with continuous outputs, or fˆ(x) = sign(w·h(x)) for categorization tasks, where w are the synaptic weights of the readout from the expansion layer. These weights were set using least squares regression. Performance was measured as the fraction of incorrect predictions for categorization tasks, or relative mean squared error for tasks with continuous targets: Error = E[(f(x)f(x))2]E[f(x)2], where the expectation is across test patterns.

4) To clarify differences in model and task parameters used across the different figures, we have moved the table of parameters to the main text. The main text references this table after introducing the model and the tasks used in Figure 2.

Reviewer #3 (Recommendations for the authors):

The paper by Xie et al. is a modelling study of the mossy fiber-to-granule cell-to-Purkinje cell network, reporting that the optimal type of representations in the cerebellar granule cell layer depends on the type task. The paper stresses that the findings indicate a higher overall bias towards dense representations than stated in the literature, but it appears the authors have missed parts of the literature that already reported on this. While the modelling and analysis appear mathematically solid, the model is lacking many known constraints of the cerebellar circuitry, which makes the applicability of the findings to the biological counterpart somewhat limited.

We thank the Reviewer for suggesting additional references to include in our manuscript, and for encouraging us to extend our model toward greater biological plausibility and more critically discuss simplifying assumptions we have made. We respond to both the comment about previous literature and about applicability to cerebellar circuitry in detail below.

I have some concerns with the novelty of the main conclusion, here from the abstract:

'Here, we generalize theories of cerebellar learning to determine the optimal granule cell representation for tasks beyond random stimulus discrimination, including continuous input-output transformations as required for smooth motor control. We show that for such tasks, the optimal granule cell representation is substantially denser than predicted by classic theories.'

Stated like this, this has in principle already been shown, i.e. for example:

Spanne and Jorntell (2013) Processing of multi-dimensional sensorimotor information in the spinal and cerebellar neuronal circuitry: a new hypothesis. PLoS Comput Biol. 9(3):e1002979.

Indeed, even the 2 DoF arm movement control that is used in the present paper as an application, was used in this previous paper, with similar conclusions with respect to the advantage of continuous input-output transformations and dense coding. Thus, already from the beginning of this paper, the novelty aspect of this paper is questionable. Even the conclusion in the last paragraph of the Introduction: 'We show that, when learning input-output mappings for motor control tasks, the optimal granule cell representation is much denser than predicted by previous analyses.' was in principle already shown by this previous paper.

We thank the Reviewer for drawing our attention to Spanne and Jo¨rntell (2013). Our study shares certain similarities with this work, including the consideration of tasks with smooth input-output mappings, such as learning the dynamics of a two-joint arm. However, our study differs substantially, most notably the fact that we focus our study on parametrically varying the degree of sparsity in the granule cell layer to determine the circumstances under which dense versus sparse coding is optimal. To the best of our ability, we can find no result in Spanne and J¨orntell (2013) that indicates the performance of a network as a function of average coding level. Instead, Spanne and Jo¨rntell (2013) propose that inhibition from Golgi cells produces heterogeneity in coding level which can improve performance, which is an interesting but complementary finding to ours. We therefore do not believe that the quantitative computations of optimal coding level that we present are redundant with the results of this previous study. We also note that a key contribution of our study is mathemetical analysis of the inductive bias of networks with different coding levels which supports our conclusions.

We have included a discussion of Spanne and Jo¨rntell (2013) and (2015) in the revised version of our manuscript:

“Other studies have considered tasks with smooth input-output mappings and low-dimensional inputs, finding that heterogeneous Golgi cell inhibition can improve performance by diversifying individual granule cell thresholds (Spanne and J¨orntell, 2013). Extending our model to include heterogeneous thresholds is an interesting direction for future work. Another proposal states that dense coding may improve generalization (Spanne and Jo¨rntell, 2015). Our theory reveals that whether or not dense coding is beneficial depends on the task.”

However, the present paper does add several more specific investigations/characterizations that were not previously explored. Many of the main figures report interesting new model results. However, the model is implemented in a highly generic fashion. Consequently, the model relates better to general neural network theory than to specific interpretations of the function of the cerebellar neuronal circuitry. One good example is the findings reported in Figure 2. These represent an interesting extension to the main conclusion, but they are also partly based on arbitrariness as the type of mossy fiber input described in the random categorization task has not been observed in the mammalian cerebellum under behavior in vivo, whereas in contrast, the type of input for the motor control task does resemble mossy fiber input recorded under behavior (van Kan et al. 1993).

We agree that the tasks we consider in Figure 2 are simplified compared to those that we consider elsewhere in the paper. The choice of random mossy fiber input was made to provide a comparison to previous modeling studies that also use random input as a benchmark (Marr 1969, Albus 1971, Brunel 2004, Babadi and Sompolinsky 2014, Billings 2014, LitwinKumar et al., 2017). This baseline permits us to specifically evaluate the effects of lowdimensional inputs (Figure 2) and richer input-output mappings (Figure 2, Figure 7). We agree with the Reviewer that the random and uncorrelated mossy fiber activity that has been extensively used in previous studies is almost certainly an unrealistic idealization of in vivo neural activity—this is a motivating factor for our study, which relaxes this assumption and examines the consequences. To provide additional context, we have updated the following paragraph in the main text Results section:

“A typical assumption in computational theories of the cerebellar cortex is that inputs are randomly distributed in a high-dimensional space (Marr, 1969; Albus, 1971; Brunel et al., 2004; Babadi and Sompolinsky, 2014; Billings et al., 2014; Litwin-Kumar et al., 2017). While this may be a reasonable simplification in some cases, many tasks, including cerebellumdependent tasks, are likely best-described as being encoded by a low-dimensional set of variables. For example, the cerebellum is often hypothesized to learn a forward model for motor control (Wolpert et al., 1998), which uses sensory input and motor efference to predict an effector’s future state. Mossy fiber activity recorded in monkeys correlates with position and velocity during natural movement (van Kan et al., 1993). Sources of motor efference copies include motor cortex, whose population activity lies on a lowdimensional manifold (Wagner et al., 2019; Huang et al., 2013; Churchland et al., 2010; Yu et al., 2009). We begin by modeling the low dimensionality of inputs and later consider more specific tasks.”

The overall conclusion states:

'Our results….suggest that optimal cerebellar representations are task-dependent.'

This is not a particularly strong or specific conclusion. One could interpret this statement as simply saying: ' if I construct an arbitrary neural network, with arbitrary intrinsic properties in neurons and synapses, I can get outputs that depend on the intensity of the input that I provide to that network.'

Further, the last sentence of the Introduction states: 'More broadly, we show that the sparsity of a neural code has a task-dependent influence on learning…' This is very general and unspecific, and would likely not come as a surprise to anyone interested in the analysis of neural networks. It doesn't pinpoint any specific biological problem but just says that if I change the density of the input to a [generic] network, then the learning will be impacted in one way or another.

We agree with the Reviewer that our conclusions are quite general, and we have removed the final sentence as we agree it was unspecific. However, we disagree with the Reviewer’s paraphrasing of our results.

First, we do not select arbitrary intrinsic properties of neurons and synapses. Rather, we construct a simplified model with a key quantity, the neuronal threshold, that we vary parametrically in order to assess the effect of the resulting changes in the representation on performance. Second, we do not vary the intensity/density of inputs provided to the network – this is fixed throughout our study for all key comparisons we perform. Instead, we vary the density (coding level) of the expansion layer representation and quantify its effect on inductive bias and generalization. Finally, our study’s key contribution is an explanation of the heterogeneity in average coding level observed across behaviors and cerebellum-like systems. We go beyond the empirical statement that there is a dependence of performance on the parameter that we vary by developing an analytical theory. Our theory describes the performance of the class of networks that we study and the properties of learning tasks that determine the optimal expansion layer representation.

To clarify our main contributions, we have updated the final paragraph of the Introduction. We have also removed the sentence that the Reviewer objects to, as it was less specific than the other points we make here.

We propose that these differences can be explained by the capacity of representations with different levels of sparsity to support learning of different tasks. We show that the optimal level of sparsity depends on the structure of the input-output relationship of a task. When learning input-output mappings for motor control tasks, the optimal granule cell representation is much denser than predicted by previous analyses. To explain this result, we develop an analytic theory that predicts the performance of cerebellum-like circuits for arbitrary learning tasks. The theory describes how properties of cerebellar architecture and activity control these networks’ inductive bias: the tendency of a network toward learning particular types of input-output mappings (Sollich, 1998; Jacot et al., 2018; Bordelon et al., 2020; Canatar et al., 2021; Simon et al., 2021). The theory shows that inductive bias, rather than the dimension of the representation alone, is necessary to explain learning performance across tasks. It also suggests that cerebellar regions specialized for different functions may adjust the sparsity of their granule cell representations depending on the task.

The interpretation of the distribution of the mossy fiber inputs to the granule cells, which would have a crucial impact on the results of a study like this, is likely incorrect. First, unlike the papers that the authors cite, there are many studies indicating that there is a topographic organization in the mossy fiber termination, such that mossy fibers from the same inputs, representing similar types of information, are regionally co-localized in the granule cell layer. Hence, there is no support for the model assumption that there is a predominantly random termination of mossy fibers of different origins. This risks invalidating the comparisons that the authors are making, i.e. such as in Figure 3. This is a list of example papers, there are more:

van Kan, Gibson and Houk (1993) Movement-related inputs to intermediate cerebellum of the monkey. Journal of Neurophysiology.

Garwicz et al. (1998) Cutaneous receptive fields and topography of mossy fibres and climbing fibres projecting to cat cerebellar C3 zone. The Journal of Physiology.

Brown and Bower (2001) Congruence of mossy fiber and climbing fiber tactile projections in the lateral hemispheres of the rat cerebellum. The Journal of Comparative Neurology.

Na, Sugihara, Shinoda (2019) The entire trajectories of single pontocerebellar axons and their lobular and longitudinal terminal distribution patterns in multiple aldolase C-positive compartments of the rat cerebellar cortex. The Journal of Comparative Neurology.

The nature of the mossy fiber-granule cell recording is also reviewed here:

Gilbert and Miall (2022) How and Why the Cerebellum Recodes Input Signals: An Alternative to Machine Learning. The Neuroscientist

Further, considering the recoding idea, the following paper shows that detailed information, as it is provided by mossy fibers, is transmitted through the granule cells without any evidence of recoding: Jorntell and Ekerot (2006) Journal of Neuroscience; and this paper shows that these granule inputs are powerfully transmitted to the molecular layer even in a decerebrated animal (i.e. where only the ascending sensory pathways remains) Jorntell and Ekerot 2002, Neuron.

We agree that there is strong evidence for a topographic organization in mossy fiber to granule cell connectivity at the microzonal level. We thank the Reviewer for pointing us to specific examples. We acknowledge that our simplified model does not capture the structure of connectivity observed in these studies.

However, the focus of our model is on cerebellar neurons presynaptic to a single Purkinje cell. Random or disordered distribution of inputs at this local scale is compatible with topographic organization at the microzonal scale. Furthermore, while there is evidence of structured connections at the local scale, models with random connectivity are able to reproduce the dimensionality of granule cell activity within a small margin of error (Nguyen et al., 2022). Finally, our finding that dense codes are optimal for learning slowly varying tasks is consistent with evidence for the lack of re-coding – for such tasks, re-coding may absent because it is not required.

We have dedicated a section on this issue in the Assumptions and Extensions portion of our Discussion:

“Another key assumption concerning the granule cells is that they sample mossy fiber inputs randomly, as is typically assumed in Marr-Albus models (Marr, 1969; Albus, 1971; LitwinKumar et al., 2017; Cayco-Gajic et al., 2017). Other studies instead argue that granule cells sample from mossy fibers with highly similar receptive fields (Garwicz et al., 1998; Brown and Bower, 2001; J¨orntell and Ekerot, 2006) defined by the tuning of mossy fiber and climbing fiber inputs to cerebellar microzones (Apps et al., 2018). This has led to an alternative hypothesis that granule cells serve to relay similarly tuned mossy fiber inputs and enhance their signal-to-noise ratio (Jo¨rntell and Ekerot, 2006; Gilbert and Chris Miall, 2022) rather than to re-encode inputs. Another hypothesis is that granule cells enable Purkinje cells to learn piece-wise linear approximations of nonlinear functions (Spanne and J¨orntell, 2013). However, several recent studies support the existence of heterogeneous connectivity and selectivity of granule cells to multiple distinct inputs at the local scale (Huang et al., 2013; Ishikawa et al., 2015). Furthermore, the deviation of the predicted dimension in models constrained by electron-microscopy data as compared to randomly wired models is modest (Nguyen et al., 2022). Thus, topographically organized connectivity at the macroscopic scale may coexist with disordered connectivity at the local scale, allowing granule cells presynaptic to an individual Purkinje cell to sample heterogeneous combinations of the subset of sensorimotor signals relevant to the tasks that Purkinje cell participates in. Finally, we note that the optimality of dense codes for learning slowly varying tasks in our theory suggests that observations of a lack of mixing (J¨orntell and Ekerot, 2002) for such tasks are compatible with Marr-Albus models, as in this case nonlinear mixing is not required.”

I could not find any description of the neuron model used in this paper, so I assume that the neurons are just modelled as linear summators with a threshold (in fact, Figure 5 mentions inhibition, but this appears to be just one big lump inhibition, which basically is an incorrect implementation). In reality, granule cells of course do have specific properties that can impact the input-output transformation, PARTICULARLY with respect to the comparison of sparse versus dense coding, because the low-pass filtering of input that occurs in granule cells (and other neurons) as well as their spike firing stochasticity (Saarinen et al. (2008). Stochastic differential equation model for cerebellar granule cell excitability. PLoS Comput. Biol. 4:e1000004) will profoundly complicate these comparisons and make them less straight forward than what is portrayed in this paper. There are also several other factors that would be present in the biological setting but are lacking here, which makes it doubtful how much information in relation to the biological performance that this modelling study provides:

What are the types of activity patterns of the inputs? What are the learning rules? What is the topography? What is the impact of Purkinje cell outputs downstream, as the Purkinje cell output does not have any direct action, it acts on the deep cerebellar nuclear neurons, which in turn act on a complex sensorimotor circuitry to exert their effect, hence predictive coding could only become interpretable after the PC output has been added to the activity in those circuits. Where is the differentiated Golgi cell inhibition?

Thank you for these critiques. We have made numerous edits to improve the presentation of the details of our model in the main text of the manuscript. Indeed, granule cells in the main text are modeled as linear sums of mossy fiber inputs with a threshold-linear activation function. A more detailed description of the model for granule cells can now be found in Equation 1 in the Results section:

“The activity of neurons in the expansion layer is given by:

h = φ(Jeffx − θ), (1)

where φ is a rectified linear activation function φ(u) = max(u,0) applied element-wise. Our results also hold for other threshold-polynomial activation functions. The scalar threshold θ is shared across neurons and controls the coding level, which we denote by f, defined as the average fraction of neurons in the expansion layer that are active.”

Most of our analyses use the firing rate model we describe above, but several Supplemental Figures show extensions to this model. As we mention in the Discussion, our results do not depend on the specific choice of nonlinearity (Figure 2—figure supplement 2). We have also considered the possibility that the stochastic nature of granule cell spikes could impact our measures of coding level. In Figure 7—figure supplement 1 we test the robustness of our main conclusion using a spiking model where we model granule cell spikes with Poisson statistics. When measuring coding level in a population of spiking neurons, a key question is at what time window the Purkinje cell integrates spikes. For several choices of integration time windows, we show that dense coding remains optimal for learning smooth tasks. However, we agree with the Reviewer that there are other biological details our model does not address. For example, our spiking model does not capture some of the properties the Saarinen et al. (2008) model captures, including random sub-threshold oscillations and clusters of spikes. Modeling biophysical phenomena at this scale is beyond the scope of our study. We have added this reference to the relevant section of the Discussion:

“We also note that coding level is most easily defined when neurons are modeled as rate, rather than spiking units. To investigate the consistency of our results under a spiking code, we implemented a model in which granule cell spiking exhibits Poisson variability and quantify coding level as the fraction of neurons that have nonzero spike counts (Figure 7—figure supplement 1; Figure 7C). In general, increased spike count leads to improved performance as noise associated with spiking variability is reduced. Granule cells have been shown to exhibit reliable burst responses to mossy fiber stimulation (Chadderton et al., 2004), motivating models using deterministic responses or sub-Poisson spiking variability. However, further work is needed to quantitatively compare variability in model and experiment and to account for more complex biophysical properties of granule cells (Saarinen et al., 2008).”

A second concern the Reviewer raises is our implementation of Golgi cell inhibition as a homogeneous rather than heterogeneous input onto granule cells. In simplified models, adding heterogeneous inhibition does not dramatically change the qualitative properties of the expansion layer representation, in particular the dimensionality of the representation (Billings et al., 2014, Cayco-Gajic et al., 2017, Litwin-Kumar et al., 2017). We have added a section about inhibition to our Discussion:

“We also have not explicitly modeled inhibitory input provided by Golgi cells, instead assuming such input can be modeled as a change in effective threshold, as in previous studies (Billings et al., 2014; Cayco-Gajic et al., 2017; Litwin-Kumar et al., 2017). This is appropriate when considering the dimension of the granule cell representation (Litwin-Kumar et al., 2017), but more work is needed to extend our model to the case of heterogeneous inhibition.”

Regarding the mossy fiber inputs, as we state in response to paragraph 3, we agree with the Reviewer that the random and uncorrelated mossy fiber activity that has been used in previous studies is an unrealistic idealization of in vivo neural activity. One of the motivations for our model was to relax this assumption and examine the consequences: we introduce correlations in the mossy fiber activity by projecting low-dimensional patterns into the mossy fiber layer (Figure 1B):

“A typical assumption in computational theories of the cerebellar cortex is that inputs are randomly distributed in a high-dimensional space (Marr, 1969; Albus, 1971; Brunel et al., 2004; Babadi and Sompolinsky, 2014; Billings et al., 2014; Litwin-Kumar et al., 2017). While this may be a reasonable simplification in some cases, many tasks, including cerebellumdependent tasks, are likely best-described as being encoded by a low-dimensional set of variables. For example, the cerebellum is often hypothesized to learn a forward model for motor control (Wolpert et al., 1998), which uses sensory input and motor efference to predict an effector’s future state. Mossy fiber activity recorded in monkeys correlates with position and velocity during natural movement (van Kan et al., 1993). Sources of motor efference copies include motor cortex, whose population activity lies on a low-dimensional manifold (Wagner et al., 2019; Huang et al., 2013; Churchland et al., 2010; Yu et al., 2009). We begin by modeling the low dimensionality of inputs and later consider more specific tasks.

We therefore assume that the inputs to our model lie on a D-dimensional subspace embedded in the N-dimensional input space, where D is typically much smaller than N (Figure 1B). We refer to this subspace as the “task subspace” (Figure 1C).”

The Reviewer also mentions the learning rule at granule cell to Purkinje cell synapses. We agree that considering online, climbing-fiber-dependent learning is an important generalization. We therefore added a new supplemental figure investigating whether we would still see a difference in optimal coding levels across tasks if online learning were used instead of the least squares solution (Figure 7—figure supplement 2). Indeed, we observed a similar task dependence as we saw in Figure 2F. We have added a new paragraph in the Discussion under Assumptions and Extensions describing our rationale and approach in detail:

“For the Purkinje cells, our model assumes that their responses to granule cell input can be modeled as an optimal linear readout. Our model therefore provides an upper bound to linear readout performance, a standard benchmark for the quality of a neural representation that does not require assumptions on the nature of climbing fiber-mediated plasticity, which is still debated. Electrophysiological studies have argued in favor of a linear approximation (Brunel et al., 2004). To improve the biological applicability of our model, we implemented an online climbing fiber-mediated learning rule and found that optimal coding levels are still task-dependent (Figure 7—figure supplement 2). We also note that although we model several timing-dependent tasks (Figure 7), our learning rule does not exploit temporal information, and we assume that temporal dynamics of granule cell responses are largely inherited from mossy fibers. Integrating temporal information into our model is an interesting direction for future investigation.

During each epoch of training, the network is presented with all patterns in a randomized order, and the learned weights are updated with each pattern (see Methods). Networks were presented with 30 patterns and trained for 20,000 epochs, with a learning rate of η = 0.7/M. Other parameters: D = 3,M = 10,000.

A) Performance of an example network during online learning, measured as relative mean squared error across training epochs. Parameters: f = 0.3, γ = 1.

B) Generalization error as a function of coding level for networks trained with online learning (solid lines) or unregularized least squares (dashed lines) for Gaussian process tasks with different length scales (colors). Standard error of the mean was computed across 20 realizations.”

Finally, regarding the function of the Purkinje cell, our model defines a learning task as a mapping from inputs to target activity in the Purkinje cell and is thus agnostic to the cell’s downstream effects. We clarify this point when introducing the definition of a learning task:

“In our model, a learning task is defined by a mapping from task variables x to an output f(x), representing a target change in activity of a readout neuron, for example a Purkinje cell. The limited scope of this definition implies our results should not strongly depend on the influence of the readout neuron on downstream circuits.”

The problem of these, in my impression, generic, arbitrary settings of the neurons and the network in the model becomes obvious here: 'In contrast to the dense activity in cerebellar granule cells, odor responses in Kenyon cells, the analogs of granule cells in the Drosophila mushroom body, are sparse…' How can this system be interpreted as an analogy to granule cells in the mammalian cerebellum when the model does not address the specifics lined up above? I.e. the 'inductive bias' that the authors speak of, defined as 'the tendency of a network toward learning particular types of input-output mappings', would be highly dependent on the specifics of the network model.

We agree with the Reviewer that our model makes several simplifying assumptions for mathematical tractability. However, we note that our study is not the first to draw analogies between cerebellum-like systems, including the mushroom body (Bell et al., 2008; Farris, 2011). All the systems we study feature a sparsely connected, expanded granule-like layer that sends parallel fiber axons onto densely connected downstream neurons known to exhibit powerful synaptic plasticity, thus motivating the key architectural assumptions of our model. We have constrained anatomical parameters of the model using data as available (Table 1). However, we agree with the Reviewer that when making comparisons across species there is always a possibility that differences are due to physiological mechanisms we have not fully understood or captured with a model. As such, we can only present a hypothesis for these differences. We have modified our Discussion section on this topic to clearly state this.

“Our results predict that qualitative differences in the coding levels of cerebellum-like systems, across brain regions or across species, reflect an optimization to distinct tasks (Figure 7). However, it is also possible that differences in coding level arise from other physiological differences between systems.”

More detailed comments:

Abstract:

'In these models [Marr-Albus], granule cells form a sparse, combinatorial encoding of diverse sensorimotor inputs. Such sparse representations are optimal for learning to discriminate random stimuli.' Yes, I would agree with the first part, but I contest the second part of this statement. I think what is true for sparse coding is that the learning of random stimuli will be faster, as in a perceptron, but not necessarily better. As the sparsification essentially removes information, it could be argued that the quality of the learning is poorer. So from that perspective, it is not optimal. The authors need to specify from what perspective they consider sparse representations optimal for learning.

This is an important point that we would like to clarify. It is not the case that sparse coding simply speeds up learning. In our study and many related works (Barak et al. 2013; Babadi and Sompolinsky 2014; Litwin-Kumar et al. 2017), learning performance is measured based on the generalization ability of the network – the ability to predict correct labels for previously unseen inputs. As our study and previous studies show, sparse codes are optimal in the sense that they minimize generalization error, independent of any effect on learning speed. To communicate this more effectively, we have added the following sentence to the first paragraph of the Introduction:

“Sparsity affects both learning speed (Cayco-Gajic et al., 2017), and generalization, the ability to predict correct labels for previously unseen inputs (Barak et al., 2013; Babadi and Sompolinsky, 2014; Litwin-Kumar et al., 2017).”

Introduction:

'Indeed, several recent studies have reported dense activity in cerebellar granule cells in response to sensory stimulation or during motor control tasks (Knogler et al., 2017; Wagner et al., 2017; Giovannucci et al., 2017; Badura and De Zeeuw, 2017; Wagner et al., 2019), at odds with classic theories (Marr, 1969; Albus, 1971).' In fact, this was precisely the issue that was addressed already by Jorntell and Ekerot (2006) Journal of Neuroscience. The conclusion was that these actual recordings of granule cells in vivo provided essentially no support for the assumptions in the Marr-Albus theories.

In our reading, the main finding of J¨orntell and Ekerot (2006) is that individual granule cells are activated by mossy fibers with overlapping receptive fields driven by a single type of somatosensory input. However, there is also evidence of nonlinear mixed selectivity in granule cells in support of the re-coding hypothesis (Huang et al., 2013; Ishikawa et al., 2015). Jo¨rntell and Ekerot (2006) also suggest that the granule cell layer shares similar topographic organization as mossy fibers, organized into microzones. The existence of topographic organization does not invalidate Marr-Albus theories. As we have suggested earlier, a local combinatorial expansion can coexist with a global topographic organization.

We have described these considerations in the Assumptions and Extensions portion of the Discussion:

“Another key assumption concerning the granule cells is that they sample mossy fiber inputs randomly, as is typically assumed in Marr-Albus models (Marr, 1969; Albus, 1971; LitwinKumar et al., 2017; Cayco-Gajic et al., 2017). Other studies instead argue that granule cells sample from mossy fibers with highly similar receptive fields (Garwicz et al., 1998; Brown and Bower, 2001; J¨orntell and Ekerot, 2006) defined by the tuning of mossy fiber and climbing fiber inputs to cerebellar microzones (Apps et al., 2018). This has led to an alternative hypothesis that granule cells serve to relay similarly tuned mossy fiber inputs and enhance their signal-to-noise ratio (Jo¨rntell and Ekerot, 2006; Gilbert and Chris Miall, 2022) rather than to re-encode inputs. Another hypothesis is that granule cells enable Purkinje cells to learn piece-wise linear approximations of nonlinear functions (Spanne and J¨orntell, 2013). However, several recent studies support the existence of heterogeneous connectivity and selectivity of granule cells to multiple distinct inputs at the local scale (Huang et al., 2013; Ishikawa et al., 2015). Furthermore, the deviation of the predicted dimension in models constrained by electron-microscopy data as compared to randomly wired models is modest (Nguyen et al., 2022). Thus, topographically organized connectivity at the macroscopic scale may coexist with disordered connectivity at the local scale, allowing granule cells presynaptic to an individual Purkinje cell to sample heterogeneous combinations of the subset of sensorimotor signals relevant to the tasks that Purkinje cell participates in. Finally, we note that the optimality of dense codes for learning slowly varying tasks in our theory suggests that observations of a lack of mixing (J¨orntell and Ekerot, 2002) for such tasks are compatible with Marr-Albus models, as in this case nonlinear mixing is not required.”

We have also included the Jo¨rntell and Ekerot (2006) study as a citation in the Introduction:

“Indeed, several recent studies have reported dense activity in cerebellar granule cells in response to sensory stimulation or during motor control tasks (Jo¨rntell and Ekerot, 2006; Knogler et al., 2017; Wagner et al., 2017; Giovannucci et al., 2017; Badura and De Zeeuw, 2017; Wagner et al., 2019), at odds with classic theories (Marr, 1969; Albus, 1971).”

Results:

First para: There is no information about how the granule cells are modelled.

We agree that this should information should have been more readily available. We now more completely describe the model in the main text. Our model for granule cells can be found in Equation 1 in the Results section and also the Methods (Network Model):

“The activity of neurons in the expansion layer is given by:

h = φ(Jeffx − θ), (2)

where φ is a rectified linear activation function φ(u) = max(u,0) applied element-wise. Our results also hold for other threshold-polynomial activation functions. The scalar threshold θ is shared across neurons and controls the coding level, which we denote by f, defined as the average fraction of neurons in the expansion layer that are active.”

Second para: 'A typical assumption in computational theories of the cerebellar cortex is that inputs are randomly distributed in a high-dimensional space.' Yes, I agree, and this is in fact in conflict with the known topographical organization in the cerebellar cortex (see broader comment above). Mossy fiber inputs coding for closely related inputs are co-localized in the cerebellar cortex. I think for this model to be of interest from the point of view of the mammalian cerebellar cortex, it would need to pay more attention to this organizational feature.

As we discuss in our response to paragraphs 5 and 6, we see the random distribution assumption at the local scale (inputs presynaptic to a single Purkinje cell) as being compatible with topographic organization occurring at the microzone scale. Furthermore, as discussed earlier, we specifically model low-dimensional input as opposed to the random and high-dimensional inputs typically studied in prior models.

“A typical assumption in computational theories of the cerebellar cortex is that inputs are randomly distributed in a high-dimensional space (Marr, 1969; Albus, 1971; Brunel et al., 2004; Babadi and Sompolinsky, 2014; Billings et al., 2014; Litwin-Kumar et al., 2017). While this may be a reasonable simplification in some cases, many tasks, including cerebellum dependent tasks, are likely best-described as being encoded by a low-dimensional set of variables. For example, the cerebellum is often hypothesized to learn a forward model for motor control (Wolpert et al., 1998), which uses sensory input and motor efference to predict an effector’s future state. Mossy fiber activity recorded in monkeys correlates with position and velocity during natural movement (van Kan et al., 1993). Sources of motor efference copies include motor cortex, whose population activity lies on a low-dimensional manifold (Wagner et al., 2019; Huang et al., 2013; Churchland et al., 2010; Yu et al., 2009). We begin by modeling the low dimensionality of inputs and later consider more specific tasks.

We therefore assume that the inputs to our model lie on a D-dimensional subspace embedded in the N-dimensional input space, where D is typically much smaller than N (Figure 1B).

We refer to this subspace as the “task subspace” (Figure 1C).”

References

Albus, J.S. (1971). A theory of cerebellar function. Mathematical Biosciences 10, 25–61.

Apps, R., et al. (2018). Cerebellar Modules and Their Role as Operational Cerebellar Processing Units. Cerebellum 17, 654–682.

Babadi, B. and Sompolinsky, H. (2014). Sparseness and expansion in sensory representations. Neuron 83, 1213–1226.

Badura, A. and De Zeeuw, C.I. (2017). Cerebellar granule cells: dense, rich and evolving representations. Current Biology 27, R415–R418.

Barak, O., Rigotti, M., and Fusi, S. (2013). The sparseness of mixed selectivity neurons controls the generalization–discrimination trade-off. Journal of Neuroscience 33, 3844– 3856.

Bell, C.C., Han, V., and Sawtell, N.B. (2008). Cerebellum-like structures and their implications for cerebellar function. Annual Review of Neuroscience 31, 1–24.

Billings, G., Piasini, E., Lo˝rincz, A., Nusser, Z., and Silver, R.A. (2014). Network structure within the cerebellar input layer enables lossless sparse encoding. Neuron 83, 960–974.

Bordelon, B., Canatar, A., and Pehlevan, C. (2020). Spectrum dependent learning curves in kernel regression and wide neural networks. International Conference on Machine Learning 1024–1034.

Brown, I.E. and Bower, J.M. (2001). Congruence of mossy fiber and climbing fiber tactile projections in the lateral hemispheres of the rat cerebellum. Journal of Comparative Neurology 429, 59–70.

Brunel, N., Hakim, V., Isope, P., Nadal, J.P., and Barbour, B. (2004). Optimal information storage and the distribution of synaptic weights: perceptron versus Purkinje cell. Neuron 43, 745–757.

Canatar, A., Bordelon, B., and Pehlevan, C. (2021). Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature Communications 12, 1–12.

Cayco-Gajic, N.A., Clopath, C., and Silver, R.A. (2017). Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature Communications 8, 1–11.

Chadderton, P., Margrie, T.W., and Ha¨usser, M. (2004). Integration of quanta in cerebellar granule cells during sensory processing. Nature 428, 856–860.

Churchland, M.M., et al. (2010). Stimulus onset quenches neural variability: a widespread cortical phenomenon. Nature Neuroscience 13, 369–378.

Farris, S.M. (2011). Are mushroom bodies cerebellum-like structures? Arthropod structure and development 40, 368–379.

Garwicz, M., Jorntell, H., and Ekerot, C.F. (1998). Cutaneous receptive fields and topography of mossy fibres and climbing fibres projecting to cat cerebellar C3 zone. The Journal of Physiology 512 ( Pt 1), 277–293.

Gilbert, M. and Chris Miall, R. (2022). How and Why the Cerebellum Recodes Input Signals: An Alternative to Machine Learning. The Neuroscientist 28, 206–221.

Giovannucci, A., et al. (2017). Cerebellar granule cells acquire a widespread predictive feedback signal during motor learning. Nature Neuroscience 20, 727–734.

Huang, C.C., et al. (2013). Convergence of pontine and proprioceptive streams onto multimodal cerebellar granule cells. eLife 2, e00400.

Ishikawa, T., Shimuta, M., and Ha¨usser, M. (2015). Multimodal sensory integration in single cerebellar granule cells in vivo. eLife 4, e12916.

Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems 31.

Jo¨rntell, H. and Ekerot, C.F. (2002). Reciprocal Bidirectional Plasticity of Parallel Fiber Receptive Fields in Cerebellar Purkinje Cells and Their Afferent Interneurons. Neuron 34, 797–806.

Jo¨rntell, H. and Ekerot, C.F. (2006). Properties of Somatosensory Synaptic Integration in Cerebellar Granule Cells in vivo. Journal of Neuroscience 26, 11786–11797.

Knogler, L.D., Markov, D.A., Dragomir, E.I., Stih, V., and Portugues, R. (2017). Senso-ˇ rimotor representations in cerebellar granule cells in larval zebrafish are dense, spatially organized, and non-temporally patterned. Current Biology 27, 1288–1302.

Litwin-Kumar, A., Harris, K.D., Axel, R., Sompolinsky, H., and Abbott, L.F. (2017). Optimal degrees of synaptic connectivity. Neuron 93, 1153–1164.

Marr, D. (1969). A theory of cerebellar cortex. Journal of Physiology 202, 437–470.

Nguyen, T.M., et al. (2022). Structured cerebellar connectivity supports resilient pattern separation. Nature 1–7.

Saarinen, A., Linne, M.L., and Yli-Harja, O. (2008). Stochastic Differential Equation Model for Cerebellar Granule Cell Excitability. PLOS Computational Biology 4, e1000004.

Simon, J.B., Dickens, M., and DeWeese, M.R. (2021). A theory of the inductive bias and generalization of kernel regression and wide neural networks. arXiv: 2110.03922.

Sollich, P. (1998). Learning curves for Gaussian processes. Advances in Neural Information Processing Systems 11.

Spanne, A. and Jo¨rntell, H. (2013). Processing of Multi-dimensional Sensorimotor Information in the Spinal and Cerebellar Neuronal Circuitry: A New Hypothesis. PLOS Computational Biology 9, e1002979.

Spanne, A. and Jo¨rntell, H. (2015). Questioning the role of sparse coding in the brain. Trends in Neurosciences 38, 417–427.

van Kan, P.L., Gibson, A.R., and Houk, J.C. (1993). Movement-related inputs to intermediate cerebellum of the monkey. Journal of Neurophysiology 69, 74–94.

Wagner, M.J., Kim, T.H., Savall, J., Schnitzer, M.J., and Luo, L. (2017). Cerebellar granule cells encode the expectation of reward. Nature 544, 96–100.

Wagner, M.J., et al. (2019). Shared cortex-cerebellum dynamics in the execution and learning of a motor task. Cell 177, 669–682.e24.

Wolpert, D.M., Miall, R.C., and Kawato, M. (1998). Internal models in the cerebellum. Trends in Cognitive Sciences 2, 338–347.

Yu, B.M., et al. (2009). Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity. Journal of Neurophysiology 102, 614–635.

https://doi.org/10.7554/eLife.82914.sa2

Article and author information

Author details

  1. Marjorie Xie

    Zuckerman Mind Brain Behavior Institute, Columbia University, New York, United States
    Contribution
    Conceptualization, Investigation, Writing - original draft, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1456-4811
  2. Samuel P Muscinelli

    Zuckerman Mind Brain Behavior Institute, Columbia University, New York, United States
    Contribution
    Conceptualization, Investigation, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-5256-2289
  3. Kameron Decker Harris

    Department of Computer Science, Western Washington University, Bellingham, United States
    Contribution
    Conceptualization, Investigation, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3716-6173
  4. Ashok Litwin-Kumar

    Zuckerman Mind Brain Behavior Institute, Columbia University, New York, United States
    Contribution
    Conceptualization, Investigation, Writing – review and editing
    For correspondence
    a.litwin-kumar@columbia.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-2422-6576

Funding

National Institutes of Health (T32-NS06492)

  • Marjorie Xie

Simons Foundation

  • Samuel P Muscinelli

Swartz Foundation

  • Samuel P Muscinelli

Washington Research Foundation

  • Kameron Decker Harris

Burroughs Wellcome Fund

  • Ashok Litwin-Kumar

Gatsby Charitable Foundation (GAT3708)

  • Marjorie Xie

National Science Foundation (DBI-1707398)

  • Marjorie Xie

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

The authors thank L F Abbott, N A Cayco-Gajic, N Sawtell, and S Fusi for helpful discussions and comments on the manuscript. The authors also thank S Muller for contributing data, code, and helpful discussions for analysis of ELL data. M X was supported by NIH grant T32-NS064929. S M was supported by the Simons and Swartz Foundations. K D H was supported by a grant from the Washington Research Foundation. A L-K was supported by the Simons and Burroughs Wellcome Foundations. M X, S M, and A L-K were also supported by the Gatsby Charitable Foundation and NSF NeuroNex award DBI-1707398.

Senior Editor

  1. Michael J Frank, Brown University, United States

Reviewing Editor

  1. Jörn Diedrichsen, Western University, Canada

Reviewers

  1. Jörn Diedrichsen, Western University, Canada
  2. Henrik Jörntell, Lund University, Sweden

Version history

  1. Preprint posted: August 15, 2022 (view preprint)
  2. Received: August 22, 2022
  3. Accepted: September 5, 2023
  4. Accepted Manuscript published: September 6, 2023 (version 1)
  5. Version of Record published: September 29, 2023 (version 2)

Copyright

© 2023, Xie et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 784
    Page views
  • 153
    Downloads
  • 2
    Citations

Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Marjorie Xie
  2. Samuel P Muscinelli
  3. Kameron Decker Harris
  4. Ashok Litwin-Kumar
(2023)
Task-dependent optimal representations for cerebellar learning
eLife 12:e82914.
https://doi.org/10.7554/eLife.82914

Share this article

https://doi.org/10.7554/eLife.82914

Further reading

    1. Computational and Systems Biology
    James D Brunner, Nicholas Chia
    Research Article

    The microbial community composition in the human gut has a profound effect on human health. This observation has lead to extensive use of microbiome therapies, including over-the-counter 'probiotic' treatments intended to alter the composition of the microbiome. Despite so much promise and commercial interest, the factors that contribute to the success or failure of microbiome-targeted treatments remain unclear. We investigate the biotic interactions that lead to successful engraftment of a novel bacterial strain introduced to the microbiome as in probiotic treatments. We use pairwise genome-scale metabolic modeling with a generalized resource allocation constraint to build a network of interactions between taxa that appear in an experimental engraftment study. We create induced sub-graphs using the taxa present in individual samples and assess the likelihood of invader engraftment based on network structure. To do so, we use a generalized Lotka-Volterra model, which we show has strong ability to predict if a particular invader or probiotic will successfully engraft into an individual's microbiome. Furthermore, we show that the mechanistic nature of the model is useful for revealing which microbe-microbe interactions potentially drive engraftment.

    1. Computational and Systems Biology
    2. Physics of Living Systems
    Nicholas M Boffi, Yipei Guo ... Ariel Amir
    Research Article

    The adaptive dynamics of evolving microbial populations takes place on a complex fitness landscape generated by epistatic interactions. The population generically consists of multiple competing strains, a phenomenon known as clonal interference. Microscopic epistasis and clonal interference are central aspects of evolution in microbes, but their combined effects on the functional form of the population’s mean fitness are poorly understood. Here, we develop a computational method that resolves the full microscopic complexity of a simulated evolving population subject to a standard serial dilution protocol. Through extensive numerical experimentation, we find that stronger microscopic epistasis gives rise to fitness trajectories with slower growth independent of the number of competing strains, which we quantify with power-law fits and understand mechanistically via a random walk model that neglects dynamical correlations between genes. We show that increasing the level of clonal interference leads to fitness trajectories with faster growth (in functional form) without microscopic epistasis, but leaves the rate of growth invariant when epistasis is sufficiently strong, indicating that the role of clonal interference depends intimately on the underlying fitness landscape. The simulation package for this work may be found at https://github.com/nmboffi/spin_glass_evodyn.