1. Computational and Systems Biology
  2. Neuroscience
Download icon

Towards deep learning with segregated dendrites

  1. Jordan Guerguiev
  2. Timothy P Lillicrap
  3. Blake A Richards  Is a corresponding author
  1. University of Toronto Scarborough, Canada
  2. University of Toronto, Canada
  3. DeepMind, United Kingdom
  4. Canadian Institute for Advanced Research, Canada
Research Article
  • Cited 5
  • Views 10,277
  • Annotations
Cite as: eLife 2017;6:e22901 doi: 10.7554/eLife.22901

Abstract

Deep learning has led to significant advances in artificial intelligence, in part, by adopting strategies motivated by neurophysiology. However, it is unclear whether deep learning could occur in the real brain. Here, we show that a deep learning algorithm that utilizes multi-compartment neurons might help us to understand how the neocortex optimizes cost functions. Like neocortical pyramidal neurons, neurons in our model receive sensory information and higher-order feedback in electrotonically segregated compartments. Thanks to this segregation, neurons in different layers of the network can coordinate synaptic weight updates. As a result, the network learns to categorize images better than a single layer network. Furthermore, we show that our algorithm takes advantage of multilayer architectures to identify useful higher-order representations—the hallmark of deep learning. This work demonstrates that deep learning can be achieved using segregated dendritic compartments, which may help to explain the morphology of neocortical pyramidal neurons.

https://doi.org/10.7554/eLife.22901.001

eLife digest

Artificial intelligence has made major progress in recent years thanks to a technique known as deep learning, which works by mimicking the human brain. When computers employ deep learning, they learn by using networks made up of many layers of simulated neurons. Deep learning has opened the door to computers with human – or even super-human – levels of skill in recognizing images, processing speech and controlling vehicles. But many neuroscientists are skeptical about whether the brain itself performs deep learning.

The patterns of activity that occur in computer networks during deep learning resemble those seen in human brains. But some features of deep learning seem incompatible with how the brain works. Moreover, neurons in artificial networks are much simpler than our own neurons. For instance, in the region of the brain responsible for thinking and planning, most neurons have complex tree-like shapes. Each cell has ‘roots’ deep inside the brain and ‘branches’ close to the surface. By contrast, simulated neurons have a uniform structure.

To find out whether networks made up of more realistic simulated neurons could be used to make deep learning more biologically realistic, Guerguiev et al. designed artificial neurons with two compartments, similar to the ‘roots’ and ‘branches’. The network learned to recognize hand-written digits more easily when it had many layers than when it had only a few. This shows that artificial neurons more like those in the brain can enable deep learning. It even suggests that our own neurons may have evolved their shape to support this process.

If confirmed, the link between neuronal shape and deep learning could help us develop better brain-computer interfaces. These allow people to use their brain activity to control devices such as artificial limbs. Despite advances in computing, we are still superior to computers when it comes to learning. Understanding how our own brains show deep learning could thus help us develop better, more human-like artificial intelligence in the future.

https://doi.org/10.7554/eLife.22901.002

Introduction

Deep learning refers to an approach in artificial intelligence (AI) that utilizes neural networks with multiple layers of processing units. Importantly, deep learning algorithms are designed to take advantage of these multi-layer network architectures in order to generate hierarchical representations wherein each successive layer identifies increasingly abstract, relevant variables for a given task (Bengio and LeCun, 2007; LeCun et al., 2015). In recent years, deep learning has revolutionized machine learning, opening the door to AI applications that can rival human capabilities in pattern recognition and control (Mnih et al., 2015; Silver et al., 2016; He et al., 2015). Interestingly, the representations that deep learning generates resemble those observed in the neocortex (Kubilius et al., 2016; Khaligh-Razavi and Kriegeskorte, 2014; Cadieu et al., 2014), suggesting that something akin to deep learning is occurring in the mammalian brain (Yamins and DiCarlo, 2016; Marblestone et al., 2016).

Yet, a large gap exists between deep learning in AI and our current understanding of learning and memory in neuroscience. In particular, unlike deep learning researchers, neuroscientists do not yet have a solution to the ‘credit assignment problem’ (Rumelhart et al., 1986; Lillicrap et al., 2016; Bengio et al., 2015). Learning to optimize some behavioral or cognitive function requires a method for assigning ‘credit’ (or ‘blame’) to neurons for their contribution to the final behavioral output (LeCun et al., 2015; Bengio et al., 2015). The credit assignment problem refers to the fact that assigning credit in multi-layer networks is difficult, since the behavioral impact of neurons in early layers of a network depends on the downstream synaptic connections. For example, consider the behavioral effects of synaptic changes, that is long-term potentiation/depression (LTP/LTD), occurring between different sensory circuits of the brain. Exactly how these synaptic changes will impact behavior and cognition depends on the downstream connections between the sensory circuits and motor or associative circuits (Figure 1A). If a learning algorithm can solve the credit assignment problem then it can take advantage of multi-layer architectures to develop complex behaviors that are applicable to real-world problems (Bengio and LeCun, 2007). Despite its importance for real-world learning, the credit assignment problem, at the synaptic level, has received little attention in neuroscience.

The credit assignment problem in multi-layer neural networks.

(A) Illustration of the credit assignment problem. In order to take full advantage of the multi-circuit architecture of the neocortex when learning, synapses in earlier processing stages (blue connections) must somehow receive ‘credit’ for their impact on behavior or cognition. However, the credit due to any given synapse early in a processing pathway depends on the downstream synaptic connections that link the early pathway to later computations (red connections). (B) Illustration of weight transport in backpropagation. To solve the credit assignment problem, the backpropagation of error algorithm explicitly calculates the credit due to each synapse in the hidden layer by using the downstream synaptic weights when calculating the hidden layer weight changes. This solution works well in AI applications, but is unlikely to occur in the real brain.

https://doi.org/10.7554/eLife.22901.003

The lack of attention to credit assignment in neuroscience is, arguably, a function of the history of biological studies of synaptic plasticity. Due to the well-established dependence of LTP and LTD on presynaptic and postsynaptic activity, current theories of learning in neuroscience tend to emphasize Hebbian learning algorithms (Dan and Poo, 2004; Martin et al., 2000), that is, learning algorithms where synaptic changes depend solely on presynaptic and postsynaptic activity. Hebbian learning models can produce representations that resemble the representations in the real brain (Zylberberg et al., 2011; Leibo et al., 2017) and they are backed up by decades of experimental findings (Malenka and Bear, 2004; Dan and Poo, 2004; Martin et al., 2000). But, current Hebbian learning algorithms do not solve the credit assignment problem, nor do global neuromodulatory signals used in reinforcement learning (Lillicrap et al., 2016). As a result, deep learning algorithms from AI that can perform multi-layer credit assignment outperform existing Hebbian models of sensory learning on a variety of tasks (Yamins and DiCarlo, 2016; Khaligh-Razavi and Kriegeskorte, 2014). This suggests that a critical, missing component in our current models of the neurobiology of learning and memory is an explanation of how the brain solves the credit assignment problem.

However, the most common solution to the credit assignment problem in AI is to use the backpropagation of error algorithm (Rumelhart et al., 1986). Backpropagation assigns credit by explicitly using current downstream synaptic connections to calculate synaptic weight updates in earlier layers, commonly termed ‘hidden layers’ (LeCun et al., 2015) (Figure 1B). This technique, which is sometimes referred to as ‘weight transport’, involves non-local transmission of synaptic weight information between layers of the network (Lillicrap et al., 2016; Grossberg, 1987). Weight transport is clearly unrealistic from a biological perspective (Bengio et al., 2015; Crick, 1989). It would require early sensory processing areas (e.g. V1, V2, V4) to have precise information about billions of synaptic connections in downstream circuits (MT, IT, M2, EC, etc.). According to our current understanding, there is no physiological mechanism that could communicate this information in the brain. Some deep learning algorithms utilize purely Hebbian rules (Scellier and Bengio, 2016; Hinton et al., 2006). But, they depend on feedback synapses that are symmetric to feedforward synapses (Scellier and Bengio, 2016; Hinton et al., 2006), which is essentially a version of weight transport. Altogether, these artificial aspects of current deep learning solutions to credit assignment have rendered many scientists skeptical of the proposal that deep learning occurs in the real brain (Crick, 1989; Grossberg, 1987; Harris, 2008; Urbanczik and Senn, 2009).

Recent findings have shown that these problems may be surmountable, though. Lillicrap et al. (2016), Lee et al., 2015 and Liao et al., 2015 have demonstrated that it is possible to solve the credit assignment problem even while avoiding weight transport or symmetric feedback weights. The key to these learning algorithms is the use of feedback signals that convey enough information about credit to calculate local error signals in hidden layers (Lee et al., 2015; Lillicrap et al., 2016; Liao et al., 2015). With this approach it is possible to take advantage of multi-layer architectures, leading to performance that rivals backpropagation (Lee et al., 2015; Lillicrap et al., 2016; Liao et al., 2015). Hence, this work has provided a significant breakthrough in our understanding of how the real brain might do credit assignment.

Nonetheless, the models of Lillicrap et al. (2016), Lee et al., 2015 and Liao et al., 2015 involve some problematic assumptions. Specifically, although it is not directly stated in all of the papers, there is an implicit assumption that there is a separate feedback pathway for transmitting the information that determines the local error signals (Figure 2A). Such a pathway is required in these models because the error signal in the hidden layers depends on the difference between feedback that is generated in response to a purely feedforward propagation of sensory information, and feedback that is guided by a teaching signal (Lillicrap et al., 2016; Lee et al., 2015; Liao et al., 2015). In order to calculate this difference, sensory information must be transmitted separately from the feedback signals that are used to drive learning. In single compartment neurons, keeping feedforward sensory information separate from feedback signals is impossible without a separate pathway. At face value, such a pathway is possible. But, closer inspection uncovers a couple of difficulties with such a proposal.

Potential solutions to credit assignment using top-down feedback.

(A) Illustration of the implicit feedback pathway used in previous models of deep learning. In order to assign credit, feedforward information must be integrated separately from any feedback signals used to calculate error for synaptic updates (the error is indicated here with δ). (B) Illustration of the segregated dendrites proposal. Rather than using a separate pathway to calculate error based on feedback, segregated dendritic compartments could receive feedback and calculate the error signals locally.

https://doi.org/10.7554/eLife.22901.004

First, the error signals that solve the credit assignment problem are not global error signals (like neuromodulatory signals used in reinforcement learning). Rather, they are cell-by-cell error signals. This would mean that the feedback pathway would require some degree of pairing, wherein each neuron in the hidden layer is paired with a feedback neuron (or circuit). That is not impossible, but there is no evidence to date of such an architecture in the neocortex. Second, the error signal in the hidden layer is signed (i.e. it can be positive or negative), and the sign determines whether LTP or LTD occur in the hidden layer neurons (Lee et al., 2015; Lillicrap et al., 2016; Liao et al., 2015). Communicating signed signals with a spiking neuron can theoretically be done by using a baseline firing rate that the neuron can go above (for positive signals) or below (for negative signals). But, in practice, such systems are difficult to operate with a single neuron, because as the error gets closer to zero any noise in the spiking of the neuron can switch the sign of the signal, which switches LTP to LTD, or vice versa. This means that as learning progresses the neuron’s ability to communicate error signs gets worse. It would be possible to overcome this by using many neurons to communicate an error signal, but this would then require many error neurons for each hidden layer neuron, which would lead to a very inefficient means of communicating errors. Therefore, the real brain’s specific solution to the credit assignment problem is unlikely to involve a separate feedback pathway for cell-by-cell, signed signals to instruct plasticity.

However, segregating the integration of feedforward and feedback signals does not require a separate pathway if neurons have more complicated morphologies than the point neurons typically used in artificial neural networks. Taking inspiration from biology, we note that real neurons are much more complex than single-compartments, and different signals can be integrated at distinct dendritic locations. Indeed, in the primary sensory areas of the neocortex, feedback from higher-order areas arrives in the distal apical dendrites of pyramidal neurons (Manita et al., 2015; Budd, 1998; Spratling, 2002), which are electrotonically very distant from the basal dendrites where feedforward sensory information is received (Larkum et al., 1999; 2007; 2009). Thus, as has been noted by previous authors (Körding and König, 2001; Spratling, 2002; Spratling and Johnson, 2006), the anatomy of pyramidal neurons may actually provide the segregation of feedforward and feedback information required to calculate local error signals and perform credit assignment in biological neural networks.

Here, we show how deep learning can be implemented if neurons in hidden layers contain segregated ‘basal’ and ‘apical’ dendritic compartments for integrating feedforward and feedback signals separately (Figure 2B). Our model builds on previous neural networks research (Lee et al., 2015; Lillicrap et al., 2016) as well as computational studies of supervised learning in multi-compartment neurons (Urbanczik and Senn, 2014; Körding and König, 2001; Spratling and Johnson, 2006). Importantly, we use the distinct basal and apical compartments in our neurons to integrate feedback signals separately from feedforward signals. With this, we build a local error signal for each hidden layer that ensures appropriate credit assignment. We demonstrate that even with random synaptic weights for feedback into the apical compartment, our algorithm can coordinate learning to achieve classification of the MNIST database of hand-written digits that is better than that which can be achieved with a single layer network. Furthermore, we show that our algorithm allows the network to take advantage of multi-layer structures to build hierarchical, abstract representations, one of the hallmarks of deep learning (LeCun et al., 2015). Our results demonstrate that deep learning can be implemented in a biologically feasible manner if feedforward and feedback signals are received at electrotonically segregated dendrites, as is the case in the mammalian neocortex.

Results

A network architecture with segregated dendritic compartments

Deep supervised learning with local weight updates requires that each neuron receive signals that can be used to determine its ‘credit’ for the final behavioral output. We explored the idea that the cortico-cortical feedback signals to pyramidal cells could provide the required information for credit assignment. In particular, we were inspired by four observations from both machine learning and biology:

  1. Current solutions to credit assignment without weight transport require segregated feedforward and feedback signals (Lee et al., 2015; Lillicrap et al., 2016).

  2. In the neocortex, feedforward sensory information and higher-order cortico-cortical feedback are largely received by distinct dendritic compartments, namely the basal dendrites and distal apical dendrites, respectively (Spratling, 2002; Budd, 1998).

  3. The distal apical dendrites of pyramidal neurons are electrotonically distant from the soma, and apical communication to the soma depends on active propagation through the apical dendritic shaft, which is predominantly driven by voltage-gated calcium channels. Due to the dynamics of voltage-gated calcium channels these non-linear, active events in the apical shaft generate prolonged upswings in the membrane potential, known as ‘plateau potentials’, which can drive burst firing at the soma (Larkum et al., 1999; 2009).

  4. Plateau potentials driven by apical activity can guide plasticity in pyramidal neurons in vivo (Bittner et al., 2015; Bittner et al., 2017).

With these considerations in mind, we hypothesized that the computations required for credit assignment could be achieved without separate pathways for feedback signals. Instead, they could be achieved by having two distinct dendritic compartments in each hidden layer neuron: a ‘basal’ compartment, strongly coupled to the soma for integrating bottom-up sensory information, and an ‘apical’ compartment for integrating top-down feedback in order calculate credit assignment and drive synaptic plasticity via ‘plateau potentials’ (Bittner et al., 2015; Bittner et al., 2017) (Figure 3A).

Illustration of a multi-compartment neural network model for deep learning.

(A) Left: Reconstruction of a real pyramidal neuron from layer five mouse primary visual cortex. Right: Illustration of our simplified pyramidal neuron model. The model consists of a somatic compartment, plus two distinct dendritic compartments (apical and basal). As in real pyramidal neurons, top-down inputs project to the apical compartment while bottom-up inputs project to the basal compartment. (B) Diagram of network architecture. An image is used to drive spiking input units which project to the hidden layer basal compartments through weights W0. Hidden layer somata project to the output layer dendritic compartment through weights W1. Feedback from the output layer somata is sent back to the hidden layer apical compartments through weights Y. The variables for the voltages in each of the compartments are shown. The number of neurons used in each layer is shown in gray. (C) Illustration of transmit vs. plateau computations. Left: In the transmit computation, the network dynamics are updated at each time-step, and the apical dendrite is segregated by a low value for ga, making the network effectively feed-forward. Here, the voltages of each of the compartments are shown for one run of the network. The spiking output of the soma is also shown. Note that the somatic voltage and spiking track the basal voltage, and ignore the apical voltage. However, the apical dendrite does receive feedback, and this is used to drive its voltage. After a period of Δts to allow for settling of the dynamics, the average apical voltage is calculated (shown here as a blue line). Right: The average apical voltage is then used to calculate an apical plateau potential, which is equal to the nonlinearity σ() applied to the average apical voltage.

https://doi.org/10.7554/eLife.22901.005

As an initial test of this concept we built a network with a single hidden layer. Although this network is not very ‘deep’, even a single hidden layer can improve performance over a one-layer architecture if the learning algorithm solves the credit assignment problem (Bengio and LeCun, 2007; Lillicrap et al., 2016). Hence, we wanted to initially determine whether our network could take advantage of a hidden layer to reduce error at the output layer.

The network architecture is illustrated in Figure 3B. An image from the MNIST data set is used to set the spike rates of =784 Poisson point-process neurons in the input layer (one neuron per image pixel, rates-of-fire determined by pixel intensity). These project to a hidden layer with m=500 neurons. The neurons in the hidden layer (which we index with a ‘0’) are composed of three distinct compartments with their own voltages: the apical compartments (with voltages described by the vector V0a(t)=[V10a(t),...,Vm0a(t)]), the basal compartments (with voltages V0b(t)=[V10b(t),...,Vm0b(t)]), and the somatic compartments (with voltages V0(t)=[V10(t),...,Vm0(t)]). (Note: for notational clarity, all vectors and matrices in the paper are in boldface.) The voltage of the ith neuron in the hidden layer is updated according to:

(1) τdVi0(t)dt=Vi0(t)+gbgl(Vi0b(t)Vi0(t))+gagl(Vi0a(t)Vi0(t))

where gl, gb and ga represent the leak conductance, the conductance from the basal dendrites, and the conductance from the apical dendrites, respectively, and τ=Cm/gl where Cm is the membrance capacitance (see Materials and methods, Equation (16)). For mathematical simplicity we assume in our simulations a resting membrane potential of 0 mV (this value does not affect the results). We implement electrotonic segregation in the model by altering the ga value—low values for ga lead to electrotonically segregated apical dendrites. In the initial set of simulations we set ga=0, which effectively makes it a feed-forward network, but we relax this condition in later simulations.

We treat the voltages in the dendritic compartments simply as weighted sums of the incoming spike trains. Hence, for the ith hidden layer neuron:

(2) Vi0b(t)=j=1Wij0sjinput(t)+bi0Vi0a(t)=j=1nYijsj1(t)

where Wij0 and Yij are synaptic weights from the input layer and the output layer, respectively, bi0 is a bias term, and sinput and s1 are the filtered spike trains of the input layer and output layer neurons, respectively. (Note: the spike trains are convolved with an exponential kernel to mimic postsynaptic potentials, see Materials and methods Equation (11).)

The somatic compartments generate spikes using Poisson processes. The instantaneous rates of these processes are described by the vector ϕ0(t)=[ϕ10(t),...,ϕm0(t)], which is in units of spikes/s or Hz. These rates-of-fire are determined by a non-linear sigmoid function, σ(), applied to the somatic voltages, that is for the ith hidden layer neuron:

(3) ϕi0(t)=ϕmaxσ(Vi0(t))=ϕmax11+eVi0(t)

where ϕmax is the maximum rate-of-fire for the neurons.

The output layer (which we index here with a ‘1’) contains n=10 two-compartment neurons (one for each image category), similar to those used in a previous model of dendritic prediction learning (Urbanczik and Senn, 2014). The output layer dendritic voltages (V1b(t)=[V11b(t),...,Vn1b(t)]) and somatic voltages (V1(t)=[V11(t),...,Vn1(t)]) are updated in a similar manner to the hidden layer basal compartment and soma:

(4) τdVi1(t)dt=Vi1(t)+gdgl(Vi1b(t)Vi1(t))+Ii(t)Vi1b(t)=j=1Wij1sj0(t)+bi1

where Wij1 are synaptic weights from the hidden layer, s0 are the filtered spike trains of the hidden layer neurons (see Equation (11)), gl is the leak conductance, gd is the conductance from the dendrites, and τ is given by Equation (16). In addition to the absence of an apical compartment, the other salient difference between the output layer neurons and the hidden layer neurons is the presence of the term Ii(t), which is a teaching signal that can be used to force the output layer to the correct answer. Whether any such teaching signals exist in the real brain is unknown, though there is evidence that animals can represent desired behavioral outputs with internal goal representations (Gadagkar et al., 2016). (See below, and Materials and methods, Equations (19) and (20) for more details on the teaching signal).

In our model, there are two different types of computation that occur in the hidden layer neurons: ‘transmit’ and ‘plateau’. The transmit computations are standard numerical integration of the simulation, with voltages evolving according to Equation (1), and with the apical compartment electrotonically segregated from the soma (depending on ga) (Figure 3C, left). In contrast, the plateau computations do not involve numerical integration with Equation (1). Instead, the apical voltage is averaged over the most recent 20–30 ms period and the sigmoid non-linearity is applied to it, giving us ‘plateau potentials’ in the hidden layer neurons (we indicate plateau potentials with α, see Equation (5) below, and Figure 3C, right). The intention behind this design was to mimic the non-linear transmission from the apical dendrites to the soma that occurs during a plateau potential driven by calcium spikes in the apical dendritic shaft (Larkum et al., 1999), but in the simplest, most abstract formulation possible.

Importantly, plateau potentials in our simulations are single numeric values (one per hidden layer neuron) that can be used for credit assignment. We do not use them to alter the network dynamics. When they occur, they are calculated, transmitted to the basal dendrite instantaneously, and then stored temporarily (0–60 ms) for calculating synaptic weight updates.

Calculating credit assignment signals with feedback driven plateau potentials

To train the network we alternate between two phases. First, during the ‘forward’ phase we present an image to the input layer without any teaching current at the output layer (I(t)i=0,i). The forward phase occurs between times t0 to t1. At t1 a plateau potential is calculated in all the hidden layer neurons (αf=[α1f,...,αmf]) and the ‘target’ phase begins. During this phase, which lasts until t2, the image continues to drive the input layer, but now the output layer also receives teaching current. The teaching current forces the correct output neuron to its max firing rate and all the others to silence. For example, if an image of a ‘9’ is presented, then over the time period t1-t2 the ‘9’ neuron in the output layer fires at max, while the other neurons are silent (Figure 4A). At t2 another set of plateau potentials (αt=[α1t,...,αmt]) are calculated in the hidden layer neurons. The result is that we have plateau potentials in the hidden layer neurons for both the end of the forward phase (αf) and the end of the target phase (αt), which are calculated as:

Illustration of network phases for learning.

(A) Illustration of the sequence of network phases that occur for each training example. The network undergoes a forward phase where Ii(t)=0, i and a target phase where Ii(t) causes any given neuron i to fire at max-rate or be silent, depending on whether it is the correct category of the current input image. In this illustration, an image of a ‘9’ is being presented, so the ’9’ unit at the output layer is activated and the other output neurons are inhibited and silent. At the end of the forward phase the set of plateau potentials αf are calculated, and at the end of the target phase the set of plateau potentials αt are calculated. (B) Illustration of phase length sampling. Each phase length is sampled stochastically. In other words, for each training image, the lengths of forward and target phases (shown as blue bar pairs, where bar length represents phase length) are randomly drawn from a shifted inverse Gaussian distribution with a minimum of 50 ms.

https://doi.org/10.7554/eLife.22901.006
(5) αif=σ(1Δt1t1Δt1t1Vi0a(t)dt)αit=σ(1Δt2t2Δt2t2Vi0a(t)dt)

where Δts is a time delay used to allow the network dynamics to settle before integrating the plateau, and Δti=ti-(ti-1+Δts) (see Materials and methods, Equation (22) and Figure 4A).

Similar to how targets are used in deep supervised learning (LeCun et al., 2015), the goal of learning in our network is to make the network dynamics during the forward phase converge to the same output activity pattern as exists in the target phase. Put another way, in the absence of the teaching signal, we want the activity at the output layer to be the same as that which would exist with the teaching signal, so that the network can give appropriate outputs without any guidance. To do this, we initialize all the weight matrices with random weights, then we train the weight matrices W0 and W1 using stochastic gradient descent on local loss functions for the hidden and output layers, respectively (see below). These weight updates occur at the end of every target phase, that is the synapses are not updated during transmission. Like Lillicrap et al. (2016), we leave the weight matrix Y fixed in its initial random configuration. When we update the synapses in the network we use the plateau potential values αf and αt to determine appropriate credit assignment (see below).

The network is simulated in near continuous-time (except that each plateau is considered to be instantaneous), and the temporal intervals between plateaus are randomly sampled from an inverse Gaussian distribution (Figure 4B, top). As such, the specific amount of time that the network is presented with each image and teaching signal is stochastic, though usually somewhere between 50–60 ms of simulated time (Figure 4B, bottom). This stochasticity was not necessary, but it demonstrates that although the system operates in phases, the specific length of the phases is not important as long as they are sufficiently long to permit integration (see Lemma 1). In the data presented in this paper, all 60,000 images in the MNIST training set were presented to the network one at a time, and each exposure to the full set of images was considered an ‘epoch’ of training. At the end of each epoch, the network’s classification error rate on a separate set of 10,000 test images was assessed with a single forward phase for each image (see Materials and methods). The network’s classification was judged by which output neuron had the highest average firing rate during these test image forward phases.

It is important to note that there are many aspects of this design that are not physiologically accurate. Most notably, stochastic generation of plateau potentials across a population is not an accurate reflection of how real pyramidal neurons operate, since apical calcium spikes are determined by a number of concrete physiological factors in individual cells, including back-propagating action potentials, spike-timing and inhibitory inputs (Larkum et al., 1999, 2007, 2009). However, we note that calcium spikes in the apical dendrites can be prevented from occurring via the activity of distal dendrite targeting inhibitory interneurons (Murayama et al., 2009), which can synchronize pyramidal activity (Hilscher et al., 2017). Furthermore, distal dendrite targeting interneurons can themselves can be rapidly inhibited in response to temporally precise neuromodulatory inputs (Pi et al., 2013; Pfeffer et al., 2013; Karnani et al., 2016; Hangya et al., 2015; Brombas et al., 2014). Therefore, it is entirely plausible that neocortical micro-circuits would generate synchronized plateaus/bursts at punctuated periods of time in response to disinhibition of the apical dendrites governed by neuromodulatory signals that determine ‘phases’ of processing. Alternatively, oscillations in population activity could provide a mechanism for promoting alternating phases of processing and synaptic plasticity (Buzsáki and Draguhn, 2004). But, complete synchrony of plateaus in our hidden layer neurons is not actually critical to our algorithm—only the temporal relationship between the plateaus and the teaching signal is critical. This relationship itself is arguably plausible given the role of neuromodulatory inputs in dis-inhibiting the distal dendrites of pyramidal neurons (Karnani et al., 2016; Brombas et al., 2014). Of course, we are engaged in a great deal of speculation here. But, the point is that our model utilizes anatomical and functional motifs that are loosely analogous to what is observed in the neocortex. Importantly for the present study, the key issue is the use of segregated dendrites which permit an effective feed-forward dynamic, punctuated by feedback driven plateau potentials to solve the credit assignment problem.

Co-ordinating optimization across layers with feedback to apical dendrites

To solve the credit assignment problem without using weight transport, we had to define local error signals, or ‘loss functions’, for the hidden layer and output layer that somehow took into account the impact that each hidden layer neuron has on the output of the network. In other words, we only want to update a hidden layer synapse in a manner that will help us make the forward phase activity at the output layer more similar to the target phase activity. To begin, we define the target firing rates for the output neurons, ϕ1=[ϕ11,...,ϕn1], to be their average firing rates during the target phase:

(6) ϕi1=ϕi1¯t=1Δt2t1+Δtst2ϕi1(t)dt

(Throughout the paper, we use ϕ* to denote a target firing rate and ϕ¯ to denote a firing rate averaged over time.) We then define a loss function at the output layer using this target, by taking the difference between the average forward phase activity and the target:

(7) L1||ϕ1ϕ1¯f||22=||ϕ1¯tϕ1¯f||22=||1Δt2t1+Δtst2ϕ1(t)dt1Δt1t0+Δtst1ϕ1(t)dt||22

(Note: the true loss function we use is slightly more complex than the one formulated here, hence the symbol in Equation (7), but this formulation is roughly correct and easier to interpret. See Materials and methods, Equation (23) for the exact formulation.) This loss function is zero only when the average firing rates of the output neurons during the forward phase equals their target, that is the average firing rates during the target phase. Thus, the closer L1 is to zero, the more the network’s output for an image matches the output activity pattern imposed by the teaching signal, I(t).

Effective credit assignment is achieved when changing the hidden layer synapses is guaranteed to reduce L1. To obtain this guarantee, we defined a set of target firing rates for the hidden layer neurons that uses the information contained in the plateau potentials. Specifically, in a similar manner to Lee et al., 2015, we define the target firing rates for the hidden layer neurons, ϕ0=[ϕ10,...,ϕm0], to be:

(8) ϕi0=ϕi0¯f+αitαif

where αit and αif are the plateaus defined in Equation (5). As with the output layer, we define the loss function for the hidden layer to be the difference between the target firing rate and the average firing rate during the forward phase:

(9) L0||ϕ0ϕ0¯f||22=||ϕ0¯f+αtαifϕ0¯f||22=||αtαf||22

(Again, note the use of the symbol, see Equation (30) for the exact formulation.) This loss function is zero only when the plateau at the end of the forward phase equals the plateau at the end of the target phase. Since the plateau potentials integrate the top-down feedback (see Equation (5)), we know that the hidden layer loss function, L0, is zero if the output layer loss function, L1, is zero. Moreover, we can show that these loss functions provide a broader guarantee that, under certain conditions, if L0 is reduced, then on average, L1 will also be reduced (see Theorem 1). This provides our assurance of credit assignment: we know that the ultimate goal of learning (reducing L1) can be achieved by updating the synaptic weights at the hidden layer to reduce the local loss function L0 (Figure 5A). We do this using stochastic gradient descent at the end of every target phase:

(10) ΔW1=η0L1W1ΔW0=η1L0W0
Figure 5 with 1 supplement see all
Co-ordinated errors between the output and hidden layers. 

(A) Illustration of output loss function (L1) and local hidden loss function (L0). For a given test example shown to the network in a forward phase, the output layer loss is defined as the squared norm of the difference between target firing rates ϕ1 and the average firing rate during the forward phases of the output units. Hidden layer loss is defined similarly, except the target is ϕ0 (as defined in the text). (B) Plot of L1 vs. L0 for all of the ‘2’ images after one epoch of training. There is a strong correlation between hidden layer loss and output layer loss (real data, black), as opposed to when output and hidden loss values were randomly paired (shuffled data, gray). (C) Plot of correlation between hidden layer loss and output layer loss across training for each category of images (each dot represents one category). The correlation is significantly higher in the real data than the shuffled data throughout training. Note also that the correlation is much lower on the first epoch of training (red oval), suggesting that the conditions for credit assignment are still developing during the first epoch.

https://doi.org/10.7554/eLife.22901.007

 where ηi and ΔWi refer to the learning rate and update term for weight matrix Wi (see Materials and methods, Equations (28), (29), (33) and (35) for details of the weight update procedures). Performing gradient descent on L1 results in a relatively straight-forward delta rule update for W1 (see Equation (29)). The weight update for the hidden layer weights, W0, is similar, except for the presence of the difference between the two plateau potentials αtαf (see Equation (35)). Importantly, given the way in which we defined the loss functions, as the hidden layer reduces L0 by updating W0, L1 should also be reduced, that is hidden layer learning should imply output layer learning, thereby utilizing the multi-layer architecture.

To test that we were successful in credit assignment with this design, and to provide empirical support for the proof of Theorem 1, we compared the loss function at the hidden layer, L0, to the output layer loss function, L1, across all of the image presentations to the network. We observed that, generally, whenever the hidden layer loss was low, the output layer loss was also low. For example, when we consider the loss for the set of ‘2’ images presented to the network during the second epoch, there was a Pearson correlation coefficient between L0 and L1 of r=0.61, which was much higher than what was observed for shuffled data, wherein output and hidden activities were randomly paired (Figure 5B). Furthermore, these correlations were observed across all epochs of training, with most correlation coefficients for the hidden and output loss functions falling between r=0.2 - 0.6, which was, again, much higher than the correlations observed for shuffled data (Figure 5C).

Interestingly, the correlations between L0 and L1 were smaller on the first epoch of training (see data in red oval Figure 5C) . This suggests that the guarantee of coordination between L0 and L1 only comes into full effect once the network has engaged in some learning. Therefore, we inspected whether the conditions on the synaptic matrices that are assumed in the proof of Theorem 1 were, in fact, being met. More precisely, the proof assumes that the feedforward and feedback synaptic matrices (W1 and Y, respectively) produce forward and backward transformations between the output and hidden layer whose Jacobians are approximate inverses of each other (see Proof of Theorem 1). Since we begin learning with random matrices, this condition is almost definitely not met at the start of training. But, we found that the network learned to meet this condition. Inspection of W1 and Y showed that during the first epoch, the Jacobians of the forward and backwards functions became approximate inverses of each other (Figure 5—figure supplement 1). Since Y is frozen, this means that during the first few image presentations W1 was being updated to have its Jacobian come closer to the inverse of Y's Jacobian. Put another way, the network was learning to do credit assignment. We have yet to resolve exactly why this happens, though the result is very similar to the findings of Lillicrap et al. (2016), where a proof is provided for the linear case. Intuitively, though, the reason is likely the interaction between W1 and W0: as W0 gets updated, the hidden layer learns to group stimuli based on the feedback sent through Y. So, for W1 to transform the hidden layer activity into the correct output layer activity, W1 must become more like the inverse of Y, which would also make the Jacobian of W1 more like the inverse of Y’s Jacobian (due to the inverse function theorem). However, a complete, formal explanation for this phenomenon is still missing, and the the issue of weight alignment deserves additional investigation Lillicrap et al. (2016). From a biological perspective, it also suggests that very early development may involve a period of learning how to assign credit appropriately. Altogether, our model demonstrates that deep learning using random feedback weights is a general phenomenon, and one which can be implemented using segregated dendrites to keep forward information separate from feedback signals used for credit assignment.

Deep learning with segregated dendrites

Given our finding that the network was successfully assigning credit for the output error to the hidden layer neurons, we had reason to believe that our network with local weight-updates would exhibit deep learning, that is an ability to take advantage of a multi-layer structure (Bengio and LeCun, 2007). To test this, we examined the effects of including hidden layers. If deep learning is indeed operational in the network, then the inclusion of hidden layers should improve the ability of the network to classify images.

We built three different versions of the network (Figure 6A). The first was a network that had no hidden layer, that is the input neurons projected directly to the output neurons. The second was the network illustrated in Figure 3B, with a single hidden layer. The third contained two hidden layers, with the output layer projecting directly back to both hidden layers. This direct projection allowed us to build our local targets for each hidden layer using the plateaus driven by the output layer, thereby avoiding a ‘backward pass’ through the entire network as has been used in other models (Lillicrap et al., 2016; Lee et al., 2015; Liao et al., 2015). We trained each network on the 60,000 MNIST training images for 60 epochs, and recorded the percentage of images in the 10,000 image test set that were incorrectly classified. The network with no hidden layers rapidly learned to classify the images, but it also rapidly hit an asymptote at an average error rate of 8.3% (Figure 6B, gray line). In contrast, the network with one hidden layer did not exhibit a rapid convergence to an asymptote in its error rate. Instead, it continued to improve throughout all 60 epochs, achieving an average error rate of 4.1% by the 60th epoch (Figure 6B, blue line). Similar results were obtained when we loosened the synchrony constraints and instead allowed each hidden layer neuron to engage in plateau potentials at different times (Figure 6—figure supplement 1). This demonstrates that strict synchrony in the plateau potentials is not required. But, our target definitions do require two different plateau potentials separated by the teaching signal input, which mandates some temporal control of plateau potentials in the system.

Figure 6 with 1 supplement see all
Improvement of learning with hidden layers.

(A) Illustration of the three networks used in the simulations. Top: a shallow network with only an input layer and an output layer. Middle: a network with one hidden layer. Bottom: a network with two hidden layers. Both hidden layers receive feedback from the output layer, but through separate synaptic connections with random weights Y0 and Y1. (B) Plot of test error (measured on 10,000 MNIST images not used for training) across 60 epochs of training, for all three networks described in A. The networks with hidden layers exhibit deep learning, because hidden layers decrease the test error. Right: Spreads (min – max) of the results of repeated weight tests (n=20) after 60 epochs for each of the networks. Percentages indicate means (two-tailed t-test, 1-layer vs. 2-layer: t38=197.11, p=2.5×1058; 1-layer vs. 3-layer: t38=238.26, p=1.9×1061; 2-layer vs. 3-layer: t38=42.99, p=2.3×1033, Bonferroni correction for multiple comparisons). (C) Results of t-SNE dimensionality reduction applied to the activity patterns of the first three layers of a two hidden layer network (after 60 epochs of training). Each data point corresponds to a test image shown to the network. Points are color-coded according to the digit they represent. Moving up through the network, images from identical categories are clustered closer together and separated from images of different categories. Thus the hidden layers learn increasingly abstract representations of digit categories.

https://doi.org/10.7554/eLife.22901.010

Interestingly, we found that the addition of a second hidden layer further improved learning. The network with two hidden layers learned more rapidly than the network with one hidden layer and achieved an average error rate of 3.2% on the test images by the 60th epoch, also without hitting a clear asymptote in learning (Figure 6B, red line). However, it should be noted that additional hidden layers beyond two did not significantly improve the error rate (data not shown), which suggests that our particular algorithm could not be used to construct very deep networks as is. Nonetheless, our network was clearly able to take advantage of multi-layer architectures to improve its learning, which is the key feature of deep learning (Bengio and LeCun, 2007; LeCun et al., 2015).

Another key feature of deep learning is the ability to generate representations in the higher layers of a network that capture task-relevant information while discarding sensory details (LeCun et al., 2015; Mnih et al., 2015). To examine whether our network exhibited this type of abstraction, we used the t-Distributed Stochastic Neighbor Embedding algorithm (t-SNE). The t-SNE algorithm reduces the dimensionality of data while preserving local structure and non-linear manifolds that exist in high-dimensional space, thereby allowing accurate visualization of the structure of high-dimensional data (Maaten and Hinton, 2008). We applied t-SNE to the activity patterns at each layer of the two hidden layer network for all of the images in the test set after 60 epochs of training. At the input level, there was already some clustering of images based on their categories. However, the clusters were quite messy, with different categories showing outliers, several clusters, or merged clusters (Figure 6C, bottom). For example, the ‘2’ digits in the input layer exhibited two distinct clusters separated by a cluster of ‘7’s: one cluster contained ‘2’s with a loop and one contained ‘2’s without a loop. Similarly, there were two distinct clusters of ‘4’s and ‘9’s that were very close to each other, with one pair for digits on a pronounced slant and one for straight digits (Figure 6C, bottom, example images). Thus, although there is built-in structure to the categories of the MNIST dataset, there are a number of low-level features that do not respect category boundaries. In contrast, at the first hidden layer, the activity patterns were much cleaner, with far fewer outliers and split/merged clusters (Figure 6C, middle). For example, the two separate ‘2’ digit clusters were much closer to each other and were now only separated by a very small cluster of ‘7’s. Likewise, the ‘9’ and ‘4’ clusters were now distinct and no longer split based on the slant of the digit. Interestingly, when we examined the activity patterns at the second hidden layer, the categories were even better segregated, with only a little bit of splitting or merging of category clusters (Figure 6C, top). Therefore, the network had learned to develop representations in the hidden layers wherein the categories were very distinct and low-level features unrelated to the categories were largely ignored. This abstract representation is likely to be key to the improved error rate in the two hidden layer network. Altogether, our data demonstrates that our network with segregated dendritic compartments can engage in deep learning.

Coordinated local learning mimics backpropagation of error

The backpropagation of error algorithm (Rumelhart et al., 1986) is still the primary learning algorithm used for deep supervised learning in artificial neural networks (LeCun et al., 2015). Previous work has shown that learning with random feedback weights can actually match the synaptic weight updates specified by the backpropagation algorithm after a few epochs of training (Lillicrap et al., 2016). This fascinating observation suggests that deep learning with random feedback weights is not completely distinct from backpropagation of error, but rather, networks with random feedback connections learn to approximate credit assignment as it is done in backpropagation (Lillicrap et al., 2016). Hence, we were curious as to whether or not our network was, in fact, learning to approximate the synaptic weight updates prescribed by backpropagation. To test this, we trained our one hidden layer network as before, but now, in addition to calculating the vector of hidden layer synaptic weight updates specified by our local learning rule (ΔW0 in Equation (10)), we also calculated the vector of hidden layer synaptic weight updates that would be specified by non-locally backpropagating the error from the output layer, (ΔWBP0). We then calculated the angle between these two alternative weight updates. In a very high-dimensional space, any two independent vectors will be roughly orthogonal to each other (i.e. ΔW0ΔWBP090). If the two synaptic weight update vectors are not orthogonal to each other (i.e. ΔW0ΔWBP0<90), then it suggests that the two algorithms are specifying similar weight updates.

As in previous work (Lillicrap et al., 2016), we found that the initial weight updates for our network were orthogonal to the updates specified by backpropagation. But, as the network learned the angle dropped to approximately 65, before rising again slightly to roughly 70 (Figure 7A, blue line). This suggests that our network was learning to develop local weight updates in the hidden layer that were in rough agreement with the updates that explicit backpropagation would produce. However, this drop in orthogonality was still much less than that observed in non-spiking artificial neural networks learning with random feedback weights, which show a drop to below 45(Lillicrap et al., 2016). We suspected that the higher angle between the weight updates that we observed may have been because we were using spikes to communicate the feedback from the upper layer, which could introduce both noise and bias in the estimates of the output layer activity. To test this, we also examined the weight updates that our algorithm would produce if we propagated the spike rates of the output layer neurons, ϕ1(t), back directly through the random feedback weights, Y. In this scenario, we observed a much sharper drop in the ΔW0ΔWBP0 angle, which reduced to roughly 35 before rising again to 40 (Figure 7A, red line). These results show that, in principle, our algorithm is learning to approximate the backpropagation algorithm, though with some drop in accuracy introduced by the use of spikes to propagate output layer activities to the hidden layer.

Approximation of backpropagation with local learning rules.

(A) Plot of the angle between weight updates prescribed by our local update learning algorithm compared to those prescribed by backpropagation of error, for a one hidden layer network over 10 epochs of training (each point on the horizontal axis corresponds to one image presentation). Data was time-averaged using a sliding window of 100 image presentations. When training the network using the local update learning algorithm, feedback was sent to the hidden layer either using spiking activity from the output layer units (blue) or by directly sending the spike rates of output units (red). The angle between the local update ΔW0 and backpropagation weight updates ΔWBP0 remains under 90 during training, indicating that both algorithms point weight updates in a similar direction. (B) Examples of hidden layer receptive fields (synaptic weights) obtained by training the network in A using our local update learning rule (left) and backpropagation of error (right) for 60 epochs. (C) Plot of correlation between local update receptive fields and backpropagation receptive fields. For each of the receptive fields produced by local update, we plot the maximum Pearson correlation coefficient between it and all 500 receptive fields learned using backpropagation (Regular). Overall, the maximum correlation coefficients are greater than those obtained after shuffling all of the values of the local update receptive fields (Shuffled).

https://doi.org/10.7554/eLife.22901.013

To further examine how our local learning algorithm compared to backpropagation we compared the low-level features that the two algorithms learned. To do this, we trained the one hidden layer network with both our algorithm and backpropagation. We then examined the receptive fields (i.e. the synaptic weights) produced by both algorithms in the hidden layer synapses (W0) after 60 epochs of training. The two algorithms produced qualitatively similar receptive fields (Figure 7B). Both produced receptive fields with clear, high-contrast features for detecting particular strokes or shapes. To quantify the similarity, we conducted pair-wise correlation calculations for the receptive fields produced by the two algorithms and identified the maximum correlation pairs for each. Compared to shuffled versions of the receptive fields, there was a very high level of maximum correlation (Figure 7C), showing that the receptive fields were indeed quite similar. Thus, the data demonstrate that our learning algorithm using random feedback weights into segregated dendrites can in fact come to approximate the backpropagation of error algorithm.

Conditions on feedback weights

Once we had convinced ourselves that our learning algorithm was, in fact, providing a solution to the credit assignment problem, we wanted to examine some of the constraints on learning. First, we wanted to explore the structure of the feedback weights. In our initial simulations we used non-sparse, random (i.e. normally distributed) feedback weights. We were interested in whether learning could still work with sparse weights, given that neocortical connectivity is sparse. As well, we wondered whether symmetric weights would improve learning, which would be expected given previous findings (Lillicrap et al., 2016; Lee et al., 2015; Liao et al., 2015). To explore these questions, we trained our one hidden layer network using both sparse feedback weights (only 20% non-zero values) and symmetric weights (Y=W1T) (Figure 8A,C). We found that learning actually improved slightly with sparse weights (Figure 8B, red line), achieving an average error rate of 3.7% by the 60th epoch, compared to the average 4.1% error rate achieved with fully random weights. But, this result appeared to depend on the magnitude of the sparse weights. To compensate for the loss of 80% of the weights we initially increased the sparse synaptic weight magnitudes by a factor of 5. However, when we did not re-scale the sparse weights learning was actually worse (Figure 8—figure supplement 1), though this could likely be dealt with by a careful resetting of learning rates. Altogether, our results suggest that sparse feedback provides a signal that is sufficient for credit assignment.

Figure 8 with 1 supplement see all
Conditions on feedback synapses for effective learning.

(A) Diagram of a one hidden layer network trained in B, with 80% of feedback weights set to zero. The remaining feedback weights Y were multiplied by five in order to maintain a similar overall magnitude of feedback signals. (B) Plot of test error across 60 epochs for our standard one hidden layer network (gray) and a network with sparse feedback weights (red). Sparse feedback weights resulted in improved learning performance compared to fully connected feedback weights. Right: Spreads (min – max) of the results of repeated weight tests (n=20) after 60 epochs for each of the networks. Percentages indicate mean final test errors for each network (two-tailed t-test, regular vs. sparse: t38=16.43, p=7.4×1019). (C) Diagram of a one hidden layer network trained in D, with feedback weights that are symmetric to feedforward weights W1, and symmetric but with added noise. Noise added to feedback weights is drawn from a normal distribution with variance σ=0.05. (D) Plot of test error across 60 epochs of our standard one hidden layer network (gray), a network with symmetric weights (red), and a network with symmetric weights with added noise (blue). Symmetric weights result in improved learning performance compared to random feedback weights, but adding noise to symmetric weights results in impaired learning. Right: Spreads (min – max) of the results of repeated weight tests (n=20) after 60 epochs for each of the networks. Percentages indicate means (two-tailed t-test, random vs. symmetric: t38=18.46, p=4.3×1020; random vs. symmetric with noise: t38=-71.54, p=1.2×1041; symmetric vs. symmetric with noise: t38=-80.35, p=1.5×1043, Bonferroni correction for multiple comparisons).

https://doi.org/10.7554/eLife.22901.015

Similar to sparse feedback weights, symmetric feedback weights also improved learning, leading to a rapid decrease in the test error and an error rate of 3.6% by the 60th epoch (Figure 8D, red line). This is interesting, given that backpropagation assumes symmetric feedback weights (Lillicrap et al., 2016; Bengio et al., 2015), though our proof of Theorem 1 does not. However, when we added noise to the symmetric weights any advantage was eliminated and learning was, in fact, slightly impaired (Figure 8D, blue line). At first, this was a very surprising result: given that learning works with random feedback weights, why would it not work with symmetric weights with noise? However, when we considered our previous finding that during the first epoch the feedforward weights, W1, learn to have the feedforward Jacobian match the inverse of the feedback Jacobian (Figure 5—figure supplement 1) a possible answer emerges. In the case of symmetric feedback weights the synaptic matrix Y is changing as W1 changes. This works fine when Y is set to W1T, since that artificially forces something akin to backpropagation. But, if the feedback weights are set to W1T plus noise, then the system can never align the Jacobians appropriately, since Y is now a moving target. This would imply that any implementation of feedback learning must either be very effective (to achieve the right feedback) or very slow (to allow the feedforward weights to adapt).

Learning with partial apical attenuation

Another constraint that we wished to examine was whether total segregation of the apical inputs was necessary, given that real pyramidal neurons only show an attenuation of distal apical inputs to the soma (Larkum et al., 1999). Total segregation (ga=0) renders the network effectively feed-forward in its dynamics, which made it easier to construct the loss functions to ensure that reducing L0 also reduces L1 (see Figure 5 and Theorem 1). But, we wondered whether some degree of apical conductance to the soma would be sufficiently innocuous so as to not disrupt deep learning. To examine this, we re-ran our two hidden layer network, but now, we allowed the apical dendritic voltage to influence the somatic voltage by setting ga=0.05. This value gave us twelve times more attenuation than the attenuation from the basal compartments, since gb=0.6 (Figure 9A). When we compared the learning in this scenario to the scenario with total apical segregation, we observed very little difference in the error rates on the test set (Figure 9B, gray and red lines). Importantly, though, we found that if we increased the apical conductance to the same level as the basal (ga=gb=0.6) then the learning was significantly impaired (Figure 9B, blue line). This demonstrates that although total apical attenuation is not necessary, partial segregation of the apical compartment from the soma is necessary. That result makes sense given that our local targets for the hidden layer neurons incorporate a term that is supposed to reflect the response of the output neurons to the feedforward sensory information (αf). Without some sort of separation of feedforward and feedback information, as is assumed in other models of deep learning (Lillicrap et al., 2016; Lee et al., 2015), this feedback signal would get corrupted by recurrent dynamics in the network. Our data show that electrontonically segregated dendrites is one potential way to achieve the separation between feedforward and feedback information that is required for deep learning.

Importance of dendritic segregation for deep learning.

(A) Left: Diagram of a hidden layer neuron. ga represents the strength of the coupling between the apical dendrite and soma. Right: Example traces of the apical voltage in a single neuron Vi0a and the somatic voltage Vi0 in response to spikes arriving at apical synapses. Here ga=0.05, so the apical activity is strongly attenuated at the soma. (B) Plot of test error across 60 epochs of training on MNIST of a two hidden layer network, with total apical segregation (gray), strong apical attenuation (red) and weak apical attenuation (blue). Apical input to the soma did not prevent learning if it was strongly attenuated, but weak apical attenuation impaired deep learning. Right: Spreads (min – max) of the results of repeated weight tests (n=20) after 60 epochs for each of the networks. Percentages indicate means (two-tailed t-test, total segregation vs. strong attenuation: t38=-4.00, p=8.4×104; total segregation vs. weak attenuation: t38=-95.24, p=2.4×1046; strong attenuation vs. weak attenuation: t38=-92.51, p=7.1×1046, Bonferroni correction for multiple comparisons).

https://doi.org/10.7554/eLife.22901.018

Discussion

Deep learning has radically altered the field of AI, demonstrating that parallel distributed processing across multiple layers can produce human/animal-level capabilities in image classification, pattern recognition and reinforcement learning (Hinton et al., 2006; LeCun et al., 2015; Mnih et al., 2015; Silver et al., 2016; Krizhevsky et al., 2012; He et al., 2015). Deep learning was motivated by analogies to the real brain (LeCun et al., 2015; Cox and Dean, 2014), so it is tantalizing that recent studies have shown that deep neural networks develop representations that strongly resemble the representations observed in the mammalian neocortex (Khaligh-Razavi and Kriegeskorte, 2014; Yamins and DiCarlo, 2016; Cadieu et al., 2014; Kubilius et al., 2016). In fact, deep learning models can match cortical representations better than some models that explicitly attempt to mimic the real brain (Khaligh-Razavi and Kriegeskorte, 2014). Hence, at a phenomenological level, it appears that deep learning, defined as multilayer cost function reduction with appropriate credit assignment, may be key to the remarkable computational prowess of the mammalian brain (Marblestone et al., 2016). However, the lack of biologically feasible mechanisms for credit assignment in deep learning algorithms, most notably backpropagation of error (Rumelhart et al., 1986), has left neuroscientists with a mystery. Given that the brain cannot use backpropagation, how does it solve the credit assignment problem (Figure 1)? Here, we expanded on an idea that previous authors have explored (Körding and König, 2001; Spratling, 2002; Spratling and Johnson, 2006) and demonstrated that segregating the feedback and feedforward inputs to neurons, much as the real neocortex does (Larkum et al., 1999; 2007; 2009), can enable the construction of local targets to assign credit appropriately to hidden layer neurons (Figure 2). With this formulation, we showed that we could use segregated dendritic compartments to coordinate learning across layers (Figure 3, Figure 4 and Figure 5). This enabled our network to take advantage of multiple layers to develop representations of hand-written digits in hidden layers that enabled better levels of classification accuracy on the MNIST dataset than could be achieved with a single layer (Figure 6). Furthermore, we found that our algorithm actually approximated the weight updates that would be prescribed by backpropagation, and produced similar low-level feature detectors (Figure 7). As well, we showed that our basic framework works with sparse feedback connections (Figure 8) and more realistic, partial apical attenuation (Figure 9). Therefore, our work demonstrates that deep learning is possible in a biologically feasible framework, provided that feedforward and feedback signals are sufficiently segregated in different dendrites.

In this work we adopted a similar strategy to the one taken by Lee et al., 2015 in their difference target propagation algorithm, wherein the feedback from higher layers is used to construct local firing-rate targets at the hidden layers. One of the reasons that we adopted this strategy is that it is appealing to think that feedback from upper layers may not simply be providing a signal for plasticity, but also a predictive and/or modulatory signal to push the hidden layer neurons towards a ‘better’ activity pattern in real-time. This sort of top-down control could be used by the brain to improve sensory processing in different contexts and engage in inference (Bengio et al., 2015). Indeed, framing cortico-cortical feedback as a mechanism to predict or modulate incoming sensory activity is a more common way of viewing feedback signals in the neocortex (Larkum, 2013; Gilbert and Li, 2013; Zhang et al., 2014; Fiser et al., 2016; Leinweber et al., 2017). In light of this, it is interesting to note that distal apical inputs in sensory cortical areas can predict upcoming stimuli (Leinweber et al., 2017Fiser et al., 2016), and help animals perform sensory discrimination tasks (Takahashi et al., 2016; Manita et al., 2015). However, in our model, we did not actually implement a system that altered the hidden layer activity to make sensory computations—we simply used the feedback signals to drive learning. In-line with this view of top-down feedback, two recent papers have found evidence that cortical feedback can indeed guide feedforward sensory plasticity (Thompson et al., 2016; Yamada et al., 2017), and in the hippocampus, there is evidence that plateau potentials generated by apical inputs are key determinants of plasticity (Bittner et al., 2015; Bittner et al., 2017). But, ultimately, there is no reason that feedback signals cannot provide both top-down predicton/modulation and a signal for learning (Spratling, 2002). In this respect, a potential future advance on our model would be to implement a system wherein the feedback makes predictions and ‘nudges’ the hidden layers towards appropriate activity patterns in order to guide learning and shape perception simultaneously. This proposal is reminiscent of the approach taken in previous computational models (Urbanczik and Senn, 2014; Spratling and Johnson, 2006; Körding and König, 2001). Future research could study how top-down control of activity and a signal for credit assignment can be combined.

In a number of ways, the model that we presented here is more biologically feasible than other deep learning models. We utilized leaky integrator neurons that communicate with spikes, we simulated in near continuous-time, and we used spatially local synaptic plasticity rules. Yet, there are still clearly unresolved issues of biological feasibility in our model. Most notably, the model updates synaptic weights using the difference between two plateau potentials that occur following two different phases. There are three issues with this method from a biological standpoint. First, it necessitates two distinct global phases of processing (the ‘forward’ and ‘target’ phases). Second, the plateau potentials occur in the apical compartment, but they are used to update the basal synapses, meaning that this information from the apical dendrites must somehow be communicated to the rest of the neuron. Third, the two plateau potentials occur with a temporal gap of tens of milliseconds, meaning that this difference must somehow be computed over time.

These issues could, theoretically, be resolved in a biologically realistic manner. The two different phases could be a result of a global signal indicating whether the teaching signal was present. This could be accomplished with neuromodulatory systems (Pi et al., 2013), or alternatively, with oscillations that the teaching signal and apical dendrites are phase locked to (Veit et al., 2017). Communicating plateau potentials to the basal dendrites is also possible using known biological principles. Plateau potentials induce bursts of action potentials in pyramidal neurons (Larkum et al., 1999), and the rate-of-fire of the bursts would be a function of the level of the plateau potential. Given that action potentials would propagate back through the basal dendrites (Kampa and Stuart, 2006), any cellular mechanism in the basal dendrites that is sensitive to rate-of-fire of bursts could be used to detect the level of the plateau potentials in the apical dendrite. Finally, taking the difference between two events that occur tens of milliseconds apart is possible if such a hypothetical cellular signal that is sensitive to bursts had a slow decay time constant, and reacted differently depending on whether the global phase signal was active. A simple mathematical formulation for such a cellular signal is given in the methods (see Equations (36) and (37)). It is worth noting that incorporation of bursting into somatic dynamics would be unlikely to affect the learning results we presented here. This is because we calculate weight updates by averaging the activity of the neurons for a period after the network is near steady-state (i.e. the period marked with the blue line in Figure 3C, see also Equation (5)). Even if bursts of activity temporarily altered the dynamics of the network, they would not significantly alter the steady-state activity. Future work could expand on the model presented here and explore whether bursting activity might beneficially alter somatic dynamics (e.g. for on-line inference), as well as driving learning.

These possible implementations are clearly speculative, and only partially in-line with experimental evidence. As the adage goes, all models are wrong, but some models are useful. Our model aims to inspire new ways to think about how the credit assignment problem could be solved by known circuits in the brain. Our study demonstrates that some of the machinery that is known to exist in the neocortex, namely electrotonically segregated apical dendrites receiving top-down inputs, may be well-suited to credit assignment computations. What we are proposing is that the neocortex could use the segregation of top-down inputs to the apical dendrites in order to solve the credit assignment problem, without using a separate feedback pathway as is implicit in most deep learning models used in machine learning. We consider this to be the core insight of our model, and an important step in making deep learning more biologically plausible. Indeed, our model makes both a generic, and a specific, prediction about the role of synaptic inputs to apical dendrites during learning. The generic prediction is that the sign of synaptic plasticity, that is whether LTP or LTD occur, in the basal dendrites will be modulated by different patterns of inputs to the apical dendrites. The more specific prediction that our model makes is that the timing of apical inputs relative to basal inputs should be what determines the sign of plasticity for synapses in the basal dendrites. For example, if apical and basal inputs arrive at the same time, but the apical inputs disappear before the basal inputs do, then presumably plateau potentials will be stronger early in the stimulus presentation (i.e. αf>αt), and so the basal synapses should engage in LTD. In contrast, if the apical inputs only arrive after the basal inputs have been active for some period of time, then plateau potentials will be stronger towards the end of stimulus presentation (i.e. αf<αt), and so the basal synapses should engage in LTP. Both the generic and specific predictions should be experimentally testable using modern optical techniques to separate the inputs to the basal and apical dendrites (Figure 10).

An experiment to test the central prediction of the model.

(A) Illustration of the basic experimental set-up required to test the predictions (generic or specific) of the deep learning with segregated dendrites model. To test the predictions of the model, patch clamp recordings could be performed in neocortical pyramidal neurons (e.g. layer 5 neurons, shown in black), while the top-down inputs to the apical dendrites and bottom-up inputs to the basal dendrites are controlled separately. This could be accomplished optically, for example by infecting layer 4 cells with channelrhodopsin (blue cell), and a higher-order cortical region with a red-shifted opsin (red axon projections), such that the two inputs could be controlled by different colors of light. (B) Illustration of the specific experimental prediction of the model. With separate control of top-down and bottom-up inputs a synaptic plasticity experiment could be conducted to test the central prediction of the model, that is that the timing of apical inputs relative to basal inputs should determine the sign of plasticity at basal dendrites. After recording baseline postsynaptic responses (black lines) to the basal inputs (blue lines) a plasticity induction protocol could either have the apical inputs (red lines) arrive early during basal inputs (left) or late during basal inputs (right). The prediction of our model would be that the former would induce LTD in the basal synapses, while the later would induce LTP.

https://doi.org/10.7554/eLife.22901.020

Another direction for future research should be to consider how to use the machinery of neocortical microcircuits to communicate credit assignment signals without relying on differences across phases, as we did here. For example, somatostatin positive interneurons, which possess short-term facilitating synapses (Silberberg and Markram, 2007), are particularly sensitive to bursts of spikes, and could be part of a mechanism to calculate differences in the top-down signals being received by pyramidal neuron dendrites. If a calculation of this difference spanned the time before and after a teaching signal arrived, it could, theoretically, provide the computation that our system implements with a difference between plateau potentials. Indeed, we would argue that credit assignment may be one of the major functions of the canonical neocortical microcircuit motif. If this is correct, then the inhibitory interneurons that target apical dendrites may be used by the neocortex to control learning (Murayama et al., 2009). Although this is speculative, it is worth noting that current evidence supports the idea that neuromodulatory inputs carrying temporally precise salience information (Hangya et al., 2015) can shut off interneurons to disinhibit the distal apical dendrites (Pi et al., 2013; Karnani et al., 2016; Pfeffer et al., 2013; Brombas et al., 2014), and presumably, promote apical communication to the soma. Recent work suggests that the specific patterns of interneuron inhibition on the apical dendrites are spatially precise and differentially timed to motor behaviours (Muñoz et al., 2017), which suggests that there may well be coordinated physiological mechanisms for determining when and how cortico-cortical feedback is transmitted to the soma and basal dendrites. Future research should examine whether these inhibitory and neuromodulatory mechanisms do, in fact, control plasticity in the basal dendrites of pyramidal neurons, as our model, and some recent experimental work (Bittner et al., 2015; Bittner et al., 2017), would predict.

A non-biological issue that should be recognized is that the error rates which our network achieved were by no means as low as can be achieved with artificial neural networks, nor at human levels of performance (Lecun et al., 1998; Li et al., 2016). As well, our algorithm was not able to take advantage of very deep structures (beyond two hidden layers, the error rate did not improve). In contrast, increasing the depth of networks trained with backpropagation can lead to performance improvements (Li et al., 2016). But, these observations do not mean that our network was not engaged in deep learning. First, it is interesting to note that although the backpropagation algorithm is several decades old (Rumelhart et al., 1986), it was long considered to be useless for training networks with more than one or two hidden layers (Bengio and LeCun, 2007). Indeed, it was only the use of layer-by-layer training that initially led to the realization that deeper networks can achieve excellent performance (Hinton et al., 2006). Since then, both the use of very large datasets (with millions of examples), and additional modifications to the backpropagation algorithm, have been key to making backpropagation work well on deeper networks (Sutskever et al., 2013; LeCun et al., 2015). Future studies could examine how our algorithm could incorporate current techniques used in machine learning to work better on deeper architectures. Second, we stress that our network was not designed to match the state-of-the-art in machine learning, nor human capabilities. To test our basic hypothesis (and to run our leaky-integration and spiking simulations in a reasonable amount of time) we kept the network small, we stopped training before it reached its asymptote, and we did not implement any add-ons to the learning to improve the error rates, such as convolution and pooling layers, initialization tricks, mini-batch training, drop-out, momentum or RMSProp (Sutskever et al., 2013; Tieleman and Hinton, 2012; Srivastava et al., 2014). Indeed, it would be quite surprising if a relatively vanilla, small network like ours could come close to matching current performance benchmarks in machine learning. Third, although our network was able to take advantage of multiple layers to improve the error rate, there may be a variety of reasons that ever increasing depth didn’t improve performance significantly. For example, our use of direct connections from the output layer to the hidden layers may have impaired the network’s ability to coordinate synaptic updates between hidden layers. As well, given our finding that the use of spikes produced weight updates that were less well-aligned to backpropagation (Figure 7A) it is possible that deeper architectures require mechanisms to overcome the inherent noisiness of spikes.

One aspect of our model that we did not develop was the potential for learning at the feedback synapses. Although we used random synaptic weights for feedback, we also demonstrated that our model actually learns to meet the mathematical conditions required for credit assignment (Figure 5—figure supplement 1). This suggests that it would be beneficial to develop a synaptic weight update rule for the feedback synapses that made this aspect of the learning better. Indeed, Lee et al., 2015 implemented an ‘inverse loss function’ for their feedback synapses which promoted the development of feedforward and feedback functions that were roughly inverses of each other, leading to the emergence of auto-encoder functions in their network. In light of this, it is interesting to note that there is evidence for unique, ‘reverse’ spike-timing-dependent synaptic plasticity rules in the distal apical dendrites of pyramidal neurons (Sjöström and Häusser, 2006; Letzkus et al., 2006), which have been shown to produce symmetric feedback weights and auto-encoder functions in artificial spiking networks (Burbank and Kreiman, 2012; Burbank, 2015). Thus, it is possible that early in development the neocortex actually learns cortico-cortical feedback connections that help it to assign credit for later learning. Our work suggests that any experimental evidence showing that feedback connections learn to approximate the inverse of feedforward connections could be considered as evidence for deep learning in the neocortex.

A final consideration, which is related to learning at feedback synapses, is the likely importance of unsupervised learning for the real brain, that is learning without a teaching signal. In this paper, we focused on a supervised learning task with a teaching signal. Supervised learning certainly could occur in the brain, especially for goal-directed sensorimotor tasks where animals have access to examples that they could use to generate internal teaching signals Teşileanu et al. (2017). But, unsupervised learning is likely critical for understanding the development of cognition (Marblestone et al., 2016). Importantly, unsupervised learning in multilayer networks still requires a solution to the credit assignment problem (Bengio et al., 2015), so our work here is not completely inapplicable. Nonetheless, future research should examine how the credit assignment problem can be addressed in the specific case of unsupervised learning.

In summary, deep learning has had a huge impact on AI, but, to date, its impact on neuroscience has been limited. Nonetheless, given a number of findings in neurophysiology and modeling (Yamins and DiCarlo, 2016), there is growing interest in understanding how deep learning may actually be achieved by the real brain (Marblestone et al., 2016). Our results show that by moving away from point neurons, and shifting towards multi-compartment neurons that segregate feedforward and feedback signals, the credit assignment problem can be solved and deep learning can be achieved. Perhaps the dendritic anatomy of neocortical pyramidal neurons is important for nature’s own deep learning algorithm.

Materials and methods

Code for the model can be obtained from a GitHub repository (https://github.com/jordan-g/Segregated-Dendrite-Deep-Learning) (Guerguiev, 2017), with a copy archived at https://github.com/elifesciences-publications/Segregated-Dendrite-Deep-Learning. For notational simplicity, we describe our model in the case of a network with only one hidden layer. We describe how this is extended to a network with multiple layers at the end of this section. As well, at the end of this section in Table 1 we provide a table listing the parameter values we used for all of the simulations presented in this paper.

Table 1
List of parameter values used in our simulations.
https://doi.org/10.7554/eLife.22901.021
ParameterUnitsValueDescription
dtms1Time step resolution
ϕmaxHz200Maximum spike rate
τsms3Short synaptic time constant
τLms10Long synaptic time constant
Δtsms30Settle duration for calculation of average voltages
gbS0.6Hidden layer conductance from basal dendrites to the soma
gaS0, 0.05, 0.6Hidden layer conductance from apical dendrites to the soma
gdS0.6Output layer conductance from dendrites to the soma
glS0.1Leak conductance
VRmV0Resting membrane potential
CmF1Membrane capacitance
P020/ϕmaxHidden layer error signal scaling factor
P120/ϕmax2Output layer error signal scaling factor

Neuronal dynamics

The network described here consists of an input layer with neurons, a hidden layer with m neurons, and an output layer with n neurons. Neurons in the input layer are simple Poisson spiking neurons whose rate-of-fire is determined by the intensity of image pixels (ranging from 0 - ϕmax). Neurons in the hidden layer are modeled using three functional compartments—basal dendrites with voltages V0b(t)=[V10b(t),V20b(t),...,Vm0b(t)], apical dendrites with voltages V0a(t)=[V10a(t),V20a(t),...,Vm0a(t)], and somata with voltages V0(t)=[V10(t),V20(t),...,Vm0(t)]. Feedforward inputs from the input layer and feedback inputs from the output layer arrive at basal and apical synapses, respectively. At basal synapses, presynaptic spikes from input layer neurons are translated into filtered spike trains sinput(t)=[s1input(t),s2input(t),...,sinput(t)] given by:

(11) sjinput(t)=kκ(ttjkinput)

where tjkinput is the k th spike time of input neuron j is the response kernel given by:

(12) κ(t)=(et/τLet/τs)Θ(t)/(τLτs)

where τs and τL are short and long time constants, and Θ is the Heaviside step function. Since the network is fully-connected, each neuron in the hidden layer will receive the same set of filtered spike trains from input layer neurons. The filtered spike trains at apical synapses, s1(t)=[s11(t),s21(t),...,sn1(t)], are modeled in the same manner. The basal and apical dendritic potentials for neuron i are then given by weighted sums of the filtered spike trains at either its basal or apical synapses:

(13) Vi0b(t)=j=1Wij0sjinput(t)+bi0Vi0a(t)=j=1nYijsj1(t)

where b0=[b10,b20,...,bm0] are bias terms, W0 is the m× matrix of feedforward weights for neurons in the hidden layer, and Y is the m×n matrix of their feedback weights. The somatic voltage for neuron i evolves with leak as:

(14)τdVi0(t)dt=(VRVi0(t))+gbgl(Vi0b(t)Vi0(t))+gagl(Vi0a(t)Vi0(t))(15)=(VRVi0(t))+gbgl(j=1Wij0sjinput(t)+bi0Vi0(t))+gagl(j=1nYij0sj1(t)Vi0(t))

where VR is the resting potential, gl is the leak conductance, gb is the conductance from the basal dendrite to the soma, and ga is the conductance from the apical dendrite to the soma, and τ is a function of gl and the membrance capacitance Cm:

(16) τ=Cmgl

Note that for simplicity’s sake we are assuming a resting potential of 0 mV and a membrane capacitance of 1 F, but these values are not important for the results. Equations (13) and (14) are identical to the Equation (1) in results.

The instantaneous firing rates of neurons in the hidden layer are given by ϕ0(t)=[ϕ10(t),ϕ20(t),...,ϕm0(t)], where ϕi0(t) is the result of applying a nonlinearity, σ(), to the somatic potential Vi0(t). We chose σ() to be a simple sigmoidal function, such that:

(17) ϕi0(t)=ϕmaxσ(Vi0(t))=ϕmax11+eVi0(t)

Here, ϕmax is the maximum possible rate-of-fire for the neurons, which we set to 200 Hz. Note that Equation (17) is identical to Equation (3) in results. Spikes are then generated using Poisson processes with these firing rates. We note that although the maximum rate was 200 Hz, the neurons rarely achieved anything close to this rate, and the average rate of fire in the neurons during our simulations was 24 Hz.

Units in the output layer are modeled using only two compartments, dendrites with voltages V1b(t)=[V11b(t),V21b(t),...,Vn1b(t)] and somata with voltages V1(t)=[V11(t),V21(t),...,Vn1(t)] is given by:

(18) Vi1b(t)=j=1mWij1sj0(t)+bi1

where s0(t)=[s10(t),s20(t),...,sm0(t)] are the filtered presynaptic spike trains at synapses that receive feedforward input from the hidden layer, and are calculated in the manner described by Equation (11). Vi1(t) evolves as:

(19) τdVi1(t)dt=(VRVi1(t))+gdgl(Vi1b(t)Vi1(t))+Ii(t)

where gl is the leak conductance, gd is the conductance from the dendrite to the soma, and I(t)=[I1(t),I2(t),...,In(t)] are somatic currents that can drive output neurons toward a desired somatic voltage. For neuron i, Ii is given by:

(20) Ii(t)=gEi(t)(EEVi1(t))+gIi(t)(EIVi1(t))

where gE(t)=[gE1(t),gE2(t),...,gEn(t)] and gI(t)=[gI1(t),gI2(t),...,gIn(t)] are time-varying excitatory and inhibitory nudging conductances, and EE and EI are the excitatory and inhibitory reversal potentials. In our simulations, we set EE=8 V and EI=-8 V. During the target phase only, we set gIi=1 and gEi=0 for all units i whose output should be minimal, and gEi=1 and gIi=0 for the unit whose output should be maximal. In this way, all units other than the ‘target’ unit are silenced, while the ‘target’ unit receives a strong excitatory drive. In the forward phase, I(t) is set to 0. The Poisson spike rates ϕ1(t)=[ϕ11(t),ϕ21(t),...,ϕn1(t)] are calculated as in Equation (17).

Plateau potentials

At the end of the forward and target phases, we calculate plateau potentials αf=[α1f,α2f,...,αmf] and αt=[α1t,α2t,...,αmt] for apical dendrites of hidden layer neurons, where αif and αit are given by:

(21) αif=σ(1Δt1t1Δt1t1Vi0a(t)dt)αit=σ(1Δt2t2Δt2t2Vi0a(t)dt)

where t1 and t2 are the end times of the forward and target phases, respectively, Δts=30 ms is the settling time for the voltages, and Δt1 and Δt2 are given by:

(22) Δt1=t1(t0+Δts)Δt2=t2(t1+Δts)

Note that Equation (21) is identical to Equation (5) in results. These plateau potentials are used by hidden layer neurons to update their basal weights.

Weight updates

All feedforward synaptic weights are updated at the end of each target phase. Output layer units update their synaptic weights W1 in order to minimize the loss function

(23) L1=||ϕ1ϕmaxσ(V1¯f)||22

where ϕ1=ϕ1¯t as in Equation (6). Note that, as long as neuronal units calculate averages after the network has reached a steady state, and the firing-rates of the neurons are in the linear region of the sigmoid function, then for layer x,

(24) ϕmaxσ(Vx¯f)ϕmaxσ(Vx)¯f=ϕx¯f

Thus,

(25) L1||ϕ1¯tϕ1¯f||22

as in Equation (7).

All average voltages are calculated after a delay Δts from the start of a phase, which allows for the network to reach a steady state before averaging begins. In practice this means that the average somatic voltage for output layer neuron i in the forward phase, Vi1¯f, has the property

(26) Vi1¯fkdVi1b¯f=kd(j=1mWij1sj0¯f+bi1)

where kd is given by:

(27) kd=gdgl+gd

Thus,

(28) L1W1kdϕmax(ϕ1ϕmaxσ(V1¯f))σ(V1¯f)s0¯fL1b1kdϕmax(ϕ1ϕmaxσ(V1¯f))σ(V1¯f)

Note that these partial derivatives assume that the activity during the target phase is fixed. We do this because the goal of learning is to have the network behave as it does during the target phase, even when the teaching signal is present. Thus, we do not update synapses in order to alter the target phase activity. As a result, there are no terms in the equation related to the partial derivatives of the voltages or firing-rates during the target phase.

The dendrites in the output layer use this approximation of the gradient in order to update their weights using gradient descent:

(29) W1W1η1P1L1W1b1b1η1P1L1b1

where η1 is a learning rate constant, and P1 is a scaling factor used to normalize the scale of the rate-of-fire function.

In the hidden layer, basal dendrites update their synaptic weights W0 by minimizing the loss function

(30) L0=||ϕ0ϕmaxσ(V0¯f)||22

We define the target rates-of-fire ϕ0=[ϕ10,ϕ20,...,ϕm0] such that

(31) ϕi0=ϕi0¯f+αitαif

where αf=[α1f,α2f,...,αmf] and αt=[α1t,α2t,...,αmt] are forward and target phase plateau potentials given in Equation (21). Note that Equation (31) is identical to Equation (8) in results. These hidden layer target firing rates are similar to the targets used in difference target propagation (Lee et al., 2015).

Using Equation (24), we can show that

(32) L0||αtαf||22

as in Equation (9). Hence:

(33) L0W0kb(αtαf)ϕmaxσ(V0¯f)sinput¯fL0b0kb(αtαf)ϕmaxσ(V0¯f)

where kb is given by:

(34) kb=gbgl+gb+ga

Note that although ϕ0 is a function of W0 and b0, we do not differentiate this term with respect to the weights and biases. Instead, we treat ϕ0 as a fixed state for the hidden layer neurons to learn to reproduce. Basal weights are updated in order to descend this approximation of the gradient:

(35) W0W0η0P0L0W0b0b0η0P0L0b0

Again, we assume that the activity during the target phase is fixed, so no derivatives are taken with respect to voltages or firing-rates during the target phase.

Importantly, this update rule is spatially local for the hidden layer neurons. It consists essentially of three terms, (1) the difference in the plateau potentials for the target and forward phases (αtαf), (2) the derivative of the spike rate function (ϕmaxσ(V0¯f)), and (3) the filtered presynaptic spike trains (sinput¯f). All three of these terms are values that a real neuron could theoretically calculate using some combination of molecular synaptic tags, calcium currents, and back-propagating action potentials.

One aspect of this update rule that is biologically questionable, though, is the use of the term (αtαf). This requires a difference between plateau potentials that are separated by tens of milliseconds. How could such a signal be used by basal dendrite synapses to guide their updates? Plateau potentials can drive bursts of spikes (Larkum et al., 1999), which can propagate to basal dendrites (Kampa and Stuart, 2006). Since the plateau potentials are similar to rate variables (i.e. a sigmoid applied to the voltage), the number of spikes during the bursts, Nf=[N1f,N2f,...,Nmf] and Nt=[N1t,N2t,...,Nmt], for the forward and target plateaus, respectively, could be sampled from a Poisson distribution with rate parameter equal to the plateau potential level:

(36) NfPoisson(αf)NtPoisson(αt)

If the distinct phases (forward and target) were marked by some global signal, ϕ(t), that was communicated to all of the neurons, for example a neuromodulatory signal, the phase of a global oscillation, or some blanket inhibition signal, then we can imagine an internal cellular memory mechanism in the basal dendrites of the ith neuron, Mi (e.g. a molecular signal like the activity of an enzyme, the phosphorylation level of some protein, or the amount of calcium released from intracellular stores), which could be differentially sensitive to the inter-spike interval of bursts, depending on ϕ. So, for example, if we define:

(37) ϕ(t)={1,if in the forward phase, i.e. x=f1,if in the target phase, i.e. x=tdMi(t)dtϕ(t)Nix

where x indicates the forward or target phase. Then, the change in Mi from before the bursts occur to afterwards would be, on average, proportional to the difference (αtαf), and could be used to calculate the weight updates.

However, this is highly speculative, and it is not clear that such a mechanism would be present in real neurons. We have outlined the mathematics here to make the reality of implementing the current model explicit, but we would predict that the brain would have some alternative method for calculating differences between top-down inputs at different times, for example by using somatostatin positive interneurons that are preferentially sensitive to bursts and which target the apical dendrite (Silberberg and Markram, 2007). We are ultimately agnostic as to this mechanism, and so, it was not included in the current model.

Multiple hidden layers

In order to extend our algorithm to deeper networks with multiple hidden layers, our model incorporates direct synaptic connections from the output layer to each hidden layer. Thus, each hidden layer receives feedback from the output layer through its own separate set of fixed, random weights. For example, in a network with two hidden layers, both layers receive the feedback from the output layer at their apical dendrites through backward weights Y0 and Y1. The local targets at each layer are then given by:

(38)ϕ2=ϕ2¯t(39)ϕ1=ϕ1¯t+α1tα1f(40)ϕ0=ϕ0¯t+α0tα0f

where the superscripts 0 and 1 denote the first and second hidden layers, respectively, and the superscript 2 denotes the output layer.

The local loss functions at each layer are:

(41) L2=||ϕ2ϕmaxσ(V2¯f)||22L1=||ϕ1ϕmaxσ(V1¯f)||22L0=||ϕ0ϕmaxσ(V0¯f)||22

where L2 is the loss at the output layer. The learning rules used by the hidden layers in this scenario are the same as in the case with one hidden layer.

Learning rate optimization

For each of the three network sizes that we present in this paper, a grid search was performed in order to find good learning rates. We set the learning rate for each layer by stepping through the range [0.1,0.3] with a step size of 0.02. For each combination of learning rates, a neural network was trained for one epoch on the 60, 000 training examples, after which the network was tested on 10,000 test images. The learning rates that gave the best performance on the test set after an epoch of training were used as a basis for a second grid search around these learning rates that used a smaller step size of 0.01. From this, the learning rates that gave the best test performance after 20 epochs were chosen as our learning rates for that network size.

In all of our simulations, we used a learning rate of 0.19 for a network with no hidden layers, learning rates of 0.21 (output and hidden) for a network with one hidden layer, and learning rates of 0.23 (hidden layers) and 0.12 (output layer) for a network with two hidden layers. All networks with one hidden layer had 500 hidden layer neurons, and all networks with two hidden layers had 500 neurons in the first hidden layer and 100 neurons in the second hidden layer.

Training paradigm

For all simulations described in this paper, the neural networks were trained on classifying handwritten digits using the MNIST database of 28 pixel × 28 pixel images. Initial feedforward and feedback weights were chosen randomly from a uniform distribution over a range that was calculated to produce voltages in the dendrites between -6 - 12 V.

Prior to training, we tested a network’s initial performance on a set of 10,000 test examples. This set of images was shuffled at the beginning of testing, and each example was shown to the network in sequence. Each input image was encoded into Poisson spiking activity of the 784 input neurons representing each pixel of the image. The firing rate of an input neuron was proportional to the brightness of the pixel that it represents (with spike rates between [0 - ϕmax]. The spiking activity of each of the 784 input neurons was received by the neurons in the first hidden layer. For each test image, the network underwent only a forward phase. At the end of this phase, the network’s classification of the input image was given by the neuron in the output layer with the greatest somatic potential (and therefore the greatest spike rate). The network’s classification was compared to the target classification. After classifying all 10,000 testing examples, the network’s classification error was given by the percentage of examples that it did not classify correctly.

Following the initial test, training of the neural network was done in an on-line fashion. All 60,000 training images were randomly shuffled at the start of each training epoch. The network was then shown each training image in sequence, undergoing a forward phase ending with a plateau potential, and a target phase ending with another plateau potential. All feedforward weights were then updated at the end of the target phase. At the end of the epoch (after all 60,000 images were shown to the network), the network was again tested on the 10,000 test examples. The network was trained for up to 60 epochs.

Simulation details

For each training example, a minimum length of 50 ms was used for each of the forward and target phases. The lengths of the forward and target training phases were determined by adding their minimum length to an extra length term, which was chosen randomly from a Wald distribution with a mean of 2 ms and scale factor of 1. During testing, a fixed length of 500 ms was used for the forward transmit phase. Average forward and target phase voltages were calculated after a settle duration of Δts=30 ms from the start of the phase.

For simulations with randomly sampled plateau potential times (Figure 5—figure supplement 1), the time at which each neuron’s plateau potential occurred was randomly sampled from a folded normal distribution (μ=0,σ2=3) that was truncated (max=5) such that plateau potentials occurred between 0 ms and 5 ms before the start of the next phase. In this scenario, the average apical voltage in the last 30 ms was averaged in the calculation of the plateau potential for a particular neuron.

The time-step used for simulations was dt=1 ms. At each time-step, the network’s state was updated bottom-to-top beginning with the first hidden layer and ending with the output layer. For each layer, dendritic potentials were updated, followed by somatic potentials, and finally their spiking activity. Table 1 lists the simulation parameters and the values that were used in the figures presented.

All code was written using the Python programming language version 2.7 (RRID: SCR_008394) with the NumPy (RRID: SCR_008633) and SciPy (RRID: SCR_008058) libraries. The code is open source and is freely available at https://github.com/jordan-g/Segregated-Dendrite-Deep-Learning (Guerguiev, 2017). The data used to train the network was from the Mixed National Institute of Standards and Technology (MNIST) database, which is a modification of the original database from the National Institute of Standards and Technology (RRID: SCR_006440) (Lecun et al., 1998). The MNIST database can be found at http://yann.lecun.com/exdb/mnist/. Some of the simulations were run on the SciNet High-Performance Computing platform (Loken et al., 2010).

Proofs

Theorem for loss function coordination

The targets that we selected for the hidden layer (see Equation (8)) were based on the targets used in Lee et al., 2015. The authors of that paper provided a proof showing that their hidden layer targets guaranteed that learning in one layer helped reduce the error in the next layer. However, there were a number of differences between our network and theirs, such as the use of spiking neurons, voltages, different compartments, etc. Here, we modify the original Lee et al., 2015 proof slightly to prove Theorem 1.

One important thing to note is that the theorem given here utilizes a target for the hidden layer that is slightly different than the one defined in Equation (8). However, the target defined in Equation (8) is a numerical approximation of the target given in Theorem 1. After the proof of we describe exactly how these approximations relate to the targets given here.

Theorem 1

Consider a neural network with one hidden layer and an output layer. Let ϕ~0=ϕ0¯f+σ(Yϕ1¯t)σ(Yϕmaxσ(E[V1¯f])) be the target firing rates for neurons in the hidden layer, where σ() is a differentiable function. Assume that V1¯fkdV1b¯f. Let ϕ1=ϕ1¯t be the target firing rates for the output layer. Also, for notational simplicity, let β(x)ϕmaxσ(kdW1x) and γ(x)σ(Yx). Theorem 1 states that if ϕ1ϕmaxσ(E[V1¯f]) is sufficiently small, and the Jacobian matrices Jβ and Jγ satisfy the condition that the largest eigenvalue of (IJβJγ)T(IJβJγ) is less than 1, then

||ϕ1ϕmaxσ(kdW1ϕ~0)||22<||ϕ1ϕmaxσ(E[V1¯f])||22

We note again that the proof for this theorem is essentially a modification of the proof provided in Lee et al., 2015 that incorporates our Lemma 1 to take into account the expected value of s0¯f, given that spikes in the network are generated with non-stationary Poisson processes.

Proof.

ϕ1ϕmaxσ(kdW1ϕ~0)ϕ1β(ϕ~0)=ϕ1β(ϕ0¯f+γ(ϕ1¯t)γ(ϕmaxσ(E[V1¯f])))

Lemma 1 shows that ϕmaxσ(E[V1¯f])=ϕmaxσ(E[kdW1s0¯f])ϕmaxσ(kdW1ϕ0¯f) given a sufficiently large averaging time window. Assume that ϕmaxσ(E[V1¯f])=ϕmaxσ(kdW1ϕ0¯f)β(ϕ0¯f). Then,

ϕ1β(ϕ~0)=ϕ1β(ϕ0¯f+γ(ϕ1¯t)γ(β(ϕ0¯f)))

Let e=ϕ1¯tβ(ϕ0¯f). Applying Taylor’s theorem,

ϕ1β(ϕ~0)=ϕ1β(ϕ0¯f+Jγe+o(||e||2))

where o(||e||2) is the remainder term that satisfies lime0o(||e||2)/||e||2=0. Applying Taylor’s theorem again,

ϕ1β(ϕ~0)=ϕ1β(ϕ0¯f)Jβ(Jγe+o(||e||2))o(||(Jγe+o(||e||2)||2)=ϕ1β(ϕ0¯f)+JβJγeo(||e||2)=(IJβJγ)eo(||e||2)

Then,

||ϕ1β(ϕ~0)||22=((IJβJγ)eo(||e||2))T((IJβJγ)eo(||e||2))=eT(IJβJγ)T(IJβJγ)eo(||e||2)T(IJβJγ)eeT(IJβJγ)To(||e||2)+o(||e||2)To(||e||2)=eT(IJβJγ)T(IJβJγ)e+o(||e||22)μ||e||22+|o(||e||22)|

where μ is the largest eigenvalue of (IJβJγ)T(IJβJγ). If e is sufficiently small so that |o(||e||22))|<(1μ)||e||22, then

||ϕ1ϕmaxσ(kdW1ϕ~0)||22||e||22=||ϕ1ϕmaxσ(E[V1¯f])||22

Note that the last step requires that μ, the largest eigenvalue of (I-JβJγ)T(I-JβJγ), is below 1. Clearly, we do not actually have any guarantee of meeting this condition. However, our results show that even though the feedback weights are random and fixed, the feedforward weights actually learn to meet this condition during the first epoch of training (Figure 5—figure supplement 1).

Hidden layer targets

Theorem 1 shows that if we use a target ϕ~0=ϕ0¯f+σ(Yϕ1¯t)σ(Yϕmaxσ(kdW1ϕ0¯f)) for the hidden layer, there is a guarantee that the hidden layer approaching this target will also push the upper layer closer to its target ϕ1, if certain other conditions are met. Our specific choice of ϕ0 defined in the results (Equation (8)) approximates this target rate vector using variables that are accessible to the hidden layer units.

If neuronal units calculate averages after the network has reached a steady state and the firing rates of neurons are in the linear region of the sigmoid function, ϕmaxσ(V1¯f)ϕ1¯f. Using Lemma 1, E[V1¯f]kdW1ϕ0¯f and E[V0a¯f]Yϕ1¯f. If we assume that V1¯fE[V1¯f] and V0a¯fE[V0a¯f], which is true on average, then:

(42) αf=σ(V0a¯f)σ(Yϕ1¯f)σ(Yϕmaxσ(V1¯f))σ(Yϕmaxσ(kdW1ϕ0¯f))

and:

(43) αt=σ(V0a¯t)σ(Yϕ1¯t)

Therefore, ϕ0ϕ~0.

Thus, our hidden layer targets ensure that our model employs a learning rule similar to difference target propagation that approximates the necessary conditions to guarantee error convergence.

Lemma for firing rates

Theorem 1 had to rely on the equivalence between the average spike rates of the neurons and their filtered spike trains. Here, we prove a lemma showing that this equivalence does indeed hold as long as the integration time is long enough relative to the synaptic time constants ts and tL.

Lemma 1

Let X be a set of presynaptic spike times during the time interval Δt=t1-t0, distributed according to an inhomogeneous Poisson process. Let N=|X| denote the number of presynaptic spikes during this time window, and let xkX denote the kth presynaptic spike time, where 0<kN. Finally, let ϕ(t) denote the time-varying presynaptic firing rate (i.e. the time-varying mean of the Poisson process), and s(t) be the filtered presynaptic spike train at time t given by Equation (11). Then, during the time window Δt, as long as Δt2τL2τs2ϕ2¯/(τLτs)2(τL+τs),

E[s(t)¯]ϕ¯

Proof.The average of s(t) over the time window Δt is

s¯=1Δtt0t1s(t)dt=1Δtkt0t1e(txk)/τLe(txk)/τsτLτsΘ(txk)dt

Since Θ(t-xk)=0 for all t<xk,

s¯=1Δtkxkt1e(txk)/τLe(txk)/τsτLτsdt=1Δt(NkτLe(t1xk)/τLτse(t1xk)/τsτLτs)

The expected value of s¯ with respect to X is given by

EX[s¯]=EX[1Δt(NkτLe(t1xk)/τLτse(t1xk)/τsτLτs)]=EX[N]Δt1ΔtEX[k=1N(τLe(t1xk)/τLτse(t1xk)/τsτLτs)]

Since the presynaptic spikes are an inhomogeneous Poisson process with a rate ϕ, EX[N]=t0t1ϕdt. Thus,

EX[s¯]=1Δtt0t1ϕdt1ΔtEX[k=1Ng(xk)]=ϕ¯1ΔtEX[k=1Ng(xk)]

where we let g(xk)(τLe-(t1-xk)/τL-τse-(t1-xk)/τs)/(τL-τs). Then, the law of total expectation gives

EX[k=1Ng(xk)]=EN[EX[k=1Ng(xk)|N]]=n=0(EX[k=1Ng(xk)|N=n]P(N=n))

Letting fxk(t) denote P(xk=t), we have that

EX[k=1Ng(xk)|N=n]=k=1nEX[g(xk)]=k=1nt0t1g(t)fxk(t)dt

Since Poisson spike times are independent, for an inhomogeneous Poisson process:

fxk(t)=ϕ(t)t0t1ϕ(u)du=ϕ(t)ϕ¯Δt

for all t[t0,t1]. Since Poisson spike times are independent, this is true for all k. Thus,

EX[k=1Ng(xk)|N=n]=1ϕ¯Δtk=1nt0t1g(t)ϕ(t)dt=nϕ¯Δtt0t1g(t)ϕ(t)dt

Then,

EX[k=1Ng(xk)]=n=0(nϕ¯Δt(t0t1g(t)ϕ(t)dt)P(N=n))=1ϕ¯Δt(t0t1g(t)ϕ(t)dt)(n=0nP(N=n))

Now, for an inhomogeneous Poisson process with time-varying rate ϕ(t),

P(N=n)=[t0t1ϕ(t)dt]net0t1ϕ(t)dtn!=[ϕ¯Δt]ne(ϕ¯Δt)n!

Thus,

EX[k=1Ng(xk)]=e(ϕ¯Δt)ϕ¯Δt(t0t1g(t)ϕ(t)dt)(n=0n[ϕ¯Δt]nn!)=e(ϕ¯Δt)ϕ¯Δt(t0t1g(t)ϕ(t)dt)(ϕ¯Δt)eϕ¯Δt=t0t1g(t)ϕ(t)dt

Then,

EX[s¯]=ϕ¯1Δt(t0t1g(t)ϕ(t)dt)

The second term of this equation is always greater than or equal to 0, since g(t)0 and ϕ(t)0 for all t. Thus, EX[s¯]ϕ¯. As well, the Cauchy-Schwarz inequality states that

t0t1g(t)ϕ(t)dtt0t1g(t)2dtt0t1ϕ(t)2dt=t0t1g(t)2dtϕ2¯Δt

where

t0t1g(t)2dt=t0t1(τLe(t1t)/τLτse(t1t)/τsτLτs)2dt12(τLτs)2(4τL2τs2τL+τs)=2τL2τs2(τLτs)2(τL+τs)

Thus,

t0t1g(t)ϕ(t)dt2τL2τs2(τLτs)2(τL+τs)ϕ2¯Δt=Δt2τL2τs2ϕ2¯(τLτs)2(τL+τs)

Therefore,

EX[s¯]ϕ¯1ΔtΔt2τL2τs2ϕ2¯(τLτs)2(τL+τs)=ϕ¯2τL2τs2ϕ2¯Δt(τLτs)2(τL+τs)

Then,

ϕ¯2τL2τs2ϕ2¯Δt(τLτs)2(τL+τs)EX[s¯]ϕ¯

Thus, as long as Δt2τL2τs2ϕ2¯/(τLτs)2(τL+τs), EX[s¯]ϕ¯.

What this lemma says, effectively, is that the expected value of s is going to be roughly the average presynaptic rate of fire as long as the time over which the average is taken is sufficiently long in comparison to the postsynaptic time constants and the average rate-of-fire is sufficiently small. In our simulations, Δt is always greater than or equal to 50 ms, the average rate-of-fire is approximately 20 Hz, and our time constants τL and τs are 10 ms and 3 ms, respectively. Hence, in general:

2τL2τs2ϕ2¯/(τLτs)2(τL+τs)=2(10)2(3)2(0.02)2/(103)2(10+3)0.00150

Thus, in the proof of Theorem 1, we assume EX[s¯]=ϕ¯.

References

  1. 1
    Scaling learning algorithms towards AI
    1. Y Bengio
    2. Y LeCun
    (2007)
    Large-Scale Kernel Machines 34:1–41.
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification
    1. K He
    2. X Zhang
    3. S Ren
    4. J Sun
    (2015)
    Proceedings of the IEEE International Conference on Computer Vision. pp. 1026–1034.
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
    Advances in Neural Information Processing Systems
    1. A Krizhevsky
    2. I Sutskever
    3. GE Hinton
    (2012)
    1097–1105, Imagenet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems.
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
    Joint European Conference on Machine Learning and Knowledge Discovery in Databases
    1. D-H Lee
    2. S Zhang
    3. A Fischer
    4. Y Bengio
    (2015)
    498–515, Difference target propagation, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer.
  37. 37
  38. 38
  39. 39
  40. 40
    International Conference on Intelligent Data Engineering and Automated Learning
    1. Y Li
    2. H Li
    3. Y Xu
    4. J Wang
    5. Y Zhang
    (2016)
    174–182, Very deep neural network for handwritten digit recognition, International Conference on Intelligent Data Engineering and Automated Learning, Springer.
  41. 41
  42. 42
  43. 43
  44. 44
    Visualizing data using t-SNE
    1. L Maaten
    2. G Hinton
    (2008)
    Journal of Machine Learning Research 9:2579–2605.
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
    Dropout: A simple way to prevent neural networks from overfitting
    1. N Srivastava
    2. G Hinton
    3. A Krizhevsky
    4. I Sutskever
    5. R Salakhutdinov
    (2014)
    The Journal of Machine Learning Research 15:1929–1958.
  62. 62
    On the importance of initialization and momentum in deep learning
    1. I Sutskever
    2. J Martens
    3. GE Dahl
    4. GE Hinton
    (2013)
    ICML 28:1139–1147.
  63. 63
  64. 64
  65. 65
  66. 66
    Lecture 6.5-Rmsprop: Divide the Gradient by a Running Average of Its Recent Magnitude
    1. T Tieleman
    2. G Hinton
    (2012)
    COURSERA: Neural Networks for Machine Learning 4:26–31.
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73

Decision letter

  1. Peter Latham
    Reviewing Editor; University College London, United Kingdom

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

[Editors’ note: this article was originally rejected after discussions between the reviewers, but the authors were invited to resubmit after an appeal against the decision.]

Thank you for submitting your work entitled "Deep learning with segregated dendrites" for consideration by eLife. Your article has been evaluated by a Senior Editor and three reviewers, one of whom is a member of our Board of Reviewing Editors. The reviewers have opted to remain anonymous.

Our decision has been reached after consultation among the reviewers. Based on these discussions, which are summarized below together with the individual reviews, we regret to inform you that your work will not be considered further for publication in eLife.

Reviewing Editor's summary:

This was a tough one: reviewers 2 and 3 were very positive, and even reviewer 1, who was negative about the clarity of the paper, was very positive about its content and importance. The problem was the writing: the reviewers felt that in its present form, the paper would be understandable only by deep learning experts. I'm sure it would be possible to fix this, but the reviewers also felt that this would be a major undertaking, and might even take a couple of rounds. It is eLife's policy to reject papers in that situation.

I'm very sorry; I would love to see work like this in eLife. Unfortunately, I'm not sure how useful the reviews will be – the most negative reviewer was #1, but there were only a small number of concrete suggestions, mainly because s/he was very lost. Perhaps it would be helpful to find a theoretical neuroscientist who is not an expert in deep networks – presumably your target audience – and see where s/he has trouble understanding the paper.

Reviewer #1:

This paper touches on a very important topic: biologically plausible deep learning. However, this particular version is not suitable for eLife. In fact, it's not clear it's suitable at all: after several hours staring at the paper, I remained thoroughly confused. Please don't get me wrong; I'm guessing the paper is correct; I think the problem is mainly the exposition relative to my level of knowledge.

A few examples:

1) It was never clear from the notation (and often the text) whether they were referring to scalars, vectors or matrices – something that does not help when one is trying to make sense of the math.

2) Above Equation (1) the authors talk about target firing rates. But, except for the output units, it's not clear at all what those are.

3) In Equation (1), I don't know what the target burst, α^t, is. I thought the apical dendrite (presumably what α is referring to) is cut off in the target phase.

4) Why should Equation (1) be the target rate?

5) L^0 is actually |α^t-α^f|^2. Why not say so?

At this point I turned to Materials and methods, in the hopes that the equations would clarify things. They didn't.

6) Equations (5)-(8) are standard, but are written in a very complicated a form. There may be a reason for that, but it's confusing for your run of the mill computational neuroscientist.

7) As far as I could tell, neither Equations (7) nor (8) include the feedback from the apical dendrite. And I couldn't figure out from anywhere in Materials and methods how that was implemented.

8) Equation (17) seems inconsistent with Equations (19) and (20).

And at that point I gave up.…

Reviewer #2:

This paper takes on the valiant task of making artificial deep neural networks more biologically relevant by implementing multi-compartmental neurons. In particular, the segregation of feed-forward from feedback information processing streams within the single cell is a welcome addition for biologists to see in computational models. The authors use details about the anatomical and physiological properties of cortical pyramidal neurons to implement backprop training. They establish that these biologically-inspired features can be accommodated without significant loss of performance. We believe this paper would be a welcome early step in the direction of bringing deep artificial network and neurophysiology thinking together, but requires conceptual explanation in a few key areas, especially for biologists not familiar with the details of deep networks.

What is the conceptual reason that feedforward and feedback streams need to be separated? Is it because the error signal is computed as the difference between the forward phase and the "correct" answer imposed by the teaching signal on the output neurons in the target phase? Conceptually, it seems that the separation of the signals allows for an error to be computed, and therefore for the appropriate change in weights to be arrived at. This is in contrast to how some often think about the relationship between feedforward and feedback in the brain where the main function of the feedforward/feedback integration is to actively and directly create downstream activity (as opposed to here where it is to change the weights of synapses).

What is the purpose of the random sampling of bursts? Why not just a fixed time? Would asynchronous bursting still be effective? Is the synchronous nature of the bursting in order to coordinate with the feedback from the teaching signal?

Would all of this be mathematically equivalent to a separate set of neurons that deal primarily with teaching signals in feedback pathways, and whose interaction with the "normal" feedforward network be regulated through some disinhibitory mechanism? To say this another way, is there anything special about the single cells and the nonlinearity α used, or could a similar setup be created by separating the different compartments into single neurons and connecting them with normal synapses?

What is the explanation for why weak apical attenuation disrupts learning? Is it because it forces an underestimation of the error by having the difference in activity between forward and target phases become eroded?

Local here means local in space. However, in order to compute weight updates, differences in activity still need to be taken over time. More specifically, the activity in the bursts between forward and target phases (equation 2). What is the biologically plausible mechanism for such non-temporally aligned computation?

Is there any explanation for why sparse feedback weights improve the network?

In general it would useful to have conceptual explanations for many of the issues discussed above.

Reviewer #3:

I think this is a very valuable manuscript that makes a link between deep learning and a possible biological implementation. As this link is of high scientific relevance topic and of broad interest, I consider the manuscript suited for a good journal as eLife, even if there is still a large gap between the performance of deep learning for artificial neuronal network and the suggest biological implementation (that only considers 2 layers with relatively humble performance). But the authors well recognize this and the manuscript represents a first step towards future research in this important field.

There is one main issue that should be addressed more thoroughly.

1) The proof of Theorem 1 assumes that the matrix product (J_β) (J_γ) is close to the identity mapping in the readout space. In the cited work by Lee et al. (Difference Propagation, 2015) this is the case because the forward and backward weights are adapted such that they get aligned. In the present case the alignment only becomes indirectly apparent by simulations showing that the error vector in the hidden layer eventually falls within 90 degrees of the true backpropagation error.

As I understand, the top-down weight matrix Y is fixed (e.g. randomly chosen). From a theoretical perspective, one may choose Y to be the pseudo-inverse of the forward weight matrix W^1. In fact, in that case a much simpler proof for Theorem 1 exists (a few lines only). But if Y is random, then the whole idea boils down to the random feedback idea (Lillicrap et al., Nature communication 2016) and this link should be emphasized more. While in the Supplementary Information of that paper a proof is outlined for linear transfer functions, it remains unclear how for nonlinear transfer functions this alignment is achieved obtained.

If J is chosen to be the transposed of W^1 as it is the case in backprop (and in part of the simulations), then nothing has to be proven. But if Y is random, then the big issue is to prove that the mapping γ(y) is approximatively an inversion of the mapping β(x). If this were proven, Theorem 1 in the manuscript could be cited as Theorem 2 in Lee et al. (Diff prop, 2015). But in the current form, Theorem 1 replicates the idea of Lee et al. (as it is also stated by the authors) without proving the basic assumption shown to be true in the case of Lee et al.. Of course, for the reader's convenience the proof of the Diff-Prop Theorem can still be reproduced.

In my view the core idea for the theory in the paper is (1) with random top-down connections the forward weights align as shown by Lillicrap et al. (2) Given the alignment, the idea of difference propagation with the proof given in Lee et al. can be applied. Once this theoretical fundament is introduced in this form (and simply referred to these papers), the idea of using segregated dendrites to implement the random feedback idea can be stressed.

A bit less fundamental, but still more than minor:

2) In view of the rather deep mathematical issues related to the feedback alignment, I would suggest to defer Lemma 1 to some Supplementary Information. The approximation of PSP signaling by instantaneous Poisson rates when the rate is small as compared to the PSP duration is standard in theoretical neuroscience. But the 3-page proof is still nicely done and may be helpful for a non-specialist who wishes to go into the details.

3) At the end of the subsection “A network architecture with segregated dendritic compartments” (Results) some critical issues are raised about the biological plausibility. In this context it should also be stressed that the alternation between two phases, each of which again subdivided into two further phases (Figure 1C), is not so easy to match to the biology. The phases need a memory that is tagged with the phase information and plasticity that is only turned on in a specific phase, checking out the memory from a previous phase.

Beside mentioning this in the Results, it should also be taken up in a further paragraph in the Discussion. One should mention that synaptic eligibility traces could help out here and that this helps to bridge information across the phases. Moreover, the phases could be implemented by exploiting global (I guess γ) oscillations that are shown to be present in various cognitive states. Discussing the link of learning and γ oscillations may be of general interest in this context.

[Editors’ note: what now follows is the decision letter after the authors submitted for further consideration.]

Thank you for resubmitting your work entitled "Towards deep learning with segregated dendrites" for further consideration at eLife. Your article has been favorably evaluated by Andrew King (Senior Editor) and three reviewers, one of whom is a member of our Board of Reviewing Editors.

This paper is much improved. However, it still has a way to go before it's ready for a neuroscience audience. Given that this has been reviewed several times now and remains in an unacceptable form, we are prepared to offer only one more opportunity to provide an acceptable version of the manuscript.

The easy thing to fix is notation and writing: we believe that, even in its improved form, it would be very hard for a neuroscientist, even a computational one who is used to thinking about circuits, to read, and the main ideas would be difficult to extract. More on that below.

The potentially harder thing to fix is biological plausibility. If we understand things correctly, the neuron must estimate the average PSPs during the feedforward sweep of activity, when only the input is active, estimate them again during the training phase, when the correct output is active as well, and then subtract the two. These signals are separated in time, which means the synapses have to store one signal, wait a few tens of ms, store another, and take the difference. In addition, the difference is computed in the apical dendrite, but it must be transferred to the proximal dendrites. And finally, a global signal is required to tell a synapse in which phase it is so that the estimate can be endowed with the correct sign. All seem nontrivial for neurons and synapses to compute.

Lack of biological plausibility should not rule out a theory – synapses and neurons are, after all, complicated. However, two things are needed. First, you need to provide a mechanism for implementing the learning rule that's not inconsistent with what's known about neurons and synapses. Second, you need to provide suggest experiments to test these predictions. Of these, the first is probably harder.

Now for the exposition. It may seem like we're micromanaging (and we are), but if this paper is to have an impact on the neuroscience community – a prerequisite for publication in eLife – it has to be cast into familiar notation.

1) The model was much simpler, and more standard, than first impressions would imply. It can be written in the very familiar form

dV^m_i/dt = -g_L V^m_i

+ sum_n g_n (b^n_i + sum_j W^mn_ij s^n_j(t) – V^m_i)

+ g^m_iE(t) (E_E – V_i^m) + g^m_iI(t) (E_I – V_i^m)

where the s^n_j(t) are filtered spike trains,

s^n_j(t) = sum_k kappa(t-t^n_jk),

t^n_jk is the k^th spike on neuron j of type n, and spikes were generated via a Poisson process based on the voltage. (Please note: errors are possible, but the equations look something like what we wrote.)

In this form it is immediately clear to a neuroscientist what kind of network this is. If nothing else, that will save a huge amount of time for the reader – it took hours of going back and forth over the equations in the paper before it became clear that the model was very standard, something that most readers would not have the patience for.

In addition, as written above, it makes it clear exactly how the dendrites are implemented: by varying the g_n. In real dendrites they vary with voltage on the dendrite; in this model they simply vary with time.

And finally, the notation with the A's and B's used in the paper is not helpful to neuroscientists, who are very used to seeing V, or maybe U, for voltage.

2) Along the same lines, a better figure showing the circuit needs to be included. The circuit with multiple hidden layers needs a similar drawing, as we were not able to figure out exactly what it looked like. (We're guessing there was sufficient information in the paper, but the amount of work it would take to extract it seemed high.)

3) The cost function, L^1, seemed somewhat arbitrary. According to Equation 7,

L^1 \propto \sum_i (<σ(U_i)>^t – σ(<U_i>^f))^2

where the angle brackets represent a time average and the superscripts t and f refer to the target and feedforward phases, respectively (basically, the overline was replaced with angle brackets, mainly because we're using plain text). Why was the average taken outside the sigmoid in the target phase and inside the sigmoid in the feedforward phase?

4) A similar question applies to L^0, which is written

L^0 = sum_i (λ_max(<σ(C_i)>^f – σ(<C_i>^f) + α_i^t – α_i^f)^2

As far as we can tell, the first two terms are included to make the update rules work out, and they are eventually set equal to each other. But is there any reason to think that L^0 should be minimized? It seemed unmotivated.

5) In Equations 19 and 22, why are there no terms involving the derivatives of the sigmoid in the target phase?

https://doi.org/10.7554/eLife.22901.026

Author response

[Editors’ note: the author responses to the first round of peer review follow.]

Reviewing Editor's summary:

This was a tough one: reviewers 2 and 3 were very positive, and even reviewer 1, who was negative about the clarity of the paper, was very positive about its content and importance. The problem was the writing: the reviewers felt that in its present form, the paper would be understandable only by deep learning experts. I'm sure it would be possible to fix this, but the reviewers also felt that this would be a major undertaking, and might even take a couple of rounds. It is eLife's policy to reject papers in that situation.

I'm very sorry; I would love to see work like this in eLife. Unfortunately, I'm not sure how useful the reviews will be – the most negative reviewer was #1, but there were only a small number of concrete suggestions, mainly because s/he was very lost. Perhaps it would be helpful to find a theoretical neuroscientist who is not an expert in deep networks – presumably your target audience – and see where s/he has trouble understanding the paper.

Reviewer #1:

This paper touches on a very important topic: biologically plausible deep learning. However, this particular version is not suitable for eLife. In fact, it's not clear it's suitable at all: after several hours staring at the paper, I remained thoroughly confused. Please don't get me wrong; I'm guessing the paper is correct; I think the problem is mainly the exposition relative to my level of knowledge.

We would like to thank reviewer #1 for their comments. We are very pleased that the reviewer recognizes that this is a “…very important topic”. Importantly, we also agree with reviewer #1 that our manuscript, as written, was not pitched at the appropriate level. This was a critical realization for us, and it has helped us to make the paper far more suitable for the general readership of eLife. With some advice from non-specialist readers, we have done a major re-write of the manuscript, especially the early parts where we introduce the central issues and describe our model. As the reviewer will see, we have completely re-written the Introduction and the first half of the Results. As well, we have included two new introductory figures (see Figures 1 and 2) that lay out the issues and describe our approach to solving them in a manner that we believe will be much easier for a general audience to understand. In particular, we now do the following:

1) We define the “credit assignment problem”, and we explain why it is important for neuroscientists to consider. This was, arguably, a major missing piece of explanation in our original submission. Readers who are unfamiliar with deep learning may not have considered the fact that effective synaptic plasticity rules in a multi-layer/multi-circuit network will require some way for neurons to know something about their contribution to the final output. The Introduction and Figure 1A now describe this issue in a manner that is generally accessible. As well, in the Introduction and Figure 1B, we also describe how the backpropagation of error algorithm solves the credit assignment problem with “weight transport”.

2) We provide a concrete explanation of how we are proposing to solve the credit assignment problem. In particular, in the Introduction and Figure 2 we now clarify that: (1) one key to assigning credit in current deep learning models is keeping separate feedforward and feedback calculations, and (2) the main goal of this paper is to accomplish this separation of feedforward and feedback signals in a biologically feasible manner that does not involve a separate feedback pathway, as is implicitly assumed in previous models (such as Lillicrap et al., 2016 and Lee et al., 2015).

3) We provide a more comprehensible description of our model in the first section of the Results. In particular, we now clarify which variables refer to vectors, and which variables refer to scalar values (in fact, we have now adopted a notation where vectors and matrices are always in boldface). Furthermore, when we describe the dynamics of the neurons we do so using the equations for single neurons, and we use a more commonplace notation for differential equations. Finally, we are also careful to fully define all of the values that appear in our target and loss function equations in the Results, as these are key to understanding how the algorithm works.

With the new figures, new notation, and re-written manuscript, we believe that the paper is now much easier for all readers to understand. We hope that reviewer #1 agrees. Below, we address some of reviewer #1's specific comments.

A few examples:

1) It was never clear from the notation (and often the text) whether they were referring to scalars, vectors or matrices – something that does not help when one is trying to make sense of the math.

We thank reviewer #1 for drawing our attention to this point. Other readers have also told us that it was very easy for those who are unfamiliar with deep learning algorithms to get lost in our original use of scalars, versus vectors or matrices, especially when it was not explicitly stated. To help this we have done three things. First, we have adopted a notation where vectors and matrices are always in boldface, and scalars are not. Second, we have been careful to always define our vectors to make clear which scalar values they contain. For example, when we define the somatic voltage vector now, we refer to “C(t) = [C1(t),.…, Cm(t)]” (see e.g. subsection “A network architecture with segregated dendritic compartments”, fourth paragraph). Third, to make the model easier to understand, we have attempted to define the dynamics of the model in terms of the individual scalar variables rather than the vectors whenever possible (see e.g. Equation (1)). We believe that these changes have significantly improved the readability of the paper.

2) Above Equation (1) the authors talk about target firing rates. But, except for the output units, it's not clear at all what those are.

We agree with reviewer #1 that this was unclear previously. We now spend much more time explicitly defining the target firing rates. For example, we now state in the Results:

“…we defined local targets for the output and the hidden layer, i.e. desired firing rates for both the output layer neurons and the hidden layer neurons. Learning is then a process of changing the synaptic connections to achieve these target firing rates across the network.”,.

For both the output and the hidden layer neurons we provide more explanation of the target rates that we define. In the Results, using Equations (4), (5) and (6), we are now careful to define the target firing rates on both a mathematical and conceptual level. For example, for the hidden layer targets, we state:

“For the hidden layer we define the target rates-of-fire… using the average rates-of-fire during the forward phase and the difference between the plateau potentials from the forward and transmit phase… The goal of learning in the hidden layer is to change the synapses W0 to achieve these targets in response to the given inputs.”

We also provide the reasoning motivating these hidden targets when we discuss the loss functions (see the responses to points 4 and 5 below). Also, we are now careful to define all of the components of our equations before we use them (see Equation (4)). We think that this is a major improvement on the original manuscript, and key to making the paper more enjoyable to read. We hope that reviewer #1 agrees.

3) In Equation (1), I don't know what the target burst, α^t, is. I thought the apical dendrite (presumably what α is referring to) is cut off in the target phase.

This is a perfect example of the lack of clarity in our original manuscript. Again, we thank the reviewer for drawing our attention to this. We have attempted to address this in two ways. First, in order to be more transparent about what these α values actually are, we have renamed them “plateau potentials”. The reason we do that is that they are actually just non-linear versions of the apical dendrite voltages, rather than actual bursts of spikes. Second, we now define the forward and target “plateau potentials” (αt and αf) explicitly (see subsection “A network architecture with segregated dendritic compartments”, eighth paragraph and Equation (3)).

4) Why should Equation (1) be the target rate?

We now try to provide a more intuitive explanation for this in the subsection “Credit assignment with segregated dendrites”. Please see the next point for more description.

5) L^0 is actually |α^t-α^f|^2. Why not say so?

Indeed, the reviewer is correct, L0 will, on average, reduce to ||αt-αf||2, and we should have said so. We now state this in the text (subsection “Credit assignment with segregated dendrites”, sixth paragraph). Furthermore, we also point out that this reduction helps to explain why our target, as defined, helps with credit assignment. Specifically, with an error function equal to ||αt-αf||2, we ensure that when the output layer is sending the same feedback to the hidden layer during both the forward and target phases, then the hidden layer neurons know that they have converged to appropriate representations for accomplishing the categorization task.

Now, as the reviewer asks, the question is, why not simply do this reduction? Why define L0 as we have? The reason is twofold. First, the reduction is not actually 100% accurate on any given trial, since the average rates-of-fire of the neurons (λi) are not necessarily exactly the same thing as the rate-of-fire that one would get if one applied the sigmoid function to the average voltage (σ(Ci(t))). Second, the reduction makes it appear as if L0 was not a function of the hidden layer activity. But it is, and this is key to calculating the gradient of the loss function with respect to the hidden layer synapses, W0 (see Equation (22)).

At this point I turned to Materials and methods, in the hopes that the equations would clarify things. They didn't.

6) Equations (5)-(8) are standard, but are written in a very complicated a form. There may be a reason for that, but it's confusing for your run of the mill computational neuroscientist.

The form of equation we used originally is common in a number of fields, but we have replaced it with a more standard format that is typical in computational neuroscience (see Equation (1)). This is more appropriate given the target audience of this paper, so we thank the reviewer for pointing this out.

7) As far as I could tell, neither Equations (7) nor (8) include the feedback from the apical dendrite. And I couldn't figure out from anywhere in Materials and methods how that was implemented.

In the original manuscript, Equation (7) did not include apical feedback, but Equation (8) did. We have replaced both equations with Equation (1) in the new manuscript, and stated clearly how we determine the level of apical feedback (i.e. using gA, see subsection “A network architecture with segregated dendritic compartments”, fourth paragraph).

8) Equation (17) seems inconsistent with Equations (19) and (20).

We are not sure why reviewer #1 felt that these equations were inconsistent. In the new version of the manuscript we have attempted to place these equations in a more appropriate location that makes their relevance more obvious.

In summary, we truly are indebted to reviewer #1 for identifying the lack of clarity in our original manuscript, and we recognize that it was not sufficiently accessible. But, we feel that with our major re-write, new figures, and new notation the paper is now well-suited to the readership of eLife. We want to emphasize that, with this paper, our goal is to get physiologists and computational neuroscientists to think differently about the reasons for pyramidal neuron morphology/physiology. That is why we feel it is important for it to be published in a journal with a broad readership, like eLife, rather than in a more specialist journal. Thanks to reviewer #1, we feel that our paper can now achieve these goals. We hope reviewer #1 agrees.

And at that point I gave up.…

Reviewer #2:

This paper takes on the valiant task of making artificial deep neural networks more biologically relevant by implementing multi-compartmental neurons. In particular, the segregation of feed-forward from feedback information processing streams within the single cell is a welcome addition for biologists to see in computational models. The authors use details about the anatomical and physiological properties of cortical pyramidal neurons to implement backprop training. They establish that these biologically-inspired features can be accommodated without significant loss of performance. We believe this paper would be a welcome early step in the direction of bringing deep artificial network and neurophysiology thinking together, but requires conceptual explanation in a few key areas, especially for biologists not familiar with the details of deep networks.

We are very happy that reviewer #2 recognizes that “…making artificial deep neural networks more biologically relevant by implementing multi-compartmental neurons.” is a valiant task, and that they view our paper as “…a welcome early step in the direction of bringing deep artificial network and neurophysiology thinking together…” We agree with the reviewer that there are a number of conceptual issues that required clarification. Below, we address reviewer #2's specific comments.

What is the conceptual reason that feedforward and feedback streams need to be separated? Is it because the error signal is computed as the difference between the forward phase and the "correct" answer imposed by the teaching signal on the output neurons in the target phase? Conceptually, it seems that the separation of the signals allows for an error to be computed, and therefore for the appropriate change in weights to be arrived at. This is in contrast to how some often think about the relationship between feedforward and feedback in the brain where the main function of the feedforward/feedback integration is to actively and directly create downstream activity (as opposed to here where it is to change the weights of synapses).

This is a key issue in our paper, and we are very grateful that reviewer #2 requested more conceptual explanation. First, we are now very clear about why feedforward information must be integrated separately from feedback information for this form of deep learning algorithm to work. We now state:

“…synaptic weight updates in the hidden layers (of previous models) depend on the difference between feedback that is generated in response to a purely feedforward propagation of sensory information, and feedback that is guided by a teaching signal (Lillicrap et al., 2016; Lee et al., 2015; Liao et al., 2015). In order to calculate this difference, sensory information must be transmitted separately from the feedback signals that are used to drive learning.”

This provides the reason for segregating feedback in apical dendrites. As the reviewer points out though, this way of viewing feedback (as a signal to drive learning, rather than a higher-order modulator of low-level activity), is not common in neuroscience. However, the two potential roles for feedback are not necessarily incompatible (as noted in previous models like Spratling and Johnson, 2006 and Körding and König, 2002). Our model focuses on the role of feedback in learning exclusively, but it is likely that future researchers will find ways of combining these functions in deep learning networks. To that end, we have added the following to the Discussion with new references:

“…framing cortico-cortical feedback as a mechanism to modulate incoming sensory activity is a more common way of viewing feedback signals in the neocortex (Larkum, 2013; Gilbert and Li, 2013; Zhang et al. 2014; Fiser et al. 2016). […] Future studies could examine how top-down modulation and a signal for credit assignment can be combined in deep learning models.”

What is the purpose of the random sampling of bursts? Why not just a fixed time? Would asynchronous bursting still be effective? Is the synchronous nature of the bursting in order to coordinate with the feedback from the teaching signal?

This is an excellent question. We ourselves were not sure of the answer immediately. Based on the definitions we give, our intuition was that explicit synchrony was not required, though the temporal relationship between the bursts/plateau potentials and the teaching signal would be important. (Note: we have renamed the “bursts” as “plateau potentials” in this version in order to make their actual form more transparent.) To determine this, we ran some simulations wherein each hidden layer neuron sampled its own inter-plateau interval during each phase, and we examined whether this affected learning. We found that strict synchrony was not, in fact, required and learning proceeded just as well with neurons engaging in plateau potentials at different times (Figure 5—figure supplement 1). However, learning would undoubtedly not work if the teaching signal input was not straddled by the two different plateau potentials. We now note this in the text:

“…(learning was still) obtained when we loosened the synchrony constraints and instead allowed each hidden layer neuron to engage in plateau potentials at different times (Figure 5—figure supplement 1). This demonstrates that strict synchrony in the plateau potentials is not required. But, our target definitions do require two different plateau potentials separated by the teaching signal input, which mandates some temporal control of plateau potentials in the system.” –

Would all of this be mathematically equivalent to a separate set of neurons that deal primarily with teaching signals in feedback pathways, and whose interaction with the "normal" feedforward network be regulated through some disinhibitory mechanism? To say this another way, is there anything special about the single cells and the nonlinearity α used, or could a similar setup be created by separating the different compartments into single neurons and connecting them with normal synapses?

Indeed, the reviewer's intuition is 100% correct: we could accomplish the same error signal we use to learn in the hidden layers using a separate feedback pathway, thereby replacing our apical dendritic compartments with other neurons. We now explicitly state this (Introduction, sixth paragraph) and even provide an introductory figure that highlights this other potential solution (Figure 2A).

In fact, in order to make the motivations for the paper more obvious, we now spend some of the Introduction discussing why we are inclined to explore an alternative to a separate feedback pathway:

“…closer inspection uncovers a couple of difficulties with (using a separate feedback pathway)… First, the error signals that solve the credit assignment problem are not global error signals (like neuromodulatory signals used in reinforcement learning). […] Therefore, the real brain's specific solution to the credit assignment problem is unlikely to involve a separate feedback pathway for cell-by-cell, signed signals to instruct plasticity.”

What is the explanation for why weak apical attenuation disrupts learning? Is it because it forces an underestimation of the error by having the difference in activity between forward and target phases become eroded?

The reason that weak apical attenuation disrupts learning is precisely that it prevents the feedback regarding the forward phase (αf) from cleanly communicating the output that the feedforward information generated. We now state this:

“This demonstrates that although total apical attenuation is not necessary, partial segregation of the apical compartment from the soma is necessary. […] Our data show that electrontonically segregated dendrites is one potential way to achieve the required separation between feedforward and feedback information.”

Local here means local in space. However, in order to compute weight updates, differences in activity still need to be taken over time. More specifically, the activity in the bursts between forward and target phases (equation 2). What is the biologically plausible mechanism for such non-temporally aligned computation?

This is a fantastic question. In some ways, our algorithm trades spatial non-locality for temporal nonlocality. However, the temporal non-locality in the network is relatively small (e.g. voltage and/or plateau potential information must be stored for tens of milliseconds), which could potentially be implemented with molecular mechanisms, such as synaptic tags (Redondo and Morris, 2011). We now make this temporal non-locality explicit:

“It should be recognized, though, that although our learning algorithm achieved deep learning with spatially local update rules, we had to assume some temporal non-locality. […] Hence, our model exhibited deep learning using only local information contained within the cells.”

Is there any explanation for why sparse feedback weights improve the network?

The reviewer asks another great question here. Again, we were unsure of the answer at first. In exploring the effects of sparse feedback further, we found that the issue may be one of the scale of feedback weights. Specifically, when we ran the tests on sparse feedback weights in the original manuscript we increased the magnitude of the weights 5x (since we were eliminating 80% of the weights). However, following on this question from reviewer #2, we explored sparse feedback weights without the 5x re-scaling. In this case, we found that learning was impaired (Figure 7—figure supplement 1). Thus, we believe that sparse feedback itself is not beneficial, rather the real reason that sparse feedback weights improved learning in the network was that we were amplifying the difference signals. We now discuss this in the results and include a supplementary figure with this data:

“We found that learning actually improved slightly with sparse weights (Figure 7B, red line), achieving an average error rate of 3.7% by the 60th epoch, compared to the average 4.1% error rate achieved with fully random weights. […] This suggests that sparse feedback provides a signal that is sufficient for credit assignment, but only if it is of appropriate magnitude.”

In general it would useful to have conceptual explanations for many of the issues discussed above.

Reviewer #3:

I think this is a very valuable manuscript that makes a link between deep learning and a possible biological implementation. As this link is of high scientific relevance topic and of broad interest, I consider the manuscript suited for a good journal as eLife, even if there is still a large gap between the performance of deep learning for artificial neuronal network and the suggest biological implementation (that only considers 2 layers with relatively humble performance). But the authors well recognize this and the manuscript represents a first step towards future research in this important field.

We are pleased that reviewer #3 recognizes that “…this is a very valuable manuscript that makes a link between deep learning and a possible biological implementation…” and that “… this link is of high scientific relevance topic and of broad interest…” and therefore well-suited to publication in eLife. As the reviewer points out, there is still a large gap between deep learning in artificial neural networks and our understanding of the neurobiology of learning, and like the reviewer, we also believe that this manuscript “… represents a first step towards future research in this important field.” We found reviewer #3's criticisms to be very constructive, and we feel that we have addressed each of their concerns. Below, we address reviewer #3's specific comments.

There is one main issue that should be addressed more thoroughly.

1) The proof of Theorem 1 assumes that the matrix product (J_β) (J_γ) is close to the identity mapping in the readout space. In the cited work by Lee et al. (Difference Propagation, 2015) this is the case because the forward and backward weights are adapted such that they get aligned. In the present case the alignment only becomes indirectly apparent by simulations showing that the error vector in the hidden layer eventually falls within 90 degrees of the true backpropagation error.

As I understand, the top-down weight matrix Y is fixed (e.g. randomly chosen). From a theoretical perspective, one may choose Y to be the pseudo-inverse of the forward weight matrix W^1. In fact, in that case a much simpler proof for Theorem 1 exists (a few lines only). But if Y is random, then the whole idea boils down to the random feedback idea (Lillicrap et al., Nature communication 2016) and this link should be emphasized more. While in the Supplementary Information of that paper a proof is outlined for linear transfer functions, it remains unclear how for nonlinear transfer functions this alignment is achieved obtained.

If J is chosen to be the transposed of W^1 as it is the case in backprop (and in part of the simulations), then nothing has to be proven. But if Y is random, then the big issue is to prove that the mapping γ(y) is approximatively an inversion of the mapping β(x). If this were proven, Theorem 1 in the manuscript could be cited as Theorem 2 in Lee et al. (Diff prop, 2015). But in the current form, Theorem 1 replicates the idea of Lee et al. (as it is also stated by the authors) without proving the basic assumption shown to be true in the case of Lee et al.. Of course, for the reader's convenience the proof of the Diff-Prop Theorem can still be reproduced.

In my view the core idea for the theory in the paper is (1) with random top-down connections the forward weights align as shown by Lillicrap et al. (2) Given the alignment, the idea of difference propagation with the proof given in Lee et al. can be applied. Once this theoretical fundament is introduced in this form (and simply referred to these papers), the idea of using segregated dendrites to implement the random feedback idea can be stressed.

Reviewer #3 has hit upon a major insight that we ourselves had yet to realize: in using the difference target propagation formalism and related proof of Lee et al. (2015), we essentially assumed that the forward and backward functions in the network were becoming, roughly, inverses of each other (i.e. that the “… matrix product (J_β) (J_γ) is close to the identity mapping in the readout space…”). Yet, in using random, fixed feedback weights without an inverse loss function to train the feedback, we had no guarantee that this condition actually held.

As reviewer #3 surmised, the answer to this problem lies in the behaviour of the feedforward weights from the hidden layer to the output layer, W1. As in Lillicrap et al. (2016), we find that W1 “aligns” with the feedback matrix Y. More precisely, we find that as learning proceeds in the first epoch, the maximum eigenvalue of the matrix product (IJfJg)(IJfJg) drops below 1, thereby meeting the conditions of the Lee et al. (2015) proof for difference target propagation (see Figure 4—figure supplement 1 which contains this new data). (Note, although this is a very important piece of data in our opinion, we put this new figure in the Supplemental Information in consideration of the general audience at eLife – expert readers like this reviewer will want to see it, but most readers will likely find its specific meaning confusing).

We think that this result is exciting, because it shows that feedback alignment from Lillicrap et al. (2016) and difference target propagation from Lee et al. (2015) are intimately linked. As the reviewer suggests, once this theoretical connection is made clear the idea of using segregated dendrites to implement these sorts of deep learning algorithms can be stressed. We now have the following section in the manuscript:

“Interestingly, the correlations between L0 and L1 were smaller on the first epoch of training. […] Altogether, our model demonstrates that credit assignment using random feedback weights is a general principle that can be implemented using segregated dendrites.”

A bit less fundamental, but still more than minor:

2) In view of the rather deep mathematical issues related to the feedback alignment, I would suggest to defer Lemma 1 to some Supplementary Information. The approximation of PSP signaling by instantaneous Poisson rates when the rate is small as compared to the PSP duration is standard in theoretical neuroscience. But the 3-page proof is still nicely done and may be helpful for a non-specialist who wishes to go into the details.

We agree with the reviewer's assessment that this Lemma is useful, but not particularly novel for many theoretical neuroscientists. We have moved it to the back of the Supplemental Information, as recommended.

3) At the end of the subsection “A network architecture with segregated dendritic compartments” (Results) some critical issues are raised about the biological plausibility. In this context it should also be stressed that the alternation between two phases, each of which again subdivided into two further phases (Figure 1C), is not so easy to match to the biology. The phases need a memory that is tagged with the phase information and plasticity that is only turned on in a specific phase, checking out the memory from a previous phase.

Beside mentioning this in the Results, it should also be taken up in a further paragraph in the Discussion. One should mention that synaptic eligibility traces could help out here and that this helps to bridge information across the phases. Moreover, the phases could be implemented by exploiting global (I guess γ) oscillations that are shown to be present in various cognitive states. Discussing the link of learning and γ oscillations may be of general interest in this context.

Indeed, reviewer #3 is correct that our model requires two different phases (possibly mediated by oscillations) and some form of spatially local temporal storage of information (possibly mediated by synaptic eligibility traces). To make these issues clear for the reader, we have now included the following new sections in the manuscript:

“…it is entirely plausible that neocortical micro-circuits would generate synchronized pyramidal plateaus at punctuated periods of time in response to dis-inhibition of the apical dendrites governed by neuromodulatory signals that determine “phases” of processing. Alternatively, oscillations in population activity could provide a mechanism for promoting alternating phases of processing and synaptic plasticity (Buzsáki and Draguhn, 2004).”

“It should be recognized, though, that although our learning algorithm achieved deep learning with spatially local update rules, we had to assume some temporal non-locality. […] Hence, our model exhibited deep learning using only local information contained within the cells.”

[Editors’ note: the author responses to the re-review follow.]

This paper is much improved. However, it still has a way to go before it's ready for a neuroscience audience. Given that this has been reviewed several times now and remains in an unacceptable form, we are prepared to offer only one more opportunity to provide an acceptable version of the manuscript.

The easy thing to fix is notation and writing: we believe that, even in its improved form, it would be very hard for a neuroscientist, even a computational one who is used to thinking about circuits, to read, and the main ideas would be difficult to extract. More on that below.

The potentially harder thing to fix is biological plausibility. If we understand things correctly, the neuron must estimate the average PSPs during the feedforward sweep of activity, when only the input is active, estimate them again during the training phase, when the correct output is active as well, and then subtract the two. These signals are separated in time, which means the synapses have to store one signal, wait a few tens of ms, store another, and take the difference. In addition, the difference is computed in the apical dendrite, but it must be transferred to the proximal dendrites. And finally, a global signal is required to tell a synapse in which phase it is so that the estimate can be endowed with the correct sign. All seem nontrivial for neurons and synapses to compute.

Lack of biological plausibility should not rule out a theory – synapses and neurons are, after all, complicated. However, two things are needed. First, you need to provide a mechanism for implementing the learning rule that's not inconsistent with what's known about neurons and synapses. Second, you need to provide suggest experiments to test these predictions. Of these, the first is probably harder.

Now for the exposition. It may seem like we're micromanaging (and we are), but if this paper is to have an impact on the neuroscience community – a prerequisite for publication in eLife – it has to be cast into familiar notation.

1) The model was much simpler, and more standard, than first impressions would imply. It can be written in the very familiar form

dV^m_i/dt = -g_L V^m_i

+ sum_n g_n (b^n_i + sum_j W^mn_ij s^n_j(t) – V^m_i)

+ g^m_iE(t) (E_E – V_i^m) + g^m_iI(t) (E_I – V_i^m)

where the s^n_j(t) are filtered spike trains,

s^n_j(t) = sum_k kappa(t-t^n_jk),

t^n_jk is the k^th spike on neuron j of type n, and spikes were generated via a Poisson process based on the voltage. (Please note: errors are possible, but the equations look something like what we wrote.)

In this form it is immediately clear to a neuroscientist what kind of network this is. If nothing else, that will save a huge amount of time for the reader – it took hours of going back and forth over the equations in the paper before it became clear that the model was very standard, something that most readers would not have the patience for.

In addition, as written above, it makes it clear exactly how the dendrites are implemented: by varying the g_n. In real dendrites they vary with voltage on the dendrite; in this model they simply vary with time.

And finally, the notation with the A's and B's used in the paper is not helpful to neuroscientists, who are very used to seeing V, or maybe U, for voltage.

2) Along the same lines, a better figure showing the circuit needs to be included. The circuit with multiple hidden layers needs a similar drawing, as we were not able to figure out exactly what it looked like. (We're guessing there was sufficient information in the paper, but the amount of work it would take to extract it seemed high.)

3) The cost function, L^1, seemed somewhat arbitrary. According to Equation 7,

L^1 \propto \sum_i (<σ(U_i)>^t – σ(<U_i>^f))^2

where the angle brackets represent a time average and the superscripts t and f refer to the target and feedforward phases, respectively (basically, the overline was replaced with angle brackets, mainly because we're using plain text). Why was the average taken outside the sigmoid in the target phase and inside the sigmoid in the feedforward phase?

4) A similar question applies to L^0, which is written

L^0 = sum_i (λ_max(<σ(C_i)>^f – σ(<C_i>^f) + α_i^t – α_i^f)^2

As far as we can tell, the first two terms are included to make the update rules work out, and they are eventually set equal to each other. But is there any reason to think that L^0 should be minimized? It seemed unmotivated.

5) In Equations 19 and 22, why are there no terms involving the derivatives of the sigmoid in the target phase?

In our last submission, we only received feedback from one reviewer. That reviewer was still concerned about the ease with which the paper could be understood by a general neuroscience audience. With the help of one of the editors, we have worked hard to make the paper easier to understand. We believe that the paper has improved immensely as a result. Now, we explain the dynamics of our simulations in a clear manner that would make it easy to replicate them, and we provide a far more intuitive explanation of how we solve the credit assignment problem with our loss functions. We have also redone the figures, and added a final figure illustrating the model’s experimental predictions. As a result, we believe that the paper is now appropriate for a general neuroscience audience, and we hope you agree.

https://doi.org/10.7554/eLife.22901.027

Article and author information

Author details

  1. Jordan Guerguiev

    1. Department of Biological Sciences, University of Toronto Scarborough, Toronto, Canada
    2. Department of Cell and Systems Biology, University of Toronto, Toronto, Canada
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon 0000-0002-6751-8782
  2. Timothy P Lillicrap

    DeepMind, London, United Kingdom
    Contribution
    Conceptualization, Methodology, Writing—original draft, Writing—review and editing
    Competing interests
    No competing interests declared
  3. Blake A Richards

    1. Department of Biological Sciences, University of Toronto Scarborough, Toronto, Canada
    2. Department of Cell and Systems Biology, University of Toronto, Toronto, Canada
    3. Learning in Machines and Brains Program, Canadian Institute for Advanced Research, Toronto, Canada
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Methodology, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    blake.richards@utoronto.ca
    Competing interests
    No competing interests declared
    ORCID icon 0000-0001-9662-2151

Funding

Natural Sciences and Engineering Research Council of Canada (RGPIN-2014-04947)

  • Blake A Richards

Google (Faculty Research Award)

  • Blake A Richards

Canadian Institute for Advanced Research (Learning in Machines and Brains Program)

  • Blake A Richards

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We would like to thank Douglas Tweed, João Sacramento, and Yoshua Bengio for helpful discussions on this work. This research was supported by three grants to BAR: a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (RGPIN-2014–04947), a 2016 Google Faculty Research Award, and a Fellowship with the Canadian Institute for Advanced Research. The authors declare no competing financial interests. Some simulations were performed on the gpc supercomputer at the SciNet HPC Consortium. SciNet is funded by: the Canada Foundation for Innovation under the auspices of Compute Canada; the Government of Ontario; Ontario Research Fund - Research Excellence; and the University of Toronto.

Reviewing Editor

  1. Peter Latham, Reviewing Editor, University College London, United Kingdom

Publication history

  1. Received: November 2, 2016
  2. Accepted: October 22, 2017
  3. Version of Record published: December 5, 2017 (version 1)

Copyright

© 2017, Guerguiev et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 10,277
    Page views
  • 1,460
    Downloads
  • 5
    Citations

Article citation count generated by polling the highest count across the following sources: Scopus, Crossref, PubMed Central.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

  1. Further reading

Further reading

    1. Cancer Biology
    2. Computational and Systems Biology
    Jia-Ren Lin et al.
    Tools and Resources
    1. Computational and Systems Biology
    2. Physics of Living Systems
    Weerapat Pittayakanchit et al.
    Research Article