1. Neuroscience
Download icon

Interpreting wide-band neural activity using convolutional neural networks

  1. Markus Frey  Is a corresponding author
  2. Sander Tanni
  3. Catherine Perrodin
  4. Alice O'Leary
  5. Matthias Nau
  6. Jack Kelly
  7. Andrea Banino
  8. Daniel Bendor
  9. Julie Lefort
  10. Christian F Doeller
  11. Caswell Barry  Is a corresponding author
  1. Kavli Institute for Systems Neuroscience, Centre for Neural Computation, The Egil and Pauline Braathen and Fred Kavli Centre for Cortical Microcircuits, NTNU, Norwegian University of Science and Technology, Norway
  2. Max-Planck-Insitute for Human Cognitive and Brain Sciences, Germany
  3. Cell & Developmental Biology, UCL, United Kingdom
  4. Institute of Behavioural Neuroscience, UCL, United Kingdom
  5. Open Climate Fix, United Kingdom
  6. DeepMind, United Kingdom
  7. Institute of Psychology, Leipzig University, Germany
Tools and Resources
  • Cited 0
  • Views 3,143
  • Annotations
Cite this article as: eLife 2021;10:e66551 doi: 10.7554/eLife.66551

Abstract

Rapid progress in technologies such as calcium imaging and electrophysiology has seen a dramatic increase in the size and extent of neural recordings. Even so, interpretation of this data requires considerable knowledge about the nature of the representation and often depends on manual operations. Decoding provides a means to infer the information content of such recordings but typically requires highly processed data and prior knowledge of the encoding scheme. Here, we developed a deep-learning framework able to decode sensory and behavioral variables directly from wide-band neural data. The network requires little user input and generalizes across stimuli, behaviors, brain regions, and recording techniques. Once trained, it can be analyzed to determine elements of the neural code that are informative about a given variable. We validated this approach using electrophysiological and calcium-imaging data from rodent auditory cortex and hippocampus as well as human electrocorticography (ECoG) data. We show successful decoding of finger movement, auditory stimuli, and spatial behaviors – including a novel representation of head direction - from raw neural activity.

Introduction

A central aim of neuroscience is deciphering the neural code, understanding the neural representation of sensory features and behaviors, as well as the computations that link them. The task is complex, and although there have been notable successes – such as the identification of orientation selectivity in V1 (Hubel et al., 1959) and the representation of self-location provided by hippocampal place cells (O'Keefe and Dostrovsky, 1971) – progress has been slow. Neural activity is high dimensional and often sparse, while the available datasets are typically incomplete, being both temporally and spatially limited. This problem is compounded by the fact that the code is multiplexed and functionally distributed (Walker et al., 2011). As such, activity in a single region may simultaneously represent multiple variables, to differing extents, across different elements of the neural population. Taking the entorhinal cortex for example, a typical electrophysiological recording might contain spike trains from distinct cells predominantly encoding head direction, self-location, and movement speed via their firing rates (Sargolini et al., 2006; Kropff et al., 2015; Hafting et al., 2005), while other neurons have more complex composite representations (Hardcastle et al., 2017). At the same time, information about speed and location can also be identified from the local field potential (LFP) (McFarland et al., 1975) and the relative timing of action potentials (O'Keefe and Recce, 1993). Fundamentally, although behavioral states and sensory stimuli can generally be considered to be low dimensional, finding the mapping between noisy neural representations and these less complex phenomena is far from trivial.

Historically, the approach for identifying the correspondence between neural data and external observable states – stimuli or behavior – has been one of raw discovery. An experimenter, guided by existing knowledge, must recognize the fact that the activity covaries with some other factor. Necessarily this is an incremental process, favoring identification of the simplest and most robust representations, such as the sparse firing fields of place cells (Muller et al., 1987). Classical methods, like linear regression and linear-nonlinear-Poisson cascade models (Corrado et al., 2005; Kropff et al., 2015), provide powerful tools for the characterization of existing representations but are less useful for the identification of novel responses – they typically require highly processed data in conjunction with strong assumptions about the neural response, and in the former cases are limited to one dimensional variables. Recent advances in machine learning suggest an alternative strategy. Artificial neural networks (ANNs) trained using error backpropagation regularly exceed human-level performance on tasks in which high dimensional data is mapped to lower dimensional labels (Krizhevsky et al., 2012; Mnih et al., 2015). Indeed, these tools have successfully been applied to processed neural data – accurately decoding behavioral variables from observed neural firing rates (Glaser et al., 2017; Tampuu et al., 2018). However, the true advantage of ANNs is not their impressive accuracy but rather the fact that they make few assumptions about the structure of the input data and, once trained, can be analyzed to determine which elements of the input, or indeed combination of elements, are most informative (Cichy et al., 2019; Hasson et al., 2020; Cammarata et al., 2020). Moreover, this framework provides full control over the weights, activations, and objective functions of the model, allowing fine-grained analysis of the inner workings of the network. Viewed in this way ANNs potentially provide a means to accelerate the discovery of novel neural representations.

To test this proposal, we developed a convolutional network (LeCun et al., 2015) able to take minimally processed, wide-band neural data as input and predict behaviors or other co-recorded stimuli. In the first instance, we trained the model with unfiltered and unclustered electrophysiological recordings made from the CA1 pyramidal cell layer in freely foraging rodents. As expected, the network accurately decoded the animals’ location, speed, and head direction – without spike sorting or additional user input. Analysis of the trained network showed that it had ‘discovered’ place cells (O'Keefe and Dostrovsky, 1971; O’keefe and Nadel, 1978) – frequency bands associated with pyramidal waveforms being highly informative about self-location (Epsztein et al., 2011). Equally, it successfully recognized that theta-band oscillations in the LFP were informative about running speed (McFarland et al., 1975; Jeewajee et al., 2008). Unexpectedly, the network also identified a population of putative CA1 interneurons that encoded information about head direction. We corroborated this observation using conventional tools, confirming that the firing rate of these neurons was modulated by facing-direction, a previously unreported relationship. Beyond this we found the trained network provided a means to efficiently conduct analyses which would otherwise have been complex or time consuming. For example, comparison of all frequency bands revealed positive interactions between frequencies associated with waveforms – components of the neural code that convey more information together than when considered individually. Subsequently, to demonstrate the generality of this approach, we applied the same architecture to electrophysiological data from auditory cortex, two-photon calcium imaging data acquired while mice explored a virtual environment as well as ECoG recordings in humans performing finger movements (Schalk et al., 2007).

Our model differs markedly from conventional decoding methods which often use Bayesian estimators (Zhang et al., 1998) for hippocampal recordings in conjunction with highly processed neural data or linear methods for movement decoding (Antelis et al., 2013). In the case of extracellular recordings, this usually implies that time-series are filtered and processed to detect action potentials and assign them to specific neurons. Necessarily this discards information in frequency bands outside of the spike range, potentially introducing biases implicit in the algorithm used (Pachitariu et al., 2016; Chung et al., 2017; Hyung et al., 2017) and operator’s subjective preferences (Harris et al., 2000; Wood et al., 2004), and – despite considerable advances – still demands manual input to adjust clusters (Pachitariu et al., 2016). Furthermore, accurate calculation of prior expectations regarding the way in which the data varies with the decoded variable – an essential component of Bayesian decoding – requires considerable knowledge about the structure of the neural signal being studied and appropriate noise models. Other authors have attempted to address some of these shortcomings, for example, decoding without assigning action potentials to specific neurons (Kloosterman et al., 2014; Ackermann et al., 2019; Deng et al., 2015) or combining LFP and spiking data (Stavisky et al., 2015) for cursor control in patients. However, these approaches relied on existing assumptions about neural coding statistics and did not use all available information in the recordings, while their primary focus was simply to improve decoding accuracy. In contrast, the flexible, general-purpose approach we describe here requires few assumptions and – once trained – can be interrogated to inform the discovery of novel neural representations. In addition, as the model does not rely on specific oscillations or spike waveforms, it can easily generalize across domains – a fact we demonstrate with optical imaging data acquired in rodents as well as human ECoG signals.

Results

In the following section, we present our model, the results, and describe how it was applied across different datasets. First, we report results obtained when decoding spatial behaviors from CA1 recordings made in freely moving rats. Second, we describe how the model results are interpreted and which informative features drive decoding performance, including a more detailed analysis of head direction and speed decoding. Third, we show that the model generalizes to different brain regions and recording techniques, including calcium imaging in mice and ECoG recordings in humans. In short, the framework provides a unified way of decoding continuous behaviors or stimuli from neural time-series data. The model first transforms the neural data into frequency space via a wavelet transformation. Next, the outputs (behaviors or stimuli) are resampled to match the sample rate of the neural data. Finally, the convolutional neural network takes the transformed neural data as input and decodes each output separately.

Accurate decoding of self-location from CA1 recordings

In the first instance, we sought to evaluate our network-based decoding approach on well characterised neural data with a clear behavioral correlate. To this end, we used as input extracellular electrophysiological signals recorded from hippocampal region CA1 in five freely moving rats – place cells from this area being noted for their spatially constrained firing fields that convey considerable information about an animal's self-location (O'Keefe and Dostrovsky, 1971; Muller et al., 1987). Animals were bilaterally implanted with 32 tetrodes and, after recovery and screening, 128 channel wide-band (0 Hz to 15000 Hz sampled at 30 kHz) recordings were made while the rats foraged in a 1.25 x 1.75 m arena for approximately 40 min (see methods). Raw electrophysiological data were decomposed using Morlet wavelets to generate a three-dimensional representation depicting time, channels, and frequencies from 2 Hz to 15,000 Hz (Figure 1A; Christopher and Gilbert, 1998). Using the wavelet coefficients as inputs, the model was trained in a supervised fashion using error backpropagation with the X and Y coordinates of the animal as regression targets. We report test-set performance for fully cross-validated models using five splits across the whole duration of the experiment.

Figure 1 with 11 supplements see all
Accurate decoding of self-location from unprocessed hippocampal recordings.

(A) Top, a typical ’raw’ extracellular recording from a single CA1 electrode. Bottom, wavelet decomposition of the same data, power shown for frequency bands from 2 Hz to 15 kHz (bottom to top row). (B) At each timestep wavelet coefficients (64 time points, 26 frequency bands, 128 channels) were fed to a deep network consisting of 2D convolutional layers with shared weights, followed by a fully connected layer with a regression head to decode self-location; schematic of architecture shown. (C) Example trajectory from R2478, true position (black) and decoded position (blue) shown for 3 s of data. Full test-set shown in Figure 1—video 1. (D) Distribution of decoding errors from trial shown in (C), mean error (14.2 cm ± 12.9 cm, black), chance decoding of self-location from shuffled data (62.2 cm ± 9.09 cm, red). (E) Across all five rats, the network (CNN) was more accurate than a machine learning baseline (SVM) and a Bayesian decoder (Bayesian) trained on action potentials. This was also true when the network was limited to high-frequency components (>250Hz, CNN-Spikes). When only local frequencies were used (<250Hz, CNN-LFP), network performance dropped to the level of the Bayesian decoder (distributions show the fivefold cross validated performance across each of five animals, n=25). Note that this likely reflects excitatory spikes being picked up at frequencies between 150 and 250 Hz (Figure 1—figure supplement 2). (F) Decoding accuracy for individual animals, the network outperformed the Bayesian decoder in all cases. An overview of the performance of all tested models can be seen in Figure 1—figure supplement 3. (G) The advantage of the network over the Bayesian decoder increased when the available data was reduced by downsampling the number of channels (data from R2478). Inset shows the difference between the two methods.

To reduce computational load and improve test set generalization, we use 2D-convolutions with shared weights applied to the three-dimensional input (Figure 1B, Supplementary file 1) – the first eight convolutional layers having weights shared across channels and the final six across time. Implementing weight sharing in this way is desirable as the model is able to efficiently identify features that reoccur across time and channels, for example, prominent oscillations or waveforms, while also drastically reducing model complexity. For comparison, an equivalent architecture trained to decode position from 128 channels of hippocampal electrophysiological but without shared weights had 38,144,900 hyperparameters compared to 5,299,653 – an increase of 720%. The more complex model took 4.7 hr to run per epoch, as opposed to 175 s, and ultimately yielded less accurate decoding (Figure 1—figure supplement 1).

The model accurately decoded position from the unprocessed neural data in all rats, providing a continuous estimate of location with a mean error less than 10% of the environment’s length. This demonstrates that, as expected, the network was able to identify informative signals in the raw neural data (Mean error 17.31 cm ± 4.46 cm; Median error 11.40 cm ± 3.82 cm; Chance level 65.03 cm ± 6.91 cm; Figure 1C,D). To provide a familiar benchmark, we applied a standard Bayesian decoder without a continuity prior (Ólafsdóttir et al., 2015) to the spiking data from the same datasets (see Materials and methods, see also Figure 1—figure supplement 4B for comparison to Bayesian decoder with continuity prior; Zhang et al., 1998). To this end, action potentials were identified, clustered, manually curated, and spike time vectors were used to decode location – data contained in the local field potential (LFP) was discarded. Note that while the Bayesian decoder explicitly incorporates information about the probability with which animals visit each spatial location, the CNN is effectively feed-forward – being presented with ∼2 s windows of data. Our model was consistently more accurate than the Bayesian decoder, exceeding its performance in all animals (Bayesian mean error 23.38 cm ± 4.35 cm; network error 17.31 cm ± 4.46 cm; Wilcoxon signed-rank test: T=18, p=0.0001). The CNN’s advantage over the Bayesian decoder was in part derived from the fact that it made fewer large errors (Figure 1—figure supplement 5), thus the median errors for the two methods were more similar (all channels; Bayesian decoder median error 13.99 cm ± 1.77 cm, network median error 11.40 cm ± 3.82 cm). The high accuracy and efficiency of the model for these harder samples suggest that the CNN utilizes additional information from sub-threshold spikes and those that were not successfully clustered and/or nonlinear information which is not available to the Bayesian decoder.

The relative advantage over the Bayesian decoder increased further when the number of channels used for decoding was downsampled to simulate smaller recordings (linear regression Wald-test (n=31), s=-0.65, p=1.83e-10; Figure 1G). Notably, the model achieved a similar mean decoding performance with twenty tetrodes (80 channels, 23.45 cm ± 3.15 cm) as the Bayesian decoder reached with the full data set (128 channels, 23.25 cm ± 2.79 cm, Figure 1G). Note that for these comparisons, the Bayesian decoder was only applied to periods when the animal was travelling at >3cm/s, in contrast the CNN did not have a speed threshold (i.e. is trained on moving and stationary periods). To confirm this difference did not favor the CNN, we examined how its performance compared to the Bayesian decoder if both were subject to a range of speed thresholds. As expected, limiting the CNN to only periods of movement actually improves its performance, accentuating the difference between it and the Bayes decoder (0 cm/s speed threshold: Bayesian mean error 22.23 cm ± 3.72 cm; network error 17.37 cm ± 3.58 cm; 25 cm/s speed threshold: Bayesian mean error 17.17 cm ± 2.788 cm; network error 13.40 cm ± 3.59 cm, Figure 1—figure supplement 6). To compare the CNN against standard machine learning tools, we used the wavelet transformed data to train support vector machines (SVMs). Note that in this case, the spatial structure of the input is inevitably lost as the input features are transformed to a one-dimensional representation. Both linear (53.6 cm ± 14.77 cm; Figure 1E) and non-linear SVMs (61.2 cm ± 15.67 cm) performed considerably worse than the CNN. Note that it is computationally not feasible to use a fully connected neural network as the flattening of the spatial structure will inevitably lead to a multi-million parameter model. These models are notoriously hard to train and will not fit many consumer-grade GPUs, in contrast to convolutional neural networks which share the weights across the spatial dimension.

To better understand which elements of the raw neural data the network used, we retrained our model using datasets limited to just the LFP (<250Hz) and just the spiking data (>250Hz). In both cases, the network accurately decoded location (spikes-only (CNN-Spikes) mean error 17.23 cm ± 4.69 cm; LFP-only (CNN-LFP) mean error 24.24 cm ± 6.00 cm; Figure 1E), indicating that this framework is able to extract information from varied electrophysiological sources. Consistent with the higher information content of action potentials, the spikes-only network was considerably more accurate than the LFP-only network (Wilcoxon signed-rank test, two-sided (n=25): T=0, p=1.22e-05), although the LFP-only network was still comparable with the spike-based Bayesian decoder (Bayesian 23.38 cm ± 4.35 cm; LFP 24.24 cm ± 6.00 cm; Wilcoxon signed-rank test, two-sided (n=25): T=136, p=0.475). If we retrain the model on frequency bands 0–150 Hz and 150–250 Hz, we observe that spatial information is predominantly contained in higher band frequencies (Figure 1—figure supplement 2, see also Figure 3—figure supplement 1), likely reflecting power from pyramidal cell waveforms reaching these frequencies. Note that previous studies have shown that demodulated theta is informative about the position of an animal in its environment (Agarwal et al., 2014). However, in those experiments theta oscillations were converted into a complex-valued signal, which carried both the magnitude and phase of theta – here we only used the magnitude for decoding of position.

The standard decoding model downsamples the wavelet frequencies to a rate of 30 Hz, potentially discarding transient non-local representations (e.g. replay events and theta sequences). To evaluate the potential of our model to detect these short-lived events, we applied a lower downsampling factor to achieve a sampling rate of 500 Hz. This allowed us to investigate decoded time points during immobility periods in which ripples were detected in the LFP data using standard methods (see Materials and methods). Examining these periods, we saw the CNN often decoded transient, high velocity trajectories that resembled those reported in studies of open field replay which are known to co-occur with ripples (see Figure 1—figure supplement 7A). Consistent with this interpretation the decoding error for these periods (i.e. Euclidean distance between the animal’s location and decoded location) was larger than for matched periods when the animal was stationary but ripples were not present (n=1000, p<0.001, Figure 1—figure supplement 7B). We furthermore found these putative replay trajectories were longer than trajectories decoded from the matched stopped periods (n=1000, p=0.003, Figure 1—figure supplement 7C) but were coherent, similar to trajectories decoded during movement (n=1000, p=0.257, Figure 1—figure supplement 7D). Taken together, these analyses indicate that our model captures transient high-velocity trajectories occurring during sharp-wave ripples but not during other stationary periods – it is highly likely that these correspond to replay events. Thus, it seems plausible that non-local representations are accessible to this CNN framework.

Simultaneous decoding of multiple factors

The hippocampal representation of self-location is arguably one of the most readily identifiable neural codes – at any instance a small number of sparsely active neurons are highly informative. To provide a more stringent test of the network’s ability to detect and decode behavioral variables from unprocessed neural signals, we retrained with the same data but simultaneously decoded position, speed, and head-direction within a single model. CA1 recordings are known to incorporate information about these additional factors but their representation is less pronounced than that for self-location. Thus, the spatial activity of place cells is known to be weakly modulated by head direction (Jercog et al., 2019; Yoganarasimha et al., 2006), while place cell firing rates and both the frequency and amplitude of theta, a 7–10 Hz LFP oscillation, are modulated by running speed (McFarland et al., 1975; Jeewajee et al., 2008). In this more complex scenario, the architecture and hyper-parameters remained the same with just the final fully connected layer of the network being replicated, one layer for each variable, with the provision of appropriate loss functions – cyclical mean absolute error for head direction and mean absolute error for speed (see Materials and methods). All three variables were decoded simultaneously and accurately (Position, 17.78 cm ± 4.96 cm; Head Direction, 0.80 rad ± 0.18 rad; Speed 4.94 cm/s ± 1.00 cm/s; Figure 2A and Figure 1—video 1), with no meaningful decrement in performance relative to the simpler network decoding only position (position-only model 17.31 cm ± 4.46; combined model 17.78 cm ± 4.96 cm; Wilcoxon signed-rank test two-sided (n=25): T=116, p=0.2108). Note that, a Bayesian decoder trained to decode head direction achieves a performance of 0.97 rad ± 0.14 rad using spike sorted neural data, significantly worse than our model (Wilcoxon signed-rank test two-sided (n=25): T=47, p=0.0018) but more accurate than would be expected by chance (Wilcoxon signed-rank test two-sided (n=25): T=0, p=1.22e-5). The R2-score metric from the fully trained network – a measure which represents the portion of variance explained and is independent of the loss function – indicated that mean decoding performance was above chance for all three behaviors (R2-score Position 0.86 ± 0.08, Head Direction 0.60 ± 0.12, Speed 0.72 ± 0.14, Chance R2-score Position −0.14 ± 0.13, Head Direction 0.04 ± 0.11, Speed −0.16 ± 0.22) (Figure 2B). Thus, the network was able to effectively access multiplexed information embedded in minimally processed neural data.

Figure 2 with 1 supplement see all
Simultaneous decoding of multiple variables from hippocampal data.

(A) Position, head direction, and running speed were accurately decoded in concert by a single network. Data from all five animals, each point indicates an error for a single sample. The red dashed line indicates the chance level obtained by shuffling the input relative to the output while fully retraining the model. (B) R2-scores, a loss-invariant measure of model performance – ranging from 1 (perfect decoding) to negative infinity – allowing performance to be compared between dissimilar variables. Data as in (A), each point corresponds to one of five cross-validations within each of five rats.

Interrogation of electrophysiological recordings

Although the network supports accurate decoding of self-location from electrophysiological data, this was not our main aim. Indeed, our primary goal for this framework was to provide a flexible tool capable of discovering and characterising sensory and behavioral variables represented in neural data - providing insight about the form and content of encoded information. To this end, in the fully trained network, we used a shuffling procedure to estimate the influence that each element of the 3D input (frequencies x channel x time) had on the accuracy of the decoded variables (see Materials and methods). Since this approach does not require retraining, it provides a rapid and computationally efficient means of assessing the contribution made by different channels, frequency bands, and time points.

Turning first to position decoding, we saw that the adjacent 469 Hz and 663 Hz frequency bands were by far the most influential (Figure 3A). Since these recordings were made from CA1, we hypothesized that these frequencies corresponded to place cell action potentials. To confirm this hypothesis – and demonstrate that it was possible to objectively use this network-based approach to identify the neural basis of decoded signals – we applied the following approach (see Materials and methods): First, we isolated the waveforms of place cells (n=629) and putative interneurons (n=91) in all animals, which were identified using a conventional approach (Pachitariu et al., 2016; Klausberger et al., 2003; Csicsvari et al., 1999). Second, for these two groups, we calculated the relative representation of the 26 frequency bands in their waveforms. We found that the highly informative 469 Hz and 663 Hz bands were the dominant components of place cell action potentials and that in general the power spectra of these spikes strongly resembled the frequency influence plot for position decoding (Spearman rank-order correlation, two-sided (n=26) ρ=0.84, p=7.63e-08; Figure 3B). In contrast, putative interneurons – which typically have a shorter after-hyperpolarization than place cells (English et al., 2017) – were characterized by higher frequency components (Figure 3B, Mann-Whitney rank test interneuron (n=91) vs. place cell (n=629), U=1009.5, p=2.47e-13), with the highest power at 5304 Hz and 3750 Hz, bands that were considerably less informative about self-location (Figure 3A).

Figure 3 with 3 supplements see all
Analysis of trained network identifies informative elements of the neural code.

(A) A shuffling procedure was used to determine the relative influence of different frequency bands in the network input. Left, the 469 Hz and 663 Hz components – corresponding to place cell action potentials – were highly informative about animals’ positions. Middle, both place cells and putative interneurons (5304 Hz) carried information about head direction. Right, several frequency bands were informative about running speed, including those associated with the LFP (10.4 Hz) and action potentials. Data from all animals. (B) Wavelet coefficients of place cell (top) and interneuron (bottom) waveforms are distinct and correspond to frequencies identified in A. Inset, average waveforms. Data from all animals. (C) Frequency bands associated with place cells (469 and 663 Hz) were more informative about position than those associated with putative interneurons (5304 Hz) – their elimination produced a larger decrement in decoding performance (p<0.001). Data from all animals. (D) A subset of putative CA1 interneurons encodes head direction. Thirty-three of 91 interneurons from five animals exhibited pronounced directional modulation that was stable throughout the recording (green). Depth of modulation quantified using Kullback-Leibler divergence vs. uniform circle. Stability assessed with the Pearson correlation between polar ratemaps from the first and second half of each trial (dark gray and light gray). Cells with p<0.01 for both measures were considered to be reliably modulated by head direction. Inset, example polar ratemaps. Data from all animals.

Since the frequencies associated with place cell waveforms were the most informative, this indicated that the network had correctly identified place cells as the primary source of spatial information in these recordings. To corroborate this, we used the same data and for each channel eliminated power in the 469 Hz and 663 Hz frequency bands at time points corresponding to either place cell or interneuron action potentials. As expected, position decoding was most strongly affected by removal of the place cell time points (Mann-Whitney-U-Test (n=629 place cells, n=91 interneurons): U=1497, p=2.86e-08; Figure 3C). Using the same shuffling method, we also analyzed how informative each channel was about self-location (Figure 1—figure supplement 8). In particular, we found that the number of place cells identified on a tetrode from the spike sorted data was highly correlated with the tetrode’s spatial influence (Spearman rank-order correlation (n=128) ρ=0.71, p=5.11e-06) and that the overall distribution of both number of place cells and spatial influence followed a log-normal distribution (Shapiro-Wilk test on log-transformed data, number of place cells, W=0.79, p=3.59e-05; tetrode influence W=0.59, p=3.04e-08; Figure 1—figure supplement 8B). In sum, this analysis correctly identified that the firing rates of both place cells and putative interneurons are informative about an animal’s location, place cells more so than interneurons (Wilent and Nitz, 2007). The analysis also highlighted the spatial activity of place cells, pointing to the stable place fields as a key source of spatial information.

A potential concern is that our approach might not identify multiple frequency bands if the information they contain is mutually redundant. The previous example, in which place cells and putative interneurons were both found to be informative about self-location, demonstrates this is not entirely the case. However, to further exclude this possibility, we compared the influential frequencies identified from our complete model with models trained on just a single frequency band at a time. Specifically, 26 models were trained, one for each frequency – the performance of each of these models being taken as an indication of the information present in that band. As expected we found both methods identified similar frequencies as indicated by a high correlation between our influence measure and the performance of single frequency band models (Position, Spearman rank-order correlation (n=26) ρ=0.88, p<0.001; Head Direction, ρ=0.82, p<0.001; Speed, ρ=0.47, p=0.02, Figure 3—figure supplement 1). Note that, although each model was individually faster to train than the complete model, the time to train all 26 was considerably longer than the single model applied simultaneously to all frequencies (51.2 hr vs. 8.6 hr, ∼6x faster).

To further assess the robustness of the influence measure, we conducted two additional analyses. First, we varied the volume of data by recalculating the influence measure for different amounts of training data (1–35 min, see Materials and methods). We found that our model achieves an accurate representation of the true influence with as little as 5 min of training data (mean Pearson’s r = 0.89 ± 0.06, Figure 3—figure supplement 2). Secondly, we assessed influence accuracy on a simulated behavior in which we varied the ground truth frequency information (see Materials and methods). The model trained on the simulated behavior is able to accurately represent the ground truth information (modulated frequencies 58 Hz and 3750 Hz, Figure 3—figure supplement 3).

Thus, our combined approach provides a fair and efficient means to determine the informative elements of wide-band neural data. More importantly, analyses of the full network enables multiple frequency bands to be considered in-concert, providing a means to identify interactions (e.g.Figure 1—figure supplement 9) that are not accessible to standard single-frequency methods.

CA1 interneurons are modulated by head direction

Next, having validated our approach for spatial decoding, we examined the basis upon which the network was able to decode head direction. Although place cells primarily provide an allocentric spatial code, their infield firing rate is known to be modulated by heading direction (Muller et al., 1994; Rubin et al., 2014). Consistent with the presence of this directional code, we again saw that the most influential frequencies for head direction decoding were those associated with place cells (469 Hz and 663 Hz; Figure 3A). However, the distribution also incorporated a secondary peak corresponding to the frequencies typical for interneuron waveforms (Spearman rank-order correlation (n=26) ρ=0.76, p=5.71e-06, Figure 3A&B). Presubicular interneurons have been shown to be modulated by both head direction and angular velocity (Preston-Ferrer et al., 2016) but to the best of our knowledge no similar responses have been noted in CA1. To establish if putative interneurons conveyed information about head direction, we again used an ’elimination’ analysis on data from all five animals – the two frequency bands most strongly associated with interneurons (3750 Hz and 5304 Hz) were scrambled at time points when interneuron spikes were present. Consistent with the influence plots, we found that selectively eliminating putative interneurons degraded the accuracy with which head direction was decoded (relative influence: 0.089 ± 0.043, two-sided t-test (n=91) t=4.16, p=0.014). As a final step, to verify this novel observation we reverted to a standard approach. Specifically, we calculated the directional ratemap for each interneuron using only periods when the animal was in motion (>10cm/s), determined the Kullback-Leibler divergence vs. a uniform circle (Doeller et al., 2010), and applied a shuffling procedure to determine significance – as a whole the population exhibited reliable but weak modulation of interneuronal firing rate by head direction (Kullback-Leibler Divergence (n=91): 0.0067 ± 0.009) with 58.2% (53/91) of cells being individually significant (p<0.01). Behaviors that are inhomogeneously distributed or confounded can result in spurious neural correlates (Muller et al., 1994). To control for this possibility, we repeated the analysis using only data from the centre of the environment (>25cm from the long sides of the enclosure and >20cm from the short sides). Additionally, to verify stability, we controlled that ratemaps generated from the first and second half of the trial were correlated (Pearson correlation, p<0.01). Under this more rigorous analysis, we confirmed that a sub-population (36.2%, 33/91) of putative hippocampal interneurons were modulated by head direction, a previously unrecognized spatial correlate (Figure 3D).

Multiple electrophysiological features contribute to the decoding of speed

The frequency influence plots for running speed also showed several local peaks (Figure 3A), that in all cases corresponded to established neural correlates. In rodents, theta frequency and power are well known to co-vary almost linearly with running speed (McFarland et al., 1975; Jeewajee et al., 2008), accordingly analysis of the network identified the 10.4 Hz frequency band as the most influential. Similarly, the firing rate of place cells increases with speed, an effect captured by the peak at 663 Hz. Interestingly a clear peak is also evident at 2652 Hz, indicating that interneuron firing rates are also informative – originating either from CA1 speed cells (Góis and Tort, 2018) or from theta-locked interneurons (Huh et al., 2016). Finally, a 4th peak was evident at 3.66 Hz and 5.17 Hz, a range that corresponds to type 2 (’atropine sensitive’) theta which is present during immobility (Kramis et al., 1975; Sainsbury et al., 1987). To corroborate this conclusion, we calculated the correlation between power in each frequency band and running speed (Figure 2—figure supplement 1A), confirming that the latter band showed the expected negative correlation – higher power at low speeds – while the other three peaks were positively correlated.

Generalization across brain regions and recording techniques

As a final step, we sought to determine how well our approach generalized to other recording techniques and brain areas. Addressing the latter point first, we trained the network using electrophysiological recordings (64 channels) from the primary auditory cortex of a freely moving mouse while pure tone auditory stimuli (4–64 kHz, duration 200 ms) were played from a speaker (Figure 4A). As above, the raw electrophysiological data was transformed to the frequency domain using Morlet wavelets and this wide-band frequency representation was used as input. The model architecture and hyperparameters were kept the same, reducing only the number of down-sampling steps because of the smaller input size (64 channels vs. 128 channels for CA1 recordings – each down-sampling layer halves the number of units in the previous layer). The auditory stimuli – training target – was modelled as a continuous variable with ’−1’ indicating no tone present and the log-transformed frequency of the sound at all other time points. As expected, this model architecture was also able to accurately decode tone stimuli from auditory cortex (Performance in original frequency space, R2-score: 0.734 ± 0.080, chance model: −0.432 ± 0.682, Figure 4B,C). Informative frequencies were concentrated around 663 Hz and 165 Hz, indicating that information content about tone stimuli comes mostly from pyramidal cell activity.

Figure 4 with 1 supplement see all
Model generalizes across recording techniques and brain regions.

(A) Overview of auditory recording. We recorded electrophysiological signals while the mouse is freely moving inside a small enclosure and is presented with pure tone stimuli ranging from 4 kHz to 64 kHz. (B) R2-score for decoding of frequency tone from auditory cortex (0.73 ± 0.08). Each dot describes the R2-score for a 5 s sample of the experiment. Chance level is indicated by the red line. (C) An example section for decoding of auditory tone frequencies from auditory electrophysiological recordings, real tone colored in black, decoded tone in green, the line between real and decoded indicates magnitude of error. (D) Influence plots for decoding of auditory tone stimuli, same method as used for CA1 recordings. (E) Calcium recordings from a mouse running on a linear track in VR. We record from 685 cells and use Suite2p to preprocess the raw images and extract calcium traces which we feed through the model to decode linear position. Overlay shows relative influence for decoding of position calculated for each putative cell. (F) R2-score for decoding of linear position from two-photon CA1 recordings (0.90 ± 0.03). Each dot describes the R2-score for a 5 s sample of the experiment. Chance level is indicated by the red line. (G) Example trajectory through the virtual linear track (linearized to [-π,π] with real position (black) and decoded position (orange)). (H) Influence plots for decoding of position from two-photon calcium imaging. Note that the range of frequencies is between 0 Hz and 15 Hz as the sampling rate of the calcium traces is 30 Hz.

Having shown that the model generalizes across different brain areas, we wanted to further investigate if it generalizes across different recording techniques. Therefore, in the third set of experiments, we acquired two-photon calcium fluorescence data from mouse CA1 while the head-fixed animal explored a 230 cm virtual track. Raw data was preprocessed to generate denoised activity traces for putative cells (n=685 regions of interest), these were then decomposed to a frequency representation using the same wavelet approach as before – only frequency bands between 0 Hz and 15 Hz being used because of the lower data rate (30 Hz) (see Materials and methods, Figure 4E). As before, wavelet coefficients were provided to the network as input and the only change was an increase in the number of down-sampling steps to account for the large number of ROIs (685 ROIs vs 128 channels for CA1 recordings). The network was able to accurately decode the animal’s position on the track (mean error: 15.87 ± 16.33 cm, R2-scores of 0.90 ± 0.03 vs. chance model −0.05 ± 0.127, Figure 4F,G). Using the same shuffling technique as before, we generated influence plots indicating the relative information provided by putative cells (Figure 4E) and frequencies (Figure 4H). In the frequency domain, the most informative bands were 0.33 Hz and 0.46 Hz, unsurprisingly mirroring the 1–2 s decay time of GCaMP6s (Chen et al., 2013). Interestingly the relative information content of individual cells was highly heterogeneous, a small subset (18.2%) of cells accounted for half (50%) of the influence – these units being distributed across the field of view with no discernible pattern (Figure 4E).

To further assess the ability of our model to decode continuous behavior from neural data, we investigated its performance on an Electrocorticography (ECoG) dataset recorded in humans (Schalk et al., 2007) made available as part of a BCI competition (BCI Competition IV, Dataset 4). In this dataset, three participants were instructed to move their fingers while ECoG signals were acquired. We used the available training data from three subjects to train and validate our model and report performance on the test set provided by the competition hosts. We used the same model pipeline, adjusting parameters to fit the specifics of the dataset. In particular, we used a downsampling factor of 50, as the original sampling rate is 1000 Hz (in comparison to 30000 Hz in the CA1 recordings) which leads to an effective sampling rate of 20 Hz. We trained the model with 128 timesteps (6.4 s) and used a mean squared error loss function between original finger movement and decoded finger movement. All other model parameters were kept the same. We reach an average Pearson’s r of 0.517 ± 0.160 across three subjects (Figure 4—figure supplement 1A), outperforming the winning performance of the BCI competition which reached a performance of 0.46 (note that in line with the evaluation criteria of the BCI competition we excluded the ring finger for the comparison of correlation coefficients). This allows the model to accurately predict movement of the finger, on a never before seen test set, across three different subjects (Figure 4—figure supplement 1B).

These results together show that our model can be used on a wide variety of continuous regression problems in both rodents and humans and across a wide range of recording systems, including calcium imaging and electrocorticography datasets.

Discussion

The neural code provides a complex, non-linear representation of stimuli, behaviors, and cognitive states. Reading this code is one of the primary goals of neuroscience – promising to provide insights into the computations performed by neural circuits. However, decoding is a non-trivial problem, requiring strong prior knowledge about the variables encoded and, crucially, the form in which they are represented. Not only is this information often incomplete or absent but a full characterization of the neural code is precisely the question we seek to solve. Addressing these limitations, we investigated the potential of a deep-learning framework to decode behaviors and stimuli from wide-band, minimally processed neural activity. To this end, we designed a model architecture using simple 2D convolutions with shared weights, omitting recurrent layers (Bai et al., 2018). These intentional design choices resulted in a fast, data efficient architecture that could be easily interpreted to discover which elements of the neural code provided information about specific variables – a decrease in network performance was accepted as a trade-off. We showed that this approach generalized well across brain regions and recording techniques capturing both spatial and temporal information in the signal, the only changes necessary to the network being adjustments to handle the number of channels in the input matrix.

In the first instance, we validated our model using the well-characterized spatial representations of rodent CA1 place cells. Decoding performance amply exceeded a Bayesian framework, as well as a standard machine learning approach that proved ineffective on the non-linear representation of self-location. Importantly, simple analyses of the trained network correctly indicated that place cell action potentials were the most informative spatial signal – confirming that this tool can deliver insights into the nature of the neural code. In a further set of experiments, we showed that the network was able to concurrently identify multiple representations of head direction and running speed, including several that were only recently reported and one - interneuron encoding of head direction – that was previously unreported. Importantly, this framework can also identify interactions between frequency components, an analysis that is intractable to conventional methods which consider features independently. Finally, we demonstrated the flexibility of this approach, applying the same network and hyper-parameters, with adjustments made only to the input and output layers, to two-photon calcium data and extracellular recordings from auditory cortex.

In sum, we believe deep-learning based frameworks such as this constitute a valuable tool for experimental neuroscientists, being able to provide a general overview as to whether a variable is encoded in time-series data and also providing detailed information about the nature of that encoding – when, where, and in what frequency bands it is present. That is not to say that this approach is a complete substitute for conventional analyses – it merely constrains the search space for variables that might be present and their plausible format. Indeed, we imagine this network might be best used as a first-pass analysis, followed by conventional approaches to determine explicitly if a variable is present – much as we did for the interneuron representation of head direction. While we tested the network with optical and electrophysiological data in rodents, as well as ECoG data in humans, it is highly likely that it will perform well with neural data acquired in most experimental settings, including fMRI, EEG, and MEG.

Materials and methods

Tetrode recordings from CA1

Request a detailed protocol

Five male Lister Hooded rats were used for this study. All procedures were approved by the UK Home Office, subject to the restrictions and provisions contained in the Animals Scientific Procedures Act of 1986. All rats (333–386 g/13–17 weeks old at implantation) were implanted with two single-screw microdrives (Axona Ltd.) targeted to the right and left CA1 (ML: 2.5 mm, AP: 3.8 mm posterior to bregma, DV: 1.6 mm from dura). Each microdrive was assembled with two 32 channel Omnetics connectors (A79026-001) and 16 tetrodes of twisted wires (either 17 μm H HL coated platinum iridium, 90% and 10% respectively, or 12.7 μm HM-L coated Stablohm 650; California Fine Wire), platinum plated to reduce impedance to below 150 kΩ at 1 kHz (NanoZ). After surgery, rats were housed individually on a 12 hr light/dark cycle and after one week of recovery rats were maintained at 90% of free-feeding weight with ad libitum access to water.

Screening was performed from one week after surgery. Electrophysiological data was acquired using Open Ephys recording system (Siegle et al., 2017) and a 64-channel amplifier board per drive (Intan RHD2164). Positional tracking performed using a Raspberry Pi with Camera Module V2 (synchronized to Open Ephys system) and custom software, that localized two different brightness infra-red LEDs attached to amplifier boards on camera images acquired at 30 Hz. During successive recording sessions in a separate screening environment 1.4 x 1.4 m the tetrodes were gradually advanced in 62.5 μm steps until place cells were identified. During the screening session, the animals were often being trained in a spatial navigation task for projects outside the scope of this study.

The experiments were run during the animals’ dark period of the L/D cycle. The recording sessions used in this study were around 40 min long, depending on the spatial sampling of the animal, in a rectangular environment of 1.75 x 1.25 m, on the second, third or fourth exposure, varying between animals. The environment floor was black vinyl flooring, it was constructed of 60 cm high boundaries (MDF) colored matt black, surrounded by black curtains on the sides and above. There was one large cue card raised above the boundary and two smaller cue cards distributed on the side of the boundary. Foraging was encouraged with 20 mg chocolate-flavored pellets (LBS Biotechnology) dropped into the environment by custom automated devices. The recordings used in this study were part of a longer session that involved foraging in multiple other different size open field environments.

Rats were anesthetized with isoflurane and given intraperitoneal injection of Euthanal (sodium pentobarbital) overdose (0.5 ml/100 g) after which they were transcardially perfused with saline, followed by a 10% Formalin solution. Brains were removed and stored in 10% Formalin and 30% Sucrose solution for 3–4 days prior to sectioning. Subsequently, 50 μm frozen coronal sections were cut using a cryostat, mounted on gelatine coated or positively charged glass slides, stained with cresyl violet and cleared with clearing agent (Histo-Clear II), before covering with DPX and coverslips. Sections were then inspected using Olympus microscope and tetrode tracks reaching into CA1 pyramidal cell layer were verified.

Putative interneurons were classified based on waveform shape, minimum firing rate across multiple environments and lack of spatial stability. Specifically, classified interneurons had waveform half-width less than 0.15 ms, maximum ratio of amplitude to trough of 0.4, minimum firing rate of 4 Hz and maximal 0.75 spatial correlation of ratemaps from first and last half of the recording in any environment (Klausberger et al., 2003; Csicsvari et al., 1999). Note that we used the spatial stability in order to differentiate interneurons from place cells or grid cells, with no influence on the directional stability of the head direction cell analysis.

Calcium recordings from CA1

Request a detailed protocol

All procedures were conducted in accordance to UK Home Office regulations.

One GCaMP6f mouse (C57BL/6J-Tg(Thy1-GCaMP6f)GP5.17Dkim/J, Jacksons) was implanted with an imaging cannula (a 3 mm diameter x 1.5 mm height stainless-steel cannula with a glass coverslip at the base) over CA1 (stereotaxic coordinates: AP=-2.0, ML=-2.0 from bregma). A 3 mm craniotomy was drilled at these coordinates. The cortex was removed via aspiration to reveal the external capsule of the hippocampus. The cannula was inserted into the craniotomy and secured to the skull with dental cement. A metal head-plate was glued to the skull and secured with dental cement. The animal was left to recover for at least 1 week after surgery before diet restriction and habituation to head-fixation commenced.

Following a period of handling and habituation, the mouse was head-fixed above a styrofoam wheel and trained to run for reward through virtual reality environments, presented on 3 LCD screens that surrounded the animal. ViRMEn software (Aronov and Tank, 2014) was used to design and present the animal with virtual reality linear tracks. Movement of the animal on the wheel was recorded with a rotatory encoder and lead to corresponding translation through the virtual track. During the experimental phase of the training, the animal was trained to run down a 230 cm linear track and was required to lick at a reward port at a fixed, unmarked goal location within the environment in order to trigger release of a drop of condensed milk. Licks were detected by an optical lick detector, with an IR LED and sensor positioned on either side of the animal’s mouth. When the animal reached the end of the linear track, a black screen appeared for 2 s and the animal was presented with the beginning of the linear track, starting a new trial.

Imaging was conducted using a two-photon microscope (resonant scanning vivoscope, Scientifica) using 16x/0.8-NA water-immersion objective (Nikon). GCaMP was excited using a Ti:sapphire laser (Mai Tai HP, Spectra-Physics), operated with an excitation wavelength of 940 nm. ScanImage software was used for data collection / to interface with the microscope hardware. Frames were acquired at a rate of 30 Hz.

The Suite2p toolbox (Pachitariu et al., 2017) was used to motion correct the raw imaging frames and extract regions of interest, putative cells.

Tetrode recordings from auditory cortex

Request a detailed protocol

Sound-evoked neuronal responses were obtained via chronically-implanted electrodes in the right hemisphere auditory cortex of one 17-week-old male mouse (M. musculus, C57Bl/6, Charles River). All experimental procedures were carried out in accordance with the institutional animal welfare guidelines and a UK Home Office Project License approved under the United Kingdom Animals (Scientific Procedures) Act of 1986.

During recordings, the animal was allowed to freely move within a 11x21 cm cardboard enclosure, with one wall consisting of an acoustically transparent mesh panel to allow unobstructed sound stimulation. Acoustic stimuli were delivered via two free-field electrostatic speakers (Tucker-Davis Technologies, FL, USA) placed at ear level, 7 cm from the edge of the enclosure. Recordings were performed inside a double-walled soundproof booth (IAC Acoustics), whose interior was covered by 4 cm thick acoustic absorption foam (E-foam, UK). Pure tones were generated using MATLAB (Matlab version R2015a; MathWorks, Natwick, MA, USA), and played via a digital signal processor (RX6, Tucker Davis Technologies, FL, USA). The frequency response of the loudspeaker was ±10 dB across the frequency range used for stimulation. Pure tones of a duration of 200 ms (including 5 ms linear rise and fall times) of variable frequencies (4–64 kHz in 0.1 octave increments) were used for stimulation. The tones were presented at 65 dB SPL at the edge of the testing box. The 41 frequencies were presented pseudo-randomly, separated by a randomly varying inter-stimulus interval ranging from 500 to 510 ms, for a total of 20 repetitions.

Extracellular electrophysiological recordings were obtained using a custom chronically implanted 64-channel hyperdrive with two 32-channel Omnetics connectors (A79026-001) and 16 individually movable tetrodes (FlexDrive, Voigts et al., 2013). Tetrodes were made from 12.7 μm tungsten wire (99.95%, HFV insulation, California Fine Wire, USA) gold-plated to reduce impedance to 200 kΩ at 1 kHz (NanoZ, Multichannel Systems). Neuronal signals were collected and amplified using two 32-channel amplifier boards (Intan RHD 2132 headstages) and an Open Ephys recording system (Siegle et al., 2017) at 30 kHz.

Data preprocessing

Request a detailed protocol

Raw electrophysiological traces as well as calcium traces were transformed to a frequency representation using discrete-wavelet transformation (DWT). We decided to use wavelet transformation instead of windowed Fourier transform (WFT) as we expected a wide range of dominant frequencies in our signal for which the wavelet transformation is more appropriate (Christopher and Gilbert, 1998). For the wavelet transformation, we used the morlet wavelet:

(1) ψ0(η)=π14exp(iω0η)exp(η2/2)

with a non-dimensional frequency constant w0=6. The full frequency space for tetrode recordings consisted of 26 log-space frequencies with fourier frequencies of: 2.59, 3.66, 5.18, 7.32, 10.36, 14.65, 20.72, 29.3, 41.44, 58.59, 82.88, 117.19, 165.75, 234.38, 331.5, 468.75, 663, 937.5, 1326, 1875, 2652, 3750, 5304, 7500, 15,000 Hz. For calcium imaging the fourier frequencies used are: 0.002, 0.003, 0.005 0.007, 0.01, 0.014, 0.02, 0.03, 0.04, 0.058, 0.08, 0.11, 0.16, 0.23, 0.33, 0.46, 0.66, 0.93, 1.32, 1.87, 2.65, 3.75, 5.3, 7.5, 10.6, 15 Hz. We noticed that downscaling the wavelets improved our model performance, prompting us to use an additional preprocessing step which effectively decreased the sampling rate of the wavelets to ψSRafter=ψSRbefore/𝑀 by a factor of M. This can also be seen as an additional convolutional layer with a kernel size of M, a stride of M and weights fixed to 1M. We performed a hyperparameter search for M with a simplified model and found the best performing model with M=1000, thus effectively decreasing our sampling rate from 30,000 to 30 (Figure 1—figure supplement 10).

As additional preprocessing steps, we applied channel and frequency wise normalization using a median absolute deviation (MAD) approach. We calculated the median and the corresponding median absolute deviation for each frequency and channel on the training set and normalized our inputs as follows:

(2) Xc,f=Xc,fX~c,fmedian(|Xc,fX~c,f|)

where Xi~ is the median of Xi. This approach turned out to be more robust against outliers in the signal than simple mean normalization. Additional min-max scaling did not further improve performance.

Bayesian decoder

Request a detailed protocol

As a baseline model, we used a Bayesian decoder which was trained on manually sorted and clustered spikes. Given a time window T and number of spikes K=(k1,kN) fired by N place cells, we can calculate the probability P(K|x), estimating the number of spikes K at location x:

(3) P(K|x)=Poisson(ki,Tαi(x))=i=1N(Tαi(x))kiki!

where αi(x) is the firing rate of cell i at position x. From this we can calculate the probability of the animals location given the observed spikes:

(4) P(x|K)=P(x)i=1Nαi(x)kiexp(Ti=1Nαi(x))

where P(x) is the historic position of the animal which we use to constrain P(x|K) to provide a fair comparison to the convolutional decoder. The final estimate of position is based on the peak of P(x|K):

(5) x^=argmaxxP(x|K)

We implemented a Bayesian continuity prior using a Gaussian distribution centred around the previous decoded location xt-1 of the animal, adjusting the standard deviation based on the speed of the animal in the previous four timesteps, as implemented in Zhang et al., 1998. Specifically we calculated the probability of the current location given the previous by:

(6) P(xt1|xt)=exp(||xtxt1||22σt2)

with the standard deviation adjusted based on the speed of the animal

(7) σt=vtV

with constant V making the fraction dimensionless. We used the previous four timesteps (2 s) to estimate the speed of the animal and used V=1s. We only used the Bayesian decoder with continuity prior for the comparison between Bayesian decoders (see Figure 1—figure supplement 4B). In all other analyses, we used the Bayesian decoder without the continuity prior.

We used the same cross-validation splits as for the convolutional model and calculated the Euclidean distance between the real and decoded position. We performed grid search on one representative rat to find the optimal parameters regarding bin size and bin length for the Bayesian decoder. The optimized Bayesian decoder uses a Gaussian smoothing kernel with sigma = 1.5, a bin size of 2 cm for binning the ratemaps, and uses a bin length of 0.5 s (see Figure 1—figure supplement 4A for Bayesian performance across different time windows).

Convolutional neural network

Request a detailed protocol

The model takes a three-dimensional wavelet transformed signal as input and uses convolutional layers connected to a regression head to decode continuous behavior. We use a kernel size of 3 throughout the model and keep the number of filters constant at 64 for the first eight layers while sharing the weights over the channel axis, while then doubling them for each following layer. For downsampling the input, we use a stride of 2, intermixed between the time and frequency dimension (Table S1, Figure 1B). As regularization we apply gaussian noise 𝒩(μ,σ2) with μ=0 and σ=1 to each input sample.

We extensively investigated the use of a convolutional long-short-term-memory (LSTM) after the initial convolutions, where we used backpropagation through time on the time dimension. In a simplified model, this led to a small decrease in decoding error for the model trained on position. We nevertheless decided to employ a model with only simple convolutions as one important aspect of this model is the simplicity of use for a neuroscientist. Moreover, we experimented with using a wavenet (Oord et al., 2016) inspired model directly on the raw electrophysiological signal but noticed that the model using the wavelet transformed input outperformed the wavenet approach by a margin of around 20 cm for the positional decoding. The wavenet inspired model was considerably slower to train and therefore a full hyperparameter search could not be performed.

Previous models contrasting recurrent vs. convolutional networks (Bai et al., 2018), find that convolutional layers outperform recurrent ones when trained directly on minimally processed data. The benchmarks typically used in classical sequence learning are one-dimensional, whereas we record two-dimensional raw input (time x channels) with a high sampling rate, complicating the amount of experimentation we could perform as the unprocessed data for a 2 s time window exceeds the capacity of GPU memory (30,000 x 128 time points per sample). In the related field of speech processing with sampling rates up to 48,000 Hz, the input is processed using log-mel feature banks which are computed with a 25 ms window and a 10 ms shift (Bahdanau et al., 2016; William et al., 2016; Rohit et al., 2017). We therefore opted for a similar approach by using downsampled wavelet transformed signals, resulting in a 33.3 ms window given a downsampling size of z=1000. Note that with further downsampling there might be a risk of losing decoding precision, with some of the behaviors coming close to the downsampled rate (e.g. head direction can be up to 40 deg/s) (Figure 1—figure supplement 10).

Model training

Request a detailed protocol

The model takes as input a three-dimensional wavelet transformed signal corresponding to time, frequency and channels, with frequencies logarithmically scaled between 0 Hz and 15.000Hz. An optimal temporal window of T=64 (corresponding to 2.13 s) was established by hyperparameter search taking into account the tradeoff between speed of training and model error. For training the model across the full duration of the experiment, we divided the experiment into five partitions and used cross-validation for testing the model on before unseen data partitions, that is we first used partitions 2 to 5 for training and one for testing, then 1, 3, 4, and 5 for training and two for testing and so on. The last partition uses 1 to 4 for training and five for testing. Importantly, the overlap introduced by using 2 s long samples was accounted for by using gaps (2 s) between the training partitions, making sure that training and test set are fully independent of each other. We then randomly sampled inputs and outputs from the training set. Each input had corresponding outputs for the position (X, Y in cm), head direction (in radians) and speed (in cm/s). We used Adam as our learning algorithm with a learning rate of 0.0007 and stopped training after we sampled 18,000 samples, divided into 150 batches for 15 epochs, each batch consisting of eight samples. During training, we multiplied the learning rate by 0.2 if validation performance did not improve for three epochs. We performed random hyperparameter search for the following parameters: learning rate, dropout, number of units in the fully connected layer and number of input timesteps. For calculating the chance level, we used a shuffling procedure in which the wavelet transformed electrophysiological signal is shifted relative to its corresponding position. After shuffling, we trained the model with the same setting as the unshuffled model and for the same number of epochs. The training was performed on one GTX1060 using Keras with Tensorflow as backend.

Model comparison

Request a detailed protocol

In order to compare the performance of the network against the Bayesian decoder we simulated both models in a setting with artificially reduced inputs. We used 1 to 32 tetrodes as input for both decoders, with tetrodes taken top to bottom in order of the given tetrode number. The input of run one was then comprised of tetrodes 1 to 32, while run 2 used tetrodes 1 to 31. The last run uses only the first tetrode as input to both models. We then retrained both models with the artificially reduced number of tetrodes making sure both models have the same cross-validation splits and report decoding errors as the average of each cross-validation split.

To generate a further baseline measure of performance when decoding using wavelet transformed coefficients, we trained support vector machines to decode position from wavelet transformed CA1 recordings. We used either a linear kernel or a non-linear radial-basis-function (RBF) kernel to train the model, using a regularization factor of C=100. For the non-linear RBF kernel we set gamma to the default 1num_features*var(X) as implemented in the sklearn framework. The SVM model was trained on the same wavelet coefficients as the convolutional neural network.

Model evaluation

Request a detailed protocol

For adjusting the model weights during training we use different loss terms depending on the behavior or stimuli which we decode.

(8) ED=i=1M(y^i-yi)2 MAE=1Mi=1M|y^i-yi| CMAE=min[|y^i-yi|,|y^i-yi|+π,|y^i-yi|-π]

For decoding of position from tetrode CA1 recordings we try to minimize the Euclidean loss between predicted and ground truth position (ED). We use the mean squared error for the decoding of speed (MAE) and the cyclical absolute error for decoding of head direction (CMAE). For all other behaviors or stimuli we use MAE as the default optimizer.

We decided to use R2 scores to measure model performance across different behaviors, brain areas, and recording techniques. We use the formulation of fraction of variance accounted for (FVAF) instead of the squared Pearson’s correlation coefficient. Both terms are based on the fraction of the residual sum of squares and the total sum of squares:

(9) R2=1-i=1M(yi-αy^i-β)2i=1M(yi-y¯)

with yi the ground truth of sample i, y^i the predicted value and y¯ the mean value. Here, Pearson’s correlation coefficient tries to maximize R2 by adjusting α and β while FVAF uses α=1 and β=0 (Fagg et al., 2009). This provides a more conservative measure of performance as FVAF requires that prediction and ground truth fit without scaling the predicted values. FVAF in turn has no lower bound as the prediction can be arbitrarily worse with a given scaling constant (i.e. given a ground truth value of 10, a prediction of 1000 has a lower (worse) R2 score than a prediction of 100).

Detection of replay events

Request a detailed protocol

To investigate if the model can detect replay events we reran the analysis using a lower downsampling factor. In addition, we used a new dataset in which animals performed a goal directed task with specific reward locations and in which we had previously detected sharp-wave ripples using standard methods. We applied the same preprocessing only adjusting the downsampling factor to 60, resulting in a sampling rate of 500 Hz instead of the previously used 30 Hz. We then trained the model using the default 64 samples (128 ms) in order to obtain a better estimate of decoded positions during ripples. All other model parameters were kept the same. We then trained the model using full cross-validation and decoded the animal position every 2 ms, resulting in over 1 million decoded positions for the experiment which lasted ∼35 min. Ripples in the raw neural data were extracted by finding regions in the band-pass filtered ripple band (150–250 Hz) that were above a certain threshold (five standard deviations above the mean). We expanded these until power was back at a lower threshold (0.5 standard deviations above the mean). We only kept ripples which were longer than 60 ms and occurred during stationary periods. To quantify putative replay events detected by the CNN we calculated three measures. First, we evaluated if the events showed a higher overall decoding error by calculating the average loss – Euclidean distance between decoded and true location. Second, we quantified the length of each event to detect if they are longer than the average decoded trajectory during immobility, matched to be the same length as ripple periods. Third, we generated a coherence measure to assess the smoothness or continuity of each trajectory, defined as the absolute distance between each point on the trajectory to a second degree polynomial fitted to the entire trajectory. To obtain a statistical score for all three, we permuted the events in time (keeping the absolute length of each) and recalculated each measure 1000 times. We only used periods where the speed distribution of the original replay events matched the shuffled replay events (stationary periods).

Influence maps

Request a detailed protocol

To investigate which frequencies, channels or timepoints were informative for the respective decoding we performed a bootstrapping procedure after training the models. For each sample in time, we calculated the real decoding error eo for each behavior by using the wavelets as input. We then shuffle the wavelets for a particular frequency and re-calculate the error. We then define the influence of a given frequency or channel as the relative change: es-eoeo where eo is the original error and es the shuffled error. We repeat this for the channel and time dimension to get an estimate of how much influence each channel or timepoint has on the decoding of a given behavior.

To evaluate if the influence measure accurately captures the true information content, we used simulated behaviors in which ground truth information was known. We used the preprocessed wavelet transformed data from one animal and created a simulated behavior (ysb) using uniform random noise. Two frequency bands were then modulated by the simulated behavior using fnew=fold*β*ysb. We used β=2 for 58 Hz and β=1 for 3750 Hz. We then retrained the model using five-fold cross validation and evaluated the influence measure as previously described. We report the proportion of frequency bands that fall into the correct frequencies (i.e. the frequencies we chose to be modulated, 58 Hz and 3750 Hz).

We also tried calculating sample gradients with respect to our inputs (Simonyan et al., 2013). For this, we calculated the derivative w by back-propagation for each sample and with respect to the inputs. In contrast to class saliency maps, we obtain a gradient estimate indicating how much each part of the input strongly drives the regression output. We calculate saliency maps for each sample cross-validated over the entire experiment. For deriving influence maps from the raw gradients we calculate the variance across the time dimension and use this as an estimate of how much influence each frequency band or channel has on the decoding. This method however introduces a lot of high-frequency noise in the gradients, possibly coming from the strides in the convolutional layers used throughout the model (Olah et al., 2017).

Code availability

Request a detailed protocol

We provide the full code and notebooks demonstrating the usage of our method on electrophysiological and two-photon calcium imaging data. All examples can be adapted to predict one-dimensional (default) or n-dimensional behaviors or stimuli from any desired brain area. The code is available at https://github.com/CYHSM/DeepInsight.

Data availability

Raw neural recordings from CA1 in rodents using tetrode or two-photon calcium imaging, as well as auditory cortex recordings in mice are available here: https://figshare.com/collections/DeepInsight_-_Data_sharing/5486703. The ECoG dataset is part of a BCI competition (http://www.bbci.de/competition/iv/) and can be obtained from https://purl.stanford.edu/zk881ps0522 (fingerflex.zip).

The following data sets were generated
    1. Perrodin C
    2. Bendor D
    (2021) figshare
    Tetrode recordings from auditory cortex.
    https://doi.org/10.6084/m9.figshare.14909730.v2

References

    1. Christopher T
    2. Gilbert PC
    (1998)
    A practical guide to wavelet analysis
    Bulletin of the American Meteorological Society 79:61–78.
  1. Conference
    1. Krizhevsky A
    2. Sutskever I
    3. Hinton GE
    (2012)
    ImageNet classification with deep convolutional neural networks Event-Place
    Proceedings of the 25th International Conference on Neural Information Processing Systems. pp. 1097–1105.
  2. Book
    1. O’keefe J
    2. Nadel L
    (1978)
    The Hippocampus as a Cognitive Map
    Oxford: Clarendon Press.
  3. Conference
    1. Rohit P
    2. Rao K
    3. Sainath TN
    4. Li B
    5. Johnson L
    6. Jaitly N
    (2017)
    A comparison of Sequence-to-Sequence Models for Speech Recognition
    Interspeech 2017. pp. 939–943.

Decision letter

  1. Sachin Deshmukh
    Reviewing Editor; Indian Institute of Science Bangalore, India
  2. Joshua I Gold
    Senior Editor; University of Pennsylvania, United States

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Acceptance summary:

Frey et al. describe a convolutional neural network capable of extracting behavioral correlates from wide-band LFP recordings or even lower-frequency imaging data. The analysis program described by the authors provides a rapid "first pass" analysis using raw, unprocessed data to generate hypotheses that can be tested later with conventional in-depth analyses. This approach is of real value to the community, particularly as it becomes more commonplace for labs to acquire multi-site in vivo recordings.

Decision letter after peer review:

Thank you for submitting your article "Interpreting wide-band neural activity using convolutional neural networks" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Joshua Gold as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

The reviewers agree that the tools and resources described in the manuscript are a substantial contribution to the field, but have raised a number of concerns which need to be addressed. The requested changes do not require any new data, but do require new analyses to be performed. Complete reviewer recommendations are appended at the end of this message, and a summary of essential revisions follows.

Essential revisions:

1) The CNN described in the manuscript needs to be better characterized in terms of its history dependence, comparison to Bayesian decoding, and ability to decode non-local representations. Head direction decoding using CNN needs to be compared with that using a Bayesian decoder.

2) The ability of the CNN to discover underlying truth needs to be characterized using simulations.

3) Contributions of low frequency activity need to be better distinguished from the low frequency components of excitatory spikes.

4) The amount of data required for accurately recovering the frequency bands contributing to decoding needs to be characterized using simulations as well as sub-sampled data. Relative contributions of using the Wavelet coefficients as inputs and the CNN to improved accuracy also need to be characterized.

Reviewer #1 (Recommendations for the authors):

In the current manuscript, Frey et al. describe a convolutional neural network capable of extracting behavioral correlates from wide-band LFP recordings or even lower-frequency imaging data. Other publications (referenced by the authors) have employed similar ideas previously, but to my knowledge, the current implementation is novel. In my opinion, the real value of this method, as the authors state in their final paragraph, is that it represents a rapid, "first-pass" analysis of large-scale electrophysiological recordings to quickly identify relevant neural features which can then become the focus of more in-depth analyses. As such, I think the analysis program described by the authors is of real value to the community, particularly as it becomes more commonplace for labs to acquire multi-site in vivo recordings.

However, to maximize its utility to the community, I have several questions/concerns that I believe need to be addressed.

(1) It is obviously important to quantify the relative accuracy of the authors' method to existing methods which correlate neural activity to behavior or sensory input. The authors attempt to do this by comparing CNN decoding to Bayesian decoding with clustered cells (Figure 1). However, I think there are several points where that comparison may be flawed.

(1a) First, while some manuscripts (including Zhang et al., 1998, referenced by the authors) do use a continuity prior in their decoding algorithms, most do not (see the vast array of papers from Matt Wilson, David Redish, Loren Frank, David Foster, and others in the field). Indeed, even Olafsdottir et al., 2015 that the authors reference in apparent support of the use of a continuity prior (Line 111 of the manuscript) explicitly state that they do not use a continuity prior in their methods. It is unclear to me in reading through the methods whether the CNN-based decoding also utilizes a continuity prior to restrict the decoded location.

To truly be a useful tool to the community, the algorithm should be capable of correlating neural activity to behaviors/sensory input in a manner that is history-independent, as it seems that a fundamental advantage of this system is for unbiased probing of such correlations. If such history-independent decoding is not possible with the CNN, this should be explicitly stated to clarify the parameters for which this method is appropriate.

Thus, I would like clarification of whether the CNN uses the animal's history to constrain its decoding output. If so, is the use of the animal's history a necessary component of the CNN-based decoding? If the CNN can be used in a history-independent manner, I would like to see it compared to history-independent Bayesian decoding.

(1b) The authors report that part of the advantage of the CNN over Bayesian decoding was that the CNN made fewer large errors, and that the median error was more similar across the two methods (Line 120). When performing the Bayesian decoding, it appears as though spikes throughout the entire experiment were used. However, it is known that during periods of immobility, population bursts during sharp-wave/ripples can encode virtual trajectories across the environment, producing non-local spatial representations. If such remote trajectories were included in the Bayesian decoding, this may account for large "errors" between the decoded location and the animal's actual location (even though this may not be an error at all!). Thus, I would like to see this comparison repeated using only periods of active movement, when the literature suggests that place cells are more likely to encode local spatial information.

In addition, it appears that the authors use a 500 ms window to quantify Bayesian decoding accuracy, but, from what I can tell, they use a ~33 ms window (within 2 second 'chunks') to quantify CNN decoding accuracy. This doesn't seem like a fair apples-to-apples comparison as the animal can move quite a bit over 500 ms. I would thus like to see a comparison between these two methods using similar timescales that are appropriate for both methods, perhaps a ~100-200 ms window.

(1c) Related to the previous point, a fundamental advantage of using single unit activity to examine behavioral information is that it allows the experimenter to identify when the neural population is representing information other than the immediate sensory input or behavioral output. For example, in the place cell field, one can correlate activity of single units to position during active behavior, and then study when virtual paths are encoded during hippocampal sharp-wave/ripples (a.k.a. replay) or theta sequences.

Thus, a basic question is whether such non-local representation is accessible in the authors' CNN. Can the authors train the network on behavioral data (using only periods of active movement) and then faithfully decode virtual trajectories during immobility-based ripples? Alternatively, can the authors identify the shorter theta sequences observed during active movement if the CNN operates on a finer timescale? If the position decoding is largely based on high-frequencies in the LFP representing action potentials, it seems that such finer-scale, non-local representations should also be available. Importantly, if such non-local or fine-time-scale representations cannot be identified using this method, it is important to clarify this to avoid incorrect future use.

(2) Although the majority of self-location information was present in high-frequency bands, the data in Figure 1E show that LFP frequencies below 250 Hz were also informative above chance (and very near Bayesian decoding accuracy). However, given that Figure 3B shows excitatory spikes spread well into low 200 Hz frequency ranges, it seems that using LFP below 250 Hz is likely also including some spike information. For clarification of how accurately low-frequency bands can reflect position information, I would like to see the analysis in Figure 1E performed with LFP frequencies less than 150 Hz (fast gamma and slower).

If position information can still be faithfully extracted using <150 Hz frequencies, it is important to further rule out non-spatial correlates. For example, are there spatial locations where the rat is more likely to run at predictable velocities, allowing theta-band frequencies to effectively decode the animal's location? Is the LFP-based decoding more/less accurate at specific locations of the environment (near walls, near a rewarded location, etc.)? A heat-map of average decoding error per spatial bin (per animal) would be useful to visualize this analysis.

(3) When quantifying the accuracy of head direction decoding, they compare the CNN to chance levels. Although this is a valuable measure, given that head direction seems heavily driven by LFP frequencies associated with excitatory and inhibitory spiking, can the authors also compare head direction decoding between the CNN and Bayesian decoding from clustered spikes (including both exc. and inh. cells)? Is head direction information available in the clustered data or are there other elements in the high-frequency LFP that correlate to head direction? If head direction can be decoded via clustered spikes, why do the authors think this has not been observed in prior studies?

(4) In the Methods (line 356), I'm not sure what the authors mean by "16 eight tetrodes". Do they mean 16 tetrodes?

(5) The x-axis of Figure 4D and 4H should be labeled, especially since the scale seems to be log rather than linear.

Reviewer #3 (Recommendations for the authors):

– I think this method could be very useful for the EEG/ECoG communities, which care about frequency representations, and appealing to those communities would significantly expand the utility of your method for the broader neuroscience community. In my opinion, if there is no example of this type of use in the paper, it is much less likely for EEG/ECoG researchers to actually use your method in practice. I think that having an EEG or ECoG example would be much more beneficial than the calcium imaging example, since researchers are not trying to determine what frequency contents are important within a calcium imaging signal. This has a list of many open datasets for EEG: https://github.com/meagmohit/EEG-Datasets

– I think providing a general overview of your approach/method at the beginning of Results would be helpful for many readers.

– Please clarify when you are reporting test-set vs. training set predictions in your results.

– In the final paragraph of the introduction, you write "Our model differs markedly from conventional decoding methods which typically use Bayesian estimators…" This is overly specific to hippocampal decoding – in movement decoding Bayesian methods are not frequently used, although linear methods still commonly are.

https://doi.org/10.7554/eLife.66551.sa1

Author response

Reviewer #1 (Recommendations for the authors):

In the current manuscript, Frey et al. describe a convolutional neural network capable of extracting behavioral correlates from wide-band LFP recordings or even lower-frequency imaging data. Other publications (referenced by the authors) have employed similar ideas previously, but to my knowledge, the current implementation is novel. In my opinion, the real value of this method, as the authors state in their final paragraph, is that it represents a rapid, "first-pass" analysis of large-scale electrophysiological recordings to quickly identify relevant neural features which can then become the focus of more in-depth analyses. As such, I think the analysis program described by the authors is of real value to the community, particularly as it becomes more commonplace for labs to acquire multi-site in vivo recordings.

However, to maximize its utility to the community, I have several questions/concerns that I believe need to be addressed.

(1) It is obviously important to quantify the relative accuracy of the authors' method to existing methods which correlate neural activity to behavior or sensory input. The authors attempt to do this by comparing CNN decoding to Bayesian decoding with clustered cells (Figure 1). However, I think there are several points where that comparison may be flawed.

(1a) First, while some manuscripts (including Zhang et al., 1998, referenced by the authors) do use a continuity prior in their decoding algorithms, most do not (see the vast array of papers from Matt Wilson, David Redish, Loren Frank, David Foster, and others in the field). Indeed, even Olafsdottir et al., 2015 that the authors reference in apparent support of the use of a continuity prior (Line 111 of the manuscript) explicitly state that they do not use a continuity prior in their methods. It is unclear to me in reading through the methods whether the CNN-based decoding also utilizes a continuity prior to restrict the decoded location.

To truly be a useful tool to the community, the algorithm should be capable of correlating neural activity to behaviors/sensory input in a manner that is history-independent, as it seems that a fundamental advantage of this system is for unbiased probing of such correlations. If such history-independent decoding is not possible with the CNN, this should be explicitly stated to clarify the parameters for which this method is appropriate.

Thus, I would like clarification of whether the CNN uses the animal's history to constrain its decoding output. If so, is the use of the animal's history a necessary component of the CNN-based decoding? If the CNN can be used in a history-independent manner, I would like to see it compared to history-independent Bayesian decoding.

We thank the reviewer for raising this important point about the decoding details of the Bayesian decoder and the history dependence of our model. Indeed, as stated by the reviewer, most prior work does not use a continuity prior for Bayesian decoding. The reviewer is also correct in spotting that we made a mistake in referring to our Bayesian model as a decoder with a continuity prior when in fact we did not use such a prior. We apologise for this mistake which we have now corrected in the text.

In addition we now add a further analysis comparing the CNN with Bayesian decoders with and without a continuity prior. Specifically, we implemented the continuity prior by using a Gaussian distribution centred around the previous decoded location (t-1) of the animal, adjusting the standard deviation based on the speed of the animal in the previous timesteps (last 2 seconds). As shown in Figure 1 – Supplement 4B, the difference between the median values of the Bayesian decoder with and without continuity prior is small (median decoding error with continuity, 22.51cm; without continuity, 23.23cm), although the Bayesian decoder with continuity shows higher variance and makes more ‘catastrophic’ errors which is reflected by a higher mean decoding error (mean decoding error with continuity, 33.06cm; without continuity, 23.38cm). Regardless, in both cases, the CNN yields more accurate decoding (mean decoding error of 17.31cm, Figure 1).

With regards to the history dependence of our model, the CNN is feedforward and only has access to information from across the input window, which has a size of 2.13 seconds (64 timesteps). There is no explicit continuity prior as implemented in the Bayesian decoder, however during training, the network weights will learn an implicit prior based on the behaviour of the animal in the training data. This will likely incorporate information about the statistics of the animal’s motion (e.g. distance travelled between neighbouring timesteps).

We also ran an analysis across all time windows for both the Bayesian decoder as well as the convolutional neural network (see also Question 3). As seen in Figure 1 Supplement 4A, for very small time windows, which imply a history independent manner of decoding behaviour, both models show decreased performance in comparison to longer time windows.

Page 4:

“To provide a familiar benchmark, we applied a standard Bayesian decoder without a continuity prior (Olafsdottir et al. 2015) to the spiking data from the same datasets (see methods, see also Figure 1 – Supplement 4B for comparison to Bayesian decoder with continuity prior (Zhang et al. 1998)).”

Page 17:

“We implemented a Bayesian continuity prior using a Gaussian distribution centred around the previous decoded location xt-1 of the animal, adjusting the standard deviation based on the speed of the animal in the previous 4 timesteps, as implemented in (Zhang et al. 1998). […] We used the previous 4 timesteps (2 seconds) to estimate the speed of the animal and used V=1s.”

(1b) The authors report that part of the advantage of the CNN over Bayesian decoding was that the CNN made fewer large errors, and that the median error was more similar across the two methods (Line 120). When performing the Bayesian decoding, it appears as though spikes throughout the entire experiment were used. However, it is known that during periods of immobility, population bursts during sharp-wave/ripples can encode virtual trajectories across the environment, producing non-local spatial representations. If such remote trajectories were included in the Bayesian decoding, this may account for large "errors" between the decoded location and the animal's actual location (even though this may not be an error at all!). Thus, I would like to see this comparison repeated using only periods of active movement, when the literature suggests that place cells are more likely to encode local spatial information.

We thank the reviewer for the suggestion to compare decoding performance between stationary periods and periods of active movement.

For all analysis reported currently in the manuscript the Bayesian decoder is only applied to periods when animals are travelling >3cm/s. No speed threshold was applied to the convolutional neural network. We have now clarified this in the text. In addition, we now include a new Supplementary Figure which shows the comparison between both models across a range of speed thresholds. As expected, the performance of both models increases as the speed threshold is increased, presumably because – as the reviewer suggested – non-locomotor neural activity is excluded. The mean decoding performance of our model improves from 17.37cm ± 3.58cm with no speed threshold to 13.40cm ± 3.59cm with a speed threshold of 25cm/s, while the Bayesian decoder improves from 22.23cm ± 3.72 to 17.17cm ± 2.78.

Importantly the CNN is more accurate than the Bayesian decoder for all thresholds.

Page 4:

“Note that for these comparisons the Bayesian decoder was only applied to periods when the animal was travelling at >3cm/s, in contrast the CNN did not have a speed threshold (i.e. is trained on moving and stationary periods). […] As expected, limiting the CNN to only periods of movement actually improves its performance, accentuating the difference between it and the Bayes decoder (0cm/s speed threshold: Bayesian mean error 22.23cm ± 3.72cm; network error 17.37cm ± 3.58 cm; 25cm/s speed threshold: Bayesian mean error 17.17cm ± 2.788cm; network error 13.40cm ± 3.59 cm, Figure 1 Supplement 6).”

In addition, it appears that the authors use a 500 ms window to quantify Bayesian decoding accuracy, but, from what I can tell, they use a ~33 ms window (within 2 second 'chunks') to quantify CNN decoding accuracy. This doesn't seem like a fair apples-to-apples comparison as the animal can move quite a bit over 500 ms. I would thus like to see a comparison between these two methods using similar timescales that are appropriate for both methods, perhaps a ~100-200 ms window.

We thank the reviewer for this question regarding the timeframes of both the Bayesian decoder and our model. Indeed the Bayesian decoder uses a 500ms window to decode position in the environment, while our model uses a 2s window, from which 4 separate timesteps are decoded (every 500ms). However, we only analyse the last of the decoded sample – ensuring that the model is not able to decode based on future neural data. In effect, all reported performance scores for our model use a 2s window, in which the last timestep is decoded.

We now include an additional Supplementary Figure comparing the performance of both the Bayesian decoder and the convolutional neural network across different durations, ranging from 100ms up to 1.6 seconds. Both models yield more accurate decoding with longer time windows – which incorporate more data – Bayesian being most accurate with a 1s window, the CNN performing best with a 1.6s window (Figure 1 – Supplement 4A). Here again, CNN performance is better than the Bayes decoder for all time windows.

Page 17:

“The optimized Bayesian decoder uses a Gaussian smoothing kernel with σ = 1.5, a bin size of 2cm for binning the ratemaps, and uses a bin length of 0.5s (see Figure 1 Supplement 4A for Bayesian performance across different time windows).”

(1c) Related to the previous point, a fundamental advantage of using single unit activity to examine behavioral information is that it allows the experimenter to identify when the neural population is representing information other than the immediate sensory input or behavioral output. For example, in the place cell field, one can correlate activity of single units to position during active behavior, and then study when virtual paths are encoded during hippocampal sharp-wave/ripples (a.k.a. replay) or theta sequences.

Thus, a basic question is whether such non-local representation is accessible in the authors' CNN. Can the authors train the network on behavioral data (using only periods of active movement) and then faithfully decode virtual trajectories during immobility-based ripples? Alternatively, can the authors identify the shorter theta sequences observed during active movement if the CNN operates on a finer timescale? If the position decoding is largely based on high-frequencies in the LFP representing action potentials, it seems that such finer-scale, non-local representations should also be available. Importantly, if such non-local or fine-time-scale representations cannot be identified using this method, it is important to clarify this to avoid incorrect future use.

This is a really interesting suggestion – we thank the reviewer for making it.

In the original implementation of the convolutional model we downsample the electrophysiological signals after the wavelet transformation by a factor of 1000 to maximize the size (in seconds) of the input sample – the amount of data in a sample is constrained by GPU memory size. It’s highly likely this preprocessing would compress short replay events, making them hard to decode.

Thus to investigate if the model can detect replay events we reran the analysis using less downsampling (see Figure 1—figure supplement 7). In addition we used a new dataset in which rats were performing a navigation task and for which we had observed replay events before. We used the same preprocessing, only adjusting the downsampling factor to 60, resulting in a sampling rate of 500 Hz instead of the previously used 30 Hz. We then trained the model using 64 samples (128ms) in order to obtain a better estimate of decoded positions during replay events. We then decoded the animal position every 2 ms, resulting in over 1 million decoded positions for the experiment which lasted ~35 minutes.

To quantify if our model can detect replay we investigated model behaviour at time points where sharp-wave ripples (SWRs) had been detected using standard methods (i.e. finding LFP segments in which ripple band power (150-250Hz) was at least 5 standard deviations above the mean and expanding these regions until power dropped back to 0.5 standard deviations above the mean, only retaining segments that were longer than 60ms). Examining these SWRs we saw the CNN often decoded transient, high velocity trajectories that resembled those reported in studies of open field replay (see Figure 1 – Supplement 7A). Consistent with this interpretation the decoding error for these periods (i.e. Euclidean distance between the animal’s location and decoded location) was larger than for matched periods when the animal was stationary but in which SWRs had not been detected (n=1000, p<0.001, Figure 1 Supplement 7B). We furthermore found these putative replay trajectories were longer than the average decoded trajectories during stopped periods (n=1000, p=0.003, Figure 1 – Supplement 7C) and they were as coherent as trajectories detected during movement (n=1000, p=0.257, Figure 1 – Supplement 7D).

Page 6:

“The standard decoding model downsamples the wavelet frequencies to a rate of 30Hz, potentially discarding transient non-local representations (e.g. replay events and theta sequences). […] Thus it seems plausible that non-local representations are accessible to this CNN framework.”

Page 19-20:

“To investigate if the model can detect replay events we reran the analysis using a lower downsampling factor. […] We only used periods where the speed distribution of the original replay events matched the shuffled replay events (stationary periods).”

(2) Although the majority of self-location information was present in high-frequency bands, the data in Figure 1E show that LFP frequencies below 250 Hz were also informative above chance (and very near Bayesian decoding accuracy). However, given that Figure 3B shows excitatory spikes spread well into low 200 Hz frequency ranges, it seems that using LFP below 250 Hz is likely also including some spike information. For clarification of how accurately low-frequency bands can reflect position information, I would like to see the analysis in Figure 1E performed with LFP frequencies less than 150 Hz (fast gamma and slower).

If position information can still be faithfully extracted using <150 Hz frequencies, it is important to further rule out non-spatial correlates. For example, are there spatial locations where the rat is more likely to run at predictable velocities, allowing theta-band frequencies to effectively decode the animal's location? Is the LFP-based decoding more/less accurate at specific locations of the environment (near walls, near a rewarded location, etc.)? A heat-map of average decoding error per spatial bin (per animal) would be useful to visualize this analysis.

This was a good suggestion, it does indeed appear that the reviewer was correct – the majority of the spatial information is contained in the 125 to 250 Hz band.

We reran our LFP decoding analysis the same way as before but further segregated LFP frequencies into two bands – 0-150Hz (12 of 26 frequencies) and 150-250Hz (3 of 26 frequencies). As shown in Figure 1 – Supplement 2, spatial decoding performance was much lower when only frequencies up to 150 Hz were used, relative to the model evaluated on all frequencies up to 250 Hz. In contrast, the models trained on the 150-250Hz band and 0-250Hz band performed similarly and were sufficient to reach the same performance as the Bayesian decoder (shown in grey).

Indeed, as can be seen in Figure 3 – Supplement 1, models which were only trained on one frequency band at a time demonstrated accurate spatial decoding on frequencies from 165Hz upwards This together with the previous analysis support the reviewer’s hypothesis that excitatory spikes are picked up in frequency bands lower than the traditional 250Hz LFP cut-off. We have now revised our statements in the manuscript and point toward Figure 1 – Supplement 2 showing the difference in decoding performance for all low-frequency models.

Page 4:

“The high accuracy and efficiency of the model for these harder samples suggest that the CNN utilizes additional information from sub-threshold spikes and those that were not successfully clustered, as well as nonlinear information which is not available to the Bayesian decoder.”

Page 5: [Figure caption]

“When only local frequencies were used (<250Hz, CNN-LFP), network performance dropped to the level of the Bayesian decoder (distributions show the five-fold cross validated performance across each of five animals, n=25). Note that this likely reflects excitatory spikes being picked up at frequencies between 150 and 250 Hz (Figure 1 Supplement 2).”

Page 6:

“If we retrain the model on frequency bands 0-150 Hz and 150-250 Hz we observe that spatial information is predominantly contained in higher band frequencies (Figure 1 Supplement 2, see also Figure 3 – Supplement 1), likely reflecting power from pyramidal cell waveforms reaching these frequencies.”

(3) When quantifying the accuracy of head direction decoding, they compare the CNN to chance levels. Although this is a valuable measure, given that head direction seems heavily driven by LFP frequencies associated with excitatory and inhibitory spiking, can the authors also compare head direction decoding between the CNN and Bayesian decoding from clustered spikes (including both exc. and inh. cells)? Is head direction information available in the clustered data or are there other elements in the high-frequency LFP that correlate to head direction? If head direction can be decoded via clustered spikes, why do the authors think this has not been observed in prior studies?

We thank the reviewer for the suggestion to run our head direction decoding also with a Bayesian decoder. We now include an additional analysis in which we quantify head direction decoding using a Bayesian decoding framework, where we calculate errors based on a circular loss, similar to the way the CNN is decoding head direction. The Bayesian decoder's accuracy across all 5 rats is 55.73 ± 8.07 degrees. In comparison, our model achieves a decoding accuracy of 34.37 ± 6.87 degrees. Thus both methods are significantly better than chance (Bayesian model, Wilcoxon signed-rank test two-sided (n=25): T=0, p=1.22e-5, CNN Wilcoxon signed-rank test two-sided (n=25): T=0, p=1.22e-5), although the CNN provides a significant benefit beyond simple Bayesian decoding (Wilcoxon signed-rank test two-sided (n=25): T=47, p=0.0018). It is not entirely surprising that it is possible to decode head direction from CA1 neurons, it has been known for some time that place cells are weakly modulated by head direction (Muller et al. 1994) and, as we report in this manuscript, hippocampal interneurons are similarly modulated. However, in the case of place cells at least, the non-linear interaction between direction and position codes likely impacts the accuracy of the Bayes decoder but is less of a problem for the CNN.

Page 7:

“Note that, a Bayesian decoder trained to decode head direction achieves a performance of 0.97rad ± 0.14rad using spike sorted neural data, significantly worse than our model (Wilcoxon signed-rank test two-sided (n=25): T=47, p=0.0018) but more accurate than would be expected by chance (Wilcoxon signed-rank test two-sided (n=25): T=0, p=1.22e-5).”

(4) In the Methods (line 356), I'm not sure what the authors mean by "16 eight tetrodes". Do they mean 16 tetrodes?

We thank the reviewer for spotting this mistake which we have now revised. We indeed use 16 tetrodes per microdrive, resulting in 128 channels (16 tetrodes x 4 channels per tetrode x 2 microdrives).

(5) The x-axis of Figure 4D and 4H should be labeled, especially since the scale seems to be log rather than linear.

Indeed the frequencies are log scaled and we initially decided to only show some frequencies to not clutter the axis. We now changed the figure axis to show every frequency component. For completeness we now also report all frequencies in the methods section.

Page 16:

“The full frequency space for tetrode recordings consisted of 26 log-space frequencies with Fourier frequencies of: 2.59, 3.66, 5.18, 7.32, 10.36, 14.65, 20.72, 29.3, 41.44, 58.59, 82.88, 117.19, 165.75, 234.38, 331.5, 468.75, 663, 937.5, 1326, 1875, 2652, 3750,

5304, 7500, 15000 Hz. For calcium imaging the Fourier frequencies used are: 0.002,

0.003, 0.005 0.007, 0.01, 0.014, 0.02, 0.03, 0.04, 0.058, 0.08, 0.11, 0.16, 0.23, 0.33, 0.46, 0.66, 0.93, 1.32, 1.87, 2.65, 3.75, 5.3, 7.5, 10.6, 15 Hz.”

Reviewer #3 (Recommendations for the authors):

– I think this method could be very useful for the EEG/ECoG communities, which care about frequency representations, and appealing to those communities would significantly expand the utility of your method for the broader neuroscience community. In my opinion, if there is no example of this type of use in the paper, it is much less likely for EEG/ECoG researchers to actually use your method in practice. I think that having an EEG or ECoG example would be much more beneficial than the calcium imaging example, since researchers are not trying to determine what frequency contents are important within a calcium imaging signal. This has a list of many open datasets for EEG: https://github.com/meagmohit/EEG-Datasets

We thank the reviewer for the suggestion to evaluate our model on another dataset from a different neuroscience community to increase the potential impact of our framework.

We first want to point out that frequency contents might also be important for calcium imaging researchers. One or two-photon microscopes are now being developed for freely moving animals and the decrease in size, which is necessary to fit the animal, is most often accompanied by a decrease in sampling rate. Our decoding results and the underlying informative frequencies show that a sampling rate of around 1Hz is enough to capture most of the information contained in the signal. This information can give other researchers a lower bound for the sampling rate of the neural signal.

As suggested, we now evaluate our model on a publicly available ECoG dataset (Schalk et al. 2007), in which participants finger movements were recorded while simultaneously acquiring ECoG signals. As this dataset was part of a BCI competition (BBCI IV) we can directly compare our model results to the best models of the competition. We used the available training data from three subjects to train and validate our model and report performance on the provided test set. Each subject was instructed to move a given finger in response to a visual cue, which lasted around 2 seconds. We used the same model pipeline, adjusting parameters to fit the dataset. In particular, we used a downsampling factor of 50, as the original sampling rate is 1000Hz (in comparison to 30000 Hz in the CA1 recordings) which leads to an effective sampling rate of 20Hz. We trained the model with 128 timesteps (6.4s) and used a mean squared error loss function between original finger movement and decoded finger movement. As can be seen in Figure 4 – Supplement 1 we reach an average Pearson's r of 0.517 +- 0.160 across three subjects. The best result in the competition reached a performance of 0.46 (see competition results).

Page 12:

“To further assess the ability of our model to decode continuous behaviour from neural data, we investigated its performance on an Electrocorticography (ECoG) dataset recorded in humans (Schalk et al. 2017) made available as part of a BCI competition (BCI Competition IV, Dataset 4). […] These results together show that our model can be used on a wide variety of continuous regression problems in both rodents and humans and across a wide range of recording systems, including calcium imaging and electrocorticography datasets.”

– I think providing a general overview of your approach/method at the beginning of Results would be helpful for many readers.

We thank the reviewers for this suggestion. We have now added a paragraph explaining how the Results section is structured and provide a general overview of our approach in more detail.

Page 3:

“In the following section, we present our model, the results, and describe how it was applied across different datasets. […] The transformed data is then aligned with one or several decoding outputs, representing different behaviours or stimuli, which are fed through a convolutional neural network, decoding each output separately.”

– Please clarify when you are reporting test-set vs. training set predictions in your results.

We are sorry for the lack of clarity regarding the use of training vs. test-set. We now clarify in the manuscript that all of the reported predictions are fully cross-validated on the test set and at no point in the manuscript do we report training predictions except when explicitly stated (e.g. Figure S1).

Page 3:

“Using the wavelet coefficients as inputs, the model was trained in a supervised fashion using error backpropagation with the X and Y coordinates of the animal as regression targets. We report test-set performance for fully cross-validated models using 5 splits across the whole duration of the experiment.”

– In the final paragraph of the introduction, you write "Our model differs markedly from conventional decoding methods which typically use Bayesian estimators…" This is overly specific to hippocampal decoding – in movement decoding Bayesian methods are not frequently used, although linear methods still commonly are.

We thank the reviewer for the suggestion to open up our introduction to a wider audience. We now rewrote the manuscript in the following way.

Page 2:

“Our model differs markedly from conventional decoding methods which often use Bayesian estimators (Zhang et al. 1998) for hippocampal recordings in conjunction with highly processed neural data or linear methods for movement decoding from EEG or

ECoG signals (Antelis et al. 2013).”

https://doi.org/10.7554/eLife.66551.sa2

Article and author information

Author details

  1. Markus Frey

    1. Kavli Institute for Systems Neuroscience, Centre for Neural Computation, The Egil and Pauline Braathen and Fred Kavli Centre for Cortical Microcircuits, NTNU, Norwegian University of Science and Technology, Trondheim, Norway
    2. Max-Planck-Insitute for Human Cognitive and Brain Sciences, Leipzig, Germany
    Contribution
    Data curation, Software, Formal analysis, Visualization, Methodology, Writing - original draft, Writing - review and editing
    For correspondence
    markus.frey@ntnu.no
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0291-3391
  2. Sander Tanni

    Cell & Developmental Biology, UCL, London, United Kingdom
    Contribution
    Resources, Data curation
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-9275-0735
  3. Catherine Perrodin

    Institute of Behavioural Neuroscience, UCL, London, United Kingdom
    Contribution
    Resources, Data curation
    Competing interests
    No competing interests declared
  4. Alice O'Leary

    Cell & Developmental Biology, UCL, London, United Kingdom
    Contribution
    Resources, Data curation
    Competing interests
    No competing interests declared
  5. Matthias Nau

    1. Kavli Institute for Systems Neuroscience, Centre for Neural Computation, The Egil and Pauline Braathen and Fred Kavli Centre for Cortical Microcircuits, NTNU, Norwegian University of Science and Technology, Trondheim, Norway
    2. Max-Planck-Insitute for Human Cognitive and Brain Sciences, Leipzig, Germany
    Contribution
    Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0956-7815
  6. Jack Kelly

    Open Climate Fix, London, United Kingdom
    Contribution
    Conceptualization
    Competing interests
    No competing interests declared
  7. Andrea Banino

    DeepMind, London, United Kingdom
    Contribution
    Supervision, Methodology
    Competing interests
    No competing interests declared
  8. Daniel Bendor

    Institute of Behavioural Neuroscience, UCL, London, United Kingdom
    Contribution
    Funding acquisition
    Competing interests
    No competing interests declared
  9. Julie Lefort

    Cell & Developmental Biology, UCL, London, United Kingdom
    Contribution
    Resources, Data curation
    Competing interests
    No competing interests declared
  10. Christian F Doeller

    1. Kavli Institute for Systems Neuroscience, Centre for Neural Computation, The Egil and Pauline Braathen and Fred Kavli Centre for Cortical Microcircuits, NTNU, Norwegian University of Science and Technology, Trondheim, Norway
    2. Max-Planck-Insitute for Human Cognitive and Brain Sciences, Leipzig, Germany
    3. Institute of Psychology, Leipzig University, Leipzig, Germany
    Contribution
    Conceptualization, Supervision, Funding acquisition
    Contributed equally with
    Caswell Barry
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-4120-4600
  11. Caswell Barry

    Cell & Developmental Biology, UCL, London, United Kingdom
    Contribution
    Conceptualization, Resources, Data curation, Supervision, Methodology, Writing - original draft, Project administration, Writing - review and editing
    Contributed equally with
    Christian F Doeller
    For correspondence
    caswell.barry@ucl.ac.uk
    Competing interests
    No competing interests declared

Funding

H2020 European Research Council (ERC-CoG GEOCOG 724836)

  • Christian F Doeller

Wellcome Trust (212281/Z/18/Z)

  • Caswell Barry

Wellcome Trust (110238/Z/15/Z)

  • Catherine Perrodin

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Ethics

Animal experimentation: All procedures were approved by the UK Home Office, subject to the restrictions and provisions contained in the Animals Scientific Procedures Act of 1986.

Senior Editor

  1. Joshua I Gold, University of Pennsylvania, United States

Reviewing Editor

  1. Sachin Deshmukh, Indian Institute of Science Bangalore, India

Publication history

  1. Preprint posted: December 11, 2019 (view preprint)
  2. Received: January 15, 2021
  3. Accepted: July 13, 2021
  4. Version of Record published: August 2, 2021 (version 1)

Copyright

© 2021, Frey et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 3,143
    Page views
  • 238
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Neuroscience
    Debora Fusca, Peter Kloppenburg
    Research Article

    Local interneurons (LNs) mediate complex interactions within the antennal lobe, the primary olfactory system of insects, and the functional analog of the vertebrate olfactory bulb. In the cockroach Periplaneta americana, as in other insects, several types of LNs with distinctive physiological and morphological properties can be defined. Here, we combined whole-cell patch-clamp recordings and Ca2+ imaging of individual LNs to analyze the role of spiking and nonspiking LNs in inter- and intraglomerular signaling during olfactory information processing. Spiking GABAergic LNs reacted to odorant stimulation with a uniform rise in [Ca2+]i in the ramifications of all innervated glomeruli. In contrast, in nonspiking LNs, glomerular Ca2+ signals were odorant specific and varied between glomeruli, resulting in distinct, glomerulus-specific tuning curves. The cell type-specific differences in Ca2+ dynamics support the idea that spiking LNs play a primary role in interglomerular signaling, while they assign nonspiking LNs an essential role in intraglomerular signaling.

    1. Neuroscience
    Wanhui Sheng et al.
    Research Article Updated

    Hypothalamic oxytocinergic magnocellular neurons have a fascinating ability to release peptide from both their axon terminals and from their dendrites. Existing data indicates that the relationship between somatic activity and dendritic release is not constant, but the mechanisms through which this relationship can be modulated are not completely understood. Here, we use a combination of electrical and optical recording techniques to quantify activity-induced calcium influx in proximal vs. distal dendrites of oxytocinergic magnocellular neurons located in the paraventricular nucleus of the hypothalamus (OT-MCNs). Results reveal that the dendrites of OT-MCNs are weak conductors of somatic voltage changes; however, activity-induced dendritic calcium influx can be robustly regulated by both osmosensitive and non-osmosensitive ion channels located along the dendritic membrane. Overall, this study reveals that dendritic conductivity is a dynamic and endogenously regulated feature of OT-MCNs that is likely to have substantial functional impact on central oxytocin release.