Movie reconstruction from mouse visual cortex activity

Joel Bauer; Troy W Margrie; Claudia Clopath

doi:10.7554/eLife.105081.1

eLife Assessment

This valuable study uses state-of-the-art neural encoding and video reconstruction methods to achieve a substantial improvement in video reconstruction quality from mouse neural data, providing a convincing demonstration of how reconstruction performance can be improved by combining these methods. The findings showed that model ensembling and the number of neurons used for reconstruction were key determinants of reconstruction accuracy, but the theoretical contribution to understanding neural encoding was less clear. The treatment of how image masking improved reconstruction performance was also incomplete.

https://doi.org/10.7554/eLife.105081.1.sa4

Significance of findings

valuable: Findings that have theoretical or practical implications for a subfield

landmark
fundamental
important
valuable
useful

Strength of evidence

convincing: Appropriate and validated methodology in line with current state-of-the-art

incomplete: Main claims are only partially supported

exceptional
compelling
convincing
solid
incomplete
inadequate

During the peer-review process the editor and reviewers write an eLife assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife assessments

Abstract

The ability to reconstruct imagery represented by the brain has the potential to give us an intuitive understanding of what the brain sees. Reconstruction of visual input from human fMRI data has garnered significant attention in recent years. Comparatively less focus has been directed towards vision reconstruction from single-cell recordings, despite its potential to provide a more direct measure of the information represented by the brain. Here, we achieve high-quality reconstructions of videos presented to mice, from the activity of neurons in their visual cortex. Using our method of video optimization via backpropagation through a state-of-the-art dynamic neural encoding model we reliably reconstruct 10-second movies at 30 Hz from two-photon calcium imaging data. We achieve a ≈ 2-fold increase in pixel-by-pixel correlation compared to previous state-of-the-art reconstructions of static images from mouse V1, while also capturing temporal dynamics. We find that critical for high-quality reconstructions are the number of neurons in the dataset and the use of model ensembling. This paves the way for movie reconstruction to be used as a tool to investigate a variety of visual processing phenomena.

1 Introduction

One fundamental aim of neuroscience is to eventually gain insight into the ongoing perceptual experience of humans and animals. Reconstruction of visual perception directly from brain activity has the potential to give us a deeper understanding of how the brain represents visual information. Over the past decade, there have been considerable advances in reconstructing images and videos from human brain activity [Nishimoto et al., 2011, Shen et al., 2019a,b, Rakhimberdina et al., 2021, Ren et al., 2021, Takagi and Nishimoto, 2023, Ozcelik and VanRullen, 2023, Ho et al., 2023, Scotti et al., 2023, Chen et al., 2023, Benchetrit et al., 2023, Kupershmidt et al., 2022]. These advances have largely leveraged deep learning techniques to interpret fMRI or MEG recordings, taking advantage of the fact that spatially separated clusters of neurons have distinct visual and semantic response properties [Rakhimberdina et al., 2021]. Due to the low resolution of fMRI and MEG, relative to single neurons, the most successful models heavily rely on extracting semantic content and use diffusion models to generate semantically similar images and videos. Some approaches combine low-level perceptual (retinotopic) and semantic information in separate modules to achieve even better image similarity [Ren et al., 2021, Ozcelik and VanRullen, 2023, Scotti et al., 2023]. However, the pixel-level similarities are still relatively low. These methods are highly useful in humans, but their focus on semantic content may make them less useful when applied to non-human subjects or when using the reconstructed images to investigate visual processing.

Less attention has been given to image reconstruction from non-human brains. This is surprising given the advantages of large-scale single-cell-resolution recording techniques available in animal models, particularly mice. In the past, reconstructions using linear summation of receptive fields or Gabor filters have shown some success using responses from retinal ganglion cells [Brackbill et al., 2020], thalamo-cortical neurons in lateral genicular nucleus [Stanley et al., 1999], and primary visual cortex [Garasto et al., 2019, Yoshida and Ohki, 2020]. Recently, deep nonlinear neural networks have been used with promising results to reconstruct static images from retina [Zhang et al., 2020, Li et al., 2023] and in particular from monkey V4 extracellular recordings [Li et al., 2023, Pierzchlewicz et al., 2023].

Here, we present a method for the reconstruction of 10-second movie clips using two-photon calcium imaging data recorded in mouse V1 [Turishcheva et al., 2023, 2024]. Our method takes advantage of a state-of-the-art (SOTA) dynamic neural encoding model (DNEM) [Baikulov, 2023] which predicts neuronal activity based on video input as well as behavior (Figure 1A). Our method allows us to successfully reconstruct videos despite the fact that V1 neuronal activity in awake mice is heavily modulated by behavioral factors such as running speed [Niell and Stryker, 2010] and pupil diameter (correlated with arousal; [Reimer et al., 2014]). We then quantify the spatio-temporal limits of this reconstruction approach and identify key aspects of our method necessary for optimal performance.

Video reconstruction from neuronal activity in mouse V1 (data provided by the Sensorium 2023 competition; [Turishcheva et al., 2023, 2024]) using a state-of-the-art (SOTA) dynamic neural encoding model (DNEM; [Baikulov, 2023]). A) SOTA DNEMs predict neuronal activity from mouse primary visual cortex, given a video and behavioural input. B) We use a SOTA DNEM to reconstruct part of the input video given neuronal population activity, using gradient descent to optimize the input. C) Poisson negative log likelihood loss across training steps between ground truth neuronal activity and predicted neuronal activity in response to videos. Left: all 50 videos from 5 mice for one model. Right: average loss across all videos for 7 model instances. D) Spatio-temporal (pixel-by-pixel) correlation between reconstructed video and ground truth video.

2 Video reconstruction using state-of-the-art dynamic neural encoding models

We used publicly available data provided by the Sensorium 2023 competition [Turishcheva et al., 2023, 2024]. The data included movies that were presented to mice and the evoked activity of V1 neurons along with pupil position, pupil diameter, and running speed. The neuronal activity was measured using two-photon imaging of GCaMP6s [Chen et al., 2013] fluorescence from 10 mice, with ≈ 8000 neurons from each mouse. In total, we reconstructed ten 10s natural movies from 5 mice.

We used the winning model of the Sensorium 2023 competition which achieved a score of 0.301 ([Baikulov, 2023, Turishcheva et al., 2024] single trial correlation between predicted and ground truth neuronal activity; Figure 1A and Figure S1A-C). This state-of-the-art (SOTA) dynamic neural encoding model (DNEM) was composed of three parts: core, cortex and readout. The model takes the video as input with the behavioral data (pupil position, pupil diameter, and running speed) broadcast to four additional channels of the video. The original model weights were not used to avoid reconstructing movies the model was trained on. Instead, we retrained 7 instances of the model using the same training data, which did not include the movies reserved for reconstruction. Beyond this point the weights of the model were frozen, i.e. not influenced by future movie presentations.

To reconstruct the videos presented to mice we iteratively optimized an initially blank input video to the SOTA DNEM until the predicted activity in response to this input matched the ground truth recorded neuronal activity. To achieve this we used an input optimization through gradient descent approach inspired by the optimization of maximally exciting images [Walker et al., 2019] and the reconstruction of static images from monkey V4 extracellular recordings [Pierzchlewicz et al., 2023]. The input videos were initialized as uniform grey values and the behavioral parameters (Figure S1A) were added as additional channels, i.e. these were not reconstructed but given. The neuronal activity in response to the input video was predicted using the SOTA DNEM for a sliding window of 32 frames (1.067 sec) with a stride of 8 frames. We saw slightly better results with a stride of 2 frames but, in our case, this did not warrant the increase in training time. For each window, the difference between the predicted and ground truth responses was calculated and this loss backpropagated to the pixels of the input video to get the gradient of the loss with respect to each pixel. In effect, the input pixels were thus treated as if they were model weights. The gradients for each pixel were then averaged across all windows and the pixels of the input video updated accordingly (See Supplementary Algorithm 1).

The data from the Sensorium competition provided the activity of neurons within a 630 by 630 µm field of view for each mouse, i.e. covering roughly one-fifth of mouse V1. Due to the retinotopic organization of V1 we therefore did not expect to get good reconstructions of the entire video frame. However, gradients still propagated to the full video frame and produced non-sensical results along the periphery of the video frames. Inspired by previous work [Mordvintsev et al., 2018, Willeke et al., 2023] we therefore decided to apply a mask during training and evaluation. To generate these masks, we optimized a transparency layer placed at the input to the SOTA DNEM. High values are given to pixels that contribute to the accurate prediction of neuronal activity and represent the collective receptive field of the neural population. This mask was applied during the optimization of the reconstructed movies (training mask: binarized with threshold α = 0.5) and applied again to the final reconstruction (evaluation mask: binarized with threshold α = 1) (See Supplementary Algorithm 2).

As the loss between predicted (Figure S1D) and ground truth responses (Figure S1B) decreased, the similarity between the reconstructed and ground truth input video increased (Figure 1C-D). We generated 7 separate reconstructions from 7 neural encoding models (trained on the same data) and averaged them. Finally, we applied a 3D Gaussian filter with sigma 0.5 pixels to remove the remaining static noise and applied the evaluation mask. The Gaussian filter was not applied when evaluating spatial or temporal resolution (Figure 4 and Figure S2).

2.1 High-quality video reconstruction

As can be seen in Figure 2 and Supplementary Video 1, the reconstructed videos capture much of the spatial and temporal dynamics of the original input video. To evaluate performance of the video reconstructions we correlated either all pixels from all time points between ground truth and reconstructed videos (Pearson’s correlation r = 0.563; to quantify temporal and spatial similarity), or the average correlation between all sets of frames (Pearson’s correlation r = 0.500; to quantify just spatial similarity)(Figure 2B and Figure S1E). Importantly, this represents a ≈ 2x improvement over previous static image reconstructions from V1 in awake mice (image correlation 0.23 +/-0.02 s.e.m for awake mice) [Yoshida and Ohki, 2020] over a similar retinotopic area (≈ 60^° diameter) while also capturing temporal dynamics (Table 1).

Benchmarking against previous natural image reconstructions from mouse visual cortex. Correlation r values are given as mean and std across mice (except for [Garasto et al., 2019], where they are given as mean and std across reconstructed images).

Reconstruction performance. A) Three reconstructions of 10s videos from different mice (see Supplementary Video 1 for full set: YouTube link). Reconstructions have been luminance (mean pixel value across video) and contrast (standard deviation of pixel values across video) matched to ground truth. B) The reconstructed videos have high correlation to ground truth in both spatio-temporal correlation (mean Pearson’s correlation r = 0.563 with 95% CIs 0.534 to 0.593, t-test between ground truth and random video p = 8.66 * 10⁻⁴⁵, n = 50 videos from 5 mice) and mean frame correlation (mean Pearson’s correlation r = 0.500 with 95% CIs 0.468 to 0.532, t-test between ground truth and random video p = 3.27 * 10⁻³⁸, n = 50 videos from 5 mice).

Reconstruction quality, however, was not consistent across movies (Figure 2B) or constant throughout the 10 second videos (Figure S1E). We therefore investigated what factors may cause these fluctuations by correlating video motion energy, contrast and luminance, as well as running speed, pupil diameter and eye movement with frame correlation. We found that only video motion energy and contrast correlated with frame correlation, but only to a moderate degree (Supplementary Figure S3).

2.2 Ensembling

We found that the 7 instances of the SOTA DNEMs by themselves performed similarly in terms of reconstructed video correlation (Figure 1D), but that this correlation was significantly increased by taking the average across reconstructions from different models (Figure 3) – A technique known as bagging, and more generally ensembling [Breiman, 1996]. Individual models produced reconstructions with high-frequency noise in the temporal and spatial domains. We therefore think the increase in performance from ensembling is mostly an effect of averaging out this high-frequency noise.

Model ensembling. Mean video correlation is improved when predictions from multiple models are averaged. Dashed lines are individual animals, solid line is mean. One-way repeated measures ANOVA p = 1.11 * 10⁻¹⁶. Bonferroni corrected paired ttest outcomes between consecutive ensemble sizes are all p < 0.001, n = 5 mice.

In this paper we averaged over 7 model instances which gave a performance increase of 28.3%, but the largest gain in performance, 13.8%, came from averaging across just 2 models (Figure 3). Doubling the number of models to 4 increased the performance by another 8.12%. Overall, although ensembling over models trained on separate data splits is a computationally expensive method, it substantially improved reconstruction quality.

2.3 Not all spatial and temporal frequencies are reconstructed equally

While the reconstructed videos achieve high correlation to ground truth, it is not entirely clear if the remaining deviations are due to the limitations of the model or arise from the recorded neurons themselves. To assess the resolution limits of our reconstruction process, we assessed the model’s ability to reconstruct synthetic stimuli at varying spatial and temporal resolutions in a noise-free scenario.

To quantify which spatial and temporal frequencies our reconstruction approach is able to capture we used a Gaussian noise stimulus set generated using a Gaussian process (https://github.com/TomGeorge1234/gp_video; Figure 4A). The dataset consisted of 49, 2 second, 36 by 36 pixel videos at 30 Hz, which varied in the spatial and temporal length constants. As we did not have ground truth neuronal activity in response to this stimulus set, we first predicted the neuronal responses given these videos using the ensembled SOTA DNEMs. We then used gradient descent to reconstruct the original input using these predicted neuronal responses as the target. In this way, we generated reconstructions in an ideal case with no biological noise and assuming the SOTA DNEM perfectly predicts neuronal activity (Figure 4B). This means the video reconstruction quality loss reflects the inefficiency of the reconstruction process itself without the additional loss or transformation of information by processes such as top-down modulation, e.g. predictive coding or selective feature attention (see Discussion). We found that the reconstruction process failed at high spatial frequencies (< 1 pixel, or < 3.4^° retinotopy) and performed worse at high temporal frequencies (< 1 frame, or < 30 Hz)(Figure 4C and Supplementary Video 2). We repeated this analysis using full-field high-contrast square gratings drifting in the four cardinal directions and similarly found that high-spatial and temporal frequencies were not reconstructed as well as low-spatial and temporal frequency gratings (Figure S2).

To test if model ensembling improves Gaussian noise reconstruction quality across all spatial and temporal length constants uniformly, we subtracted the average video correlation across the six model instances from the video correlation of the average video (i.e. ensembled video reconstruction minus unensembled video reconstruction; Figure 4D). We found that in particular short temporal and spatial length constant stimuli improved in correlation, supporting our hypothesis that ensembling mitigates the high-frequency noise we observed in the reconstruction from individual models.

2.4 Neuronal population size

In order to design future in vivo experiments to investigate visual processing using our video reconstruction approach, it would be useful to know how reconstruction performance scales with the number of recorded neurons. This is vital for prioritizing experimental parameters such as weighing between sampling density within a similar retinotopic area and retinotopic coverage to maximize both video reconstruction quality and visual coverage. We therefore performed an in silico ablation experiment, dropping either 50, 75% or 87.5% of the total recorded population of ≈ 8000 neurons per mouse by setting their activity to 0 (Figure 5). We found that dropping 50% of the neurons reduced the video correlation by only 9.8% while dropping 75% reduced the performance by 23.8%. We would therefore argue that ≈ 4000-8000 neurons within a 630 by 630 µm area (≈ 10000-20000 neurons/mm²) of mouse V1 would prove a blanance when compromising between density and 2D coverage.

Reconstruction of Gaussian noise across the spatial and temporal spectrum using predicted activity. A) Example Gaussian noise stimulus set with evaluation mask for one mouse. Shown is the last frame of a 2 second video. B) Reconstructed Gaussian stimuli with SOTA DNEM predicted neuronal activity as the target (see also Supplementary Video 2: YouTube link). C) Pearson’s correlation between ground truth (A) and reconstructed videos (B) across the range of spatial and temporal length constants. For each stimulus type the average correlation across 5 movies reconstructed from the SOTA DNEM of 3 mice is given. D) Ensembling effect for each stimulus. Video correlation for ensembled prediction (average videos from 7 model instances) minus the mean video correlation across the 7 individual model instances.

Video reconstruction using fewer neurons (i.e. population ablation) leads to lower reconstruction quality. Dashed lines are individual animals, solid line is mean. One-way repeated measures ANOVA p = 1.58* 10⁻¹². Bonferroni corrected paired t-test outcomes between consecutive drops in population size are all p < 0.001, n = 5 mice.

3 Discussion

3.1 Stimulus identification vs reconstruction

Stimulus identification, i.e. identifying the most likely stimulus from a constrained set, has been a popular approach for quantifying whether a population of neurons encodes the identity of a particular stimulus [Földiák, 1993, Kay et al., 2008]. This approach has, for instance, been used to decode frame identity within a movie [Deitch et al., 2021, Xia et al., 2021, Schneider et al., 2023, Chen et al., 2024]. Some of these approaches have also been used to reorder the frames of the ground truth movie [Schneider et al., 2023] based on the decoded frame identity. Importantly, stimulus identification methods are distinct from stimulus reconstruction where the aim is to recreate what the sensory content of a neuronal code is in a way that generalizes to new sensory stimuli [Rakhimberdina et al., 2021]. This is inherently a more demanding task because the range of possible solutions is much larger. Although stimulus identification is a valuable tool for understanding the information content of a population code, stimulus reconstruction promises a more generalizable approach.

3.2 Comparison to other reconstruction methods

There has recently been a growing number of publications in the field of image reconstruction, primarily from fMRI data, and a comprehensive review of all the approaches is outside the scope of this paper. However, we will briefly summarize the most common approaches and how they relate to our own method. In general, image reconstruction methods can be categorized into one of four groups: direct decoding models, encoder-decoder models, invertible encoding models, and encoder model input optimization.

Direct decoders, directly decode the input image/videos from neuronal activity with deep neuronal networks [Shen et al., 2019a, Zhang et al., 2020, Li et al., 2023]. When training direct decoders, the decoders can be pretrained [Ren et al., 2021] or additional constraints can be added to the loss function to encourage the decoder to produce images that adhere to learned image statistics [Shen et al., 2019a, Kupershmidt et al., 2022]. A direct decoder approach has been used for video reconstruction in mice [Chen et al., 2024], but in that case, the training and test movies were the same, meaning it is unclear if out-of-training set generalization was achieved (a key distinction between sensory reconstruction and stimulus identification, see previous section).

In encoder-decoder models the aim is to combine separately trained brain encoders (brain activity to latent space) and decoders (latent space to image/video). Recently this approach has become particularly popular because it allows the use of SOTA generative image models such as stable diffusion [Rombach et al., 2021, Takagi and Nishimoto, 2023, Scotti et al., 2023, Chen et al., 2023, Benchetrit et al., 2023]. The encoder part of the models are first trained to translate brain activity into a latent space that the pretrained generative networks can interpret. Because these latent spaces are often conditioned on semantic information, this lends itself to separate processing of low-level visual and high-level semantic information from brain activity [Scotti et al., 2023].

Invertible encoding models are encoding models which, once trained to predict neuronal activity, can implicitly be inverted to predict sensory input given brain activity. We would also include those models in this class which first compute the receptive field or preferred stimulus of neurons (or voxels) and reconstruct the input as the weighted sum of the receptive fields by their activity [Stanley et al., 1999, Thirion et al., 2006, Garasto et al., 2019, Brackbill et al., 2020, Yoshida and Ohki, 2020, Nishimoto et al., 2011]. The down-side of this approach is that these encoding models generally under perform in terms of capturing the coding properties of neurons compared to more complex deep neural networks [Willeke et al., 2023].

Encoder input optimization, also involves first training an encoder which predicts the activity of neurons or voxels given sensory input. Once trained the encoder is fixed and the input to the network is optimized using backpropagation until the predicted activity matches the observed activity [Pierzchlewicz et al., 2023]. Unlike with invertible encoding models any SOTA neuronal encoding model can be used. But like invertible models, the networks are not specifically trained to reconstruct images so they may be less likely to extrapolate information encoded by the brain by learning general image statistics.

Although outlined here as 4 distinct classes these approaches can be combined. For instance, Encoding input optimization can be combined with image diffusion [Pierzchlewicz et al., 2023] and in principle invertible models could also be combined in such a way.

We chose to pursue a pure encoder input optimization approach for single cell mouse visual cortex activity for two reasons. First, there have been considerable advances in the performance of neuronal encoding models for dynamic visual stimuli [Sinz et al., 2018, Wang et al., 2023, Turishcheva et al., 2024] and we aimed to take advantage of these developments. Second, the addition of a generative decoder trained to producing high quality images brings with it the risk of extrapolating information based on general image statistics rather than interpreting what the brain is representing. In some cases, the brain may not be encoding coherent images and in those cases we would argue image reconstruction should fail, rather than producing an image when only the semantic information is present.

3.3 Key contributions and limitations

We demonstrate high-quality video reconstruction from mouse V1 using SOTA DNEMs to iteratively optimize the input video to match the resulting predicted activity with the recorded neuronal activity. Key to achieving high-quality reconstructions is model ensembling and using a large enough number of recorded neurons over a given retinotopic area.

While we averaged the video reconstructions from several models, an alternative method would be to average the gradients calculated by multiple models at each epoch, as has been done for the generation of maximally exciting images in the past [Walker et al., 2019]. However, this requires a large amount of GPU memory when using video models and is likely not practical with most hardware limitations. However, there might be situations in which averaging gradients yields better reconstructions. For instance, there may be multiple solutions for the activation pattern of a neural population, e.g. if their responses are translation/phase invariant [Ito et al., 1995, Tacchetti et al., 2018]. In such a case, averaging ‘misaligned’ reconstructions from multiple models might degrade overall quality.

The SOTA DNEM we used takes video data at an angular resolution of 3.4 ^°/pixels at the center of the screen which is about 3x worse than the visual acuity of mice (≈ 0.5 cycles/^° [Prusky and Douglas, 2004]). As our model can reconstruct Gaussian noise stimuli down to a spatial length constant of 1 pixel, and drifting gratings up to a spatial frequency of 0.071 cycles/^°, there is still some potential for improving spatial resolution. To close this gap and achieve reconstructions equivalent to the limit of mouse visual acuity, a different dataset and model would likely need to be developed. However, the frame rate of the videos the SOTA DNEM takes as input (30 Hz) is faster than the flicker fusion frequency of mice (14 Hz [Nomura et al., 2019]) and our tests with Gaussian noise and drifting grating stimuli show that the temporal resolution of reconstruction is close to this expected limit. Future efforts should therefore focus on the spatial resolution of video reconstruction rather than the temporal resolution.

It is, however, unclear how closely the representation of vision by the brain is expected to match the actual input. There are a number of visual processing phenomena that have previously been identified which leads us to suspect that some deviations between video reconstructions and ground truth input are to be expected. One such phenomenon is predictive coding [Rao and Ballard, 1999, Fiser et al., 2016]. It is possible that the unexpected parts of visual stimuli are sharper and have higher contrast compared to the expected parts when reconstructed from neuronal activity. Alternatively, perceptual learning is a phenomenon where visual stimulus detection or discriminability is enhanced through prolonged training [Li, 2015] and is associated with changes in the tuning distribution of neurons in the visual system [Goltstein et al., 2013, Poort et al., 2015, Jurjut et al., 2017, Schumacher et al., 2022]. Similarly, selective feature attention can modulate the response amplitude of neurons that have a preference for the features that are currently being attended to [Kanamori and Mrsic-Flogel, 2022]. Visual task engagement and training could therefore alter the accuracy and biases of what features of a video can accurately be reconstructed from the neuronal activity.

Such visual processing phenomena are likely difficult to investigate using existing fMRI-based reconstruction approaches, due to the low spatial and temporal resolution of the data. Additionally, many of these fMRI-based reconstruction approaches rely on the use of pretrained generative diffusion models to achieve more naturalistic and semantically interpretable images [Takagi and Nishimoto, 2023, Ozcelik and VanRullen, 2023, Scotti et al., 2023, Chen et al., 2023], but very likely at the cost of introducing information that may not be present in the actual neuronal representation. In contrast, our video reconstruction approach using single-cell resolution recordings, without a pretrained generative model, provides a more accurate method to investigate visual processing phenomena such as predictive coding, perceptual learning, and selective feature attention.

In conclusion, we reconstruct videos presented to mice based on the activity of neurons in the mouse visual cortex, with a ≈ 2-fold improvement in pixel-by-pixel correlation compared to previous static image reconstruction methods. This paves the way to using movie reconstruction as a tool to investigate a variety of visual processing phenomena.

4 Methods

4.1 Source data

The data was provided by the Sensorium 2023 competition [Turishcheva et al., 2023, 2024] and downloaded from https://gin.g-node.org/pollytur/Sensorium2023Data and https://gin.g-node.org/pollytur/sensorium_2023_dataset. The data included grayscale movies presented to the mice at 30 Hz on a 31.8 by 56.5 cm monitor 15 cm from and perpendicular to the left eye. The movies were provided as spatially downsampled versions of the original screen resolution to 36 by 64 pixels, corresponding to an angular resolution of 3.4 ^°/pixel at the center of the screen. The pupil position and diameter were recorded at 20 Hz and the running at 100 Hz. The neuronal activity was measured using two-photon imaging [Denk et al., 1990] of GCaMP6s [Chen et al., 2013] fluorescence at 8 Hz, extracted and deconvolved using the CAIMAN pipeline [Giovannucci et al., 2019]. For each of the 10 mice, the activity of ≈ 8000 neurons was provided. The different data types were resampled to 30 Hz.

4.2 State-of-the-art dynamic neural encoding model

We used the winning model of the Sensorium 2024 competition, DwiseNeuro [Turishcheva et al., 2023, 2024]. The code for the SOTA DNEM was downloaded from https://github.com/lRomul/ sensorium. The full model consists of 3 main components: core, cortex, and readout. The core largely consisted of factorized 3D convolution blocks with residual connections, positional encoding [Vaswani et al., 2017] and SiLU activations [Elfwing et al., 2017] followed by spatial average pooling. The cortex consisted of three fully connected layers. The readout consisted of a 1D convolution for each mouse with a final Softplus nonlinearity, that gives activity predictions for all neurons of each mouse. The kernel of the input layer had size 16 with a dilation of 2 in the time dimension, so spanned 32 video frames.

The original ensemble of models consisted of 7 model instances trained on a 7-fold cross-validation split of all available Sensorium 2023 competition data (≈ 1 hour of training data and ≈ 8 min of cross-validation data per fold from each mouse). Each model instance was trained on 6 of 7 data folds, with different validation data excluded from training for each model. To allow ensembled reconstructions of videos without test set contamination we instead retrained the models with a shared validation fold, i.e. we retrained the models leaving out the same validation data for all 7 model instances. The only other difference in the training procedure was that we retrained the models using a batch size of 24 instead of 32, this did not change the performance of neuronal response prediction on the withheld data folds (mean validation fold predicted vs ground truth response correlation for original weights: 0.293; and retrained weights: 0.291). We also did not use model distillation, while the original model did (see https://github.com/lRomul/sensorium).

4.3 Additional visual stimuli

The Gaussian noise stimuli were downloaded from https://github.com/TomGeorge1234/gp_video and spanned a range of 0 to 32 pixels in spatial length constant and 0 to 32 frames in temporal length constant used in the Gaussian process. The drifting grating stimuli were produced using PsychoPy [Peirce et al., 2019] and ranged from 0.5 to 0.062 cycles/degree and 0.5 to 0 cycles/second, with 2 seconds of movie for each cardinal direction. These ranges were chosen to avoid aliasing effects in the 36 by 64 pixel videos. The highest temporal frequency corresponds to a flicker stimulus.

4.4 Mask training

To generate the transparency masks we used an alpha blending approach inspired by [Mordvintsev et al., 2018, Willeke et al., 2023]. A transparency layer was placed at the input to the SOTA DNEM. This transparency layer was used to alpha blend the true video V with another randomly selected background video BG from the data:

where α is the 2D transparency mask and V_BG is the blended input video. This mask was optimized using stochastic gradient descent (for 1000 epochs with learning rate 10) with mean squared error (MSE) loss between the true responses y and the predicted responses ŷ scaled by the average weight of the transparency mask :

where n is the total number of neurons. The mask was initialized as uniform noise between 0 and 0.05. At each epoch the neuronal activity in response to a randomly selected 32 frame video segment from the training set was predicted and the gradients of the loss (Equation 3) with respect to the pixels in the transparency mask α were calculated for each video frame. The gradients were normalized by their matrix norm, clipped to between -1 and 1 and averaged across frames. The gradients were smoothed with a 2D Gaussian kernel of σ = 5 and subtracted from the transparency mask. The transparency mask was only calculated using one SOTA DNEM instance using its validation fold. See Supplementary Algorithm 2.

The transparency mask was thresholded and binarized at 0.5 for the masked gradients ∇_masked or 1 for the masked videos for evaluation V_eval:

where ∇ is the gradients of the loss with respect to each pixel in the video and V is the reconstructed video before masking. These masks were trained independently for each mouse using one model instance with the original weights of the model https://github.com/lRomul/sensorium, not the retrained models used in the rest of this paper to reconstruct the videos.

4.5 Video reconstruction

To reconstruct the input video we initialized the video as uniform gray values and concatenated the ground truth behavioral parameters. The SOTA DNEM took 32 frames at a time and we shifted this window by 8 frames until all frames of the whole 10s video were covered. For each 32-frame window, the Poisson negative log-likelihood loss between the predicted and true neuronal responses was calculated:

where ŷ are the predicted responses and y are the ground truth responses. The gradients of the loss with respect to each pixel of the input video were calculated for each window of frames and averaged across all windows. The gradients for each pixel were normalized by the matrix norm across all gradients and clipped to between -1 and 1. The gradients were masked (Equation 4) and applied to the input video using Adam without second order momentum [Kingma and Ba, 2014] (β₁ = 0.9) for 1000 epochs and a learning rate of 1000, with a learning rate warm-up for the first 10 epochs. After each epoch, the video was clipped to between 0 and 255. The optimization was run for 1000 epochs. 7 reconstructions from 7 model instances were averaged, denoised with a 3D Gaussian filter σ = 0.5 (unless specified otherwise), and masked with the evaluation mask. See Supplementary Algorithm 1. Optimizing each 10-second video with one model instance for 1000 epochs took ≈ 60 min using a desktop with an RTX4070 GPU.

4.6 Reconstruction quality assessment

To evaluate the similarity between reconstructed and ground truth videos, we used the mean Pearson’s correlation between pixels of corresponding frames to evaluate spatial similarity:

where f is the number of frames, and x_i and are the ground truth and reconstructed frames. To evaluate temporal and spatial similarity between ground truth and reconstructed videos we used the Pearson’s correlation between all pixels of the whole movie:

A Appendix / supplemental material

Summary ethogram of SOTA DNEM inputs, output predictions, and video reconstruction over time for three videos from three mice (same as Figure 2A). A) Top: motion energy of the input video. Bottom: pupil diameter and running speed of the mouse during the video. B) Ground truth neuronal activity. C) Predicted neuronal activity in response to input video and behavioural parameters. D) predicted neuronal activity given reconstructed video and ground truth behaviour as input. E) Frame by frame correlation between reconstructed and ground truth video.

Reconstruction of drifting grating stimuli with different spatial and temporal frequencies using predicted activity. A) Example drifting grating stimuli (rightwards moving) masked with the evaluation mask for one mouse. Shown is the 31st frame of a 2 second video. B) Reconstructed drifting grating stimuli with SOTA DNEM predicted neuronal activity as the target (see also Supplementary Video 3: YouTube link). C) Pearson’s correlation between ground truth (A) and reconstructed videos (B) across the range of spatial and temporal frequencies. For each stimulus type the average correlation across 4 directions (up, down, left, right) reconstructed from the SOTA DNEM of 1 mouse is given. Interestingly, video correlation at 15 cycles/second (half the video frame rate of 30 Hz) is much higher than 7.5 cycles/second. This is an artifact of using predicted responses rather than true neural responses. The DNEM input layer convolution has dilation 2. The predicted activity is therefore based on every second frame, with the effect that the activity is predicted as the response of 2 static images which are then interleaved.

Reconstruction performance correlates with video motion energy and frame contrast but not with behavioral parameters. Pearson’s correlation between mean frame correlation per movie, and 3 movie parameters and 3 behavioral parameters. Linear fit as black line.

Algorithm 1

Movie reconstruction

Algorithm 2

Mask training

Acknowledgements

We would like to thank Emmanuel Bauer, Sandra Reinert and the anonymous reviewers for useful input and discussions, and Tom George for the Gaussian noise stimulus set. T.W.M. is funded by The Wellcome Trust (219627/Z/19/Z; 214333/Z/18/Z) and Gatsby Charitable Foundation (GAT3755) and J.B. is funded by EMBO (ALTF 415-2024).

Additional information

5 Code

The code is available at https://github.com/Joel-Bauer/movie_reconstruction_code.

Additional files

Supplementary Videos. Supplementary Video 1: Reconstructed natural videos from mouse brain activity. Odd rows are ground truth (GT) movie clips presented to mice. Even rows are the reconstructed movies from the activity of ≈ 8000 V1 neurons. Reconstructed movies are smoothed (σ = 0.5 pixels), masked, and contrast (std) and luminance (mean) matched to ground truth movies. Supplementary Video 2: Gaussian noise stimuli and reconstructions. Odd rows are ground truth (GT) video inputs to the model. Even rows are the reconstructed videos from the predicted neuronal activity for 1 mouse. Reconstructed movies are masked, and contrast (std) and luminance (mean) matched to ground truth videos. Supplementary Video 3: Drifting grating stimuli and reconstructions. Odd rows are ground truth (GT) video inputs to the model. Even rows are the reconstructed videos from the predicted neuronal activity for 1 mouse. Reconstructed movies are masked, and contrast (std) and luminance (mean) matched to ground truth videos.

References

1. Baikulov Ruslan
2023Solution for Sensorium 2023 Competition (v23.11.22)Zenodo https://doi.org/10.5281/zenodo.10155151
1. Benchetrit Yohann
2. Banville Hubert
3. King Jean-Rémi
2023Brain decoding: toward real-time reconstruction of visual perceptionarXiv https://doi.org/10.48550/arxiv.2310.19812 Google Scholar
1. Brackbill Nora
2. Rhoades Colleen
3. Kling Alexandra
4. Shah Nishal P
5. Sher Alexander
6. Litke Alan M
7. Chichilnisky EJ
2020Reconstruction of natural images from responses of primate retinal ganglion cellseLife 9:e58516https://doi.org/10.7554/elife.58516 Google Scholar
1. Breiman Leo
1996Stacked regressionsMachine Learning 24:49–64https://doi.org/10.1007/bf00117832 Google Scholar
1. Chen Tsai-Wen
2. Wardill Trevor J.
3. Sun Yi
4. Pulver Stefan R.
5. Renninger Sabine L.
6. Baohan Amy
7. Schreiter Eric R.
8. Kerr Rex A.
9. Orger Michael B.
10. Jayaraman Vivek
11. Looger Loren L.
12. Svoboda Karel
13. Kim Douglas S.
2013Ultrasensitive fluorescent proteins for imaging neuronal activityNature 499:295–300https://doi.org/10.1038/nature12354 Google Scholar
1. Chen Ye
2. Beech Peter
3. Yin Ziwei
4. Jia Shanshan
5. Zhang Jiayi
6. Yu Zhaofei
7. Liu Jian K.
2024Decoding dynamic visual scenes across the brain hierarchyPLOS Computational Biology 20:e1012297https://doi.org/10.1371/journal.pcbi.1012297 Google Scholar
1. Chen Zijiao
2. Qing Jiaxin
3. Zhou Juan Helen
2023Cinematic Mindscapes: High-quality Video Reconstruction from Brain ActivityarXiv https://doi.org/10.48550/arxiv.2305.11675 Google Scholar
1. Deitch Daniel
2. Rubin Alon
3. Ziv Yaniv
2021Representational drift in the mouse visual cortexCurrent Biology 31:4327–4339https://doi.org/10.1016/j.cub.2021.07.062 Google Scholar
1. Denk Winifried
2. Strickler James H.
3. Webb Watt W.
1990Two-Photon Laser Scanning Fluorescence MicroscopyScience 248:73–76https://doi.org/10.1126/science.2321027 Google Scholar
1. Elfwing Stefan
2. Uchibe Eiji
3. Doya Kenji
2017Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement LearningarXiv https://doi.org/10.48550/arxiv.1702.03118 Google Scholar
1. Fiser Aris
2. Mahringer David
3. Oyibo Hassana K
4. Petersen Anders V
5. Leinweber Marcus
6. Keller Georg B
2016Experience-dependent spatial expectations in mouse visual cortexNature Neuroscience 19:1658–1664https://doi.org/10.1038/nn.4385 Google Scholar
1. Földiák Peter
1993The ‘Ideal Homunculus’: Statistical Inference from Neural Population ResponsesIn: Computation and Neural Systems pp. 55–60https://doi.org/10.1007/978-1-4615-3254-5_9 Google Scholar
1. Garasto Stef
2. Nicola Wilten
3. Bharath Anil A.
4. Schultz Simon R.
2019Neural Sampling Strategies for Visual Stimulus Reconstruction from Two-photon Imaging of Mouse Primary Visual CortexIn: 2019 9th International IEEE/EMBS Conference on Neural Engineering (NER) pp. 566–570https://doi.org/10.1109/ner.2019.8716934 Google Scholar
1. Giovannucci Andrea
2. Friedrich Johannes
3. Gunn Pat
4. Kalfon Jérémie
5. Brown Brandon L
6. Koay Sue Ann
7. Taxidis Jiannis
8. Najafi Farzaneh
9. Gauthier Jeffrey L
10. Zhou Pengcheng
11. Khakh Baljit S
12. Tank David W
13. Chklovskii Dmitri B
14. Pnevmatikakis Eftychios A
2019CaImAn an open source tool for scalable calcium imaging data analysiseLife 8:e38173https://doi.org/10.7554/elife.38173 Google Scholar
1. Goltstein Pieter M.
2. Coffey Emily B. J.
3. Roelfsema Pieter R.
4. Pennartz Cyriel M. A.
2013In Vivo Two-Photon Ca2+ Imaging Reveals Selective Reward Effects on Stimulus-Specific Assemblies in Mouse Visual CortexThe Journal of Neuroscience 33:11540–11555https://doi.org/10.1523/jneurosci.1341-12.2013 Google Scholar
1. Ho Jun Kai
2. Horikawa Tomoyasu
3. Majima Kei
4. Cheng Fan
5. Kamitani Yukiyasu
2023Inter-individual deep image reconstruction via hierarchical neural code conversionNeuroImage 271:120007https://doi.org/10.1016/j.neuroimage.2023.120007 Google Scholar
1. Ito M.
2. Tamura H.
3. Fujita I.
4. Tanaka K.
1995Size and position invariance of neuronal responses in monkey inferotemporal cortexJournal of Neurophysiology 73:218–226https://doi.org/10.1152/jn.1995.73.1.218 Google Scholar
1. Jurjut Ovidiu
2. Georgieva Petya
3. Busse Laura
4. Katzner Steffen
2017Learning Enhances Sensory Processing in Mouse V1 before Improving BehaviorThe Journal of Neuroscience 37:6460–6474https://doi.org/10.1523/jneurosci.3485-16.2017 Google Scholar
1. Kanamori Takahiro
2. Mrsic-Flogel Thomas D.
2022Independent response modulation of visual cortical neurons by attentional and behavioral statesNeuron 110:3907–3918https://doi.org/10.1016/j.neuron.2022.08.028 Google Scholar
1. Kay Kendrick N.
2. Naselaris Thomas
3. Prenger Ryan J.
4. Gallant Jack L.
2008Identifying natural images from human brain activityNature 452:352–355https://doi.org/10.1038/nature06713 Google Scholar
1. Kingma Diederik P
2. Ba Jimmy
2014Adam: A Method for Stochastic OptimizationarXiv https://doi.org/10.48550/arxiv.1412.6980 Google Scholar
1. Kupershmidt Ganit
2. Beliy Roman
3. Gaziv Guy
4. Irani Michal
2022A Penny for Your (visual) Thoughts: Self-Supervised Reconstruction of Natural Movies from Brain ActivityarXiv https://doi.org/10.48550/arxiv.2206.03544 Google Scholar
1. Li Wenyi
2. Zheng Shengjie
3. Liao Yufan
4. Hong Rongqi
5. He Chenggang
6. Chen Weiliang
7. Deng Chunshan
8. Li Xiaojian
2023The brain-inspired decoder for natural visual image reconstructionFrontiers in Neuroscience 17https://doi.org/10.3389/fnins.2023.1130606 Google Scholar
1. Li Wu
2015Perceptual Learning: Use-Dependent Cortical PlasticityAnnual Review of Vision Science 2:1–22https://doi.org/10.1146/annurev-vision-111815-114351 Google Scholar
1. Mordvintsev Alexander
2. Pezzotti Nicola
3. Schubert Ludwig
4. Olah Chris
2018Differentiable image param-eterizationsDistill 3https://doi.org/10.23915/distill.00012
1. Niell Cristopher M.
2. Stryker Michael P.
2010Modulation of Visual Responses by Behavioral State in Mouse Visual CortexNeuron 65:472–479https://doi.org/10.1016/j.neuron.2010.01.033 Google Scholar
1. Nishimoto Shinji
2. Vu An T.
3. Naselaris Thomas
4. Benjamini Yuval
5. Yu Bin
6. Gallant Jack L.
2011Reconstructing Visual Experiences from Brain Activity Evoked by Natural MoviesCurrent Biology 21:1641–1646https://doi.org/10.1016/j.cub.2011.08.031 Google Scholar
1. Nomura Yuichiro
2. Ikuta Shohei
3. Yokota Satoshi
4. Mita Junpei
5. Oikawa Mami
6. Matsushima Hiroki
7. Amano Akira
8. Shimonomura Kazuhiro
9. Seya Yasuhiro
10. Koike Chieko
2019Evaluation of critical flicker-fusion frequency measurement methods using a touchscreen-based visual temporal discrimination task in the behaving mouseNeuroscience Research 148:28–33https://doi.org/10.1016/j.neures.2018.12.001 Google Scholar
1. Ozcelik Furkan
2. VanRullen Rufin
2023Natural scene reconstruction from fMRI signals using generative latent diffusionScientific Reports 13:15666https://doi.org/10.1038/s41598-023-42891-8 Google Scholar
1. Peirce Jonathan
2. Gray Jeremy R
3. Simpson Sol
4. MacAskill Michael
5. Sogo Hiroyuki
6. Kastman Erik
7. Kristoffer Lindeløv Jonas
2019Psychopy2: Experiments in behavior made easyBehavior research methods 51:195–203Google Scholar
1. Pierzchlewicz Pawel A.
2. Willeke Konstantin F.
3. Nix Arne F.
4. Elumalai Pavithra
5. Restivo Kelli
6. Shinn Tori
7. Nealley Cate
8. Rodriguez Gabrielle
9. Patel Saumil
10. Franke Katrin
11. Tolias Andreas S.
12. Sinz Fabian H.
2023Energy Guided Diffusion for Generating Neurally Exciting ImagesbioRxiv https://doi.org/10.1101/2023.05.18.541176 Google Scholar
1. Poort Jasper
2. Khan Adil G.
3. Pachitariu Marius
4. Nemri Abdellatif
5. Orsolic Ivana
6. Krupic Julija
7. Bauza Marius
8. Sahani Maneesh
9. Keller Georg B.
10. Mrsic-Flogel Thomas D.
11. Hofer Sonja B.
2015Learning Enhances Sensory and Multiple Non-sensory Representations in Primary Visual CortexNeuron 86:1478–1490https://doi.org/10.1016/j.neuron.2015.05.037 Google Scholar
1. Prusky G.T.
2. Douglas R.M.
2004Characterization of mouse cortical spatial visionVision Research 44:3411–3418https://doi.org/10.1016/j.visres.2004.09.001 Google Scholar
1. Rakhimberdina Zarina
2. Jodelet Quentin
3. Liu Xin
4. Murata Tsuyoshi
2021Natural Image Reconstruction From fMRI Using Deep Learning: A SurveyFrontiers in Neuroscience 15:0795488https://doi.org/10.3389/fnins.2021.795488 Google Scholar
1. Rao Rajesh P. N.
2. Ballard Dana H.
1999Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effectsNature Neuroscience 2:79–87https://doi.org/10.1038/4580 Google Scholar
1. Reimer Jacob
2. Froudarakis Emmanouil
3. Cadwell Cathryn R.
4. Yatsenko Dimitri
5. Denfield George H.
6. Tolias Andreas S.
2014Pupil Fluctuations Track Fast Switching of Cortical States during Quiet WakefulnessNeuron 84:355–362https://doi.org/10.1016/j.neuron.2014.09.033 Google Scholar
1. Ren Ziqi
2. Li Jie
3. Xue Xuetong
4. Li Xin
5. Yang Fan
6. Jiao Zhicheng
7. Gao Xinbo
2021Reconstructing seen image from brain activity by visually-guided cognitive representation and adversarial learningNeuroImage 228:117602https://doi.org/10.1016/j.neuroimage.2020.117602 Google Scholar
1. Rombach Robin
2. Blattmann Andreas
3. Lorenz Dominik
4. Esser Patrick
5. Ommer Björn
2021High-Resolution Image Synthesis with Latent Diffusion ModelsarXiv https://doi.org/10.48550/arxiv.2112.10752 Google Scholar
1. Schneider Steffen
2. Lee Jin Hwa
3. Mathis Mackenzie Weygandt
2023Learnable latent embeddings for joint be-havioural and neural analysisNature 12:5170https://doi.org/10.1038/s41586-023-06031-6 Google Scholar
1. Schumacher Joseph W.
2. McCann Matthew K.
3. Maximov Katherine J.
4. Fitzpatrick David
2022Selective enhance-ment of neural coding in V1 underlies fine-discrimination learning in tree shrewCurrent Biology https://doi.org/10.1016/j.cub.2022.06.009 Google Scholar
1. Scotti Paul S
2. Banerjee Atmadeep
3. Goode Jimmie
4. Shabalin Stepan
5. Nguyen Alex
6. Cohen Ethan
7. Dempster Aidan J
8. Verlinde Nathalie
9. Yundler Elad
10. Weisberg David
11. Norman Kenneth A
12. Abraham Tanishq Mathew
2023Reconstructing the Mind’s Eye: fMRI-to-Image with Contrastive Learning and Diffusion PriorsarXiv https://doi.org/10.48550/arxiv.2305.18274 Google Scholar
1. Shen Guohua
2. Dwivedi Kshitij
3. Majima Kei
4. Horikawa Tomoyasu
5. Kamitani Yukiyasu
2019aEnd-to-End Deep Image Reconstruction From Human Brain ActivityFrontiers in Computational Neuroscience 13:21https://doi.org/10.3389/fncom.2019.00021 Google Scholar
1. Shen Guohua
2. Horikawa Tomoyasu
3. Majima Kei
4. Kamitani Yukiyasu
2019bDeep image reconstruction from human brain activityPLoS Computational Biology 15:e1006633https://doi.org/10.1371/journal.pcbi.1006633 Google Scholar
1. Sinz Fabian H.
2. Ecker Alexander S.
3. Fahey Paul G.
4. Walker Edgar Y.
5. Cobos Erick
6. Froudarakis Emmanouil
7. Yatsenko Dimitri
8. Pitkow Xaq
9. Reimer Jacob
10. Tolias Andreas S.
2018Stimulus domain transfer in recurrent models for large scale cortical population prediction on videobioRxiv :452672https://doi.org/10.1101/452672 Google Scholar
1. Stanley Garrett B
2. Li Fei F
3. Dan Yang
1999Reconstruction of Natural Scenes from Ensemble Responses in the Lateral Geniculate NucleusThe Journal of Neuroscience 19:8036–8042https://doi.org/10.1523/jneurosci.19-18-08036.1999 Google Scholar
1. Tacchetti Andrea
2. Isik Leyla
3. Poggio Tomaso A.
2018Invariant Recognition Shapes Neural Representations of Visual InputAnnual Review of Vision Science 4:403–422https://doi.org/10.1146/annurev-vision-091517-034103 Google Scholar
1. Takagi Yu
2. Nishimoto Shinji
2023High-resolution image reconstruction with latent diffusion models from human brain activitybioRxiv https://doi.org/10.1101/2022.11.18.517004 Google Scholar
1. Thirion Bertrand
2. Duchesnay Edouard
3. Hubbard Edward
4. Dubois Jessica
5. Poline Jean-Baptiste
6. Lebihan Denis
7. Dehaene Stanislas
2006Inverse retinotopy: Inferring the visual content of images from brain activation patternsNeuroImage 33:1104–1116https://doi.org/10.1016/j.neuroimage.2006.06.062 Google Scholar
1. Turishcheva Polina
2. Fahey Paul G
3. Hansel Laura
4. Froebe Rachel
5. Ponder Kayla
6. Vystrcilová Michaela
7. Willeke Kon-stantin F
8. Bashiri Mohammad
9. Wang Eric
10. Ding Zhiwei
11. Tolias Andreas S
12. Sinz Fabian H
13. Ecker Alexander S
2023The Dynamic Sensorium competition for predicting large-scale mouse visual cortex activity from videosarXiv https://doi.org/10.48550/arxiv.2305.19654 Google Scholar
1. Turishcheva Polina
2. Fahey Paul G
3. Vystrcilová Michaela
4. Hansel Laura
5. Froebe Rachel
6. Ponder Kayla
7. Qiu Yongrong
8. Willeke Konstantin F
9. Bashiri Mohammad
10. Baikulov Ruslan
11. Zhu Yu
12. Ma Lei
13. Yu Shan
14. Huang Tiejun
15. Li Bryan M
16. De Wulf Wolf
17. Kudryashova Nina
18. Hennig Matthias H
19. Rochefort Nathalie L
20. Onken Arno
21. Wang Eric
22. Ding Zhiwei
23. Tolias Andreas S
24. Sinz Fabian H
25. Ecker Alexander S
2024Retrospective for the Dynamic Sensorium Competition for predicting large-scale mouse primary visual cortex activity from videosarXiv Google Scholar
1. Vaswani Ashish
2. Shazeer Noam
3. Parmar Niki
4. Uszkoreit Jakob
5. Jones Llion
6. Gomez Aidan N
7. Kaiser Lukasz
8. Polosukhin Illia
2017Attention Is All You NeedarXiv https://doi.org/10.48550/arxiv.1706.03762 Google Scholar
1. Walker Edgar Y.
2. Sinz Fabian H.
3. Cobos Erick
4. Muhammad Taliah
5. Froudarakis Emmanouil
6. Fahey Paul G.
7. Ecker Alexander S.
8. Reimer Jacob
9. Pitkow Xaq
10. Tolias Andreas S.
2019Inception loops discover what excites neurons most using deep predictive modelsNature Neuroscience 22:2060–2065https://doi.org/10.1038/s41593-019-0517-x Google Scholar
1. Wang Eric Y
2. Fahey Paul G
3. Ponder Kayla
4. Ding Zhuokun
5. Chang Andersen
6. Muhammad Taliah
7. Patel Saumil
8. Ding Zhiwei
9. Tran Dat
10. Fu Jiakun
11. Papadopoulos Stelios
12. Franke Katrin
13. Ecker Alexander S
14. Reimer Jacob
15. Pitkow Xaq
16. Sinz Fabian H
17. Tolias Andreas S
2023Towards a Foundation Model of the Mouse Visual CortexbioRxiv https://doi.org/10.1101/2023.03.21.533548 Google Scholar
1. Willeke Konstantin F.
2. Restivo Kelli
3. Franke Katrin
4. Nix Arne F.
5. Cadena Santiago A.
6. Shinn Tori
7. Nealley Cate
8. Rodriguez Gabrielle
9. Patel Saumil
10. Ecker Alexander S.
11. Sinz Fabian H.
12. Tolias Andreas S.
2023Deep learning-driven characterization of single cell tuning in primate visual area V4 unveils topological organizationbioRxiv https://doi.org/10.1101/2023.05.12.540591 Google Scholar
1. Xia Ji
2. Marks Tyler D.
3. Goard Michael J.
4. Wessel Ralf
2021Stable representation of a naturalistic movie emerges from episodic activity with gain variabilityNature Communications 12:5170https://doi.org/10.1038/s41467-021-25437-2 Google Scholar
1. Yoshida Takashi
2. Ohki Kenichi
2020Natural images are reliably represented by sparse and variable populations of neurons in visual cortexNature Communications 11:872https://doi.org/10.1038/s41467-020-14645-x Google Scholar
1. Zhang Yichen
2. Jia Shanshan
3. Zheng Yajing
4. Yu Zhaofei
5. Tian Yonghong
6. Ma Siwei
7. Huang Tiejun
8. Liu Jian K.
2020Reconstruction of natural visual scenes from neural spikes with deep neural networksNeural Networks 125:19–30https://doi.org/10.1016/j.neunet.2020.01.033 Google Scholar

Article and author information

Author information

Joel Bauer
Sainsbury Wellcome Centre, University College London, London, United Kingdom, Bioengineering Dept, Imperial College,, London, United Kingdom
ORCID iD: 0000-0001-5858-166X
- For correspondence: joel.bauer@ucl.ac.uk
Troy W Margrie
Sainsbury Wellcome Centre, University College London, London, United Kingdom
ORCID iD: 0000-0002-5526-4578
- These authors contributed equally as last authors
Claudia Clopath
Sainsbury Wellcome Centre, University College London, London, United Kingdom, Bioengineering Dept, Imperial College,, London, United Kingdom
ORCID iD: 0000-0003-4507-8648
- These authors contributed equally as last authors

Author Notes

Competing interests: No competing interests declared

Version history

Preprint posted: November 11, 2024
Sent for peer review: December 2, 2024
Reviewed Preprint version 1: March 25, 2025

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.105081. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

views: 274
downloads: 16
citations: 0

Views, downloads and citations are aggregated across all versions of this paper published by eLife.