1 Introduction

One fundamental aim of neuroscience is to eventually gain insight into the ongoing perceptual experience of humans and animals. Reconstruction of visual perception directly from brain activity has the potential to give us a deeper understanding of how the brain represents visual information. Over the past decade, there have been considerable advances in reconstructing images and videos from human brain activity [Nishimoto et al., 2011, Shen et al., 2019a,b, Rakhimberdina et al., 2021, Ren et al., 2021, Takagi and Nishimoto, 2023, Ozcelik and VanRullen, 2023, Ho et al., 2023, Scotti et al., 2023, Chen et al., 2023, Benchetrit et al., 2023, Kupershmidt et al., 2022]. These advances have largely leveraged deep learning techniques to interpret fMRI or MEG recordings, taking advantage of the fact that spatially separated clusters of neurons have distinct visual and semantic response properties [Rakhimberdina et al., 2021]. Due to the low resolution of fMRI and MEG, relative to single neurons, the most successful models heavily rely on extracting semantic content and use diffusion models to generate semantically similar images and videos. Some approaches combine low-level perceptual (retinotopic) and semantic information in separate modules to achieve even better image similarity [Ren et al., 2021, Ozcelik and VanRullen, 2023, Scotti et al., 2023]. However, the pixel-level similarities are still relatively low. These methods are highly useful in humans, but their focus on semantic content may make them less useful when applied to non-human subjects or when using the reconstructed images to investigate visual processing.

Less attention has been given to image reconstruction from non-human brains. This is surprising given the advantages of large-scale single-cell-resolution recording techniques available in animal models, particularly mice. In the past, reconstructions using linear summation of receptive fields or Gabor filters have shown some success using responses from retinal ganglion cells [Brackbill et al., 2020], thalamo-cortical neurons in lateral genicular nucleus [Stanley et al., 1999], and primary visual cortex [Garasto et al., 2019, Yoshida and Ohki, 2020]. Recently, deep nonlinear neural networks have been used with promising results to reconstruct static images from retina [Zhang et al., 2020, Li et al., 2023] and in particular from monkey V4 extracellular recordings [Li et al., 2023, Pierzchlewicz et al., 2023].

Here, we present a method for the reconstruction of 10-second movie clips using two-photon calcium imaging data recorded in mouse V1 [Turishcheva et al., 2023, 2024]. Our method takes advantage of a state-of-the-art (SOTA) dynamic neural encoding model (DNEM) [Baikulov, 2023] which predicts neuronal activity based on video input as well as behavior (Figure 1A). Our method allows us to successfully reconstruct videos despite the fact that V1 neuronal activity in awake mice is heavily modulated by behavioral factors such as running speed [Niell and Stryker, 2010] and pupil diameter (correlated with arousal; [Reimer et al., 2014]). We then quantify the spatio-temporal limits of this reconstruction approach and identify key aspects of our method necessary for optimal performance.

Video reconstruction from neuronal activity in mouse V1 (data provided by the Sensorium 2023 competition; [Turishcheva et al., 2023, 2024]) using a state-of-the-art (SOTA) dynamic neural encoding model (DNEM; [Baikulov, 2023]). A) SOTA DNEMs predict neuronal activity from mouse primary visual cortex, given a video and behavioural input. B) We use a SOTA DNEM to reconstruct part of the input video given neuronal population activity, using gradient descent to optimize the input. C) Poisson negative log likelihood loss across training steps between ground truth neuronal activity and predicted neuronal activity in response to videos. Left: all 50 videos from 5 mice for one model. Right: average loss across all videos for 7 model instances. D) Spatio-temporal (pixel-by-pixel) correlation between reconstructed video and ground truth video.

2 Video reconstruction using state-of-the-art dynamic neural encoding models

We used publicly available data provided by the Sensorium 2023 competition [Turishcheva et al., 2023, 2024]. The data included movies that were presented to mice and the evoked activity of V1 neurons along with pupil position, pupil diameter, and running speed. The neuronal activity was measured using two-photon imaging of GCaMP6s [Chen et al., 2013] fluorescence from 10 mice, with ≈ 8000 neurons from each mouse. In total, we reconstructed ten 10s natural movies from 5 mice.

We used the winning model of the Sensorium 2023 competition which achieved a score of 0.301 ([Baikulov, 2023, Turishcheva et al., 2024] single trial correlation between predicted and ground truth neuronal activity; Figure 1A and Figure S1A-C). This state-of-the-art (SOTA) dynamic neural encoding model (DNEM) was composed of three parts: core, cortex and readout. The model takes the video as input with the behavioral data (pupil position, pupil diameter, and running speed) broadcast to four additional channels of the video. The original model weights were not used to avoid reconstructing movies the model was trained on. Instead, we retrained 7 instances of the model using the same training data, which did not include the movies reserved for reconstruction. Beyond this point the weights of the model were frozen, i.e. not influenced by future movie presentations.

To reconstruct the videos presented to mice we iteratively optimized an initially blank input video to the SOTA DNEM until the predicted activity in response to this input matched the ground truth recorded neuronal activity. To achieve this we used an input optimization through gradient descent approach inspired by the optimization of maximally exciting images [Walker et al., 2019] and the reconstruction of static images from monkey V4 extracellular recordings [Pierzchlewicz et al., 2023]. The input videos were initialized as uniform grey values and the behavioral parameters (Figure S1A) were added as additional channels, i.e. these were not reconstructed but given. The neuronal activity in response to the input video was predicted using the SOTA DNEM for a sliding window of 32 frames (1.067 sec) with a stride of 8 frames. We saw slightly better results with a stride of 2 frames but, in our case, this did not warrant the increase in training time. For each window, the difference between the predicted and ground truth responses was calculated and this loss backpropagated to the pixels of the input video to get the gradient of the loss with respect to each pixel. In effect, the input pixels were thus treated as if they were model weights. The gradients for each pixel were then averaged across all windows and the pixels of the input video updated accordingly (See Supplementary Algorithm 1).

The data from the Sensorium competition provided the activity of neurons within a 630 by 630 µm field of view for each mouse, i.e. covering roughly one-fifth of mouse V1. Due to the retinotopic organization of V1 we therefore did not expect to get good reconstructions of the entire video frame. However, gradients still propagated to the full video frame and produced non-sensical results along the periphery of the video frames. Inspired by previous work [Mordvintsev et al., 2018, Willeke et al., 2023] we therefore decided to apply a mask during training and evaluation. To generate these masks, we optimized a transparency layer placed at the input to the SOTA DNEM. High values are given to pixels that contribute to the accurate prediction of neuronal activity and represent the collective receptive field of the neural population. This mask was applied during the optimization of the reconstructed movies (training mask: binarized with threshold α = 0.5) and applied again to the final reconstruction (evaluation mask: binarized with threshold α = 1) (See Supplementary Algorithm 2).

As the loss between predicted (Figure S1D) and ground truth responses (Figure S1B) decreased, the similarity between the reconstructed and ground truth input video increased (Figure 1C-D). We generated 7 separate reconstructions from 7 neural encoding models (trained on the same data) and averaged them. Finally, we applied a 3D Gaussian filter with sigma 0.5 pixels to remove the remaining static noise and applied the evaluation mask. The Gaussian filter was not applied when evaluating spatial or temporal resolution (Figure 4 and Figure S2).

2.1 High-quality video reconstruction

As can be seen in Figure 2 and Supplementary Video 1, the reconstructed videos capture much of the spatial and temporal dynamics of the original input video. To evaluate performance of the video reconstructions we correlated either all pixels from all time points between ground truth and reconstructed videos (Pearson’s correlation r = 0.563; to quantify temporal and spatial similarity), or the average correlation between all sets of frames (Pearson’s correlation r = 0.500; to quantify just spatial similarity)(Figure 2B and Figure S1E). Importantly, this represents a ≈ 2x improvement over previous static image reconstructions from V1 in awake mice (image correlation 0.23 +/-0.02 s.e.m for awake mice) [Yoshida and Ohki, 2020] over a similar retinotopic area (≈ 60° diameter) while also capturing temporal dynamics (Table 1).

Benchmarking against previous natural image reconstructions from mouse visual cortex. Correlation r values are given as mean and std across mice (except for [Garasto et al., 2019], where they are given as mean and std across reconstructed images).

Reconstruction performance. A) Three reconstructions of 10s videos from different mice (see Supplementary Video 1 for full set: YouTube link). Reconstructions have been luminance (mean pixel value across video) and contrast (standard deviation of pixel values across video) matched to ground truth. B) The reconstructed videos have high correlation to ground truth in both spatio-temporal correlation (mean Pearson’s correlation r = 0.563 with 95% CIs 0.534 to 0.593, t-test between ground truth and random video p = 8.66 * 10−45, n = 50 videos from 5 mice) and mean frame correlation (mean Pearson’s correlation r = 0.500 with 95% CIs 0.468 to 0.532, t-test between ground truth and random video p = 3.27 * 10−38, n = 50 videos from 5 mice).

Reconstruction quality, however, was not consistent across movies (Figure 2B) or constant throughout the 10 second videos (Figure S1E). We therefore investigated what factors may cause these fluctuations by correlating video motion energy, contrast and luminance, as well as running speed, pupil diameter and eye movement with frame correlation. We found that only video motion energy and contrast correlated with frame correlation, but only to a moderate degree (Supplementary Figure S3).

2.2 Ensembling

We found that the 7 instances of the SOTA DNEMs by themselves performed similarly in terms of reconstructed video correlation (Figure 1D), but that this correlation was significantly increased by taking the average across reconstructions from different models (Figure 3) – A technique known as bagging, and more generally ensembling [Breiman, 1996]. Individual models produced reconstructions with high-frequency noise in the temporal and spatial domains. We therefore think the increase in performance from ensembling is mostly an effect of averaging out this high-frequency noise.

Model ensembling. Mean video correlation is improved when predictions from multiple models are averaged. Dashed lines are individual animals, solid line is mean. One-way repeated measures ANOVA p = 1.11 * 10−16. Bonferroni corrected paired ttest outcomes between consecutive ensemble sizes are all p < 0.001, n = 5 mice.

In this paper we averaged over 7 model instances which gave a performance increase of 28.3%, but the largest gain in performance, 13.8%, came from averaging across just 2 models (Figure 3). Doubling the number of models to 4 increased the performance by another 8.12%. Overall, although ensembling over models trained on separate data splits is a computationally expensive method, it substantially improved reconstruction quality.

2.3 Not all spatial and temporal frequencies are reconstructed equally

While the reconstructed videos achieve high correlation to ground truth, it is not entirely clear if the remaining deviations are due to the limitations of the model or arise from the recorded neurons themselves. To assess the resolution limits of our reconstruction process, we assessed the model’s ability to reconstruct synthetic stimuli at varying spatial and temporal resolutions in a noise-free scenario.

To quantify which spatial and temporal frequencies our reconstruction approach is able to capture we used a Gaussian noise stimulus set generated using a Gaussian process (https://github.com/TomGeorge1234/gp_video; Figure 4A). The dataset consisted of 49, 2 second, 36 by 36 pixel videos at 30 Hz, which varied in the spatial and temporal length constants. As we did not have ground truth neuronal activity in response to this stimulus set, we first predicted the neuronal responses given these videos using the ensembled SOTA DNEMs. We then used gradient descent to reconstruct the original input using these predicted neuronal responses as the target. In this way, we generated reconstructions in an ideal case with no biological noise and assuming the SOTA DNEM perfectly predicts neuronal activity (Figure 4B). This means the video reconstruction quality loss reflects the inefficiency of the reconstruction process itself without the additional loss or transformation of information by processes such as top-down modulation, e.g. predictive coding or selective feature attention (see Discussion). We found that the reconstruction process failed at high spatial frequencies (< 1 pixel, or < 3.4° retinotopy) and performed worse at high temporal frequencies (< 1 frame, or < 30 Hz)(Figure 4C and Supplementary Video 2). We repeated this analysis using full-field high-contrast square gratings drifting in the four cardinal directions and similarly found that high-spatial and temporal frequencies were not reconstructed as well as low-spatial and temporal frequency gratings (Figure S2).

To test if model ensembling improves Gaussian noise reconstruction quality across all spatial and temporal length constants uniformly, we subtracted the average video correlation across the six model instances from the video correlation of the average video (i.e. ensembled video reconstruction minus unensembled video reconstruction; Figure 4D). We found that in particular short temporal and spatial length constant stimuli improved in correlation, supporting our hypothesis that ensembling mitigates the high-frequency noise we observed in the reconstruction from individual models.

2.4 Neuronal population size

In order to design future in vivo experiments to investigate visual processing using our video reconstruction approach, it would be useful to know how reconstruction performance scales with the number of recorded neurons. This is vital for prioritizing experimental parameters such as weighing between sampling density within a similar retinotopic area and retinotopic coverage to maximize both video reconstruction quality and visual coverage. We therefore performed an in silico ablation experiment, dropping either 50, 75% or 87.5% of the total recorded population of ≈ 8000 neurons per mouse by setting their activity to 0 (Figure 5). We found that dropping 50% of the neurons reduced the video correlation by only 9.8% while dropping 75% reduced the performance by 23.8%. We would therefore argue that ≈ 4000-8000 neurons within a 630 by 630 µm area (≈ 10000-20000 neurons/mm2) of mouse V1 would prove a blanance when compromising between density and 2D coverage.

Reconstruction of Gaussian noise across the spatial and temporal spectrum using predicted activity. A) Example Gaussian noise stimulus set with evaluation mask for one mouse. Shown is the last frame of a 2 second video. B) Reconstructed Gaussian stimuli with SOTA DNEM predicted neuronal activity as the target (see also Supplementary Video 2: YouTube link). C) Pearson’s correlation between ground truth (A) and reconstructed videos (B) across the range of spatial and temporal length constants. For each stimulus type the average correlation across 5 movies reconstructed from the SOTA DNEM of 3 mice is given. D) Ensembling effect for each stimulus. Video correlation for ensembled prediction (average videos from 7 model instances) minus the mean video correlation across the 7 individual model instances.

Video reconstruction using fewer neurons (i.e. population ablation) leads to lower reconstruction quality. Dashed lines are individual animals, solid line is mean. One-way repeated measures ANOVA p = 1.58* 10−12. Bonferroni corrected paired t-test outcomes between consecutive drops in population size are all p < 0.001, n = 5 mice.

3 Discussion

3.1 Stimulus identification vs reconstruction

Stimulus identification, i.e. identifying the most likely stimulus from a constrained set, has been a popular approach for quantifying whether a population of neurons encodes the identity of a particular stimulus [Földiák, 1993, Kay et al., 2008]. This approach has, for instance, been used to decode frame identity within a movie [Deitch et al., 2021, Xia et al., 2021, Schneider et al., 2023, Chen et al., 2024]. Some of these approaches have also been used to reorder the frames of the ground truth movie [Schneider et al., 2023] based on the decoded frame identity. Importantly, stimulus identification methods are distinct from stimulus reconstruction where the aim is to recreate what the sensory content of a neuronal code is in a way that generalizes to new sensory stimuli [Rakhimberdina et al., 2021]. This is inherently a more demanding task because the range of possible solutions is much larger. Although stimulus identification is a valuable tool for understanding the information content of a population code, stimulus reconstruction promises a more generalizable approach.

3.2 Comparison to other reconstruction methods

There has recently been a growing number of publications in the field of image reconstruction, primarily from fMRI data, and a comprehensive review of all the approaches is outside the scope of this paper. However, we will briefly summarize the most common approaches and how they relate to our own method. In general, image reconstruction methods can be categorized into one of four groups: direct decoding models, encoder-decoder models, invertible encoding models, and encoder model input optimization.

Direct decoders, directly decode the input image/videos from neuronal activity with deep neuronal networks [Shen et al., 2019a, Zhang et al., 2020, Li et al., 2023]. When training direct decoders, the decoders can be pretrained [Ren et al., 2021] or additional constraints can be added to the loss function to encourage the decoder to produce images that adhere to learned image statistics [Shen et al., 2019a, Kupershmidt et al., 2022]. A direct decoder approach has been used for video reconstruction in mice [Chen et al., 2024], but in that case, the training and test movies were the same, meaning it is unclear if out-of-training set generalization was achieved (a key distinction between sensory reconstruction and stimulus identification, see previous section).

In encoder-decoder models the aim is to combine separately trained brain encoders (brain activity to latent space) and decoders (latent space to image/video). Recently this approach has become particularly popular because it allows the use of SOTA generative image models such as stable diffusion [Rombach et al., 2021, Takagi and Nishimoto, 2023, Scotti et al., 2023, Chen et al., 2023, Benchetrit et al., 2023]. The encoder part of the models are first trained to translate brain activity into a latent space that the pretrained generative networks can interpret. Because these latent spaces are often conditioned on semantic information, this lends itself to separate processing of low-level visual and high-level semantic information from brain activity [Scotti et al., 2023].

Invertible encoding models are encoding models which, once trained to predict neuronal activity, can implicitly be inverted to predict sensory input given brain activity. We would also include those models in this class which first compute the receptive field or preferred stimulus of neurons (or voxels) and reconstruct the input as the weighted sum of the receptive fields by their activity [Stanley et al., 1999, Thirion et al., 2006, Garasto et al., 2019, Brackbill et al., 2020, Yoshida and Ohki, 2020, Nishimoto et al., 2011]. The down-side of this approach is that these encoding models generally under perform in terms of capturing the coding properties of neurons compared to more complex deep neural networks [Willeke et al., 2023].

Encoder input optimization, also involves first training an encoder which predicts the activity of neurons or voxels given sensory input. Once trained the encoder is fixed and the input to the network is optimized using backpropagation until the predicted activity matches the observed activity [Pierzchlewicz et al., 2023]. Unlike with invertible encoding models any SOTA neuronal encoding model can be used. But like invertible models, the networks are not specifically trained to reconstruct images so they may be less likely to extrapolate information encoded by the brain by learning general image statistics.

Although outlined here as 4 distinct classes these approaches can be combined. For instance, Encoding input optimization can be combined with image diffusion [Pierzchlewicz et al., 2023] and in principle invertible models could also be combined in such a way.

We chose to pursue a pure encoder input optimization approach for single cell mouse visual cortex activity for two reasons. First, there have been considerable advances in the performance of neuronal encoding models for dynamic visual stimuli [Sinz et al., 2018, Wang et al., 2023, Turishcheva et al., 2024] and we aimed to take advantage of these developments. Second, the addition of a generative decoder trained to producing high quality images brings with it the risk of extrapolating information based on general image statistics rather than interpreting what the brain is representing. In some cases, the brain may not be encoding coherent images and in those cases we would argue image reconstruction should fail, rather than producing an image when only the semantic information is present.

3.3 Key contributions and limitations

We demonstrate high-quality video reconstruction from mouse V1 using SOTA DNEMs to iteratively optimize the input video to match the resulting predicted activity with the recorded neuronal activity. Key to achieving high-quality reconstructions is model ensembling and using a large enough number of recorded neurons over a given retinotopic area.

While we averaged the video reconstructions from several models, an alternative method would be to average the gradients calculated by multiple models at each epoch, as has been done for the generation of maximally exciting images in the past [Walker et al., 2019]. However, this requires a large amount of GPU memory when using video models and is likely not practical with most hardware limitations. However, there might be situations in which averaging gradients yields better reconstructions. For instance, there may be multiple solutions for the activation pattern of a neural population, e.g. if their responses are translation/phase invariant [Ito et al., 1995, Tacchetti et al., 2018]. In such a case, averaging ‘misaligned’ reconstructions from multiple models might degrade overall quality.

The SOTA DNEM we used takes video data at an angular resolution of 3.4 °/pixels at the center of the screen which is about 3x worse than the visual acuity of mice (≈ 0.5 cycles/° [Prusky and Douglas, 2004]). As our model can reconstruct Gaussian noise stimuli down to a spatial length constant of 1 pixel, and drifting gratings up to a spatial frequency of 0.071 cycles/°, there is still some potential for improving spatial resolution. To close this gap and achieve reconstructions equivalent to the limit of mouse visual acuity, a different dataset and model would likely need to be developed. However, the frame rate of the videos the SOTA DNEM takes as input (30 Hz) is faster than the flicker fusion frequency of mice (14 Hz [Nomura et al., 2019]) and our tests with Gaussian noise and drifting grating stimuli show that the temporal resolution of reconstruction is close to this expected limit. Future efforts should therefore focus on the spatial resolution of video reconstruction rather than the temporal resolution.

It is, however, unclear how closely the representation of vision by the brain is expected to match the actual input. There are a number of visual processing phenomena that have previously been identified which leads us to suspect that some deviations between video reconstructions and ground truth input are to be expected. One such phenomenon is predictive coding [Rao and Ballard, 1999, Fiser et al., 2016]. It is possible that the unexpected parts of visual stimuli are sharper and have higher contrast compared to the expected parts when reconstructed from neuronal activity. Alternatively, perceptual learning is a phenomenon where visual stimulus detection or discriminability is enhanced through prolonged training [Li, 2015] and is associated with changes in the tuning distribution of neurons in the visual system [Goltstein et al., 2013, Poort et al., 2015, Jurjut et al., 2017, Schumacher et al., 2022]. Similarly, selective feature attention can modulate the response amplitude of neurons that have a preference for the features that are currently being attended to [Kanamori and Mrsic-Flogel, 2022]. Visual task engagement and training could therefore alter the accuracy and biases of what features of a video can accurately be reconstructed from the neuronal activity.

Such visual processing phenomena are likely difficult to investigate using existing fMRI-based reconstruction approaches, due to the low spatial and temporal resolution of the data. Additionally, many of these fMRI-based reconstruction approaches rely on the use of pretrained generative diffusion models to achieve more naturalistic and semantically interpretable images [Takagi and Nishimoto, 2023, Ozcelik and VanRullen, 2023, Scotti et al., 2023, Chen et al., 2023], but very likely at the cost of introducing information that may not be present in the actual neuronal representation. In contrast, our video reconstruction approach using single-cell resolution recordings, without a pretrained generative model, provides a more accurate method to investigate visual processing phenomena such as predictive coding, perceptual learning, and selective feature attention.

In conclusion, we reconstruct videos presented to mice based on the activity of neurons in the mouse visual cortex, with a ≈ 2-fold improvement in pixel-by-pixel correlation compared to previous static image reconstruction methods. This paves the way to using movie reconstruction as a tool to investigate a variety of visual processing phenomena.

4 Methods

4.1 Source data

The data was provided by the Sensorium 2023 competition [Turishcheva et al., 2023, 2024] and downloaded from https://gin.g-node.org/pollytur/Sensorium2023Data and https://gin.g-node.org/pollytur/sensorium_2023_dataset. The data included grayscale movies presented to the mice at 30 Hz on a 31.8 by 56.5 cm monitor 15 cm from and perpendicular to the left eye. The movies were provided as spatially downsampled versions of the original screen resolution to 36 by 64 pixels, corresponding to an angular resolution of 3.4 °/pixel at the center of the screen. The pupil position and diameter were recorded at 20 Hz and the running at 100 Hz. The neuronal activity was measured using two-photon imaging [Denk et al., 1990] of GCaMP6s [Chen et al., 2013] fluorescence at 8 Hz, extracted and deconvolved using the CAIMAN pipeline [Giovannucci et al., 2019]. For each of the 10 mice, the activity of ≈ 8000 neurons was provided. The different data types were resampled to 30 Hz.

4.2 State-of-the-art dynamic neural encoding model

We used the winning model of the Sensorium 2024 competition, DwiseNeuro [Turishcheva et al., 2023, 2024]. The code for the SOTA DNEM was downloaded from https://github.com/lRomul/ sensorium. The full model consists of 3 main components: core, cortex, and readout. The core largely consisted of factorized 3D convolution blocks with residual connections, positional encoding [Vaswani et al., 2017] and SiLU activations [Elfwing et al., 2017] followed by spatial average pooling. The cortex consisted of three fully connected layers. The readout consisted of a 1D convolution for each mouse with a final Softplus nonlinearity, that gives activity predictions for all neurons of each mouse. The kernel of the input layer had size 16 with a dilation of 2 in the time dimension, so spanned 32 video frames.

The original ensemble of models consisted of 7 model instances trained on a 7-fold cross-validation split of all available Sensorium 2023 competition data (≈ 1 hour of training data and ≈ 8 min of cross-validation data per fold from each mouse). Each model instance was trained on 6 of 7 data folds, with different validation data excluded from training for each model. To allow ensembled reconstructions of videos without test set contamination we instead retrained the models with a shared validation fold, i.e. we retrained the models leaving out the same validation data for all 7 model instances. The only other difference in the training procedure was that we retrained the models using a batch size of 24 instead of 32, this did not change the performance of neuronal response prediction on the withheld data folds (mean validation fold predicted vs ground truth response correlation for original weights: 0.293; and retrained weights: 0.291). We also did not use model distillation, while the original model did (see https://github.com/lRomul/sensorium).

4.3 Additional visual stimuli

The Gaussian noise stimuli were downloaded from https://github.com/TomGeorge1234/gp_video and spanned a range of 0 to 32 pixels in spatial length constant and 0 to 32 frames in temporal length constant used in the Gaussian process. The drifting grating stimuli were produced using PsychoPy [Peirce et al., 2019] and ranged from 0.5 to 0.062 cycles/degree and 0.5 to 0 cycles/second, with 2 seconds of movie for each cardinal direction. These ranges were chosen to avoid aliasing effects in the 36 by 64 pixel videos. The highest temporal frequency corresponds to a flicker stimulus.

4.4 Mask training

To generate the transparency masks we used an alpha blending approach inspired by [Mordvintsev et al., 2018, Willeke et al., 2023]. A transparency layer was placed at the input to the SOTA DNEM. This transparency layer was used to alpha blend the true video V with another randomly selected background video BG from the data:

where α is the 2D transparency mask and VBG is the blended input video. This mask was optimized using stochastic gradient descent (for 1000 epochs with learning rate 10) with mean squared error (MSE) loss between the true responses y and the predicted responses ŷ scaled by the average weight of the transparency mask :

where n is the total number of neurons. The mask was initialized as uniform noise between 0 and 0.05. At each epoch the neuronal activity in response to a randomly selected 32 frame video segment from the training set was predicted and the gradients of the loss (Equation 3) with respect to the pixels in the transparency mask α were calculated for each video frame. The gradients were normalized by their matrix norm, clipped to between -1 and 1 and averaged across frames. The gradients were smoothed with a 2D Gaussian kernel of σ = 5 and subtracted from the transparency mask. The transparency mask was only calculated using one SOTA DNEM instance using its validation fold. See Supplementary Algorithm 2.

The transparency mask was thresholded and binarized at 0.5 for the masked gradients ∇masked or 1 for the masked videos for evaluation Veval:

where ∇ is the gradients of the loss with respect to each pixel in the video and V is the reconstructed video before masking. These masks were trained independently for each mouse using one model instance with the original weights of the model https://github.com/lRomul/sensorium, not the retrained models used in the rest of this paper to reconstruct the videos.

4.5 Video reconstruction

To reconstruct the input video we initialized the video as uniform gray values and concatenated the ground truth behavioral parameters. The SOTA DNEM took 32 frames at a time and we shifted this window by 8 frames until all frames of the whole 10s video were covered. For each 32-frame window, the Poisson negative log-likelihood loss between the predicted and true neuronal responses was calculated:

where ŷ are the predicted responses and y are the ground truth responses. The gradients of the loss with respect to each pixel of the input video were calculated for each window of frames and averaged across all windows. The gradients for each pixel were normalized by the matrix norm across all gradients and clipped to between -1 and 1. The gradients were masked (Equation 4) and applied to the input video using Adam without second order momentum [Kingma and Ba, 2014] (β1 = 0.9) for 1000 epochs and a learning rate of 1000, with a learning rate warm-up for the first 10 epochs. After each epoch, the video was clipped to between 0 and 255. The optimization was run for 1000 epochs. 7 reconstructions from 7 model instances were averaged, denoised with a 3D Gaussian filter σ = 0.5 (unless specified otherwise), and masked with the evaluation mask. See Supplementary Algorithm 1. Optimizing each 10-second video with one model instance for 1000 epochs took ≈ 60 min using a desktop with an RTX4070 GPU.

4.6 Reconstruction quality assessment

To evaluate the similarity between reconstructed and ground truth videos, we used the mean Pearson’s correlation between pixels of corresponding frames to evaluate spatial similarity:

where f is the number of frames, and xi and are the ground truth and reconstructed frames. To evaluate temporal and spatial similarity between ground truth and reconstructed videos we used the Pearson’s correlation between all pixels of the whole movie:

A Appendix / supplemental material

Summary ethogram of SOTA DNEM inputs, output predictions, and video reconstruction over time for three videos from three mice (same as Figure 2A). A) Top: motion energy of the input video. Bottom: pupil diameter and running speed of the mouse during the video. B) Ground truth neuronal activity. C) Predicted neuronal activity in response to input video and behavioural parameters. D) predicted neuronal activity given reconstructed video and ground truth behaviour as input. E) Frame by frame correlation between reconstructed and ground truth video.

Reconstruction of drifting grating stimuli with different spatial and temporal frequencies using predicted activity. A) Example drifting grating stimuli (rightwards moving) masked with the evaluation mask for one mouse. Shown is the 31st frame of a 2 second video. B) Reconstructed drifting grating stimuli with SOTA DNEM predicted neuronal activity as the target (see also Supplementary Video 3: YouTube link). C) Pearson’s correlation between ground truth (A) and reconstructed videos (B) across the range of spatial and temporal frequencies. For each stimulus type the average correlation across 4 directions (up, down, left, right) reconstructed from the SOTA DNEM of 1 mouse is given. Interestingly, video correlation at 15 cycles/second (half the video frame rate of 30 Hz) is much higher than 7.5 cycles/second. This is an artifact of using predicted responses rather than true neural responses. The DNEM input layer convolution has dilation 2. The predicted activity is therefore based on every second frame, with the effect that the activity is predicted as the response of 2 static images which are then interleaved.

Reconstruction performance correlates with video motion energy and frame contrast but not with behavioral parameters. Pearson’s correlation between mean frame correlation per movie, and 3 movie parameters and 3 behavioral parameters. Linear fit as black line.

Algorithm 1

Movie reconstruction

Algorithm 2

Mask training

Acknowledgements

We would like to thank Emmanuel Bauer, Sandra Reinert and the anonymous reviewers for useful input and discussions, and Tom George for the Gaussian noise stimulus set. T.W.M. is funded by The Wellcome Trust (219627/Z/19/Z; 214333/Z/18/Z) and Gatsby Charitable Foundation (GAT3755) and J.B. is funded by EMBO (ALTF 415-2024).

Additional information

5 Code

The code is available at https://github.com/Joel-Bauer/movie_reconstruction_code.

Additional files

Supplementary Videos. Supplementary Video 1: Reconstructed natural videos from mouse brain activity. Odd rows are ground truth (GT) movie clips presented to mice. Even rows are the reconstructed movies from the activity of 8000 V1 neurons. Reconstructed movies are smoothed (σ = 0.5 pixels), masked, and contrast (std) and luminance (mean) matched to ground truth movies. Supplementary Video 2: Gaussian noise stimuli and reconstructions. Odd rows are ground truth (GT) video inputs to the model. Even rows are the reconstructed videos from the predicted neuronal activity for 1 mouse. Reconstructed movies are masked, and contrast (std) and luminance (mean) matched to ground truth videos. Supplementary Video 3: Drifting grating stimuli and reconstructions. Odd rows are ground truth (GT) video inputs to the model. Even rows are the reconstructed videos from the predicted neuronal activity for 1 mouse. Reconstructed movies are masked, and contrast (std) and luminance (mean) matched to ground truth videos.