Movie reconstruction from mouse visual cortex activity
Figures
Video reconstruction from neuronal activity in mouse V1 (data provided by the Sensorium 2023 competition; Turishcheva et al., 2023; Turishcheva et al., 2024) using a state-of-the-art (SOTA) dynamic neural encoding model (DNEM; Baikulov, 2023a).
(A) Dynamic neural encoding models (DNEMs) predict neuronal activity from mouse primary visual cortex, given a video and behavioral input. (B) We use a SOTA DNEM to reconstruct part of the input video given neuronal population activity, using gradient descent to optimize the input. (C) Poisson negative log likelihood loss across training steps between ground truth neuronal activity and predicted neuronal activity in response to reconstructed videos. Left: all 50 videos from 5 mice for one model. Right: average loss across all videos for seven model instances. (D) Spatio-temporal (pixel-by-pixel) correlation between reconstructed video and ground truth video.
-
Figure 1—source data 1
Source data to Figure 1.
- https://cdn.elifesciences.org/articles/105081/elife-105081-fig1-data1-v1.xlsx
Summary ethogram of state-of-the-art (SOTA) dynamic neural encoding model (DNEM) inputs, output predictions, and video reconstruction over time for three videos from three mice (same as Figure 2A).
(A) Top: motion energy of the input video. Bottom: pupil diameter and running speed of the mouse during the video. (B) Ground truth neuronal activity. (C) Predicted neuronal activity in response to input video and behavioral parameters. (D) Predicted neuronal activity given reconstructed video and ground truth behaviour as input. (E) Frame-by-frame correlation between reconstructed and ground truth video.
-
Figure 1—figure supplement 1—source data 1
Source data to Figure 1—figure supplement 1.
- https://cdn.elifesciences.org/articles/105081/elife-105081-fig1-figsupp1-data1-v1.xlsx
Variations on the reconstruction method.
(A) Example movie frames (from mouse 4 trial 1) using variations of the reconstruction method. (B) Same as A but using different training mask thresholds. No evaluation mask is applied here. (C) Left: reconstruction performance for different reconstruction versions. Version without contrast and luminance adjustment is not included because video correlation is always calculated before contrast and luminance adjustment. Reconstruction from predicted vs standard method (34.7% increase; paired t-test P=1.5 x 10-5, n = 5 mice), standard method vs gradient ensembling (3.00% decrease; paired t-test p=0.0198, n=5 mice), standard method vs no Gaussian smoothing (6.28% decrease; paired t-test p=6.26 x 10-6, n=5 mice). Middle: reconstruction performance with different evaluation mask thresholds compared across the three training mask thresholds shown in B. Right: same as middle but plotting the mask diameter for each evaluation mask threshold. (D) Neural activity prediction performance for different movie inputs. Left: Poisson loss, used to train the dynamic neural encoding model (DNEM) and movie reconstruction. Predicted activity from full video vs masked with alpha = 0.5 (5.02% increase, t-test p=3.71 x 10-6, n=5 mice), reconstruction vs reconstruction after contrast & luminance matching to ground truth video (0.227% increase, paired t-test p=0.776, n=5 mice). Right: correlation across all neurons and frames (note this is a different metric to the one used in the Sensorium competition). Predicted activity from full video vs masked with alpha = 0.5 (3.73% increase, t-test p=2.49 x 10-5, n=5 mice), reconstruction vs reconstruction after contrast & luminance matching to ground truth video (2.49% increase, paired t-test p=0.0195, n=5 mice). In C-D, dashed lines are single mice, and solid lines are means across mice.
-
Figure 1—figure supplement 2—source data 1
Source data to Figure 1—figure supplement 2.
- https://cdn.elifesciences.org/articles/105081/elife-105081-fig1-figsupp2-data1-v1.xlsx
Receptive fields and transparency masks.
(A) One example receptive field for one neuron from each mouse mapped using on & off patch stimuli in silico. (B) Average population receptive fields from each mouse. (C) Distribution of on and off receptive field centers for each mouse. (D) Unthresholded alpha masks, i.e., transparency masks, for each mouse. (E) Pixel-wise temporal correlation between ground truth and reconstructed videos with either the training or the evaluation mask applied. Dashed lines in C-E indicate retinotopic eccentricity in steps of 10°. Plot limits correspond to screen size.
-
Figure 1—figure supplement 3—source data 1
Source data to Figure 1—figure supplement 3.
- https://cdn.elifesciences.org/articles/105081/elife-105081-fig1-figsupp3-data1-v1.xlsx
Reconstruction performance.
(A) Three reconstructions of 10 s videos from different mice (see Video 1 for the full set). Reconstructions have been luminance (mean pixel value across video) and contrast (standard deviation of pixel values across video) matched to ground truth. (B) The reconstructed videos have high correlation to ground truth in both spatio-temporal correlation (mean Pearson’s correlation r=0.569 with 95% CIs 0.542–0.596, t-test between ground truth and random video p=6.69 x 10-49, n=50 videos from 5 mice) and mean frame correlation (mean Pearson’s correlation r=0.512 with 95% CIs 0.481–0.543, t-test between ground truth and random video p=4.29 x 10-45, n=50 videos from 5 mice).
-
Figure 2—source data 1
Source data to Figure 2.
- https://cdn.elifesciences.org/articles/105081/elife-105081-fig2-data1-v1.xlsx
Reconstruction performance correlates with frame contrast but not with behavioral parameters.
(A) Pearson’s correlation between mean frame correlation per movie, and three movie parameters, and three behavioral parameters. Linear fit as black line. (B) Left: Pearson’s correlation between activity prediction accuracy and movie reconstruction accuracy. Right: cross-correlation plot of frame-by-frame activity prediction accuracy and video frame correlation. In other words, the more predictable the neural activity, the better the reconstruction performance.
-
Figure 2—figure supplement 1—source data 1
Source data to Figure 2—figure supplement 1.
- https://cdn.elifesciences.org/articles/105081/elife-105081-fig2-figsupp1-data1-v1.xlsx
Model ensembling.
Mean video correlation is improved when predictions from multiple models are averaged. Dashed lines are individual animals, and the solid line is the mean. One-way repeated measures ANOVA p=1.11 x 10-16. Bonferroni-corrected paired t-test outcomes between consecutive ensemble sizes are all p<0.001, n=5 mice.
-
Figure 3—source data 1
Source data to Figure 3.
- https://cdn.elifesciences.org/articles/105081/elife-105081-fig3-data1-v1.xlsx
Reconstruction of Gaussian noise across the spatial and temporal spectrum using predicted activity.
(A) Example Gaussian noise stimulus set with evaluation mask for one mouse. Shown is the last frame of a 2 s video. (B) Reconstructed Gaussian stimuli with state-of-the-art (SOTA) dynamic neural encoding model (DNEM) predicted neuronal activity as the target (see also Video 2). (C) Video correlation between ground truth (A) and reconstructed videos (B) across the range of spatial and temporal length constants. For each stimulus type, the average correlation across five movies reconstructed from the SOTA DNEM of 3 mice is given. (D) Video correlation between reconstructions from phase-inverted Gaussian noise stimuli.
Reconstruction of drifting grating stimuli with different spatial and temporal frequencies using predicted activity.
(A) Example drifting grating stimuli (rightwards moving) masked with the evaluation mask for one mouse. Shown is the 31st frame of a 2 s video. (B) Reconstructed drifting grating stimuli with state-of-the-art (SOTA) dynamic neural encoding model (DNEM) predicted neuronal activity as the target (see also Video 3). (C) Video correlation between ground truth (A) and reconstructed videos (B) across the range of spatial and temporal frequencies. For each stimulus type, the average correlation across four directions (up, down, left, right) reconstructed from the SOTA DNEM of 1 mouse is given. Interestingly, video correlation at 15 cycles/s (half the video frame rate of 30 Hz) is much higher than 7.5 cycles/s. This is an artifact of using predicted responses rather than true neural responses. The DNEM input layer convolution has dilation 2. The predicted activity is, therefore, based on every second frame, with the effect that the activity is predicted as the response of two static images which are then interleaved.
-
Figure 4—figure supplement 1—source data 1
Source data to Figure 4—figure supplement 1.
- https://cdn.elifesciences.org/articles/105081/elife-105081-fig4-figsupp1-data1-v1.xlsx
Gaussian Noise reconstruction: shannon entropy, motion energy .
(A) Average frame Shannon entropy (a measure of variance in the spatial domain) across Gaussian noise stimuli with various spatial and temporal Gaussian length constants. Left: ground truth stimuli. Right: reconstructions from predicted activity. (B) Same as A but for motion energy (a measure of variance in the temporal domain). (C) Ensembling effect for each stimulus. Video correlation for ensembled prediction (average videos from seven model instances) minus the mean video correlation across the seven individual model instances.
-
Figure 4—figure supplement 2—source data 1
Source data to Figure 1—figure supplement 2.
- https://cdn.elifesciences.org/articles/105081/elife-105081-fig4-figsupp2-data1-v1.xlsx
Video reconstruction using fewer neurons (i.e. population ablation) leads to lower reconstruction quality.
Dashed lines are individual animals, and the solid line is the mean. One-way repeated measures ANOVA p=5.70 x 10-13. Bonferroni corrected paired t-test outcomes between consecutive drops in population size are all p<0.001, n=5 mice.
-
Figure 5—source data 1
Source data to Figure 5.
- https://cdn.elifesciences.org/articles/105081/elife-105081-fig5-data1-v1.xlsx
Comparison of reconstructions from experimental responses vs expected responses and their visualization as error maps.
(A) Frame-by-frame correlation between reconstructed and ground truth video for mouse 1 trial 7 (same as Figure 1—figure supplement 1E). (B) From left to right: experimental (ground truth) neural activity , neural activity predicted by dynamic neural encoding model (DNEM) from ground truth video , neural activity predicted by DNEM based on reconstructed movie . (C) Difference between the correlation of true neural response with predicted neural response from the ground truth movie , and the correlation of true neural response with predicted neural response from the reconstructed movie . (D) 9 frames from mouse 1 trial 7. From top to bottom: reconstructed movie , reconstructed movie from predicted neural response to ground truth movie , ground truth movie with overlayed heatmap of the difference between and (error map). (E) Error map of one frame from all 50 movie clips. Each row is 10 trials from one mouse. See also Video 4 .
-
Figure 6—source data 1
Source data to Figure 6.
- https://cdn.elifesciences.org/articles/105081/elife-105081-fig6-data1-v1.xlsx
exclusion of movie clips with duplicates in the DNEM training data.
A) example frame of a reconstructed movie (ground truth) and the most correlated first frame from the training data. b) all movie clips and their corresponding most correlated clip from the training data. Red boxes indicate excluded duplicates.
Videos
Reconstructed natural videos from mouse brain activity.
Odd rows are ground truth (GT) movie clips presented to mice. Even rows are the reconstructed movies from the activity of ≈8000 V1 neurons. Reconstructed movies are smoothed (σ=0.5 pixels), masked, and contrast (std) and luminance (mean) matched to ground truth movies.
Gaussian noise stimuli and reconstructions.
Odd rows are ground truth (GT) video inputs to the model. Even rows are the reconstructed videos from the predicted neuronal activity for 1one mouse. Reconstructed movies are masked, and contrast (std) and luminance (mean) matched to ground truth videos.
Drifting grating stimuli and reconstructions.
Odd rows are ground truth (GT) video inputs to the model. Even rows are the reconstructed videos from the predicted neuronal activity for 1one mouse. Reconstructed movies are masked, and contrast (std) and luminance (mean) matched to ground truth videos.
Reconstruction error maps.
Pixel error = reconstructions from experimental neural activity – reconstructions from expected neural activity. Over- &and underestimations of pixel values as hot &and cold heat maps, respectively. Ground truth movies in gray. Each row is 10 trials from 1one mouse played at half speed.