Figures and data

Video reconstruction from neuronal activity in mouse V1 (data provided by the Sensorium 2023 competition; [Turishcheva et al., 2023, 2024]) using a state-of-the-art (SOTA) dynamic neural encoding model (DNEM; [Baikulov, 2023]).
A) DNEMs predict neuronal activity from mouse primary visual cortex, given a video and behavioural input. B) We use a SOTA DNEM to reconstruct part of the input video given neuronal population activity, using gradient descent to optimize the input. C) Poisson negative log likelihood loss across training steps between ground truth neuronal activity and predicted neuronal activity in response to videos. Left: all 50 videos from 5 mice for one model. Right: average loss across all videos for 7 model instances. D) Spatio-temporal (pixel-by-pixel) correlation between reconstructed video and ground truth video.

Reconstruction performance.
A) Three reconstructions of 10s videos from different mice (see Supplementary Video 1). Reconstructions have been luminance (mean pixel value across video) and contrast (standard deviation of pixel values across video) matched to ground truth. B) The reconstructed videos have high correlation to ground truth in both spatiotemporal correlation (mean Pearson’s correlation r = 0.569 with 95% CIs 0.542 to 0.596, t-test between ground truth and random video p = 6.69 * 10−49, n = 50 videos from 5 mice) and mean frame correlation (mean Pearson’s correlation r = 0.512 with 95% CIs 0.481 to 0.543, t-test between ground truth and random video p = 4.29 * 10−45, n = 50 videos from 5 mice).

Model ensembling.
Mean video correlation is improved when predictions from multiple models are averaged. Dashed lines are individual animals, solid line is mean. One-way repeated measures ANOVA p = 1.11 * 10−16. Bonferroni corrected paired t-test outcomes between consecutive ensemble sizes are all p< 0.001, n = 5 mice.

Reconstruction of Gaussian noise across the spatial and temporal spectrum using predicted activity.
A) Example Gaussian noise stimulus set with evaluation mask for one mouse. Shown is the last frame of a 2 second video. B) Reconstructed Gaussian stimuli with SOTA DNEM predicted neuronal activity as the target (see also Supplementary Video 2). C) Pearson’s correlation between ground truth (A) and reconstructed videos (B) across the range of spatial and temporal length constants. For each stimulus type the average correlation across 5 movies reconstructed from the SOTA DNEM of 3 mice is given. D) Pearson’s correlation between reconstructions from phase-inverted Gaussian noise stimuli.

Video reconstruction using fewer neurons (i.e. population ablation) leads to lower reconstruction quality.
Dashed lines are individual animals, solid line is mean. One-way repeated measures ANOVA p = 5.70 * 10−13. Bonferroni corrected paired t-test outcomes between consecutive drops in population size are all p < 0.001, n = 5 mice.

Comparison of reconstructions from experimental responses vs expected responses and their visualization as error maps.
A) Frame-by-frame correlation between reconstructed and ground truth video for mouse 1 trial 7 (same as Figure S1). B) From left to right: experimental (ground truth) neural activity y, neural activity predicted by DNEM from ground truth video 








Summary ethogram of SOTA DNEM inputs, output predictions, and video reconstruction over time for three videos from three mice (same as Figure 2A).
A) Top: motion energy of the input video. Bottom: pupil diameter and running speed of the mouse during the video. B) Ground truth neuronal activity. C) Predicted neuronal activity in response to input video and behavioural parameters. D) predicted neuronal activity given reconstructed video and ground truth behaviour as input. E) Frame by frame correlation between reconstructed and ground truth video.

Receptive fields and transparency masks.
A) One example receptive field for one neuron from each mouse mapped using on & off patch stimuli in silico. B) Average population receptive fields from each mouse. C) Distribution of on and off receptive field centers for each moue. D) Unthresholded alpha masks, i.e. transparency masks, for each mouse. E) Pixel-wise temporal correlation between ground truth and reconstructed videos with either the training or the evaluation mask applied. Dashed lines in C-E indicate retinotopic eccentricity in steps of 10°. Plot limits correspond to screen size.

Variations on reconstruction method.
A) Example movie frames (from mouse 4 trial 1) using variations of the reconstruction method. B) Same as A but using different training mask thresholds. No evaluation mask is applied here. C) Left: reconstruction performance for different reconstruction versions. Version without contrast and luminance adjustment is not included because video correlation is always calculated before contrast and luminance adjustment. Reconstruction from predicted vs standard method (34.7% increase; paired t-test p = 1.5 * 10−5 n = 5 mice), standard method vs gradient ensembling (3.00% decrease; paired t-test p = 0.0198, n = 5 mice), standard method vs no Gaussian smoothing (6.28% decrease; paired t-test p = 6.26 * 10−6, n = 5 mice). Middle: reconstruction performance with different evaluation mask thresholds compared across the three training mask thresholds shown in B. Right: same as middle but plotting the mask diameter for each evaluation mask threshold. D) Neural activity prediction performance for different movie inputs. Left: Poisson loss, used to train the DNEM and movie reconstruction. Predicted activity from full video vs masked with alpha = 0.5 (5.02% increase, t-test p = 3.71*10−6, n = 5 mice), reconstruction vs reconstruction after contrast & luminance matching to ground truth video (0.227% increase, paired t-test p = 0.776, n = 5 mice). Right: correlation across all neurons and frames (note this is a different metric to the one used in the Sensorium competition). Predicted activity from full video vs masked with alpha = 0.5 (3.73% increase, t-test p = 2.49 * 10−5, n = 5 mice), reconstruction vs reconstruction after contrast & luminance matching to ground truth video (2.49% increase, paired t-test p = 0.0195, n = 5 mice). In C-D dashed lines are single mice, solid lines are means across mice.

Reconstruction performance correlates with frame contrast but not with behavioral parameters.
A) Pearson’s correlation between mean frame correlation per movie, and 3 movie parameters and 3 behavioral parameters. Linear fit as black line. B) Left: Pearson’s correlation between activity prediction accuracy and movie reconstruction accuracy. Right: cross-correlation plot of frame-by-frame activity prediction accuracy and video frame correlation. In other words, the more predictable the neural activity, the better the reconstruction performance.

Gaussian Noise reconstruction.
A) Average frame Shannon entropy (a measure of variance in the spatial domain) across Gaussian noise stimuli with various spatial and temporal Gaussian length constants. Left: ground truth stimuli. Right: reconstructions from predicted activity. B) Same as A but for motion energy (a measure of variance in the temporal domain). C) Ensembling effect for each stimulus. Video correlation for ensembled prediction (average videos from 7 model instances) minus the mean video correlation across the 7 individual model instances.

Reconstruction of drifting grating stimuli with different spatial and temporal frequencies using predicted activity.
A) Example drifting grating stimuli (rightwards moving) masked with the evaluation mask for one mouse. Shown is the 31st frame of a 2 second video. B) Reconstructed drifting grating stimuli with SOTA DNEM predicted neuronal activity as the target (see also Supplementary Video 3). C) Pearson’s correlation between ground truth (A) and reconstructed videos (B) across the range of spatial and temporal frequencies. For each stimulus type the average correlation across 4 directions (up, down, left, right) reconstructed from the SOTA DNEM of 1 mouse is given. Interestingly, video correlation at 15 cycles/second (half the video frame rate of 30 Hz) is much higher than 7.5 cycles/second. This is an artifact of using predicted responses rather than true neural responses. The DNEM input layer convolution has dilation 2. The predicted activity is therefore based on every second frame, with the effect that the activity is predicted as the response of 2 static images which are then interleaved.