Peer review process
Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.
Read more about eLife’s peer review process.Editors
- Reviewing EditorGordon BermanEmory University, Atlanta, United States of America
- Senior EditorTimothy BehrensUniversity of Oxford, Oxford, United Kingdom
Reviewer #1 (Public Review):
In this manuscript, the authors present a valuable new method to represent animal behavior from video data using a variational autoencoder framework that disentangles individual-specific and background variance from variables that can be more reliably compared across individuals. They achieve this aim through the use of a novel Cauchy-Schwatz (C-S) regularization term in their loss function that leads to latents that model continuously varying features in the images. The authors present a variety of validations for the method, including testing across sessions and individuals for a head-fixed task. They also show how the methods could be used for behavioral decoding from neural data, quantifying social behavior in mice, demonstrating the applicability of the method outside of head-fixed environments and for different measurement modalities. While some areas of confusion and questions about the validation exist, this is an overall strong paper and an important contribution to this field.
Strengths:
- The use of the C-S regularizer is novel approach that has potential for wide use across experimental paradigms and model organisms
- The extent of the validations performed was solid, although perhaps not as convincing in a couple of cases as might be ideal
- The GitHub code demo worked well, and the code appears to be accessible and well-written
Weaknesses:
- Some of the validation figures were a bit unclear in their presentation, making it difficult to assess exactly what had been tested
- It is possible that I missed this, but the authors didn't really provide a sense of how to pick a particular distribution to match using the CS term for a specific paradigm/modality and how the choice affects the results
- While the authors' statements about individual training vs. transfer learning accuracy and efficiency in Figure 6 are technically true, the effect size is rather small ( a few percent at most in each case), thus I don't know how much of a big deal I would want to make out of these results
- In general, I would have liked to have seen the Discussion section speak more to the choices and limitations inherent in applying the method. How does the choice of prior/metaparameters/architecture/etc affect the results? In what situations would this method to fail? What are the next advances that are necessary for the field to progress?
Reviewer #2 (Public Review):
This paper presents a valuable contribution to ongoing methods for understanding and modeling structure via latent variable models for neural and behavioral data. Building on the PS-VAE model of Whiteway et al. (2021), which posited a division of latent variables into unsupervised (i.e., useful for reconstruction) and supervised (useful for predicting selected labeled features) variables, the authors propose an additional set of "constrained subspace" latent variables that are regularized toward a prespecified prior via a Cauchy-Schwarz divergence previously proposed.
The authors contend that the added CS latents aid in capturing both patterns of covariance across the data and individual-specific features that are of particular benefit in multi-animal experiments, all without requiring additional labels. They substantiate these claims with a series of computational experiments demonstrating that their CS-VAE outperforms the PS-VAE in several tasks, particularly that of capturing differences between individuals, consistency in behavioral phenotyping, and predicting correlations with neural data.
Strengths of the present work include an extensive and rigorous set of validation experiments that will be of interest to those analyzing behavioral video. Weaknesses include a lack of discussion of key theoretical ideas motivating the design of the model, including the choice of a Cauchy-Schwarz divergence, the specific form of the prior, and arguments for sorts of information the CS latents might capture and why. In addition, the model makes use of a moderate number of key hyperparameters whose effect on training outcomes are not extensively analyzed. As a result, the model may be difficult for less experienced users to apply to their own data. Finally, as with many similar VAE approaches, the lack of a ground truth against which to validate means that much of evidence provided for the model is necessarily subjective, and its appeal lies in the degree to which the discovered latent spaces appear interpretable in particular applications.
In all, this work is a valuable contribution that is likely to have appeal to those interested in applying latent space methods, particularly to multi-animal video data.
Reviewer #3 (Public Review):
As naturalistic neuroscience becomes increasingly popular, the importance of new computational tools that facilitate the study of animals behaving in minimally constrained environments grows. Yi et al convincingly demonstrate the usefulness of their new method on data from neuroethological studies involving multiple animals, including those with social interactions. Briefly, their method improves upon prior semi-supervised machine learning methods in that extracted latent variables can be more cleanly separated into those representing the behavior of individual subjects and those representing social interactions between subjects. Such an improvement is broadly useful for downstream analysis tasks in multi-subject or social neuroethological studies.
Strengths:
The authors tackle an important problem encountered in behavior analyses in an emerging subfield of neuroscience, naturalistic social neuroscience. They make a case for doing so using semi-unsupervised methods, a toolbox which balances competing scientific needs for building models using large neural-behavioral datasets and for model explainability. The paper is well written, with well-designed figures and relevant analyses that make for an enjoyable reading experience.
The authors provide a remarkable variety of examples that make a convincing case for the utility of their method when used by itself or in conjunction with other data analysis techniques commonly used in modern neuroscience (behavioral motif extraction, neural decoding, etc.). The examples show not just that the extracted latents are more disentangled, but also that the improvement in disentangling has positive effects in downstream analysis tasks.
Weaknesses:
While the paper does a great job of applying the method to real world data, the components of the method itself are not as thoroughly investigated. For example, the contribution of the novel Cauchy-Schwarz regularization technique has not been systematically investigated. This could be done either by sharing additional data where hyperparameters control the contribution of the regularizer, or cite relevant papers where such an analysis have been carried out. It would also be valuable to understand what other regularization techniques might potentially have been applicable here.
The authors conclude from their empirical investigations that the specific prior distribution does not matter to the regularization process. This seems reasonable given that the neural network can learn a complex and arbitrary transformation of the data during training. It would be helpful if the authors could cite prior work where this type of prior distribution does matter and how their approach is different from such prior work. If there is a visualization/explainability related motivation for choosing one prior distribution over another, this could be clarified.