Unsupervised Representation Learning of C. elegans Poses and Behavior Sequences From Microscope Video Recordings

  1. Center for Molecular Medicine Cologne (CMMC), Faculty of Medicine and University Hospital Cologne, University of Cologne, Cologne, Germany
  2. Institute for Biomedical Informatics, Faculty of Medicine and University Hospital Cologne, University of Cologne, Cologne, Germany
  3. Faculty of Mathematics and Natural Sciences, University of Cologne, Cologne, Germany
  4. Cologne Excellence Cluster on Cellular Stress Responses in Aging- Associated Diseases (CECAD), University of Cologne, Cologne, Germany

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Gordon Berman
    Emory University, Atlanta, United States of America
  • Senior Editor
    Aleksandra Walczak
    CNRS, Paris, France

Reviewer #1 (Public review):

Summary:

The submitted article reports the development of an unsupervised learning method that enables quantification of behaviour and poses of C. elegans from 15 minute long videos and presents a spatial map of both. The entire pipeline is a two part process, with the first part based on contrastive learning that represents spatial poses onto an embedded space, while the second part uses a transformer encoder to enable estimation of masked parts in a spatiotemporal sequence.

Strengths:

This analysis approach will prove to be useful for the C. elegans community. The application of the method on various age-related videos on various strains presents a good use-case for the approach. The manuscript is well written and presented.

Specific comments:

(1) One of the main motivations as mentioned in the introduction as well as emphasized in the discussion section is that this approach does not require key-point estimation for skeletonization and is also not dependent on the eigenworm approach for pose estimation. However, the eigenworm data has been estimated using the Tierpsy tracker in videos used in this work and stored as metadata. This is subsequently used for interpretation. It is not clear at this point, how else the spatial embedded map may be interpreted without using this kind of pose estimates obtained from other approaches. Please elaborate and comment.

(2) As per the manuscript, the second part of the pipeline is used to estimate the masked sequences of the spatiotemporal behavioral feature. However, it is not clear what the numbers listed in Fig. 2.3 represent?

(3) It is not clear how motion speed is linked to individual poses as mentioned in Figs. 4 (b) and (c).

Reviewer #2 (Public review):

Summary:

The manuscript by Maurice and Katarzyna describes a self-supervised, annotation-free deep-learning approach capable of quantitatively representing complex poses and behaviors of C. elegans directly from video pixel values. Their method overcomes limitations inherent to traditional methods relying on skeletonization or keypoint tracking, which often fail with highly coiled or self-intersecting worms. By applying self-supervised contrastive learning and a Transformer-based network architecture, the authors successfully capture diverse behavioral patterns and depict the aging trajectory of behavioral repertoire. This provides a useful new tool for behavioral research in C. elegans and other flexible-bodied organisms.

Strengths:

Reliable tracking and segmentation of complex poses remain significant bottlenecks in C. elegans behavioral research, and the authors made valuable attempts to address these challenges. The presented method offers several advantages over existing tools, including freedom from manual labeling, independence from explicit skeletonization or keypoint tracking, and the capability to capture highly coiled or overlapping poses. Thus, the proposed method would be useful to the C. elegans research community.

The research question is clearly defined. Methods and results are engagingly presented, and the manuscript is concise and well-organized.

Weaknesses:

(1) In the abstract, the claim of an 'unbiased' approach is not well-supported. The method is still affected by dataset biases, as mentioned in the aging results (section 4.3).
(2) In section 3.2, the rationale behind rotating worm images to a vertical orientation is unclear.
(3) The methods section is clearly written but uses overly technical language, making it less accessible to the audience of eLife, the majority of whom are biologists. Clearer explanations of key methods and the rationale behind their selection are needed. For example, in section 3.3, the authors should briefly explain in simple language what contrastive learning is, why they chose it, and why this method potentially achieves their goal.
(4) The reason why the gray data points could not be resolved by Tierpsy is not quantitatively described. Are they all due to heavily coiled or overlapping poses?
(5) In section 4.1, generating pose representations grouped by genetic strains would provide insights into strain-specific differences resolved by the proposed method.
(6) Fig. 3a requires clarification. Highly bent poses (red points) intuitively should be close to highly coiled poses (gray points). The authors should explain the observed greenish/blueish points interfacing with the gray points.
(7) In Fig. 3a, some colored points overlap with the gray point cloud. Why can Tierpsy resolve these overlapping points representing highly coiled poses? A more systematic quantitative comparison between Tierpsy and the proposed method is required.
(8) The claim in section 4.2 regarding strain separation in pose embedding spaces is unsupported by Fig. 3a, which lacks strain-based distinctions. As mentioned in point #5, showing pose representations grouped by different strains is required.
(9) In section 4.2, how the authors could verify the statement, "This likely occurs since most strains share common behaviors such as simple forward locomotion"?
(10) An important weakness of the proposed method is its low direct interpretability, as it is not based on handcrafted features. To better interpret the pose/behavior embedding space, it would be helpful to compare it against more basic Tierpsy features in Fig. 3 and 4. This comparison could reveal what understandable features were learned by the neural network, thereby increasing human interpretability.
(11) The main conclusion of section 4.3 is not sufficiently tested. Is Fig. 5a generated only from data of N2 animals? To quantitatively verify the statement, "Young individuals appear to display a wide range of behaviors, while as they age their behavior repertoire reduces," the authors should perform a formal analysis of behavioral variability throughout aging.
(12) In Fig. 5a, better visualization of aging trajectories could include plotting the center of mass along with variance of the point cloud over time.
(13) To better reveal aging trajectories of behavioral changes for different genetic backgrounds, it would be meaningful to generate behavior representations for different strains as they age.
(14) As a methods paper, the ease of use for other researchers should be explicitly addressed, and source code and datasets should be provided.

Reviewer #3 (Public review):

Summary:

In this paper, the authors present an unsupervised learning approach to represent C. elegans poses and temporal sequences of poses in low-dimensional spaces by directly using pixel values from video frames. The method does not rely on the exact identification of the worm's contour/midline, nor on the identification of the head and tail prior to analyzing behavioral parameters. In particular, using contrastive learning, the model represents worm poses in low-dimensional spaces, while a transformer encoder neural network embeds sequences of worm postures over short time scales. The study evaluates this newly developed method using a dataset of different C. elegans genetic strains and aging individuals. The authors compared the representations inferred by the unsupervised learning with features extracted by an established approach, which relies on direct identification of the worm's posture and its head-tail direction.

Strengths:

The newly developed method provides a coarse classification of C. elegans posture types in a low-dimensional space using a relatively simple approach that directly analyzes video frames. The authors demonstrate that representations of postures or movements of different genotypes, based on pixel values, can be distinguishable to some extent.

Weaknesses:

- A significant disadvantage of the presented method is that it does not include the direction of the worm's body (e.g., head/tail identification). This highly limits the detailed and comprehensive identification of the worm's behavioral repertoire (on- and off-food), which requires body directionality in order to infer behaviors (for example, classifying forward vs. reverse movements). In addition, including a mix of opposite postures as input to the new method may create significant classification artifacts in the low-dimensional representation-such that, for example, curvature at opposite parts of the body could cluster together. This concern applies both to the representation of individual postures and to the representation of sequences of postures.
- The authors state that head-tail direction can be inferred during forward movement. This is true when individuals are measured off-food, where they are highly likely to move forward. However, when animals are grown on food, head-tail identification can also be based on quantifying the speed of the two ends of the worm (the head shows side-to-side movements). This does not require identifying morphological features. See, for example, Harel et al. (2024) or Yemini et al. (2013).
- Another confounding parameter that cannot be distinguished using the presented method is the size of individuals. Size can differ between genotypes, as well as with aging. This can potentially lead to clustering of individuals based on their size rather than behavior.
- There is no quantitative comparison between classification based on the presented method and methods that rely on identifying the skeleton.

Author response:

We thank the editors and the reviewers for their valuable comments and for taking the time to evaluate our manuscript.

Answers to Reviewer 1:

(1) The core contribution of our method is that it learns meaningful spatiotemporal embeddings directly from image data without requiring pose estimation or eigenworm-based features as input. The learned embedding space can serve as a foundation for downstream tasks such as behavioral classification, clustering, or anomaly detection, further supporting its utility beyond visualization through eigenworm-derived features. Here we use the Tierpsy-derived features for latent space interpretation and for validation that our approach does indeed encode meaningful postural information. Additionally, without any Tierpsy-calculated features users can still color embeddings by known metadata like mutation or age and compare different strains to each other.

(2) The numbers shown in Fig. 2.3 are illustrative placeholders intended to conceptually represent a vector of behavioral features. They do not correspond to any specific measurements or carry intrinsic meaning. We agree that this may lead to confusion, and we will clarify this in the revised manuscript.

(3) The visualizations in Figs. 4 (b) and (c) show the embeddings of sequences of behavior, rather than individual poses. Therefore, motion-related features such as speed are related to temporal patterns in those sequences rather than static postures. The color overlays reflect average motion characteristics (e.g., speed) of short behavior clips projected into the embedding space, rather than being directly linked to any single frame or pose.

Answers to Reviewer 2:

(1) In the abstract, our use of the term "unbiased" refers specifically to the avoidance of human-generated bias through feature engineering—i.e., the model does not rely on handcrafted features or predefined pose representations – the representations are based on data only. However, we agree that the model is still subject to dataset biases and will rectify this in the revised manuscript.

(2) The worm images are rotated to a common vertical orientation to remove orientation as a source of variability in the input. This ensures that the model focuses on learning pose and behavioral dynamics rather than arbitrary head-tail or angular positioning. While data augmentation could in theory account for this variability, we found in our preliminary experiments that applying this preprocessing step led to more stable and interpretable embeddings.

(3) We agree that simplifying the technical explanations would enhance the manuscript’s accessibility. In the revised version, we will briefly introduce contrastive learning in a less technical language.

(4) The gray points in Fig. 3a represent frames that Tierpsy could not resolve, primarily due to coiled, self-intersecting, or overlapping worm postures as Tierpsy uses skeletonization to estimate the centerline. This approach can fail if kind of challenging elements are part of the image.

(5) We appreciate this suggestion and consider it for a revised version of the manuscript.

(6) Although it may seem intuitive for highly bent (red) poses to lie near coiled (gray) ones in the embedding space, the clustering pattern observed reflects how the network organizes pose information. The red/orange cluster consists of distinguishable bent poses that are visually distinct and consistently separable from other postures. In contrast, the greenish and blueish poses are less strongly bent and may share more visual overlap with the unresolved (gray) images.

(7) The overlap occurs because some highly bent or coiled worms can still be (partially) resolved by Tierpsy, depending on specific pose conditions (e.g., head and tail not touching, not self-overlapping). However, Tierpsy fails to consistently resolve such frames. We will describe these cases in more detail in the revised manuscript.

(8) Thank you, we agree this claim needs to be better supported and will develop it in the revision.

(9) To support this statement we mainly visualized the respective sequences embedded in this area of the embedding space and found that it mostly consists of common behaviors such as forward locomotion.

(10) We agree that interpretability is important and plan to include additional figures quantifications of the embedding space using more basic Tierpsy features.

(11) Fig. 5a is indeed based solely on N2 animals. In the revised manuscript we will include quantitative measures of behavioral variability and its change with age.

(12) We appreciate this suggestion and consider it for a revised version

(13) We agree this would be a valuable analysis. However, our current dataset primarily includes aging data for N2 animals. We acknowledge this limitation and consider adding more strains for future work.

(14) We will include links to our source code in the revised manuscript

Answers to Reviewer 3:

(1-2) Our current method is agnostic to head-tail orientation, which indeed restricts the ability to distinguish behaviors that rely on directional cues. We made this design choice as we believe that correctly identifying head/tail orientation can be a challenging task that may introduce additional biases or fail in difficult imaging conditions. However, we fully agree that integrating directional information would improve behavioral resolution, and this is a natural extension of our current framework. In future work, we aim to incorporate head-tail disambiguation.

(3) We explicitly designed our preprocessing and training pipeline to encourage size invariance, for example by resizing individuals to a consistent scale, as the focus of our work is to encode posture and movement only. However, we acknowledge that absolute size information is lost in this process, which can be informative for distinguishing genotypes or age-related changes.

(4) We agree that a direct quantitative comparison between our embedding-based representations and skeleton-based feature sets would strengthen the paper. Our current focus was to assess whether meaningful behavioral features could be learned from a skeleton-free representation.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation