DeepFly3D, a deep learning-based approach for 3D limb and appendage tracking in tethered, adult Drosophila

Abstract
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Studying how neural circuits orchestrate limbed behaviors requires the precise measurement of the positions of each appendage in three-dimensional (3D) space. Deep neural networks can estimate two-dimensional (2D) pose in freely behaving and tethered animals. However, the unique challenges associated with transforming these 2D measurements into reliable and precise 3D poses have not been addressed for small animals including the fly, Drosophila melanogaster. Here, we present DeepFly3D, a software that infers the 3D pose of tethered, adult Drosophila using multiple camera images. DeepFly3D does not require manual calibration, uses pictorial structures to automatically detect and correct pose estimation errors, and uses active learning to iteratively improve performance. We demonstrate more accurate unsupervised behavioral embedding using 3D joint angles rather than commonly used 2D pose data. Thus, DeepFly3D enables the automated acquisition of Drosophila behavioral measurements at an unprecedented level of detail for a variety of biological applications.

https://doi.org/10.7554/eLife.48571.001

Introduction

The precise quantification of movements is critical for understanding how neurons, biomechanics, and the environment influence and give rise to animal behaviors. For organisms with skeletons and exoskeletons, these measurements are naturally made with reference to 3D joint and appendage locations. Paired with modern approaches to simultaneously record the activity of neural populations in tethered, behaving animals (Dombeck et al., 2007; Seelig et al., 2010; Chen et al., 2018), 3D joint and appendage tracking promises to accelerate the discovery of neural control principles, particularly in the genetically tractable and numerically simple nervous system of the fly, Drosophila melanogaster.

However, algorithms for reliably estimating 3D pose in such small Drosophila-sized animals have not yet been developed. Instead, multiple alternative approaches have been taken. For example, one can affix and use small markers—reflective, colored, or fluorescent particles—to identify and reconstruct keypoints from video data (Bender et al., 2010; Kain et al., 1910; Todd et al., 2017). Although this approach works well on humans (Moeslund and Granum, 2000), in smaller, Drosophila-sized animals markers likely hamper movements and are difficult to mount on sub-millimeter scale limbs. Most importantly, measurements of one or even two markers for each leg (Todd et al., 2017) cannot fully describe 3D limb kinematics. Another strategy has been to use computer vision techniques that operate without markers. However, these measurements have been restricted to 2D pose in freely behaving flies. Before the advent of deep learning, this was accomplished by matching the contours of animals seen against uniform backgrounds (Isakov et al., 2016), measuring limb tip positions using complex TIRF-based imaging approaches (Mendes et al., 2013), or measuring limb segments using active contours (Uhlmann et al., 2017). In addition to being limited to 2D rather than 3D pose, these methods are complex, time-consuming, and error-prone in the face of long data sequences, cluttered backgrounds, fast motion, and occlusions that naturally occur when animals are observed from a single 2D perspective.

As a result, in recent years the computer vision community has largely forsaken these techniques in favor of deep learning-based methods. Consequently, the efficacy of monocular 3D human pose estimation algorithms has greatly improved. This is especially true when capturing human movements for which there is enough annotated data to train deep networks effectively. Walking and upright poses are prime examples of this, and state-of-the-art algorithms (Pavlakos et al., 2017a; Tome et al., 2017; Popa et al., 2017; Moreno-noguer, 2017; Martinez et al., 2017; Mehta et al., 2017; Rogez et al., 2017; Pavlakos et al., 2017b; Zhou et al., 2017; Tekin et al., 2017; Sun et al., 2017) now deliver impressive real-time results in uncontrolled environments. Increased robustness to occlusions can be obtained by using multi-camera setups (Elhayek et al., 2015; Rhodin et al., 2016; Simon et al., 2017; Pavlakos et al., 2017b) and triangulating the 2D detections. This improves accuracy while making it possible to eliminate false detections.

These advances in 2D pose estimation have also recently been used to measure behavior in laboratory animals. For example, DeepLabCut provides a user-friendly interface to DeeperCut, a state-of-the-art human pose estimation network (Mathis et al., 2018), and LEAP (Pereira et al., 2019) can successfully track limb and appendage landmarks using a shallower network. Still, 2D pose provides an incomplete representation of animal behavior: important information can be lost due to occlusions, and movement quantification is heavily influenced by perspective.

Approaches used to translate human 2D to 3D pose have also been applied to larger animals, like lab mice and cheetahs (Nath et al., 2019), but require the use of calibration boards. These techniques cannot be easily transferred for the study of small animals like Drosophila: adult flies are approximately 2.5 mm long and precisely registering multiple camera viewpoints using traditional approaches would require the fabrication of a prohibitively small checkerboard pattern, along with the tedious labor of using a small, external calibration pattern. Moreover, flies have many appendages and joints, are translucent, and in most laboratory experiments are only illuminated using infrared light (to avoid visual stimulation)—precluding the use of color information.

To overcome these challenges, we introduce DeepFly3D, a deep learning-based software pipeline that achieves comprehensive, rapid, and reliable 3D pose estimation in tethered, behaving adult Drosophila (Figure 1, Figure 1—video 1). DeepFly3D is applied to synchronized videos acquired from multiple cameras. It first uses a state-of-the-art deep network (Newell et al., 2016) and then enforces consistency across views. This makes it possible to eliminate spurious detections, achieve high 3D accuracy, and use 3D pose errors to further fine-tune the deep network to achieve even better accuracy. To register the cameras, DeepFly3D uses a novel calibration mechanism in which the fly itself is the calibration target. During the calibration process, we also employ sparse bundle adjustment methods, as previously used for human pose estimation (Takahashi et al., 2018; Triggs et al., 2000; Puwein et al., 2014). Thus, the user does not need to manufacture a prohibitively small calibration pattern, or repeat cumbersome calibration protocols. We explain how users can modify the codebase to extend DeepFly3D for 3D pose estimation in other animals (see Materials and methods). Finally, we demonstrate that unsupervised behavioral embedding of 3D joint angle data is robust against problematic artifacts present in embeddings of 2D pose data. In short, DeepFly3D delivers 3D pose estimates reliably, accurately, and with minimal manual intervention while also providing a critical tool for automated behavioral data analysis.

Figure 1 with 1 supplement see all

Download asset Open asset

Deriving 3D pose from multiple camera views.

(A) Raw image inputs to the Stacked Hourglass deep network. (B) Probability maps output from the trained deep network. For visualization purposes, multiple probability maps have been overlaid for each camera view. (C) 2D pose estimates from the Stacked Hourglass deep network after applying pictorial structures and multi-view algorithms. (D) 3D pose derived from combining multiple camera views. For visualization purposes, 3D pose has been projected onto the original 2D camera perspectives. (E) 3D pose rendered in 3D coordinates. Immobile thorax-coxa joints and antennal joints have been removed for clarity.

https://doi.org/10.7554/eLife.48571.002

Results

DeepFly3D

The input to DeepFly3D is video data from seven cameras. These images are used to identify the 3D positions of 38 landmarks per animal: (i) five on each limb – the thorax-coxa, coxa-femur, femur-tibia, and tibia-tarsus joints as well as the pretarsus, (ii) six on the abdomen - three on each side, and (iii) one on each antenna - for measuring head rotations. Our software incorporates the following innovations designed to ensure automated, high-fidelity, and reliable 3D pose estimation.

Calibration without an external calibration pattern

Estimating 3D pose from multiple images requires calibrating the cameras to achieve a level of accuracy commensurate with the target size—a difficult challenge when measuring leg movements for an animal as small as Drosophila. Therefore, instead of using a typical external calibration grid, DeepFly3D uses the fly itself as a calibration target. It detects arbitrary points on the fly’s body and relies on bundle-adjustment (Chavdarova et al., 2018) to simultaneously assign 3D locations to these points and to estimate the positions and orientations of each camera. To increase robustness, it enforces geometric constraints that apply to tethered flies with respect to limb segment lengths and ranges of motion.

Geometrically consistent reconstructions

Starting with a state-of-the-art deep network for 2D keypoint detection in individual images (Newell et al., 2016), DeepFly3D enforces geometric consistency constraints across multiple synchronized camera views. When triangulating 2D detections to produce 3D joint locations, it relies on pictorial structures and belief propagation message passing (Felzenszwalb and Huttenlocher, 2005) to detect and further correct erroneous pose estimates.

Self-supervision and active learning

DeepFly3D also uses multiple view geometry as a basis for active learning. Thanks to the redundancy inherent in obtaining multiple views of the same animal, we can detect erroneous 2D predictions for correction that would most efficiently train the 2D pose deep network. This approach greatly reduces the need for time-consuming manual labeling (Simon et al., 2017). We also use pictorial structure corrections to fine-tune the 2D pose deep network. Self-supervision constitutes 85% of our training data.

2D pose performance and improvement using pictorial structures

We validated our approach using a challenging dataset of 2,063 image frames manually annotated using the DeepFly3D annotation tool and sampled uniformly from each camera. Images for testing and training were 480 × 960 pixels. The test dataset included challenging frames and occasional motion blur to increase the difficulty of pose estimation. For training, we used a final training dataset of 37,000 frames, an overwhelming majority of which were first automatically corrected using pictorial structures. On test data, we achieved a Root Mean Square Error (RMSE) of 13.9 pixels. Compared with a ground truth RMSE of 12.4 pixels – via manual annotation of 210 images by a new human expert – our Network Annotation/Manual Annotation ratio of 1.12 (13.9 pixels / 12.4 pixels) is similar to the ratio of another state-of-the-art network (Mathis et al., 2018): 1.07 (2.88 pixels / 2.69 pixels). Setting a 50 pixel threshold (approximately one third the length of the femur) for PCK (percentage of correct keypoints) computation, we observed a 98.2% general accuracy before applying pictorial structures. Notably, if we reduced our threshold to 30 or 20 pixels, we still achieved 95% or 89% accuracy, respectively (Figure 2A).

Figure 2

Download asset Open asset

Mean absolute error distribution.

(A) PCK (percentage of keypoints) accuracy as a function of mean absolute error (MAE) threshold. (B) Evaluating network prediction error in a low data regime. The Stacked Hourglass network (blue circles) shows near asymptotic prediction error (red dashed line), even when trained with only 400 annotated images. After 800 annotations, there are minimal improvements to the MAE. (C) MAE for different limb landmarks. Violin plots are overlaid with raw data points (white circles).

https://doi.org/10.7554/eLife.48571.004

To test the performance of our network in a low data regime, we trained a two-stacked network using ground-truth annotations data from seven cameras (Figure 2B). We compared the results to an asymptotic prediction error (i.e. the error observed when the network is trained using the full dataset of 40,000 annotated images) and to the variability observed in human annotations of 210 randomly selected images. We measured an asymptotic MAE (mean absolute error) of 10.5 pixels and a human variability MAE of 9.2 pixels. With 800 annotations, our network achieved a similar accuracy to manual annotation and was near the asymptotic prediction error. Further annotation yielded diminishing returns.

Although our network achieves high accuracy, the error is not isotropic (Figure 2C). The tarsus tips (i.e. pretarsus) exhibited larger error than the other joints, perhaps due to occlusions from the spherical treadmill, and higher positional variance. Increased error observed for body-coxa joints might be due to the difficulty of annotating these landmarks from certain camera views.

To correct the residual errors, we applied pictorial structures. This strategy fixed 59% of the remaining erroneous predictions, increasing the final accuracy to 99.2%, from 98.2%. These improvements are illustrated in Figure 3. Pictorial structure failures were often due to pose ambiguities resulting from heavy motion blur. These remaining errors were automatically detected with multi-view redundancy using Equation 6, and earmarked for manual correction using the DeepFly3D GUI.

Figure 3

Download asset Open asset

Pose estimation accuracy before and after using pictorial structures.

Pixel-wise 2D pose errors/residuals (top) and their respective distributions (bottom) (A) before, or (B) after applying pictorial structures. Residuals larger than 35 pixels (red circles) represent incorrect keypoint detections. Those below this threshold (blue circles) represent correct keypoint detections.

https://doi.org/10.7554/eLife.48571.005

3D pose permits robust unsupervised behavioral classification

Unsupervised behavioral classification approaches enable the unbiased quantification of animal behavior by processing data features—image pixel intensities (Berman et al., 2014; Cande et al., 2018), limb markers (Todd et al., 2017), or 2D pose (Pereira et al., 2019)—to cluster similar behavioral epochs without user intervention and to automatically distinguish between otherwise similar actions. However, with this sensitivity may come a susceptibility to features unrelated to behavior including changes in image size or perspective resulting from differences in camera angle across experimental systems, variable mounting of tethered animals, and inter-animal morphological variability. In theory, each of these issues can be overcome—providing scale and rotational invariance—by using 3D joint angles rather than 2D pose for unsupervised embedding.

To test this possibility, we performed unsupervised behavioral classification (Figure 4 and Figure 5) on video data taken during optogenetic stimulation experiments that repeatedly and reliably drove certain behaviors. Specifically, we optically activated CsChrimson (Klapoetke et al., 2014) to elicit backward walking in MDN>CsChrimson animals (Figure 5—video 1) (Bidaye et al., 2014), or antennal grooming in aDN>CsChrimson animals (Figure 5—video 2) (Hampel et al., 2015). We also stimulated control animals lacking the UAS-CsChrimson transgene (Figure 5—video 3) (MDN-GAL4/+ and aDN-GAL4/+). First, we performed unsupervised behavioral classification using 2D pose data from three adjacent cameras containing keypoints for three limbs on one side of the body. Using these data, we generated a behavioral map (Figure 4A). In this map each individual cluster would ideally represent a single behavior (e.g. backward walking, or grooming) and be populated by nearly equal amounts of data from each of the three cameras. This was not the case: data from each camera covered non-overlapping regions and clusters (Figure 4B–D). This effect was most pronounced when comparing regions populated by cameras 1 and 2 versus camera 3. Therefore, because the underlying behaviors were otherwise identical (data across cameras were from the same animals and experimental time points), we can conclude that unsupervised behavioral classification of 2D pose data is sensitive to being corrupted by viewing angle differences.

Figure 4

Download asset Open asset

Unsupervised behavioral classification of 2D pose data is sensitive to viewing angle.

(A) Behavioral map derived using 2D pose data from three adjacent cameras (Cameras 1, 2, and 3) but the same animals and experimental time points. Shown are clusters (black outlines) that are enriched (yellow), or sparsely populated (blue) with data. Different clusters are enriched for data from either (B) camera 1, (C) camera 2, or (D) camera 3. Behavioral embeddings were derived using 1 million frames during 4 s of optogenetic stimulation of MDN>CsChrimson (n = 6 flies, n = 29 trials), aDN>CsChrimson (n = 6 flies, n = 30 trials), and wild-type control animals (MDN-GAL4/+: n = 4 flies, n = 20 trials. aDN-GAL4/+: n = 4 flies, n = 23 trials).

https://doi.org/10.7554/eLife.48571.006

Figure 5 with 6 supplements see all

Download asset Open asset

Unsupervised behavioral classification of 3D joint angle data.

Behavioral embeddings were calculated using 3D joint angles from the same 1 million frames used in Figure 4A. (A) Behavioral map combining all data during 4 s of optogenetic stimulation of MDN>CsChrimson (n = 6 flies, n = 29 trials), aDN>CsChrimson (n = 6 flies, n = 30 trials), and wild-type control animals (For MDN-Gal4/+, n = 4 flies, n = 20 trials. For aDN-Gal4/+ n = 4 flies, n = 23 trials). The same behavioral map is shown with only the data from (B) MDN>CsChrimson stimulation, (C) aDN>CsChrimson stimulation, or (D) control animal stimulation. Associated videos reveal that these distinct map regions are enriched for backward walking, antennal grooming, and forward walking, respectively.

https://doi.org/10.7554/eLife.48571.007

By contrast, performing unsupervised behavioral classification using DeepFly3D-derived 3D joint angles resulted in a map (Figure 5) with a clear segregation and enrichment of clusters for different GAL4 driver lines and their associated behaviors, i.e. backward walking (Figure 5—video 4), grooming (Figure 5—video 5), and forward walking (Figure 5—video 6). Thus, 3D pose overcomes serious issues arising from unsupervised embedding of 2D pose data, enabling more reliable and robust behavioral data analysis.

Discussion

We have developed DeepFly3D, a deep learning-based 3D pose estimation system that is optimized for quantifying limb and appendage movements in tethered, behaving Drosophila. By using multiple synchronized cameras and exploiting multiview redundancy, our software delivers robust and accurate pose estimation at the sub-millimeter scale. Ultimately, we may work solely with monocular images by lifting the 2D detections (Pavlakos et al., 2017b) to 3D or by directly regressing to 3D (Tekin et al., 2017) as has been achieved in human pose estimation studies. Our approach relies on supervised deep learning to train a neural network that detects 2D joint locations in individual camera images. Importantly, our network becomes increasingly competent as it runs: By leveraging the redundancy inherent to a multiple-camera setup, we iteratively reproject 3D pose to automatically detect and correct 2D errors, and then use these corrections to further train the network without user intervention.

None of the techniques we have put together—an approach for multiple-camera calibration that uses the animal itself rather than an external apparatus, an iterative approach to inferring 3D pose using graphical models as well as optimization based on dynamic programming and belief propagation, and a graphical user interface and active learning policy for interacting with, annotating, and correcting 3D pose data—are fly-specific. They could easily be adapted to other limbed animals, from mice to primates and humans. The only thing that would have to change significantly are the dimensions of the experimental setup. This would remove the need to deal with the very small scales Drosophila requires and would, in practice, make pose estimation easier. In the Materials and methods section, we explain in detail how organism-specific features of DeepFly3D—bone segment length, number of legs, and camera focal distance—can be modified to study, for example, humans, primates, rodents, or other insects.

As in the past, we anticipate that the development of new technologies for quantifying behavior will open new avenues and enhance existing lines of investigation. For example, deriving 3D pose using DeepFly3D can improve the resolution of studies examining how neuronal stimulation influences animal behavior (Cande et al., 2018; McKellar et al., 2019), the precision and predictive power of efforts to define natural action sequences (Seeds et al., 2014; McKellar et al., 2019), the assessment of interventions that target models of human disease (Feany and Bender, 2000; Hewitt and Whitworth, 2017), and links between neural activity and animal behavior—when coupled with recording technologies like 2-photon microscopy (Seelig et al., 2010; Chen et al., 2018). Importantly, 3D pose improves the robustness of unsupervised behavioral classification approaches. Therefore, DeepFly3D is a critical step toward the ultimate goal of achieving fully-automated, high-fidelity behavioral data analysis.

Materials and methods

With synchronized Drosophila video sequences from seven cameras in hand, the first task for DeepFly3D is to detect the 2D location of 38 landmarks. These 2D locations of the same landmarks seen across multiple views are then triangulated to generate 3D pose estimates. This pipeline is depicted in Figure 6. First, we will describe our deep learning-based approach to detect landmarks in images. Then, we will explain the triangulation process that yields full 3D trajectories. Finally, we will describe how we identify and correct erroneous 2D detections automatically.

Figure 6

Download asset Open asset

Share this article

Cite this article

Deriving 3D pose from multiple camera views.

Mean absolute error distribution.

Pose estimation accuracy before and after using pictorial structures.

Unsupervised behavioral classification of 2D pose data is sensitive to viewing angle.

Unsupervised behavioral classification of 3D joint angle data.

The DeepFly3D pose estimation pipeline.

The DeepFly3D annotation tool.

Camera calibration.

3D pose correction for one leg using the MAP solution and pictorial structures.

Pose correction using pictorial structures.

DeepFly3D graphical user interface (GUI) applied to with the Human3.6M dataset (Ionescu et al., 2014).

DeepFly3D graphical user interface (GUI).

A schematic of the seven camera spherical treadmill and optogenetic stimulation system that was used in this study.

Author details

Semih Günel

Contribution

For correspondence

Competing interests

Helge Rhodin

Contribution

Competing interests

Daniel Morales

Contribution

Competing interests

João Campagnolo

Contribution

Competing interests

Pavan Ramdya

Contribution

Contributed equally with

For correspondence

Competing interests

Pascal Fua

Contribution

Contributed equally with

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism