Overview of AVN song analysis pipeline.

a. Schematic timeline of zebra finch song learning. b. Overview of AVN song analysis pipeline. Spectrograms of songs are automatically segmented into syllables then syllables are labeled. The raw spectrograms are used to calculate features describing the rhythm of a bird’s song, the segmentations are used to calculate syllable-level timing features, and the labeled syllables are used to calculate syntax-related features and acoustic features of a bird’s song. c. Birds from different research groups, with multiple different song phenotypes can all be processed by the AVN pipeline, generating a matrix of directly comparable, interpretable features, which can be used for downstream analyses including phenotype comparisons, tracking the emergence of a phenotype over time, investigating song development, and detecting individual outlier birds with atypical song phenotypes.

Automated syllable annotation metrics.

a. F1 scores for syllable onset detections within 10ms of a syllable onset in the manual annotations of each bird (n=35 from UTSW and n=25 from Rockefeller) across segmentation methods. b. Distribution of time-differences between predicted syllable onsets and their best matches in the manual annotation, across segmentation methods. Distributions include all matched syllables across all 35 birds from the UT Southwestern colony (UTSW) and (c.) 25 from Rockefeller. d. Example spectrogram of a typical adult zebra finch. The song was segmented with WhisperSeg and labeled using UMAP C HDBSCAN clustering. Colored rectangles reflect the labels of each syllable. e. Example UMAP plot of 3131 syllables from the same bird as in d and f. Each point represents one syllable segmented with WhisperSeg, and colors reflect the AVN label of each syllable. f. Example confusion matrix for the bird depicted in d and e. The matrix shows the percentage of syllables bearing each manual annotation label which fall into each of the possible AVN labels. g. V-measure scores for AVN syllable labels compared to manual annotations for each bird (n=35 from UTSW and n=25 from Rockefeller), across segmentation methods.

Song syntax and timing analysis with AVN.

a. Example syntax raster plot for a typical adult zebra finch made with AVN labels. Each row represents a song bout, and each colored block represents a syllable, colored according to its AVN label. b. Example transition matrix from the bird featured in a. Each cell gives the probability of the bird producing the ‘following syllable’, given that they just produced a syllable with the ‘preceding syllable’ label. c. Correlation between normalized entropy rate scores calculated for each bird using manual annotations or AVN labels (n=35 birds from UTSW, r = 0.89, p<0.005). d. Comparison of normalized entropy rates calculated with AVN labels across typical (n=20), isolate (n=8), and FP1 KD (n=7) adult zebra finches (One Way ANOVA F(2, 32) = 15.05, p <0.005, Tukey HSD * indicates p-adj < 0.005). e. Schematic representing the generation of rhythm spectrograms. The amplitude trace of each song file is calculated, then the spectrum of the first derivative of the amplitude trace is computer. The spectra of multiple song files are concatenated to form a rhythm spectrogram, with bout index on the x-axis and frequency along the y axis. The example rhythm spectrograms show the expected banding structure of a typical adult zebra finch, and the less structured rhythm of a typical juvenile zebra finch (50dph). f. Comparison of rhythm spectrum entropies cross typical (n=20), isolate (n=8), FP1 KD (n=7) adult zebra finches (>90dph), and juvenile zebra finches (n = 11, 50-51dph) (One Way ANOVA F(3, 43) = 17.0, p < 0.05, Tukey HSD * indicates p-adj < 0.05).

Song phenotypes classification with AVN features.

a. Linear discriminant values for multiple groups of birds generated from a model trained to discriminate between typical and isolate zebra finches (n=16 isolate birds, 7 FP1 KD birds, 5 deaf birds, 4 sham deafening birds, 53 typical zebra finches from the UTSW colony and 25 typical zebra finches from Rockefeller). b. Linear discriminant values for multiple groups of birds generated from a model trained to discriminate between typical and deaf zebra finches. Same birds as in a. c. Confusion matrix indicating the LDA model’s classification of typical, deaf, isolate and FP1 KD birds from aa model trained to discriminate between typical, deaf, and isolate birds. Scores for typical, deaf, and isolate birds were obtained using leave-one-out cross validation, and FP1 KD scores were obtained using a model fit to all typical, deaf and isolate birds. d. Plot of the linear discriminant coordinates of isolate (n=16), typical (n=78), and FP1 KD birds (n=7) for a model trained to discriminate between typical, deaf, and isolate birds. FP1 KD birds overlap most with isolate birds in this LDA space, indicating that their song production most closely resembles that of isolates.

Age prediction with AVN features.

a. Generalized additive model’s age predictions vs. true ages for 103 days of song recordings across 19 individual birds. Model predictions were generated using leave-one-bird-out cross validation. The grey line indicates where points would lie if the model were perfectly accurate. b. Partial dependence functions for each feature in the GAM model. The values of each feature along the x-axis map onto learned contributions to the age prediction along the y-axis. The GAM model’s prediction is the sum of these age contributions based on each day of song’s feature values, plus an intercept term.

Illustration and validation of AVN’s song similarity scoring method.

a. Schematic of the similarity scoring method. A deep convolutional neural network is used to embed syllables in an 8-dimensional space, where each syllable is a single point, and similar syllables are embedded close together. The first 2 principal components of the 8-dimensional space are used for visualization purposes only here. The syllable embedding distributions for two random subsets of syllables produced by the same pupil on the same day have a high degree of overlap. The empirical distributions of all syllables from a pupil and his song tutor are less similar than a pupil compared to himself, but still much more similar than a pupil and a random unrelated bird. b. Maximum Mean Discrepancy (MMD) dissimilarity score distribution for comparisons between a pupil and itself (n=30 comparisons for UTSW, n = 25 for Rockefeller), a pupil and its tutor (n=30 comparisons for UTSW, n=25 for Rockefeller), two pupils which share the same tutor (aka pupil vs. ‘Sibling’ comparisons, n = 58 comparisons for UTSW, n = 64 for Rockefeller), and between two pupils who don’t share song tutor (aka pupil vs. unrelated bird, n = 90 comparisons for UTSW, n = 75 for Rockefeller). Calculated with a dataset of 30 typical tutor-pupil pairs from UTSW and 25 from Rockefeller. c. Correlation between MMD dissimilarity scores and human expert judgements of song similarity for 14 tutor-pupil comparisons from the UTSW colony (r = -0.80, p<0.005). d. Tutor-pupil MMD dissimilarity scores for typical pupils from the UTSW colony (n = 30), typical pupils from the Rockefeller Song Library (n = 25), and FP1 KD pupils from the UTSW colony (n = 7) (One Way ANOVA F(2, 57) = 9.57, p < 0.005. * Indicates Tukey HSD post hoc p-adj < 0.05). e. MMD Dissimilarity score between birds at various age points across development, compared to their mature song recorded when the bird is over 90dph. Each point represents one comparison (n = 91 comparisons across 11 birds). Grey line is an exponential function fit to the data to emphasize the slowing of song maturation as birds approach maturity.

Screenshots of the AVN graphical application, showing a. how syllable labels can be generated from a table of syllable segmentations, including visual inspection of labels overlaid on a spectrogram, b. how the complete AVN feature set can be generated from a table of labeled syllables with a single click, and c. how users can adjust different hyperparameters for pertaining to spectrogram generation of feature calculations, with all hyperparameters clearly explained. Hyperparameters can also be saved and loaded from files, to encourage reproducibility of analyses.

a-c. The precision, recall, and F1 scores for syllable onset predictions within 10ms of a syllable onset in the manual annotation for each bird (n = 20 typical birds, 8 isolate birds, and 7 FP1 KD birds) across segmentation methods. e-f. The precision, recall and F1 scores for syllable offset predictions within 20ms of a syllable offset in the manual annotation for each bird (n = 20 typical birds, 8 isolate birds, and 7 FP1 KD birds) across segmentation methods. g. Distribution of time-differences between predicted syllable offsets and their best matches in the manual annotation, across segmentation methods. Distributions include all matched syllables across all 35 birds from UTSW.

a-c. The precision, recall and F1 scores for syllable offset predictions within 20ms of a syllable offset in the manual annotation for each bird (n = 25 typical birds from the Rockefeller song library) across segmentation methods. d. Distribution of time-differences between predicted syllable offsets and their best matches in the manual annotation, across segmentation methods. Distributions include all matched syllables across all 25 birds from Rockefeller.

a G d. Example UMAP plot of syllables from a typical adult zebra finch. Each point represents a syllable, colored according to their AVN label. b G e. Confusion matrix showing the proportion of syllables bearing each manual annotation label which fall into each of the possible AVN labels for an example typical adult zebra finch. AVN label 1000 refers to syllables which were not segmented correctly by WhisperSeg, and therefore don’t carry an AVN label. Hand label ‘x’ refers to ‘syllables’ which were segmented by WhisperSeg, but don’t have a counterpart in the manual annotation. c G f. Example spectrogram of a song bout produced by a typical adult zebra finch, with overlaid AVN syllable labels

aGb 4 replicate UMAP embeddings with different random initializations, and the corresponding clustering validation for each of 2 different birds. c The standard deviation of v-measure scores for each bird (n = 35, 20 typical birds, 8 isolate birds, and 7 FP1 KD birds from the UTSW colony) across 30 different random initializations for UMAP dimensionality reduction and HDBSCAN clustering. Despite the stochasticity inherent in UMAP embeddings, the clustering and accuracy of the clustering is very consistent within each bird.

a-c. The homogeneity, completeness, and v-measures scores for AVN labels compared to manual annotations for each bird (n = 20 typical birds, 8 isolate birds, and 7 FP1 KD birds), across segmentation methods.

a-b. Example Syntax Raster plot for a. an example FP1 KD zebra finch with disrupted syntax, and b. a typical adult zebra finch with stereotyped syntax, made using AVN labels. c-d. Transition matrix of an example c. FP1 KD zebra finch with disrupted syntax, and d. a typical adult zebra finch with stereotyped syntax, made using AVN labels. Cells represent the probability of the bird producing a ‘following syllable’ with a particular label, given that the last syllable produced belonged to the type ‘preceding syllable’. e. Mean repetition bout lengths (number of times a syllable is produced in a row each time it is sung) for the syllable type with the highest mean repetition bout length per bird (n = 20 typical birds, 8 isolate birds and 7 FP1 KD birds)(One Way ANOVA F(2, 32) = 1.09, p = 0.35). f. Coefficient of variation of repetition bout length for the syllable type with the highest mean repetition bout length per bird (n = 20 typical birds, 8 isolate birds and 7 FP1 KD birds) (One Way ANOVA F(2, 32) = 4.65, p <0.05. * indicates Tukey HSD p-adj < 0.05).

a. Correlation between the syllable duration distribution entropies calculated using AVN labels or manual annotation (n = 35 birds, r = 0.85, p <0.005). b. Example syllable duration distribution calculated using manual annotations or WhisperSeg segmentations of syllables from a typical adult zebra finch. c. Correlation between the short silent gap duration distribution entropies calculated using AVN labels or manual annotation (n = 35 birds, r = 0.88, p<0.005). d. Example silent gap duration distribution calculated using manual annotations for WhisperSeg Segmentations of syllables from a typical adult zebra finch. e. Comparison of syllable duration entropies across typical (n = 20), isolate (n = 8) and FP1 KD (n = 7) adult zebra finches (>90dph) and juvenile zebra finches (n = 11, 50-51 dph) (One Way ANOVA F(3, 43) = 17.43, p < 0.005. * indicates Tukey HSD p-adj < 0.05). f. Comparison of silent gap duration entropies across across typical (n = 20), isolate (n = 8) and FP1 KD (n = 7) adult zebra finches (>90dph) and juvenile zebra finches (n = 11, 50-51 dph) (One Way ANOVA F(3, 43) = 8.03, p < 0.005. * indicates Tukey HSD p-adj < 0.05).

a-b. Example rhythm spectrogram for a. a typical adult zebra finch and b. a typical juvenile zebra finch (50dph). Cyan points indicate the frequency band with the highest power in each song file’s rhythm spectrum. These frequencies are used to calculate the rhythm spectrogram peak frequency CVs, as in c. c. CV of the peak frequency across song files in a bird’s rhythm spectrogram for typical (n = 20), isolate (n = 8) and FP1 KD (n = 7) adult zebra finches (>90dph) and juvenile zebra finches (n = 11, 50-51 dph) (One Way ANOVA F(3, 43) = 8.26, p < 0.005. * indicates Tukey HSD p-adj < 0.05). d. Additional example rhythm spectrograms from 3 typical adult zebra finches (>90dph) and e. 3 juvenile zebra finches (50-51 dph).

a. Feature weights of a LDA model fit to discriminate between typical and isolate zebra finches. Features with positive weights tend to have higher values for isolate birds, and features with negative weights tend to have higher values for typical birds. b. Feature weights of an LDA model fit to discriminate between typical and deaf zebra finches. Features with positive weights tend to have higher values for deaf birds, and features with negative weights tend to have higher values for typical birds. c. Feature weights for an LDA model trained to discriminate between typical, isolate, and deaf zebra finches. Each column reflects the feature weights for binary classification between the named group and both other groups combined. Features are ordered according to their total absolute weights.

a-d. Dissimilarity score distributions for comparisons between a pupil and itself (n = 30 comparisons), a pupil and it’s tutor (n = 30 comparisons), two pupils which share the same tutor (aka pupil vs. ‘Sibling’, n = 58 comparisons), and between two pupils who don’t share a tutor (aka pupil vs. unrelated bird, n = 90 comparisons). Calculated with a dataset of 30 typical tutor-pupil pairs from the UTSW colony using different combinations of VAE or Triplet loss for dimensionality reduction and EMD or MMD for dissimilarity calculations. e. Contrast index values for pupil vs. self and pupil vs. unrelated bird comparisons for 30 birds from UTSW across different dimensionality reduction and dissimilarity calculation methods. f. Tutor contrast index values for pupil vs. tutor and pupil vs. unrelated bird comparisons for 30 birds from UTSW across different dimensionality reduction and dissimilarity calculation methods.

a. MMD dissimilarity score distributions for comparisons between a pupil and itself (n = 5 comparisons), a pupil and it’s tutor (n = 5 comparisons), two pupils which share the same tutor (aka pupil vs. ‘Sibling’, n = 14 comparisons), and between two pupils who don’t share a tutor (aka pupil vs. unrelated bird, n = 15 comparisons). Calculated with a dataset of 5 FP1 KD pupils from the UTSW colony, and an additional 30 typical pupils for the ‘Sibling’ and ‘unrelated’ comparisons. b. Correlation between Sound Analysis Pro 2011 % similarity scores and human expert judgements of song similarity for 14 tutor-pupil pairs from UTSW (r = 0.32, p = 0.23). c. Contrast index values for pupil vs. self and pupil vs. unrelated bird comparisons for 30 birds from UTSW and 25 birds from Rockefeller (t-test, p = 0.47). d. Tutor contrast index values for pupils vs. tutor and pupil vs. unrelated bird comparisons for 30 birds from UTSW and 25 birds from Rockefeller (t-test, p = 0.06).

Similarity scoring deep neural network architecture.