Statistical modeling of subcoda structure in sperm whales.

A Sperm whale communication consists of rhythmic sequences of clicks, called codas. A coda is specified by a sequence of inter-click intervals (ICIs). Codas are classified into types based on their rhythmic pattern, which can have various degrees of regularity (e.g., 4R2 vs 1+3). B Social groups of sperm whales employ specific vocal repertoires: the set of coda types they use and their associated usage frequencies. As an illustration, we show those of the EC1 and EC2 clans from the Dominica dataset [12]. Only the most numerous coda types are shown: the rest of the vocal repertoires consists of more coda types with residual frequencies. C The subcoda structure can be modeled by considering rhythmic variations within codas of the same coda type. To do so, codas are represented as sequences of discrete inter-click intervals (dICIs), by discretizing absolute ICIs into discrete bins, which can then be considered akin to symbols (e.g. A, B, C, …), providing a tokenization of codas. Different instances of a single coda type can correspond to slightly different dICI sequences. The resulting dICI sequences are modeled using variable-length Markov chains, which can be represented as subcoda trees. These trees can be built for an individual speaker or for a group of speakers, and capture the statistical and memory structures of rhythm variations within codas and in the transitions between those. In other words, the tree captures a vocal style—how they say what they say. The vocal styles of different groups of sperm whales can be quantitatively compared by calculating a distance between their subcoda trees.

Summary of key concepts

Vocal style recovers social structure of vocal clans in Dominica sperm whales.

A We show the similarity of vocal style, measured as subcoda tree-distance, among social units within a vocal clan (within, darker color shade) and between two clans (between, lighter color shade). We used the manual clan assignments from [21] as ground truth. Vocal style is more similar within clans than between clans. B We show the hierarchical clustering of social unit subcoda trees. Each leaf corresponds to a social unit, and the colors below show their known clan assignments. The clustering recovers the two-clan structure observed in past work [21].

Vocal style recovers social structure of vocal clans in Pacific Ocean sperm whales.

A We show the similarity of vocal style, measured as subcoda tree-distance, between coda samples within a vocal clan (within, darker color shade) and between a clan and all others (between, lighter color shade). We used the vocal clans identified in [11]. Vocal style is more similar within a clan than between clans. B We show the hierarchical clustering of subcoda trees. Each leaf corresponds to a coda sample, and the colors below show their vocal clan assignments (based on coda usage) from [11]. We find generally good overlap between the groups obtained from clustering vocal style and those from vocal repertoire, with the exception of the Short clan (red) that is somewhat mixed with the Palindrome (orange) and Rapid Increasing (yellow) clans.

Clan overlap influences non-ID coda vocal styles

Comparing the similarities of different VLMC models fit for each Pacific Ocean clan for both ID and non-ID coda samples. The y-axis represents the measured distance between the subcoda trees, and the x-axis shows the geographical clan overlap (as calculated in [11]). Each point represents a pairwise comparison between two clans. The effect of overlap on ID coda vocal style similarity is minimal and non-significant while the opposite is true for non-ID codas: overlapping clans produce non-ID codas with a more similar vocal style. The VLMC distances are also typically much greater for ID codas than for non-ID codas. Note that while these results are visually opposite to those reported in Hersh et al. [10], they support the same final conclusions (see “Identity and non-identity codas show different trends” for details).

Representing sperm whale communication as a sequence of symbols.

a The temporal patterns of click onsets are equivalently encoded as sequences of inter-click intervals (ICIs). b We discard ICI values larger than a threshold tmax seconds, and (c) discretize the others into a finite set of bins of width δt seconds. Each bin is then assigned a symbol (here a letter of the Latin alphabet), so that each symbol represents an ICI in a given range. d We then represent sperm whale communication as a sequence of these symbols.

Count of Codas

Count of Coda Types

Fixed length Markov model bifurcates as a function of memory length h.

We show (a) the mean, and (b) the variance of the transition probabilities as a function of h. We also show (c) the number of non-zero transition probabilities per state, and (d) the AIC. The dashed and dotted lines represent two different temporal resolutions (i.e., bin size used during Preprocessing). There is a bifurcation around h ≈ 3 that suggests subcoda structure.

Steps for fitting a VLMC from a sequence.

Step 1 (a): Construction of the full (or saturated) suffix tree up to maximum depth D. Step 2 (b): Assigning a probability measure to every element of the tree. Step 3 (c) Pruning of nodes that carry the same information content as parents according to Eq. (S10). In this example, context CBA is relevant because it changes the distribution of transition probabilities with respect to its parent BA and therefore it is not pruned.

Selecting the optimal information cutoff K.

The AIC for different VLMC models with respect to the cutoff value (K) used to build them. Above are some context trees associated with three different VLMCs. The trees decrease in size for the threshold increases since the information gain has to be greater for a context to be “accepted” in the tree. This effect is also visible in the average path length which can be seen as the average memory of the model.

Selecting the optimal temporal resolution.

The temporal resolution parameter δt for the preprocessing steps is chosen by subtracting the AIC of the 0th order memory Markov model from the AIC of a VLMC model, which encodes how well the model predict versus how easy it is to predict the data. We use this setup because the AIC alone is not sufficient, because the actual fitted dataset changes with the resolution parameter, and thus models with different δt are not directly comparable. The minimum value is the best resolution parameter although any value between 0.01 and 0.1 seems acceptable. We show this for three individual sperm whales (colors) from the Dominica dataset.

Measuring dissimilarity between two VLMC models.

The two trees represent two VLMCs models built over the same alphabet χ. We measure their dissimilarity or distance with the KL divergence.

Comparison of the distribution of distances between VLMC models, with statistical significance.

Each color represents a clan of a given dataset: Pacific A) and Dominica B). The within distribution represents the distance between VLMC models fit on elements of the same clan. The between represent the distance between VLMC models of different clans.

VLMC distance and geographical overlap relation by coda type.

each bar represents a number of unique coda types. Note that almost all coda types have a negative correlation with geographical overlap, although only a small portion is significant. The p-values shown are with respect to the Pearson correlation, Table S4

Distance distribution statistics.

Table with the p- values and the effect size corresponding to the comparison of the within/between distance distributions of coda samples from the Dominica clans (top) and the Pacific clans (bottom), under the Kolmogorov-Smirnov (K-S test) and the T-Test.

Results are significant for different resolutions.

The dotted line represents the 0.05 confidence value. When varying the time resolution used for fitting the VLMCs we observe that our main results persist: ID codas are not related with geographical overlap while non-ID codas always are. The bottom row shows two illustrations of the bin sizes (and subsequently the number of different states) resulting from the preprocessing using 0.01 and 0.1 seconds time resolution respectively.

Clans vocalize different coda types in a distinctive manner

Boxplots for the distances between VLMC’s segmented by coda type and clan (Pacific A and Dominica B). We observe that VLMC fit on different coda types but from the same clan and more similar than when compared with across clans.

A) and C) Rhythmic variations within coda types.

All codas of type 8R that were vocalized by whales of the Dominican clan EC1 (blue) and EC2 (green), along with the average position and variation over all vocalizations. B) No correlation with coda type distribution. Comparison between VLMC distance and KL-Divergence (entropy) between coda type distributions on the data used to fit the VLMC.

Adjusted Rand Score for VLMC clustering dendrogram

Comparison of the dendrogram obtained from the VLMC distance between the trees with the vocal clan labels. The values of the Adjusted Rand Score go from −0.5 to 1.

Bootstraping regression results of Figure 4.

Resulting distribution of slope values for subsamples of the data for the ID coda case A and the non-ID case B. The shaded area represents the 95% confidence interval.

VLMCs capture structure of sperm whale communication.

We show context trees associated with the variable length models built for two individual sperm whales: (a) ATWOOD, and (b) FORK. For comparison, we show (c) the corresponding tree after randomly shuffling ICIs from the same timeseries data used to build (a). Note how the shuffled version does not exhibit any structure. The orange node represents the root node, and the size of each node represents the number of occurrences of the associated context.

Deep learning classifier trained on real communication data generalises well to synthetic data generated by the VLMCs.

We show (a) the accuracy of the classifier on the real data (in black) and the synthetic data (in blue), as a function of the temporal resolution δt. The dashed grey curve highlights the maximum accuracy of on the generated data. The classifier’s task was to identify to which of (b) two clans the data belonged to: EC1 or EC2.

Vector Space embedding using UMAP

Embedding of all the Pacific Ocean clans according to their VLMC distance into two dimensions using UMAP.

Geographical overlap and subcoda distance by coda type.

Results of comparing the geographical overlap and VLMC distance segmented by coda types. Negative correlations are highlighted as well as p-values that are below 0.05. “Number of Clans” represents how many clans were compared.