Songbirds can learn flexible contextual control over syllable sequencing

  1. Lena Veit  Is a corresponding author
  2. Lucas Y Tian
  3. Christian J Monroy Hernandez
  4. Michael S Brainard  Is a corresponding author
  1. Center for Integrative Neuroscience and Howard Hughes Medical Institute, University of California, San Francisco, United States

Abstract

The flexible control of sequential behavior is a fundamental aspect of speech, enabling endless reordering of a limited set of learned vocal elements (syllables or words). Songbirds are phylogenetically distant from humans but share both the capacity for vocal learning and neural circuitry for vocal control that includes direct pallial-brainstem projections. Based on these similarities, we hypothesized that songbirds might likewise be able to learn flexible, moment-by-moment control over vocalizations. Here, we demonstrate that Bengalese finches (Lonchura striata domestica), which sing variable syllable sequences, can learn to rapidly modify the probability of specific sequences (e.g. ‘ab-c’ versus ‘ab-d’) in response to arbitrary visual cues. Moreover, once learned, this modulation of sequencing occurs immediately following changes in contextual cues and persists without external reinforcement. Our findings reveal a capacity in songbirds for learned contextual control over syllable sequencing that parallels human cognitive control over syllable sequencing in speech.

eLife digest

Human speech and birdsong share numerous parallels. Both humans and birds learn their vocalizations during critical phases early in life, and both learn by imitating adults. Moreover, both humans and songbirds possess specific circuits in the brain that connect the forebrain to midbrain vocal centers.

Humans can flexibly control what they say and how by reordering a fixed set of syllables into endless combinations, an ability critical to human speech and language. Birdsongs also vary depending on their context, and melodies to seduce a mate will be different from aggressive songs to warn other males to stay away. However, so far it was unclear whether songbirds are also capable of modifying songs independent of social or other naturally relevant contexts.

To test whether birds can control their songs in a purposeful way, Veit et al. trained adult male Bengalese finches to change the sequence of their songs in response to random colored lights that had no natural meaning to the birds. A specific computer program was used to detect different variations on a theme that the bird naturally produced (for example, “ab-c” versus “ab-d”), and rewarded birds for singing one sequence when the light was yellow, and the other when it was green. Gradually, the finches learned to modify their songs and were able to switch between the appropriate sequences as soon as the light cues changed. This ability persisted for days, even without any further training.

This suggests that songbirds can learn to flexibly and purposefully modify the way in which they sequence the notes in their songs, in a manner that parallels how humans control syllable sequencing in speech. Moreover, birds can learn to do this ‘on command’ in response to an arbitrarily chosen signal, even if it is not something that would impact their song in nature.

Songbirds are an important model to study brain circuits involved in vocal learning. They are one of the few animals that, like humans, learn their vocalizations by imitating conspecifics. The finding that they can also flexibly control vocalizations may help shed light on the interactions between cognitive processing and sophisticated vocal learning abilities.

Introduction

A crucial aspect of the evolution of human speech is the development of flexible control over learned vocalizations (Ackermann et al., 2014; Belyk and Brown, 2017). Humans have unparalleled control over their vocal output, with a capacity to reorder a limited number of learned elements to produce an endless combination of vocal sequences that are appropriate for current contextual demands (Hauser et al., 2002). This cognitive control over vocal production is thought to rely on the direct innervation of brainstem and midbrain vocal networks by executive control structures in the frontal cortex, which have become more elaborate over the course of primate evolution (Hage and Nieder, 2016; Simonyan and Horwitz, 2011). However, because of the comparatively limited flexibility of vocal production in nonhuman primates (Nieder and Mooney, 2020), the evolutionary and neural circuit mechanisms that have enabled the development of this flexibility remain poorly understood.

Songbirds are phylogenetically distant from humans, but they have proven a powerful model for investigating neural mechanisms underlying learned vocal behavior. Song learning exhibits many parallels to human speech learning (Doupe and Kuhl, 1999); in particular, juveniles need to hear an adult tutor during a sensitive period, followed by a period of highly variable sensory-motor exploration and practice, during which auditory feedback is used to arrive at a precise imitation of the tutor song (Brainard and Doupe, 2002). This capacity for vocal learning is subserved by a well-understood network of telencephalic song control nuclei. Moreover, as in humans, this vocal control network includes strong projections directly from cortical (pallial) to brainstem vocal control centers (Doupe and Kuhl, 1999; Simonyan and Horwitz, 2011). These shared behavioral features and neural specializations raise the question of whether songbirds might also share the capacity to learn flexible control over syllable sequencing.

Contextual variation of song in natural settings, such as territorial counter-singing or female-directed courtship song, indicate that songbirds can rapidly alter aspects of their song, including syllable sequencing and selection of song types (Chen et al., 2016; Heinig et al., 2014; King and McGregor, 2016; Sakata et al., 2008; Searcy and Beecher, 2009; Trillo and Vehrencamp, 2005). However, such modulation of song structure is often described as affectively controlled (Berwick et al., 2011; Nieder and Mooney, 2020). For example, the presence of potential mates or rivals elicits a global and unlearned modulation of song intensity (James et al., 2018) related to the singer’s level of arousal or aggression (Alcami et al., 2021; Heinig et al., 2014; Jaffe and Brainard, 2020). Hence, while prior observations suggest that a variety of ethologically relevant factors can be integrated to influence song production in natural settings, it remains unclear whether song can be modified more flexibly by learned or cognitive factors.

Here, we tested whether Bengalese finches can learn to alter specifically targeted vocal sequences within their songs in response to arbitrarily chosen visual cues, independent of social or other natural contexts. Each Bengalese finch song repertoire includes ~5–12 acoustically distinct elements (‘syllables’) that are strung together into sequences in variable but non-random order. For a given bird, the relative probabilities of specific transitions between syllables normally remain constant over time (Okanoya, 2004; Warren et al., 2012), but previous work has shown that birds can gradually adjust the probabilities of alternative sequences in response to training that reinforces the production of some sequences over others. In this case, changes to syllable sequencing develop over a period of hours to days (Warren et al., 2012). In contrast, we investigate here whether birds can learn to change syllable sequencing on a moment-by-moment basis in response to arbitrary visual cues that signal which sequences are adaptive at any given time. Our findings reveal that songbirds can learn to immediately, flexibly, and adaptively adjust the sequencing of selected vocal elements in response to learned contextual cues.

Results

Bengalese finches can learn context-dependent syllable sequencing

For each bird in the study, we first identified variably produced syllable sequences that could be gradually modified using a previously described aversive reinforcement protocol (‘single context training’; Tumer and Brainard, 2007; Warren et al., 2012). For example, a bird that normally transitioned from the fixed syllable sequence ‘ab’ to either ‘c’ or ‘d’ (Figure 1A,B, sequence probability of ~36% for ‘ab-c’ and ~64% for ‘ab-d’) was exposed to an aversive burst of white noise (WN) feedback immediately after the ‘target sequence’ ‘ab-d’ was sung. In response, the bird learned over a period of days to gradually decrease the relative probability of that sequence in favor of the alternative sequence ‘ab-c’ (Figure 1C). This change in sequence probabilities was adaptive in that it enabled the bird to escape from WN feedback. Likewise, when the sequence, ‘ab-c’ was targeted, the probability of ‘ab-d’ increased gradually over several days of training (Figure 1D). These examples are consistent with prior work that showed such sequence modifications develop over a period of several days, with the slow time course suggesting a gradual updating of synaptic connections within syllable control networks in response to performance-related feedback (Warren et al., 2012). In contrast, the ability to immediately and flexibly reorder vocal elements in speech must reflect mechanisms that enable contextual factors to exert moment-by-moment control over selection and sequencing of alternative vocal motor programs. Having identified sequences for each bird for which the probability of production could be gradually modified in this manner, we then tested whether birds could be trained to rapidly switch between those same sequences in a context-dependent manner.

Bengalese finches can learn context-dependent sequencing.

(A) Example spectrogram highlighting points in song with variable sequencing. Syllables are labeled based on their spectral structure, target sequences for the different experiments (ab-c and ab-d) are marked with colored bars. Y-axis shows frequency in Hz. (B) Transition diagram with probabilities for sequences ab-c and ab-d. The sequence probability of ab-d (and complementary probability ab-c) stayed relatively constant over five days. Shaded area shows 95% confidence interval for sequence probability. Source data in Figure 1—source data 3. (C) Aversive reinforcement training. Schematic showing aversive WN after target sequence ab-d; spectrogram shows WN stimulus, covering part of syllable d. WN targeted to sequence ab-d led to a gradual decrease in the probability of that sequence over several days, and a complementary increase in the probability of ab-c. (D) WN targeted to ab-c led to a gradual increase in the sequence probability of ab-d. Source data in Figure 1—source data 2. (E) Schematic of the contextual learning protocol, with target for WN signaled by colored lights. (F) Left: Two example days of baseline without WN but with alternating blocks of green and yellow context. Colors indicate light context (black indicates periods of lights off during the night), error bars indicate SEM across song bouts in each block. Right: Average sequence probability in yellow and green blocks during baseline. Open circles show individual blocks, error bars show SEM across blocks. (G) Left: Two example days after training (WN on). Right: Average sequence probability in yellow and green blocks after training. (H) Contextual difference in sequence probability for eight trained birds before and after training (**p<0.01 signed rank test). Source data in Figure 1—source data 1.

Figure 1—source data 1

Switch magnitude during baseline and after training for all birds, to generate Figure 1H, and plots like Figure 1F,G for all birds.

https://cdn.elifesciences.org/articles/61610/elife-61610-fig1-data1-v1.mat
Figure 1—source data 2

Sequence data for the example bird during single-context training, to generate Figure 1C,D.

https://cdn.elifesciences.org/articles/61610/elife-61610-fig1-data2-v1.mat
Figure 1—source data 3

Sequence data for the example bird during baseline, to generate Figure 1B.

https://cdn.elifesciences.org/articles/61610/elife-61610-fig1-data3-v1.mat

To determine whether Bengalese finches can learn to flexibly select syllable sequences on a moment-by-moment basis, we paired WN targeting of specific sequences with distinct contextual cues. In this context-dependent training protocol, WN was targeted to defined sequences in the bird’s song as before, but the specific target sequence varied across alternating blocks, signaled by different colored lights in the home cage (see Materials and methods). Figure 1E shows an example experiment, with ‘ab-d’ targeted in yellow light, and ‘ab-c’ in green light. At baseline, without WN, switches between yellow and green contexts (at random intervals of 0.5–1.5 hr) did not lead to significant changes in the relative proportion of the target sequences, indicating that there was no inherent influence of the light cues on sequence probabilities (Figure 1F, p(ab-d) in yellow vs. green context was 67 ± 1.6% vs. 64 ± 1.5%, p=0.17, rank-sum test, n = 53 context blocks from baseline period). Training was then initiated in which WN was alternately targeted to each sequence, over blocks that were signaled by light cues. After 2 weeks of such context-specific training, significant sequencing differences developed between light contexts that were appropriate to reduce aversive feedback in each context (Figure 1G, p(ab-d) in yellow vs. green context shifted to 36.5 ± 4.8% vs. 83.1 ± 3.5%, p<0.01, rank-sum test, n = 22 context blocks, block duration between 1 and 2.5 hr). Likewise, for all birds trained on this protocol (n = 8), context-dependent sequencing differences developed in the appropriate direction over a period of weeks (27 ± 6% difference in probabilities between contexts after a mean of 33 days training, versus 1% ± 2% average difference in probabilities at baseline; p<0.01, n = 8, signed rank test, Figure 1H). Thus, Bengalese finches are able to learn context-specific modifications to syllable sequencing.

Syllable sequencing shifts immediately following switches in context

Contextual differences between different blocks could arise through an immediate shift in sequence probabilities upon entry into a new context and/or by rapid learning within each block. We examined whether trained birds exhibited any immediate shifts in their syllable sequencing when entering a new light context by computing the average probability of target sequences across songs aligned with the switch between contexts (Figure 2A,B, example experiment). This ‘switch-triggered average’ revealed that across all birds, switches to the yellow context were accompanied by an immediate decrease in the probability of the yellow target sequence, whereas switches out of the yellow context (and into the green context) led to an immediate increase in the yellow target sequence (Figure 2C,D, p<0.05, signed rank test comparing first and last song, n = 8). To quantify the size of these immediate shifts, we calculated the difference in sequence probability from the last five songs in the previous context to the first five songs in the current context; this difference averaged 0.24 ± 0.06 for switches to green light and −0.22 ± 0.06 for switches to yellow light (Figure 2E,F). These results indicate that birds could learn to immediately recall an acquired memory of context-appropriate sequencing upon entry into each context, even before having the chance to learn from reinforcing feedback within that context.

Sequence probabilities shift immediately following a switch in context.

(A, B) Average sequence probability per song for example Bird 1 aligned to switches from green to yellow context (A) and from yellow to green context (B). Error bars indicate SEM across song bouts (n = 35 switches (A), n = 33 switches (B)). (C) Changes in sequence probability from the last song in green context to the first song in yellow context for all eight birds. Example bird in (AB) highlighted in bold. **p<0.01 signed rank test. (D) Changes in sequence probability from the last song in yellow context to the first song in green context. *p<0.05 signed rank test. (E) Shift magnitudes for all birds, defined as the changes in sequence probability from the last five songs in the green context to the first five songs in the yellow context. Open circles show individual birds, error bars indicate SEM across birds. (F) Same as (E) for switches from yellow to green. Source data in Figure 2—source data 1. (G) Shift magnitudes over training time for the example bird (11 days and 49 context switches; seven of the original 56 context switches are excluded from calculations of shift magnitudes because at least one of the involved blocks contained only one or two song bouts.). (H) Trajectory of switch-aligned sequence probabilities for the example bird early in training (red) and late in training (blue). Probabilities are normalized by the sequence probability in preceding block, and plotted so that the adaptive direction is positive for both switch directions (i.e. inverting the probabilities for switches to yellow.) (I) Slopes of fits to the sequence probability trajectories over song bouts within block. Units in change of relative sequence probability per song bout. (K) Intercepts of fits to sequence probability trajectories over song bouts within block. Units in relative sequence probability. (L) Changes in slopes and changes in intercepts for five birds over the training process, determined as the slopes of linear fits to curves as in (I and K) for each bird. Source data in Figure 2—source data 2.

Figure 2—source data 1

Switch magnitude between all contexts after training, to generate Figures 2C–F and 3E–H.

https://cdn.elifesciences.org/articles/61610/elife-61610-fig2-data1-v1.mat
Figure 2—source data 2

Summary of training data, to generate Figure 2L.

https://cdn.elifesciences.org/articles/61610/elife-61610-fig2-data2-v1.mat

We next asked whether training additionally led to an increased rate of learning within each context, which also might contribute to increased contextual differences over time. Indeed, such faster re-learning for consecutive encounters of the same training context, or ‘savings’, is sometimes observed in contextual motor adaptation experiments (Lee and Schweighofer, 2009). To compare the magnitude of the immediate shift and the magnitude of within-block learning over the course of training, we plotted the switch-aligned sequence probabilities at different points in the training process. Figure 2G shows for the example bird that the magnitude of the shift (computed between the first and last five songs across context switches) gradually increased over 11 days of training. Figure 2H shows the switch-aligned sequence probability trajectories (as in Figure 2A,B) for this bird early in training (red) and late in training (blue), binned into groups of seven context switches. Qualitatively, there was both an abrupt change in sequence probability at the onset of each block (immediate shift at time point 0) and a gradual adjustment of sequence probability within each block (within-block learning over the first 80 songs following light switch). Over the course of training, the immediate shift at the onset of each block got larger, while the gradual change within blocks stayed approximately the same (learning trajectories remained parallel over training, Figure 2H). Linear fits to the sequence probabilities for each learning trajectory (i.e. the right side of Figure 2H) reveal that, indeed, the change in sequence probability at the onset of blocks (i.e. intercepts) increased over the training process (Figure 2K), while the rate of change within blocks (i.e. slopes) stayed constant (Figure 2I). To quantify this across birds, we measured the change over the course of learning in both the magnitude of immediate shifts (estimated as the intercepts from linear fits) and the rate of within-block learning (estimated as the slopes from linear fits). As for the example bird, we found that the rate of learning within each block stayed constant over time for all five birds (Figure 2L). In contrast, the magnitude of immediate shifts increased over time for all birds (Figure 2L). These analyses indicate that adjustments to sequence probability reflect two dissociable processes, an immediate cue-dependent shift in sequence probability at the beginning of blocks, that increases with contextual training, and a gradual adaptation of sequence probability within blocks, that does not increase with contextual training.

Visual cues in the absence of reinforcement are sufficient to evoke sequencing changes

The ability of Bengalese finches to implement an immediate shift in sequencing on the first rendition in a block – and thus before they have a chance to learn from reinforcing feedback – argues that they can maintain context-specific motor memories and use contextual visual cues to anticipate correct sequencing in each context. To explicitly test whether birds can flexibly switch between sequencing appropriate for distinct contexts using only visual cues, we included short probe blocks which presented the same light cues without WN stimulation. Probe blocks were interspersed in the sequence of training blocks so that each switch between types of blocks was possible and, on average, every third switch was into a probe block (see Materials and methods). Light switches into probe blocks were associated with similar magnitude shifts in sequence probability as switches into WN blocks of the corresponding color (−0.22 ± 0.06 to both yellow WN and yellow probe blocks from green WN blocks, p=0.94, signed rank test; 0.24 ± 0.06 to green WN and 0.23 ± 0.07 to green probe blocks from yellow WN blocks, p=0.64, signed rank test). As the most direct test of whether light cues alone evoke adaptive sequencing changes, we compared songs immediately before and after switches between probe blocks without intervening WN training blocks (probe-probe switches). Figure 3A,B shows song bouts for one example bird (Bird 2) which were sung consecutively across a switch from yellow probe to green probe blocks. In the first song following the probe-probe switch, the yellow target sequence (‘f-ab’) was more prevalent, and the green target sequence (‘n-ab’) was less prevalent, and such an immediate effect was also apparent in the average sequence probabilities for this bird aligned to probe–probe switches (Figure 3C,D). Similar immediate and appropriately directed shifts in sequencing at switches between probe blocks were observed for all eight birds (Figure 3E,F, p<0.05 signed rank test, n = 8), with average shifts in sequence probabilities of −0.21 ± 0.09 and 0.17 ± 0.08 (Figure 3G,H). The presence of such changes in the first songs sung after probe–probe switches indicates that visual cues alone are sufficient to cause anticipatory, learned shifts between syllable sequences.

Figure 3 with 1 supplement see all
Contextual cues alone are sufficient to enable immediate shifts in syllable sequencing.

(A,B) Examples of songs sung by Bird 2 immediately before (A) and after (B) a switch from a yellow probe block to a green probe block (full song bouts in Figure 3—figure supplement 1). Scale for x-axis is 500 ms; y-axis shows frequency in Hz. (C, D) Average sequence probability per song for Bird 2 aligned to switches from green probe to yellow probe blocks (C) and from yellow probe to green probe blocks (D). Error bars indicate SEM across song bouts (n = 14 switches (C), 11 switches (D)). (E, F) Average sequence probabilities for all eight birds at the switch from the last song in green probe context and the first song in yellow probe context, and vice versa. Example Bird 2 is shown in bold. *p<0.05 signed rank test. (G, H) Shift magnitudes for probe–probe switches for all birds. Open circles show individual birds; error bars indicate SEM across birds. Source data in Figure 2—source data 1.

Contextual changes are specific to target sequences

A decrease in the probability of a target sequence in response to contextual cues must reflect changes in the probabilities of transitions leading up to the target sequence. However, such changes could be restricted to the transitions that immediately precede the target sequence, or alternatively could affect other transitions throughout the song. For example, for the experiment illustrated in Figure 1, the prevalence of the target sequence ‘ab-d’ was appropriately decreased in the yellow context, in which it was targeted. The complete transition diagram and corresponding transition matrix for this bird (Figure 4A,B) reveal that there were four distinct branch points at which syllables were variably sequenced (after ‘cr’, ‘wr’, ‘i’, and ‘aab’). Therefore, the decrease in the target sequence ‘ab-d’ could have resulted exclusively from an increase in the probability of the alternative transition ‘ab-c’ at the branch point following ‘aab’. However, a reduction in the prevalence of the target sequence could also have been achieved by changes in the probability of transitions earlier in song such that the sequence ‘aab’ was sung less frequently. To investigate the extent to which contextual changes in probability were specific to transitions immediately preceding target sequences, we calculated the difference between transition matrices in the yellow and green probe contexts (Figure 4C). This difference matrix indicates that changes to transition probabilities were highly specific to the branch point immediately preceding the target sequences (specificity was defined as the proportion of total changes which could be attributed to the branch points immediately preceding target sequences; specificity for branch point ‘aab’ was 83.2%). Such specificity to branch points that immediately precede target sequences was typical across experiments, including cases in which different branch points preceded each target sequence (Figure 4D–F, specificity 96.9%). Across all eight experiments, the median specificity of changes to the most proximal branch points was 84.95%, and only one bird, which was also the worst learner in the contextual training paradigm, had a specificity of less than 50% (Figure 4G). Hence, contextual changes were specific to target sequences and did not reflect the kind of global sequencing changes that characterize innate social modulation of song structure (Sakata et al., 2008; Sossinka and Böhner, 1980).

Figure 4 with 1 supplement see all
Contextual changes are local to the target sequences.

(A) Transition diagram for the song of Bird 6 (spectrogram in Figure 1) in yellow probe context. Sequences of syllables with fixed transition patterns (e.g. ‘aab’) as well as repeat phrases and introductory notes have been summarized as single states to simplify the diagram. (B) Transition matrix for the same bird, showing same data as in (A). (C) Differences between the two contexts are illustrated by subtracting the transition matrix in the yellow context from the one in the green context, so that sequence transitions which are more frequent in green context are positive (colored green) and sequence transitions which are more frequent in yellow are negative (colored yellow). For this bird, the majority of contextual differences occurred at the branch point (‘aab’) which most closely preceded the target sequences (‘ab-c’ and ‘ab-d’), while very little contextual difference occurred at the other three branch points (‘i’, ‘wr’, ‘cr’). (D–F) Same for Bird 2 for which two different branch points (‘f’ and ‘n’) preceded the target sequences (‘f-abcd’ and ‘n-abcd’) (spectrogram in Figure 3). (G) Proportion of changes at the branch point(s) most closely preceding the target sequences, relative to the total magnitude of context differences for each bird (see Materials and methods). Most birds exhibited high specificity of contextual changes to the relevant branch points. Source data in Figure 4—source data 1.

Figure 4—source data 1

Overview of different experimental parameters and song features for each bird, to generate (Figure 4G, Figure 4—figure supplement 1).

https://cdn.elifesciences.org/articles/61610/elife-61610-fig4-data1-v1.mat

Distinct sequence probabilities are specifically associated with different visual cues

Our experiments establish that birds can shift between two distinct sequencing states in response to contextual cues. In order to test whether birds were capable of learning to shift to these two states from a third neutral context, we trained a subset of three birds with three different color-cued contexts. For these birds, after completion of training with WN targeted to distinct sequences in yellow and green contexts (as described above), we introduced interleaved blocks cued by white light in which there was no reinforcement. After this additional training, switches from the unreinforced context elicited changes in opposite directions for the green and yellow contexts (example bird Figure 5A). All birds (n = 3) showed adaptive sequencing changes for the first song bout in probe blocks (Figure 5B,C) as well as immediate shifts in the adaptive directions for all color contexts (Figure 5D, 0.11 ± 0.04 and 0.19 ± 0.05 for switches to green WN and green probe blocks, respectively; −0.15 ± 0.06 and −0.09 ± 0.02 for switches to yellow WN and yellow probe blocks, respectively). While additional data would be required to establish the number of distinct associations between contexts and sequencing states that can be learned, these findings suggest that birds can maintain at least two distinct sequencing states separate from a ‘neutral’ state and use specific associations between cue colors and sequencing states to rapidly shift sequencing in distinct directions for each context.

Contextual cues allow shifts in both directions.

(A) Sequence probability for Bird 2 at the switch from neutral context to yellow and green WN contexts, as well as yellow and green probe contexts (no WN). Error bars indicate SEM across song bouts (n = 68 switches [green WN], 78 switches [yellow WN], 27 switches [green probe], 24 switches [yellow probe]). (B, C) Sequence probabilities for three birds for the last song in neutral context and the first song in the following probe context. Example bird in (A) highlighted in bold. (D) Shift magnitude for three birds at the switch from neutral context to all other contexts. Open circles show individual birds; error bars indicate SEM across birds. Source data in Figure 5—source data 1.

Figure 5—source data 1

Switch magnitude during third context experiment, to generate Figure 5B–D.

https://cdn.elifesciences.org/articles/61610/elife-61610-fig5-data1-v1.mat

Discussion

Speech, thought, and many other behaviors are composed of ordered sequences of simpler elements. The flexible control of sequencing is thus a fundamental aspect of cognition and motor function (Aldridge and Berridge, 2002; Jin and Costa, 2015; Tanji, 2001). While the flexibility of human speech is unrivaled, our contextual training paradigm revealed a simpler, parallel capacity in birds to produce distinct vocal sequences in response to arbitrary contextual cues. The colors of the cues had no prior relevance to the birds, so that their meaning had to be learned as a new association between cues and the specific vocal sequences that were contextually appropriate (i.e. that escaped WN, given the current cues). Learned modulation of sequencing was immediately expressed in response to changes in cues, persisted following termination of training, and was largely restricted to the targeted sequences, without gross modifications of global song structure. Hence, for song, like speech, the ordering of vocal elements can be rapidly and specifically reconfigured to achieve learned, contextually appropriate goals. This shared capacity for moment-by-moment control of vocal sequencing in humans and songbirds suggests that the avian song system could be an excellent model for investigating how neural circuits enable flexible and adaptive reconfiguration of motor output in response to different cognitive demands.

Flexible control of vocalizations

Our demonstration of contextual control over the ordering of vocal elements in the songbird builds on previous work showing that a variety of animals can learn to emit or withhold innate vocalizations in response to environmental or experimentally imposed cues. For example, nonhuman primates and other animals can produce alarm calls that are innate in their acoustic structure, but that are deployed in a contextually appropriate fashion (Nieder and Mooney, 2020; Suzuki and Zuberbühler, 2019; Wheeler and Fischer, 2012). Similarly, animals, including birds, can be trained to control their vocalizations in an experimental setting, by reinforcing the production of innate vocalizations in response to arbitrary cues to obtain food or water rewards (Brecht et al., 2019; Hage and Nieder, 2013; Nieder and Mooney, 2020; Reichmuth and Casey, 2014). In relation to these prior findings, our results demonstrate a capacity to flexibly reorganize the sequencing of learned vocal elements, rather than select from a fixed set of innate vocalizations, in response to arbitrary cues. This ability to contextually control the ordering, or syntax, of specifically targeted syllable transitions within the overall structure of learned song parallels the human capacity to differentially sequence a fixed set of syllables in speech.

The ability to alter syllable sequencing in a flexible fashion also contrasts with prior studies that have demonstrated modulation of vocalizations in more naturalistic settings. For example, songs produced in the context of courtship and territorial or aggressive encounters (‘directed song’) differ in acoustic structure from songs produced in isolation (‘undirected song’) (Sakata et al., 2008; Searcy and Beecher, 2009). This modulation of song structure by social context is characterized by global changes to the intensity of song production, with directed songs exhibiting faster tempo, and greater stereotypy of both syllable structure and syllable sequencing, than undirected songs (Sakata et al., 2008; Searcy and Beecher, 2009; Sossinka and Böhner, 1980). This and other ethologically relevant modulation of song intensity may serve to communicate the singer’s affective state, such as level of arousal or aggression (Alcami et al., 2021; Hedley et al., 2017; Heinig et al., 2014), and may largely reflect innate mechanisms (James et al., 2018; Kojima and Doupe, 2011) mediated by hypothalamic and neuromodulatory inputs to premotor regions (Berwick et al., 2011; Gadagkar et al., 2019; James et al., 2018; Nieder and Mooney, 2020). In contrast, here we show that birds can learn to locally modulate specific features of their songs (i.e. individually targeted syllable transitions) in response to arbitrarily assigned contextual cues that have no prior ethological relevance.

Evolution of control over vocal sequencing

The capacity for moment-by-moment adjustment of vocalizations in response to arbitrary learned cues may depend on similar capacities that evolved to enable appropriate modulation of vocalizations in ethologically relevant natural contexts. For example, some species of songbirds preferentially sing different song types depending on factors such as time of day, location of the singer, or the presence of an audience (Alcami et al., 2021; Hedley et al., 2017; King and McGregor, 2016; Searcy and Beecher, 2009; Trillo and Vehrencamp, 2005). Even birds with only a single song type, such as Bengalese finches, vary parameters of their song depending on social context, including the specific identity of the listener (Chen et al., 2016; Heinig et al., 2014; Sakata et al., 2008). The ability to contextually control vocalizations is also relevant for the customization of vocal signatures for purposes of individual and group recognition (Vignal et al., 2004) and to avoid overlap and enhance communication during vocal turn-taking and in response to environmental noises (Benichov and Vallentin, 2020; Brumm and Zollinger, 2013). Such capacities for vocal control likely reflect evolutionary advantages of incorporating sensory and contextual information about conspecifics and the environment in generating increasingly sophisticated vocal signaling. Our results indicate a latent capacity to integrate arbitrary sensory signals into the adaptive deployment of vocalizations in songbirds and suggest that some of the contextual control observed in natural settings may likewise rely on learned associations and other cognitive factors. Perhaps evolutionary pressures to develop nuanced social communication led to the elaboration of cortical (pallial) control over brainstem vocal circuitry (Hage and Nieder, 2016), and thereby established a conduit that facilitated the integration of progressively more abstract cues and internal states in that control.

Neural implementation of context-dependent vocal motor sequencing

The ability of birds to switch between distinct motor programs using visual cues is reminiscent of contextual speech and motor control studies in humans. For example, human subjects in both laboratory studies and natural settings can learn multiple ‘states’ of vocal motor adaptation and rapidly switch between them using contextual information (Houde and Jordan, 2002; Keough and Jones, 2011; Rochet-Capellan and Ostry, 2011). Similarly, subjects can learn two separate states of motor adaptation for other motor skills, such as reaching, and switch between them using cues or other cognitive strategies (Cunningham and Welch, 1994). Models of such context-dependent motor adaptation frequently assume at least two parallel processes (Abrahamse et al., 2013; Ashe et al., 2006; Green and Abutalebi, 2013; Hikosaka et al., 1999; Lee and Schweighofer, 2009; McDougle et al., 2016; Rochet-Capellan and Ostry, 2011; Wolpert et al., 2011), one that is more flexible, and sensitive to contextual information (McDougle et al., 2016), and a second that cannot readily be associated with contextual cues and is only gradually updated during motor adaptation (Howard et al., 2013). Specifically, in support of such a two-process model, Imamizu and Kawato, 2009 and Imamizu et al., 2007 found that contextual information can drive rapid shifts in adaptation at the beginning of new blocks, without affecting the rate of adaptation within blocks. The similar separation in our study between rapid context-dependent shifts in sequence probability at the onset of blocks, and gradual adaptation within blocks that does not improve with training (Figure 2G–L), suggests that such contextual sequence learning in the Bengalese finch may also be enabled by two distinct processes.

Humans studies of two-process models suggest that slow adaptation occurs primarily within primary motor structures, while fast context-dependent state switches, including for cued switching between languages in bilinguals, engage more frontal areas involved in executive control (Bialystok, 2017; Blanco-Elorrieta and Pylkkänen, 2016; De Baene et al., 2015; Imamizu and Kawato, 2009). In songbirds, the gradual adaptation of sequence probabilities within blocks might likewise be controlled by motor and premotor song control structures, while visual contextual cues could be processed in avian structures analogous to mammalian prefrontal cortex, outside the song system. For example, the association area nidopallium caudolaterale (Güntürkün, 2005), is activated by arbitrary visual cues that encode learned rules (Veit and Nieder, 2013; Veit et al., 2015), and this or other avian association areas (Jarvis et al., 2013) may serve as an intermediate representation of the arbitrary contextual cues that can drive rapid learned shifts in syllable sequencing.

At the level of song motor control, our results indicate a greater capacity for rapid and flexible adjustment of syllable transition probabilities than previously appreciated. Current models of song production include networks of neurons in the vocal premotor nucleus HVC responsible for the temporal control of individual syllables, which are linked together by activity in a recurrent loop through brainstem vocal centers (Andalman et al., 2011; Ashmore et al., 2005; Cohen et al., 2020; Hamaguchi et al., 2016). At branch points in songs with variable syllable sequencing, one influential model posits that which syllable follows a branch point is determined by stochastic processes that depend on the strength of the connections between alternative syllable production networks, and thus dynamics local to HVC (Jin, 2009; Jin and Kozhevnikov, 2011; Troyer et al., 2017; Zhang et al., 2017). Such models could account for a gradual adjustment of sequence probabilities over a period of hours or days (Lipkind et al., 2013; Warren et al., 2012) through plasticity of motor control parameters, such as the strength of synaptic connections within HVC. However, our results demonstrate that there is not a single set of relatively fixed transition probabilities that undergo gradual adjustments, as could be captured in synaptic connectivity of branched syllable control networks. Rather, the song system has the capacity to maintain distinct representations of transition probabilities and can immediately switch between those in response to visual cues. HVC receives a variety of inputs that potentially could convey such visual or cognitive influences on sequencing (Bischof and Engelage, 1985; Cynx, 1990; Seki et al., 2008; Ullrich et al., 2016; Wild, 1994), and one of these inputs, Nif, has previously been shown to be relevant for sequencing (Hosino and Okanoya, 2000; Vyssotski et al., 2016). It therefore is likely that the control of syllable sequence in Bengalese finches involves a mix of processes local to nuclei of the song motor pathway (Basista et al., 2014; Zhang et al., 2017) as well as inputs that convey a variety of sensory feedback and contextual information. The well-understood circuitry of the avian song system makes this an attractive model to investigate how such top-down pathways orchestrate the kind of contextual control of vocalizations demonstrated in this study, and more broadly to uncover how differing cognitive demands can flexibly and adaptively reconfigure motor output.

Materials and methods

Subjects and sound recordings

Request a detailed protocol

The experiments were carried out on eight adult male Bengalese finches (Lonchura striata) obtained from the lab’s breeding colony (age range 128–320 days post-hatch, median 178 days, at start of experiment). Birds were placed in individual sound-attenuating boxes with continuous monitoring and auditory recording of song. Song was recorded using an omnidirectional microphone above the cage. We used custom software for the online recognition of target syllables and real-time delivery of short 40 ms bursts of WN depending on the syllable sequence (Tumer and Brainard, 2007; Warren et al., 2012). This LabView program, EvTAF, is included as an executable file with this submission, and further support is available from the corresponding authors upon request. All procedures were performed in accordance with animal care protocols approved by the University of California, San Francisco Institutional Animal Care and Use Committee (IACUC).

Training procedure and blocks

Request a detailed protocol

Bengalese finch song consists of a discrete number of vocal elements, called syllables, that are separated by periods of silence. At the start of each experiment, a template was generated to recognize a specific sequence of syllables (the target sequence) for each bird based on their unique spectral structure. In the context-dependent auditory feedback protocol, the target sequence that received aversive WN feedback switched between blocks of different light contexts. Colored LEDs (superbrightleds.com, St. Louis, MO; green 520 nm, amber 600 nm) produced two visually distinct environments (green and yellow) to serve as contextual cues to indicate which sequences would elicit WN and which would ‘escape’ (i.e. not trigger WN). We wanted to test whether the birds would be able to associate song changes with any arbitrary visual stimulus; therefore, there was no reason to choose these specific colors, and the birds’ color perception in this range should not matter, as long as they were able to discriminate the colors. The entire day was used for data acquisition by alternating the two possible light contexts. We determined sensitivity and specificity of the template to the target sequence on a randomly selected set of 20 song bouts on which labels and delivery of WN was hand-checked. Template sensitivity was defined as follows: sensitivity = (number of correct hits)/(total number of target sequences). The average template sensitivity across experiments was 91.3% (range 75.2–100%). Template specificity was defined as: specificity = (number of correct escapes)/(number of correct escapes plus number of false alarms), where correct escapes were defined as the number of target sequences of the currently inactive context that were not hit by WN, and false alarms were defined as any WN that was delivered either on the target sequence of the currently inactive context, or anywhere else in song. The average template specificity was 96.7% (range 90.6–100%).

At the start of each experiment, before WN training, songs were recorded during a baseline period in which cage illumination was switched between colors at random intervals. Songs from this baseline period were separately analyzed for each light color to confirm that there was no systematic, unlearned effect of light cues on sequencing before training. During initial training, cage illumination was alternatingly switched between colors at random intervals. Intervals were drawn from uniform distributions which differed between birds (60–150 min [four birds], 10–30 min [two birds], 60–240 min [one bird], 30–150 min [one bird]). Different training schedules were assigned to birds arbitrarily and were not related to a bird’s performance. After an extended period of training (average 33 days, range 12–79 days), probe blocks without WN were included, to test whether sequencing changes could be elicited by visual cues alone. During this period, probe blocks were interspersed with WN training blocks. Probe blocks made up approximately one third of total blocks (10 of 34 blocks in the sequence) and 7–35% of total time, depending on the bird. The duration of probe blocks was typically shorter or equal to the duration of WN blocks (10–30 min for six birds, 30–120 min for one bird, 18–46 min for one bird). The total duration of the experiment, consisting of baseline, training, and probe periods, was on average 52 days. During this period, birds sang 226 (range 66–356) bouts per day during baseline days and 258 (range 171–368) bouts per day during the period of probe collection at the end of training (14% increase). The average duration of song bouts also changed little, with both the average number of target sequences per bout (8.7 during baseline, 7.7 during probes, 7% decrease) and the average number of syllables per bout (74 during baseline, 71 during probes, 2% decrease) decreasing slightly. In addition to the eight birds that completed this training paradigm, three birds were started on contextual training but never progressed to testing with probe blocks, because they did not exhibit single-context learning (n = 1); because of technical issues with consistent targeting at branch points, (n = 1); or because they lost sequence variability during initial stages of training (n = 1); these birds are excluded from the results. Of the eight birds that completed training, three birds exhibited relatively small context-dependent changes in sequencing (Figure 1H). We examined several variables to assess whether they could account for differences in the magnitude of learning across birds, including the bird’s age, overall transition entropy of the song (Katahira et al., 2013), transition entropy at the targeted branch points (Warren et al., 2012), as well as the distance between the WN target and the closest preceding branch point in the sequence. None of these variables were significantly correlated with the degree of contextual learning that birds expressed (Figure 4—figure supplement 1), and consequently, all birds were treated as a single group in analysis and reporting of results. In a subset of experiments (n = 3), after completing measurements with probe blocks, we added a third, neutral context (Figure 5), signaled by white light, in which there was no WN reinforcement.

Syllable sequence annotation

Request a detailed protocol

Syllable annotation for data analysis was performed offline. Each continuous period of singing that was separated from others by at least 2 s of silence was treated as an individual ‘song’ or ‘song bout’. Song was bandpass filtered between 500 Hz and 10,000 Hz and segmented into syllables and gaps based on amplitude threshold and timing parameters determined manually for each bird. A small sample of songs (approximately 20 song bouts) was then annotated manually based on visual inspection of spectrograms. These data were used to train an offline autolabeler (‘hybrid-vocal-classifier’, Nicholson, 2021), which was then used to label the remaining song bouts. Autolabeled songs were processed further in a semi-automated way depending on each bird’s unique song, for example to separate or merge syllables that were not segmented correctly (detected by their duration distributions), to deal with WN covering syllables (detected by its amplitude), and to correct autolabeling errors detected based on the syllable sequence. A subset of songs was inspected manually for each bird to confirm correct labeling.

Sequence probability analyses

Request a detailed protocol

Sequence probability was first calculated within each song bout as the frequency of the yellow target sequence relative to the total number of yellow and green target sequences: p=n(target_Y)ntarget_Y+n(target_G). Note that this differs from transition probabilities at branch points in song in that it ignores possible additional syllable transitions at the branch point, and does not require the targeted sequences to be directly following the same branch point. For example for the experiment in Figure 3, the target sequences were ‘n-ab’ and ‘f-ab’, so the syllable covered by WN (‘b’ in both contexts) was two to three syllables removed from the respective branch point in the syllable sequence (‘n-f’ vs. ‘n-a’ or ‘f-n’ vs. ‘f-a’). Note also that units of sequence probability are in percent; therefore, reported changes in percentages (e.g. Figures 1H and 2E,F) describe absolute changes in sequence probability, which reflect the proportion of each target sequence, not percent changes. Song bouts that did not contain either of the two target sequences were discarded. In the plots of sequence probability over several days in Figure 1A–C, we calculated sequence probability for all bouts on a given day (average n = 1854 renditions of both target sequences per day). We estimated 95% confidence intervals by approximation with a normal distribution as p±z*p*(1-p)n with n=ntarget_Y+ntarget_G and z = 1.96. Context switches were processed to include only switches between adjacent blocks during the same day, that is excluding overnight switches and treating blocks as separate contexts if one day started with the same color that had been the last color on the previous day. If a bird did not produce any song during one block, this block was merged with any neighboring block of the same color (e.g. green probe without songs before green WN, where the context switch would not be noticeable for the bird). If the light color switched twice (or more) without any song bouts, those context switches were discarded.

In order to reduce variability associated with changes across individual song bouts, shift magnitude was calculated as the difference between the first five song bouts in the new context and the last five song bouts in the old context. Only context switches with at least three song bouts in each adjacent block were included in analyses of shift magnitude. In plots showing songs aligned to context switches, the x-axis is limited to show only points for which at least half of the blocks contributed data (i.e. in Figure 2D, half of the green probe blocks contained at least six songs). All statistical tests were performed with MATLAB. We used non-parametric tests to compare changes across birds (Wilcoxon rank-sum test for unpaired data, Wilcoxon signed-rank test for paired data), because with only eight birds/data points, it is more conservative to assume that data are not Gaussian distributed.

Analysis of acquisition

Request a detailed protocol

In order to investigate how context-dependent performance developed over training (Figure 2G–L), we quantified changes to sequence probabilities across block switches for five birds for which we had a continuous record from the onset of training. Sequence probability curves (e.g. Figure 2H) for yellow switches were inverted so that both yellow and green switches were plotted in the same direction, aligned by the time of context switches, and were cut off at a time point relative to context switches where fewer than five switches contributed data. We then subtracted the mean pre-switch value from each sequence probability curve. For visual display of the example bird, sequence probability curves were smoothed with a nine bout boxcar window and displayed in bins of seven context switches. To calculate the slope of slopes and slope of intercepts (Figure 2L), we calculated a linear fit to the post-switch parts of the unsmoothed sequence probability curve for each individual context switch.

Specificity to relevant branch points

Request a detailed protocol

To calculate the specificity of the context difference to the targeted branch points in song, we generated transition diagrams for each bird. To simplify the diagrams, introductory notes were summarized into a single introductory state. Introductory notes were defined for each bird as up to three syllables occurring at the start of song bouts before the main motif, which tended to be quieter, more variable, with high probabilities to repeat and to transition to other introductory notes. Repeat phrases were also summarized into a single state. Motifs, or chunks, in the song with fixed order of syllables were identified by the stereotyped transitions and short gap durations between syllables in the motif (Isola et al., 2020; Suge and Okanoya, 2010) and were also summarized as a single state in the diagram. Sometimes, the same syllable can be part of several fixed chunks (Katahira et al., 2013), in which case it may appear several times in the transition diagram. We then calculated the difference between the transition matrices for the two probe contexts at each transition that was a branch point (defined as more than 3% and less than 97% transition probability). These context differences were split into ‘targeted branch points’, i.e., the branch point or branch points most closely preceding the target sequences in the two contexts, and ‘non-targeted branch points’, i.e., all other branch points in the song. We calculated the proportion of absolute contextual difference in the transition matrix that fell to the targeted branch points, for example for the matrix in Figure 4C (44 + 45)/(44 + 45 + 6+6 + 1+1 + 2+2)=83.2%. Typically, birds with clear contextual differences at the target sequence also had high specificity of sequence changes to the targeted branch points.

To calculate the transition entropy of baseline song, we again summarized introductory notes into a single introductory state. In addition, the same syllables as part of multiple fixed motifs, or in multiple positions within the same fixed motif, were renamed as different syllables, so as not to count as sequence variability what was really a stereotyped sequence (i.e. a-b 50% and b-c 50% in the fixed sequence ‘abbc’). Transition entropy was then calculated as in Katahira et al., 2013: with x denoting the preceding syllable and y denoting the current syllable, over all syllables in the song.

Data availability

Raw data are included in the manuscript and supporting files. Source data have been provided for all summary analyses, along with code to reproduce the figures.

References

  1. Book
    1. Aldridge JW
    2. Berridge KC
    (2002) Coding of Behavioral Sequences in the Basal Ganglia
    In: Nicholson L. F. B, Faull R. L. M, editors. The Basal Ganglia VII. Boston: Springer. pp. 53–66.
    https://doi.org/10.1007/978-1-4615-0715-4
  2. Book
    1. Brumm H
    2. Zollinger SA
    (2013) Avian Vocal Production in Noise
    In: Brumm H, editors. Animal Communication and Noise. Heidelberg, Berlin: Springer. pp. 187–227.
    https://doi.org/10.1007/978-3-642-41494-7_7

Decision letter

  1. Jesse H Goldberg
    Reviewing Editor; Cornell University, United States
  2. Barbara G Shinn-Cunningham
    Senior Editor; Carnegie Mellon University, United States
  3. Jesse H Goldberg
    Reviewer; Cornell University, United States
  4. Constance Scharff
    Reviewer; Freie Universitaet Berlin, Germany

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

Bengalese finches sing syntactically complex songs with flexible transitions between specific syllables. Here, the authors show that birds can modify syllable transitions depending on an arbitrary light cue. This surprising result shows that learning in the 'song system' – a neural circuit in songbirds known to drive vocal output – can be controlled by yet-to-be defined visual inputs, setting the stage for new directions in the songbird field that move beyond sequence production and into a more cognitive realm.

Decision letter after peer review:

Thank you for submitting your article "Songbirds can learn flexible contextual control over syllable sequencing" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, including Jesse H Goldberg as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Barbara Shinn-Cunningham as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Constance Scharff (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

We would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). Specifically, when editors judge that a submitted work as a whole belongs in eLife but that some conclusions require a modest amount of additional new data, as they do with your paper, we are asking that the manuscript be revised to either limit claims to those supported by data in hand, or to explicitly state that the relevant conclusions require additional supporting data.

Our expectation is that the authors will eventually carry out the additional experiments and report on how they affect the relevant conclusions either in a preprint on bioRxiv or medRxiv, or if appropriate, as a Research Advance in eLife, either of which would be linked to the original paper.

Summary:

Veit et al. test if arbitrary visual cues can influence syllable sequence 'choices' in adult Bengalese finches. Using light- and sequence- contingent distorted auditory feedback, they find that birds robustly learn to produce context dependent syllables. The learning slowly proceeds over weeks, but once trained birds can rapidly transition between two sequence probabilities. This is a really interesting finding because it shows that the HVC chains that drive syllable phonology and sequencing can 'learn' to be gated by yet-to-be defined visual inputs. The paper is only a behavioral study without neural correlates or a candidate neural architecture that could even solve the problem – but the reviewers did not see this as a major problem for the paper. By analogy, the Tumer et al., 2007 paper from the Brainard lab was also only a behavioral study yet it has launched dozens of follow up studies and a new branch of songbird neuroscience. This paper has the potential to do the same. Follow-up studies that figure out exactly how the visual system interfaces with the songs system to dictate syllable selection will be a really interesting direction for the field – moving birdsong beyond simple sequence production and into a more cognitive realm. Thus this paper is likely to be high impact, highly cited, and important for the field.

Essential revisions:

1. Please address the question, raised by Reviewer 2, about whether or not the production of the syllable initiating the target sequence was affected by the light context. i.e. if the training was abc vs abd, did the probability of producing a or ab change depending on context? Reviwere 2 wondered, if this is indeed the case, how this would affect the interpretations of the paper, especially "postulated parallels to language?"

2. Please also address the comments below, of which there are many. A point-by-point response will not be necessary, but it should be clear that all reviewers wanted some of the claims of novelty and connection to human cognition to be tempered, i.e. keep the conclusions closer to the data. This will mostly involve re-wording in the Introduction and Discussion. There are also several sources of confusion where the methods and analyses should be presented more clearly. Please see more details below.

Reviewer 1:

In Figure 4: n=3 birds on Figure 4 is a bit thin. New data is not absolutely necessary because the effect is robust and consistent across birds, but the authors may want to replicate this in 1-2 more birds to really nail the finding that specific light cues can drive shifts bidirectionally.

The discussion would be improved if it included examples of natural context-dependent changes in song syllable selection – for example song matching by buntings, great tits, and sparrows. In these cases, birds sequence and select syllables in a context dependent way, even depending on where they are in a territory. Thus there may be some precedent and evolutionarily context for the core finding here, i.e. that a 'place' representation can access the song system to influence syllable sequence and song selection. This consideration would fit in the "Evolution of control over vocal sequencing" section.

Reviewer 2:

Line 28: 'parallels aspects of human cognitive control over speech'. I find that an overstatement, unless I misunderstand the data. The authors condition birds to avoid a particular sequence by punishing ('aversively reinforcing') it with white noise and link this to a visual stimulus. How does that parallel human cognitive control over speech? Can the authors please provide more explanation?

Line 35. Please provide a reference with evidence for the part in italics (mine) in the following statement, or rephrase more evidence-based: 'This flexibility ingrained in human language stands in striking contrast to the largely innate and stereotypic vocalization patterns of most animal species, including our closest relatives, the non-human primates.' Most of the roughly 8.7 million animal species? How many have been analyzed? Of those, how many vocalization patterns are 'largely innate'? And stereotypic? At what level stereotypic?

Line 57: '…affective behavior, elicited instinctually by contact with potential mates, rivals, or performed..' What do the authors mean by 'affective' and 'instinctually'? Human speech also has affective components (prosody for instance) and we instinctually change aspects of our speech (and language) when we talk to children, partners, strangers. This is an unsophisticated dichotomy 'humans/birdsong', please consider rephrasing (rethinking).

Line 59: 'There are differences between songs produced in distinct social contexts…' Refs to this sentence should include Heinig et al. 2014 Male mate preferences in mutual mate choice: finches modulate their songs across and within male-female interactions. Anim Behav. 97:1-12.

Line 62. ‚However, these social influences likely reflect a general modulation of song structure related to the animal's affective state (Berwick, Okanoya, Beckers, and Bolhuis, 2011). What is the concrete evidence for the 'likely' in this sentence?

Line 64 'and do not reveal whether song can be modified more flexibly by different cognitive factors.' But the fact that Bengalese finches sing different song sequences to different females (Heinig et al. paper, above) raise the possibility that 'cognitive factors' could play a role, since it's all 'affective' courtship song but different depending on which female is being sung to.

Line 78 'immediately, flexibly, and adaptively adjust their sequencing of vocal elements in response to learned contextual cues, in a manner that parallels key aspects of human cognitive control over speech' Same comment as to line 28. I think the authors are not doing themselves a favor in phrasing this claim so broadly and non-specifically.

line 110: 'alternating blocks' first mentioned here. Please include a section in the methods about blocks. I found it hard to extract the information from various points in the text how long blocks were, how many blocks per day on average (from the figures it seems that the entire day was used for data acquisition?) and what 'short probe blocks' (line 227) meant in terms of timing. Also, why were there many more block switches (Figure 1F) during baseline than during training (Figure 1G)?

Line 125: Figure 1A: it would help to point out the individual 'songs' in that figure, since song is defined differently in different species. In zebra finches a song as defined by the authors would be called 'a motif': (line 69) a 'song consist of ca 5-12 acoustically distinct elements'. Where do songs start and end? How is that determined? This relates also to my question above, whether sequences are modified in probability of occurrence or songs (or song types).

Line 237: 'Figure 3A,B shows song bouts for one example bird'. Since song bouts are defined by authors as 'separated by at least 2 sec' I would like to know whether the shown spectrogram is the entire bout and the silence before and after are just not shown or whether A and B show part of a bout. If so, can you show the entire bout, including the time when the light changes?

Line 332: 'The ability to alter syllable sequencing in a flexible fashion also contrasts with prior studies that have demonstrated modulation of vocalizations in more naturalistic settings.(…). In contrast, here we show that birds can learn to locally modulate specific features of their songs (i.e. individually targeted syllable transitions) in response to arbitrarily assigned contextual stimuli that have no prior ethological relevance.' Could the authors please comment on the following conundrum: If flexible use of song sequences under natural conditions were 'hard-wired/innate/reflexive/affective' as the authors suggest, how would the ability to pair an arbitrary contextual cue with a particular song sequence have evolved in Bengalese finches? Why would neural connections exist that allow this pairing of visual input to motor output? Isn't it more parsimonious to postulate that under natural conditions, visual stimuli do lead to different vocal motor responses because in addition to the known 'affective' mediators (hormones, dopamine etc) there is some 'top down', 'cognitive' control? (Reviewer 1 agrees with this point).

Line 348: ' Evolution of control over vocal sequencing' section is in line with my above comment, e.g. suggests what some animals might use the ability to use contextual visual information for adaptive motor output but again negates that Bengalese finches actually use it in their current behavior, instead the authors call it 'latent capacity'. I do not follow their logic.

Line 367: Neural implementation. Does the two process model relate to human speech and language? Please explain.

Line 428: add 'male'. The manuscript does not mention anywhere whether males and females in Bengalese finches sing….Or add it to line 23 in the Abstract.

Line 429: 'age range 128-320 days' was there any age-related difference in learning? Looking at the figures some birds seemed to have performed quite a bit better than others. See also line 456 below.

Line 455: 'within an interval of one to several hours' please provide more information whether this was randomly chosen or based on the birds performance. If random, what was the rationale for this large difference in block duration?

Line 456: 'after several days of training (average 33)' Please also provide the range and whether shorter training was related to age of the birds.

Line 461: 'three birds.…never progressed to full probe sequence either because they did not exhibit single-context learning or because of technical issues with consistent targeting of branch points'. Did two birds not learn and one had technical issue or other way round? How common is it in WN-escape experimental set-ups that birds do not learn? And what does 'single-context learning' mean? That they did not learn to associate yellow light with one target? This would imply that context 1 was learned first and then context 2, but in line 454 it sounds like both colors were paired with their particular target after one to several hours. Please explain.

Line 466: Please specify in the methods how many days the entire experiment lasted. How variable was the song output during that time and between individual birds? Did song output decline over time? Can the authors provide an estimate how many songs or bouts on average (and range) the birds sang?

Reviewer 3:

Although the present study describes the syllable sequence switching abilities of Bengalese finches within the framework of an elegantly designed behavioral paradigm, the links to the potential neural mechanisms are poorly presented or even obsolete since the authors do not provide any evidence about underlying brain dynamics. I recommend to rather discuss the results in a behavioral framework unless the authors add results from neural recordings or brain manipulations.

Line 34: Citation for reordering of finite elements to achieve infinite meaning, see Hauser, Chomsky and Fitch, 2002.

Line 47 f: Lipkind et al. 2017 showed that zebra finches can learn to re-order syllables during song learning. This paper is highly relevant and should be discussed.

Line 52: Reference to Doupe and Kuhl, 1999 should be moved to line 46?

Line 88 f: The authors decide to present most data in percentages. It would be useful to provide the actual number to assess the quality of the data.

Line 89: How reliable was the software in targeting syllables?

Line 284: The authors refer to white light as a neutral state. What is the color perception for Bengalese finches? Is white perceived rather as yellow or green? A novel light condition that the birds had not been exposed before would probably be better.

Line 306 f: Light cues are not arbitrary as the birds are initially trained to connect white noise with light of a certain color.

Line 444: What was the reason to specifically choose green and yellow as colors for this experiment?

Line 445: How much does visual perception of Bengalese finches differ between 520 and 600nm?

Line 453-455: What is the maximum duration of several hours? This would also help to understand the difference in amounts of switches in Figure 1 F and G.

Line 456: How did the white noise training look? Can learning curves be added?

Figure 1 B: It would be helpful to plot both probability curves (ab-d and ab-c) and color code them accordingly as a general probability plot (y axis).

Figure 1 F: Why is the amount of color switches different between baseline and training? When within the training did baseline days occur or was this prior to training?

Figure 1 G: Why are light phases for green/yellow differently long? The error bars are misleading, as they show the SEM of individual blocks rather than the entire sample.

https://doi.org/10.7554/eLife.61610.sa1

Author response

Essential revisions:

1. Please address the question, raised by Reviewer 2, about whether or not the production of the syllable initiating the target sequence was affected by the light context. i.e. if the training was abc vs abd, did the probability of producing a or ab change depending on context? Reviewer 2 wondered, if this is indeed the case, how this would affect the interpretations of the paper, especially "postulated parallels to language?"

We understand this to be a question about how specific to the target sequence were the changes in the overall transition structure of the song. We have added substantial new analysis, and a new figure (new Figure 4), to address the specificity of contextual differences to the branch points preceding the target sequences. These new analyses demonstrate that for the majority of birds, 80% or more of total contextual differences were restricted to the targeted branch points. With respect to the example cited in the question above, this means that the majority of change to sequencing in an experiment targeting ‘abc’ vs. ‘abd’ occurs to transitions at the branchpoint following syllable ‘b’ and that there is little or no contextual difference in the probability of producing an ‘a’ or ‘ab’. These new analyses indicate that the learned contextual changes to syllable sequencing reflect a capacity for modulation of specific sequences within song, rather than the kind of global modulation of structure that occurs (for example) between songs produced in different social contexts. While we have reduced comparisons with speech throughout, we note that this specificity parallels a feature of contextual modulation of sequencing in speech, which similarly reflects a capacity for flexible, local and specific reordering of elements.

2. Please also address the comments below, of which there are many. A point-by-point response will not be necessary, but it should be clear that all reviewers wanted some of the claims of novelty and connection to human cognition to be tempered, i.e. keep the conclusions closer to the data. This will mostly involve re-wording in the Introduction and Discussion. There are also several sources of confusion where the methods and analyses should be presented more clearly. Please see more details below.

We have attempted to address all comments, especially the points noted immediately above.

Reviewer 1:

In Figure 4: n=3 birds on Figure 4 is a bit thin. New data is not absolutely necessary because the effect is robust and consistent across birds, but the authors may want to replicate this in 1-2 more birds to really nail the finding that specific light cues can drive shifts bidirectionally.

We have added a sentence noting that the relevant conclusions would benefit from additional experiments beyond those presented in Figure 5 (previously Figure 4).

The discussion would be improved if it included examples of natural context-dependent changes in song syllable selection – for example song matching by buntings, great tits, and sparrows. In these cases, birds sequence and select syllables in a context dependent way, even depending on where they are in a territory. Thus there may be some precedent and evolutionarily context for the core finding here, i.e. that a 'place' representation can access the song system to influence syllable sequence and song selection. This consideration would fit in the "Evolution of control over vocal sequencing" section.

We have substantially expanded the discussion of natural context-dependent changes (l.57ff, l.385ff), including addition of concrete examples, and have adjusted the logic in the Introduction and Discussion (paragraphs on evolution) to explicitly note that these examples of natural context-dependent control suggest that birds might also be able to exert such control in response to more arbitrary, learned contexts:

L.416ff: “Such capacities for vocal control likely reflect evolutionary advantages of incorporating sensory and contextual information about conspecifics and the environment in generating increasingly sophisticated vocal signaling. […] Perhaps evolutionary pressures to develop nuanced social communication led to the elaboration of cortical (pallial) control over brainstem vocal circuitry (Hage and Nieder, 2016), and thereby established a conduit that facilitated the integration of progressively more abstract cues and internal states in that control.

Reviewer 2:

Line 28: 'parallels aspects of human cognitive control over speech'. I find that an overstatement, unless I misunderstand the data. The authors condition birds to avoid a particular sequence by punishing ('aversively reinforcing') it with white noise and link this to a visual stimulus. How does that parallel human cognitive control over speech? Can the authors please provide more explanation?

Line 78 'immediately, flexibly, and adaptively adjust their sequencing of vocal elements in response to learned contextual cues, in a manner that parallels key aspects of human cognitive control over speech' Same comment as to line 28. I think the authors are not doing themselves a favor in phrasing this claim so broadly and non-specifically.

We have tempered, specified, or removed comparisons to contextual control of human speech throughout the text, and have provided additional explanation about similarities we see to speech control in the Discussion. We particularly focus on what we see as a shared capacity for learned, context-dependent control over the sequencing of vocal elements that is immediate, flexible, and adaptive. For example, contextual shifts appear immediately after context switches (Figure 2), they are learned, in the appropriate, arbitrarily chosen, direction in response to cues, which do not elicit such changes without prior training (Figure 1, Figure 5), and they are adaptive, in that they avoid aversive WN. We do not suggest that the context-dependent learning we have demonstrated reflects a capacity for conveying the kind of rich semantic content that is central to human language. Rather, that the underlying ability manifest in speech motor control to immediately and flexibly reorganize sequences of constituent elements (phonemes/syllables/words) to achieve a communicative ‘goal’ has some formal similarities to the simpler contextual control of vocalizations demonstrated here. In particular, we construe both to include a capacity for learned, moment-by-moment, “top-down” influences on the organization and sequencing of vocal elements to achieve contextually appropriate, adaptive outcomes. For human speech, complex cognitive processes and semantic “intent” can inform those top-down influences with an adaptive goal of influencing the listener (“conveying meaning”). For our experiments, bird vocalizations are similarly deployed in a learned and contextually appropriate fashion to achieve an adaptive goal (escaping from white noise). Correspondingly, we suggest that context-dependent modulation of vocal sequencing in the Bengalese finch may provide a particularly tractable behavioral model for examining how different arbitrary learned cues can drive the kind of top-down control of vocal motor output that forms a building block of speech. However, we appreciate the reviewer’s perspective that it is a long way from the capacities demonstrated here to insights about speech motor control, and correspondingly have largely curtailed a discussion of these parallels.

Line 35. Please provide a reference with evidence for the part in italics (mine) in the following statement, or rephrase more evidence-based: 'This flexibility ingrained in human language stands in striking contrast to the largely innate and stereotypic vocalization patterns of most animal species, including our closest relatives, the non-human primates.' Most of the roughly 8.7 million animal species? How many have been analyzed? Of those, how many vocalization patterns are 'largely innate'? And stereotypic? At what level stereotypic?

We acknowledge that the previous statement was too broad, given that most of the 8.7 million species noted by the reviewer have not been characterized in depth. We have rephrased this section (l. 37f) to focus on primates, where there has been considerable prior work:

“This cognitive control over vocal production is thought to rely on the direct innervation of brainstem and midbrain vocal networks by executive control structures in the frontal cortex, which have become more elaborate over the course of primate evolution (Hage and Nieder, 2016, Simonyan and Horwitz 2011). However, because of the comparatively limited flexibility of vocal production in nonhuman primates (Nieder and Mooney, 2020), the evolutionary and neural circuit mechanisms that have enabled the development of this flexibility remain poorly understood.“

Line 57: '…affective behavior, elicited instinctually by contact with potential mates, rivals, or performed..' What do the authors mean by 'affective' and 'instinctually'? Human speech also has affective components (prosody for instance) and we instinctually change aspects of our speech (and language) when we talk to children, partners, strangers. This is an unsophisticated dichotomy 'humans/birdsong', please consider rephrasing (rethinking).

We did not mean to imply that human speech lacks affective components. Rather, we wanted to emphasize the flexible top-down control of human speech production, which is typically considered a cognitive process involving the reorganization of vocal elements to achieve some communicative intent. In contrast, contextual changes in birdsong have typically been ascribed to affective processes, such as hormonal and neuromodulatory changes related to the production of directed song, and these changes have been shown to be unaffected by learning (“instinctual”). We here test whether cognitive influences on birdsong exist beyond these possibly completely instinctual contextual changes, building, on the previously known examples of contextual differences in birdsong. We have revised the paragraph to better explain examples of naturally occurring contextual changes in birdsong (see also other Reviewer comments), and clarified the logic in Introduction and Discussion.

Line 59: 'There are differences between songs produced in distinct social contexts…' Refs to this sentence should include Heinig et al. 2014 Male mate preferences in mutual mate choice: finches modulate their songs across and within male-female interactions. Anim Behav. 97:1-12.

We have added the reference.

Line 62. ‚However, these social influences likely reflect a general modulation of song structure related to the animal's affective state (Berwick, Okanoya, Beckers, and Bolhuis, 2011). What is the concrete evidence for the 'likely' in this sentence?

Changes in directed song typically reflect a general or global modulation of song structure, such that song overall is faster, louder, higher pitched, and more stereotyped. We have rephrased and included references in the paragraph to clarify.

l. 57ff: “Contextual variation of song in natural settings, such as territorial counter-singing or female-directed courtship song, indicate that songbirds can rapidly alter aspects of their song, including syllable sequencing and selection of song types (Chen, Matheson, and Sakata, 2016; Heinig et al., 2014; King and McGregor, 2016; Sakata, Hampton, and Brainard, 2008; Searcy and Beecher, 2009; Trillo and Vehrencamp, 2005). […] For example, the presence of potential mates or rivals elicits a global and unlearned modulation of song intensity (James, Dai, and Sakata, 2018a) related to the singer’s level of arousal or aggression (Alcami, Ma, and Gahr, 2021; Heinig et al., 2014; Jaffe and Brainard, 2020).”

Line 64 'and do not reveal whether song can be modified more flexibly by different cognitive factors.' But the fact that Bengalese finches sing different song sequences to different females (Heinig et al. paper, above) raise the possibility that 'cognitive factors' could play a role, since it's all 'affective' courtship song but different depending on which female is being sung to.

To our understanding, the Heinig et al. paper shows that birds sing different intensity of directed song to different females, which is compatible with an interpretation that they have different levels of general motivation to sing directed song to different females. This and other papers about directed song, referenced in response to the previous comment, suggest that directed song can elicit global changes (increased speed, amplitude, stereotypy, including sequence stereotypy) that are not learned. In contrast, the changes we show here are learned in response to arbitrary contextual cues, and are specific to the targeted position in the song bout. We have added new analysis (Figure 4) that demonstrates this specificity of the contextual changes in the current experiment, and have also clarified logic in Introduction and Discussion, to note that song changes elicited in natural contexts – including song type selection in other species – raise the possibility that learned, cognitive factors could play a role in modulating vocal output, an idea that we attempt to specifically test in our study.

line 110: 'alternating blocks' first mentioned here. Please include a section in the methods about blocks. I found it hard to extract the information from various points in the text how long blocks were, how many blocks per day on average (from the figures it seems that the entire day was used for data acquisition?) and what 'short probe blocks' (line 227) meant in terms of timing. Also, why were there many more block switches (Figure 1F) during baseline than during training (Figure 1G)?

Line 455: 'within an interval of one to several hours' please provide more information whether this was randomly chosen or based on the birds performance. If random, what was the rationale for this large difference in block duration?

These details have been added to the “Training procedure and blocks” section of the Methods.

Line 125: Figure 1A: it would help to point out the individual 'songs' in that figure, since song is defined differently in different species. In zebra finches a song as defined by the authors would be called 'a motif': (line 69) a 'song consist of ca 5-12 acoustically distinct elements'. Where do songs start and end? How is that determined? This relates also to my question above, whether sequences are modified in probability of occurrence or songs (or song types).

We believe these questions are mainly based on a misunderstanding of the referenced sentence in l.72. For clarification, we have rephrased as follows: “Each Bengalese finch song repertoire included ~5-12 acoustically distinct elements (‘syllables’) that are strung together into long sequences in variable but non-random order”. A song, or song bout, (which we colloquially use interchangeably but have now tried to exclusively use song bout in the manuscript) is defined as is typical in the field, for zebra finch and Bengalese finch, as a period of continuous vocalizations separated by 2s of silence (now defined in the Methods). There are no different song types in Bengalese finches, and each song bout typically contains several renditions of both target sequences.

Line 237: 'Figure 3A,B shows song bouts for one example bird'. Since song bouts are defined by authors as 'separated by at least 2 sec' I would like to know whether the shown spectrogram is the entire bout and the silence before and after are just not shown or whether A and B show part of a bout. If so, can you show the entire bout, including the time when the light changes?

The spectrograms do not show the entire bout, but show an exemplary section of the bout, to make it easier to recognize target sequences. We have added a supplementary to Figure 3 to show the entire bout. We cannot show the time of the light change, as the recording program was set up to never change lights in the middle of song, i.e. the light changed as soon as the recording for the first bout ended, and before the recording for any following bout started.

Line 332: 'The ability to alter syllable sequencing in a flexible fashion also contrasts with prior studies that have demonstrated modulation of vocalizations in more naturalistic settings.(…). In contrast, here we show that birds can learn to locally modulate specific features of their songs (i.e. individually targeted syllable transitions) in response to arbitrarily assigned contextual stimuli that have no prior ethological relevance.' Could the authors please comment on the following conundrum: If flexible use of song sequences under natural conditions were 'hard-wired/innate/reflexive/affective' as the authors suggest, how would the ability to pair an arbitrary contextual cue with a particular song sequence have evolved in Bengalese finches? Why would neural connections exist that allow this pairing of visual input to motor output? Isn't it more parsimonious to postulate that under natural conditions, visual stimuli do lead to different vocal motor responses because in addition to the known 'affective' mediators (hormones, dopamine etc) there is some 'top down', 'cognitive' control? (Reviewer 1 agrees with this point).

Line 348: ' Evolution of control over vocal sequencing' section is in line with my above comment, e.g. suggests what some animals might use the ability to use contextual visual information for adaptive motor output but again negates that Bengalese finches actually use it in their current behavior, instead the authors call it 'latent capacity'. I do not follow their logic.

We largely agree with this interpretation and have now added discussion to clarify this logic in the “Evolution” paragraph, and explicitly state that some examples of natural contextual variation likely also involve more cognitive processing.

l. 420f: “and suggest that some of the contextual control observed in natural settings may likewise rely on learned associations and other cognitive factors.”

Line 367: Neural implementation. Does the two process model relate to human speech and language? Please explain.

We have added references to speech motor adaptation and language selection studies, and clarified that the rest of this paragraph concerns models related to more general motor control processes.

Line 428: add 'male'. The manuscript does not mention anywhere whether males and females in Bengalese finches sing….Or add it to line 23 in the Abstract.

Done.

Line 429: 'age range 128-320 days' was there any age-related difference in learning? Looking at the figures some birds seemed to have performed quite a bit better than others. See also line 456 below.

Line 456: 'after several days of training (average 33)' Please also provide the range and whether shorter training was related to age of the birds.

We have added analyses in Sup. Figure 4 to examine whether the bird’s performance depended on age and other possible explanatory variables. We did not find a significant correlation with any tested variables, although that may be expected given the small sample size and idiosyncratic features of each song and choice of branch point. Follow-up studies would need to be performed which systematically test learning ability at different branch points of the same bird.

We now also note that training duration (range 12-79 days) was not varied across birds in a fashion that was explicitly related to magnitude of sequence changes, and indeed was not a tightly controlled variable in these experiments.

Author response image 1

Line 461: 'three birds.…never progressed to full probe sequence either because they did not exhibit single-context learning or because of technical issues with consistent targeting of branch points'. Did two birds not learn and one had technical issue or other way round? How common is it in WN-escape experimental set-ups that birds do not learn? And what does 'single-context learning' mean? That they did not learn to associate yellow light with one target? This would imply that context 1 was learned first and then context 2, but in line 454 it sounds like both colors were paired with their particular target after one to several hours. Please explain.

We have clarified that one bird did not learn in single context training, one bird was abandoned due to technical difficulties, and one bird exhibited a loss of sequence variability during initial training that prevented further differential training.

We first tested each bird to ensure that it was capable of “single context learning” before initiating context dependent training. Single context learning as now defined in results means learning in one direction with constant light color, as in Figure 1C,D, and we construed this as a likely pre-requisite for context-dependent learning. It happens sometimes that birds do not learn in WN-escape experiments, typically because of some higher-order structure in the song (such as history dependence as described by Warren et al. 2012). However, for one pilot bird that did not exhibit single context learning in these initial experiments (similar to 1C,D) we nonetheless initiated dual context-dependent training to see if it might develop learning over time. We never saw evidence of learning in that bird, and training was abandoned.

A second bird was excluded because of technical issues with maintaining accurate targeting of syllables through template matching (see Methods) to deliver WN.

The third excluded bird learned well in single context training, but lost sequence variability over the course of initial training for reasons that are unclear and in a manner that was not observed in other birds. This resulted in the elimination of one of the target sequences, precluding differential training, and the bird was abandoned. These further details are now noted in Methods.

Line 466: Please specify in the methods how many days the entire experiment lasted. How variable was the song output during that time and between individual birds? Did song output decline over time? Can the authors provide an estimate how many songs or bouts on average (and range) the birds sang?

Consistent with prior observations (Yamahachi et al., 2020 Plos One), we found that some birds increased and others decreased the average number of song bouts per day. On average, birds sang 226 (range 66-356) bouts during baseline days and 258 (range 171-368) bouts per day during the period of probe collection at the end of training (14% increase). The average duration of song bouts also changed little, with both the average number of target sequences per bout (8.7 during baseline, 7.7 during probes, 7% decrease) and the average number of syllables per bout (74 during baseline, 71 during probes, 2% decrease) decreasing slightly. These numbers are now included in Methods.

Author response image 2

Reviewer 3:

Although the present study describes the syllable sequence switching abilities of Bengalese finches within the framework of an elegantly designed behavioral paradigm, the links to the potential neural mechanisms are poorly presented or even obsolete since the authors do not provide any evidence about underlying brain dynamics. I recommend to rather discuss the results in a behavioral framework unless the authors add results from neural recordings or brain manipulations.

We have expanded Introduction and Discussion on behavioral studies in birds and humans, and have reserved any speculation about neural mechanisms for the discussion. We retained some discussion of this point, as songbirds are an extensively studied model for neural mechanisms of vocal motor control; this large prior body of work on neural mechanisms enables some informed speculation about how the ability to rapidly adjust song in response to learned, visual cues could be accomplished by the song system, and we felt this would be of potential interest to more mechanistically inclined readers.

Line 34: Citation for reordering of finite elements to achieve infinite meaning, see Hauser, Chomsky and Fitch, 2002

Line 47 f: Lipkind et al. 2017 showed that zebra finches can learn to re-order syllables during song learning. This paper is highly relevant and should be discussed.

Line 52: Reference to Doupe and Kuhl, 1999 should be moved to line 46?

We have added these references.

Line 88 f: The authors decide to present most data in percentages. It would be useful to provide the actual number to assess the quality of the data.

We have clarified that the measure used throughout this study, sequence probability, is in units of percent. The changes that we describe in percentages are absolute changes in sequence probability, which reflect the proportion of each target sequence, not percent changes. For example, the plots in Figure 2 A-D (and similar ones throughout the manuscript) show raw, absolute values of sequence probability. To provide a measure of the quality of the data, we provide error bars, reflecting s.e.m. across song bouts.

We previously had not provided error bars on Figure 1 B-D, as these data points are based on all target sequences sung on a given day (i.e. there is only one data point per day). We have now added confidence intervals to Figure 1 B-D, estimated from normal approximation of the binomial probability of the proportion of ‘abd’ and ‘abc’ target sequences. The actual number of either target sequence are 1854 per day during baseline, 808 per day during ab-d targeting (fewer, because day 1 and day 4 are not full days of singing, but belong partly to baseline or ab-c targeting), 1888 per day during ab-c targeting.

Line 89: How reliable was the software in targeting syllables?

The reliability of the templates was checked continuously throughout the experiment, but we did not keep careful notes on this for each bird. We have therefore retroactively assessed the specificity and sensitivity of the template by hand-checking 20 randomly selected song bouts from a single day of training for each bird, and added this information to Methods:

“We determined sensitivity and specificity of the template to the target sequence on a randomly selected set of 20 song bouts on which labels and delivery of WN was hand-checked. […] The average template specificity was 96.7% (range 90.6-100% ).”

Line 284: The authors refer to white light as a neutral state. What is the color perception for Bengalese finches? Is white perceived rather as yellow or green? A novel light condition that the birds had not been exposed before would probably be better.

Line 306 f: Light cues are not arbitrary as the birds are initially trained to connect white noise with light of a certain color.

Line 444: What was the reason to specifically choose green and yellow as colors for this experiment?

Line 445: How much does visual perception of Bengalese finches differ between 520 and 600nm?

We mean by ‘arbitrary’ that these colors have no prior ethological meaning for the behavior of the bird. Hence, the color perception should not matter, as long as the birds are able to discriminate the colors. We set out to demonstrate that the birds would be able to associate song changes with any arbitrary visual stimulus. There was no reason to choose these specific colors. The white light is “neutral” only insofar as the birds had learned that aversive WN would never occur in the white context, not because it is spectrally in the middle of green and yellow. We did use a white LED light which was different from the home light in the cage, which was also white and the birds might have had experience with prior to any context training.

Line 453-455: What is the maximum duration of several hours? This would also help to understand the difference in amounts of switches in Figure 1 F and G.

Figure 1 F: Why is the amount of color switches different between baseline and training? When within the training did baseline days occur or was this prior to training?

Figure 1 G: Why are light phases for green/yellow differently long? The error bars are misleading, as they show the SEM of individual blocks rather than the entire sample.

We have added further explanation of block durations in the Methods (see also Reviewer2). Probes were collected before WN training. The training schedule was changed between 1F and 1G. The individual light phases are drawn from random intervals, therefore, it might randomly happen that on one day the yellow contexts appear longer than the green context (or vice versa), but the two colors are drawn from the same intervals, so this should even out over time. We think that the SEM per block should be informative about effect reliability, and have expanded the legend for Figure 1G so as to clarify this measure.

Line 456: How did the white noise training look? Can learning curves be added?

Learning data are shown in Figure 1 C and D for single context training, and Figure 2 G,H for the contextual training protocol.

Figure 1 B: It would be helpful to plot both probability curves (ab-d and ab-c) and color code them accordingly as a general probability plot (y axis).

We now have added the probability for the other target sequence; thank you for the suggestion.

https://doi.org/10.7554/eLife.61610.sa2

Article and author information

Author details

  1. Lena Veit

    Center for Integrative Neuroscience and Howard Hughes Medical Institute, University of California, San Francisco, San Francisco, United States
    Present address
    Institute for Neurobiology, University of Tübingen, Tübingen, Germany
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Supervision, Funding acquisition, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    For correspondence
    lena.veit@uni-tuebingen.de
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-9566-5253
  2. Lucas Y Tian

    Center for Integrative Neuroscience and Howard Hughes Medical Institute, University of California, San Francisco, San Francisco, United States
    Present address
    The Rockefeller University, New York, United States
    Contribution
    Conceptualization, Software, Supervision, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7346-7360
  3. Christian J Monroy Hernandez

    Center for Integrative Neuroscience and Howard Hughes Medical Institute, University of California, San Francisco, San Francisco, United States
    Contribution
    Investigation, Visualization
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3796-989X
  4. Michael S Brainard

    Center for Integrative Neuroscience and Howard Hughes Medical Institute, University of California, San Francisco, San Francisco, United States
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Writing - review and editing
    For correspondence
    msb@phy.ucsf.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-9425-9907

Funding

Leopoldina German National Academy of Sciences (Postdoc Fellowship)

  • Lena Veit

Life Sciences Research Foundation (Howard Hughes Medical Institute Fellowship)

  • Lena Veit

Howard Hughes Medical Institute

  • Michael S Brainard

Howard Hughes Medical Institute (EXROP summer fellowship)

  • Christian J Monroy Hernandez

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Alla Karpova, Jon Sakata, Dave Mets, William Mehaffey, Assaf Breska, and Guy Avraham for helpful discussions and comments on earlier versions of this manuscript. This work was supported by the Howard Hughes Medical Institute. Lena Veit was supported as a Howard Hughes Medical Institute Fellow of the Life Sciences Research Foundation and by a postdoctoral fellowship from Leopoldina German National Academy of Sciences. Christian J Monroy Hernandez was supported by an HHMI EXROP summer fellowship.

Ethics

Animal experimentation: All procedures were performed in accordance with protocols (#AN170723- 02) approved by the University of California, San Francisco Institutional Animal Care Use Committee (IACUC).

Senior Editor

  1. Barbara G Shinn-Cunningham, Carnegie Mellon University, United States

Reviewing Editor

  1. Jesse H Goldberg, Cornell University, United States

Reviewers

  1. Jesse H Goldberg, Cornell University, United States
  2. Constance Scharff, Freie Universitaet Berlin, Germany

Version history

  1. Received: July 30, 2020
  2. Accepted: April 25, 2021
  3. Version of Record published: June 1, 2021 (version 1)

Copyright

© 2021, Veit et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,550
    Page views
  • 136
    Downloads
  • 7
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Lena Veit
  2. Lucas Y Tian
  3. Christian J Monroy Hernandez
  4. Michael S Brainard
(2021)
Songbirds can learn flexible contextual control over syllable sequencing
eLife 10:e61610.
https://doi.org/10.7554/eLife.61610

Further reading

    1. Cell Biology
    2. Neuroscience
    Elisabeth Jongsma, Anita Goyala ... Collin Yvès Ewald
    Research Article Updated

    The amyloid beta (Aβ) plaques found in Alzheimer’s disease (AD) patients’ brains contain collagens and are embedded extracellularly. Several collagens have been proposed to influence Aβ aggregate formation, yet their role in clearance is unknown. To investigate the potential role of collagens in forming and clearance of extracellular aggregates in vivo, we created a transgenic Caenorhabditis elegans strain that expresses and secretes human Aβ1-42. This secreted Aβ forms aggregates in two distinct places within the extracellular matrix. In a screen for extracellular human Aβ aggregation regulators, we identified different collagens to ameliorate or potentiate Aβ aggregation. We show that a disintegrin and metalloprotease a disintegrin and metalloprotease 2 (ADM-2), an ortholog of ADAM9, reduces the load of extracellular Aβ aggregates. ADM-2 is required and sufficient to remove the extracellular Aβ aggregates. Thus, we provide in vivo evidence of collagens essential for aggregate formation and metalloprotease participating in extracellular Aβ aggregate removal.

    1. Computational and Systems Biology
    2. Neuroscience
    Marjorie Xie, Samuel P Muscinelli ... Ashok Litwin-Kumar
    Research Article Updated

    The cerebellar granule cell layer has inspired numerous theoretical models of neural representations that support learned behaviors, beginning with the work of Marr and Albus. In these models, granule cells form a sparse, combinatorial encoding of diverse sensorimotor inputs. Such sparse representations are optimal for learning to discriminate random stimuli. However, recent observations of dense, low-dimensional activity across granule cells have called into question the role of sparse coding in these neurons. Here, we generalize theories of cerebellar learning to determine the optimal granule cell representation for tasks beyond random stimulus discrimination, including continuous input-output transformations as required for smooth motor control. We show that for such tasks, the optimal granule cell representation is substantially denser than predicted by classical theories. Our results provide a general theory of learning in cerebellum-like systems and suggest that optimal cerebellar representations are task-dependent.