Humans parsimoniously represent auditory sequences by pruning and completing the underlying network structure

  1. Lucas Benjamin  Is a corresponding author
  2. Ana Fló
  3. Fosca Al Roumi
  4. Ghislaine Dehaene-Lambertz
  1. Cognitive Neuroimaging Unit, CNRS ERL 9003, INSERM U992, Université Paris-Saclay, NeuroSpin center, France

Abstract

Successive auditory inputs are rarely independent, their relationships ranging from local transitions between elements to hierarchical and nested representations. In many situations, humans retrieve these dependencies even from limited datasets. However, this learning at multiple scale levels is poorly understood. Here, we used the formalism proposed by network science to study the representation of local and higher-order structures and their interaction in auditory sequences. We show that human adults exhibited biases in their perception of local transitions between elements, which made them sensitive to high-order network structures such as communities. This behavior is consistent with the creation of a parsimonious simplified model from the evidence they receive, achieved by pruning and completing relationships between network elements. This observation suggests that the brain does not rely on exact memories but on a parsimonious representation of the world. Moreover, this bias can be analytically modeled by a memory/efficiency trade-off. This model correctly accounts for previous findings, including local transition probabilities as well as high-order network structures, unifying sequence learning across scales. We finally propose putative brain implementations of such bias.

Editor's evaluation

This paper communicates important findings on the learning of local and higher order structures in auditory sequences and will be of interest to researchers studying statistical learning, learning of graph structures, and auditory learning. The strength of the evidence is convincing, including a compelling demonstration that humans do not encode objective transition probabilities and the implementation of a wide range of sequence learning models that have been proposed in the literature.

https://doi.org/10.7554/eLife.86430.sa0

Introduction

The fact then that many complex systems have a nearly decomposable, hierarchic structure is a major facilitating factor enabling us to understand, describe, and even ‘see’ such system and their parts” – H. Simon, The architecture of complexity (1962).

To interact efficiently with their environment, humans have to learn how to structure its complexity. In fact, far from being random, the sensory inputs we face are highly interdependent and often follow an underlying hidden structure that the brain tries to capture from the incomplete or noisy input it receives. For instance, Tenenbaum et al., 2011, proposed that learning implies building the simpler underlying relational model which can explain the data. Indeed, evidence suggests that humans can infer structures from data at different scales, ranging from local statistics on consecutive items (Saffran et al., 1996) to local and global statistical dependencies across sequences of notes (Basirat et al., 2014; Bekinschtein et al., 2009) or more high-order and abstract relationships such as pattern repetitions (Barascud et al., 2016), hierarchical patterns, and nested structures (Dehaene et al., 2015), networks (Garvert et al., 2017; Schapiro et al., 2013), and rules (Maheu et al., 2020).

At first, the extraction of local regularities in auditory streams was proposed as a major mechanism to structure the input, available from an early age since Saffran et al., 1996, showed that 8-month-old infants can use transition probabilities (TPs) - P(Et|Et1) - between syllables to extract words from a monotonous stream with no other available cues. Since then, the sensitivity of humans to local dependencies has been robustly demonstrated in the auditory and visual domain (Fiser and Aslin, 2002) without the focus of attention (Batterink and Choi, 2021; Batterink and Paller, 2019; Benjamin et al., 2021) and even in asleep neonates (Benjamin et al., 2023; Fló et al., 2022). Moreover, it is not limited to adjacent elements but can be extended to non-adjacent syllables - P(Et|XEt2) - that could account for non-adjacent dependencies in language (Peña et al., 2002).

However, the computation of TPs between adjacent - PEt|Et-1 - and non-adjacent elements - PEt|XEt-2 - seems too limited to allow the extraction of higher-order properties without an infinite memory that the human brain does not have. Network science - an emerging interdisciplinary field - thus proposed a different description to characterize more complex streams (Lynn et al., 2020). In this framework, a stream of stimuli corresponds to a random walk in the associated probabilistic network. Several studies used this network approach to investigate how humans encode visual sequential information (Garvert et al., 2017; Mark et al., 2020). Schapiro et al., 2013, tested human adults with a network consisting of three communities (i.e. sets of nodes densely connected with each other and poorly connected with the rest of the graph; Newman, 2003) where transitions between all elements were equiprobable (each node had the same degree). This community structure is an extreme version of the communities and clustering properties that are often found in real-life networks, whether social, biological, or phonological (Girvan and Newman, 2002; Karuza et al., 2016; Siew, 2013). The authors reported that subjects discriminated transitions between communities from those within communities. Since local properties (TP) were not informative, this result revealed participants’ sensitivity to higher-order properties not covered by local probabilistic models. This sensitivity seems already to be in place at 6y-o (Pudhiyidath et al., 2020). Recently, Lynn et al., 2020, replicated a similar effect with a probabilistic sequential response task. They presented subjects with sequences of visual stimuli that followed a random walk into a network composed of three communities. After each stimulus, subjects were asked to press one or two computer keys, and their reaction time was measured as a proxy of the predictability of the stimulus. To explain the response pattern, the authors proposed an analytical model that optimizes the trade-off between accuracy and computational complexity by minimizing the free energy function. This model allows taking into account the probability of memory errors in the computation of the TPs between the elements of the stream. From now on, we will refer to this model as the free energy minimization model (FEMM: model D, explained below).

In this paper, we aim to merge these two lines of research and validate a model that can explain how humans learn local and high-order relations simultaneously present in sequences generated from noisy or incomplete structures. Moreover, we propose that adults do not encode the exact input but a parsimonious version based on the generalization of the underlying structure. To this end, we leveraged the community network framework and adapted it to expose adult participants to rapid sequences of sounds that followed a random walk through a network, building on the studies described above (Lynn et al., 2020; Schapiro et al., 2013), but using sparse communities with missing transitions between elements of the same community (see Figure 1). This design allows investigating whether participants are able to complete the network according to the high-order structure or whether, on the contrary, they rely on local transitions and reject impossible transitions ignoring the high-order structure. In other words, after training with an incomplete network, if new (‘unheard’) transitions are presented, are participants more willing to accept them if they belong to the community (i.e. within community transitions) than if they occur between communities? Moreover, while several papers have studied network learning in the visual domain (Karuza et al., 2019; Lynn et al., 2020; Schapiro et al., 2013), to our knowledge, it has never been tested in the auditory domain despite the better statistical learning capacities in the auditory modality (Conway and Christiansen, 2005), the sophisticated auditory sequence processing abilities observed in humans compared to other primates (Dehaene et al., 2015), and their potential importance in language acquisition. In addition, the original design was at a very slow rate, allowing for possible conscious decision to take place on the adequation of each element of the sequence to the structure. Here, we used a 4 Hz presentation, typically used in auditory sequence learning tasks, in order to force rapid processing of each element of the sequence and to be more comparable to the sequence learning literature. Finally, we compared how the different models proposed in the literature might fit our data and proposed a unified hypothesis of how any structure (local or global) might be extracted from a sequence.

Experimental design.

(A) Graph structure to which adult subjects were exposed in three different paradigms. (B) Graph design with color-coded conditions. Blue and pink lines represent transitions that have never been presented during the stream presentation but only during the forced-choice task. (C) Test procedure used for behavioral testing. In the press task phase, participants had to press a key when they felt there was a natural break in the sequence. In the forced-choice task, they had to choose between two quadruplets, the most congruent with the sequence they had heard. In the proposed pair, one was always a familiar within condition transition (purple transitions), and the other, one of the three other conditions.

For this purpose, we tested three different experimental paradigms in an online task, using sequences of pure tones or of syllables (~240 adult participants tested in each paradigm). The first paradigm - full community - tested a network composed of two communities of six elements each, with all nodes within a community connected with each other (except two nodes at the border of the community to keep an equal degree for each node). In the second and third paradigms, the communities were incomplete, some connections being never presented during the exposure to the continuous sequence: In the sparse and high sparse community paradigms, respectively one and two possible edges for each node were removed. The performances in these two ‘sparse’ designs, relative to the full community design, are crucial to investigate the participants’ underlying representations of the sequences.

In each paradigm, participants were first asked to carefully listen to a continuous sequence for about 4 mn and then to press a key when they felt there was a natural break in the sequence (~2 mn). This task allowed measuring participants’ ability to parse the sequence and to compare their performances in the auditory domain with those published in the visual domain. In a following test phase, they were asked to choose between two isolated quadruplets, the most congruent with what they had heard before, during the familiarization sequence. With this test phase, we could present previously unheard transitions (‘new transitions’) and study whether participants were able to generalize the network structure (Figure 1), notably in the two incomplete networks (sparse and high sparse paradigms). These two tasks were done twice.

In the forced-choice task between the isolated quadruplets, we tested each other conditions against the familiar within community transitions (condition considered as the reference; Figure 1C). If participants did not learn the graph structure of the sequence, they had to be random in their familiarity choice between familiar within and between community transitions because all quadruplets have been presented and had the same local TPs between their elements. By contrast, if they had indeed learned the graph, their familiarity score should be below 50% denoting their preference for the familiar within community transitions (i.e. reference). The performances for the unheard transitions, which can be either within or between community transitions (i.e. new within community condition and new between community condition) relative to the reference should allow to separate the different models proposed in the literature to explain how structures are perceived. Therefore, we compared the participants’ behavior (i.e. their familiarity rating for the presented transition relative to the reference) to the predictions of different theoretical models proposed in the stream processing and graph learning literature (Figure 2).

Model predictions.

Model description and predictions for the three paradigms tested. For each model, we computed the estimated familiarity (a.u.) predicted for each condition in the full, sparse, and high sparse paradigms. Although the models are partially correlated, they differ in their prediction about the familiarity of new within community transitions (light blue) which allows to separate the different models. Models D and E (free energy minimization model [FEMM] and hitting time) are two variations of the same sequence property from a statistical modeling or sequential point of view. Their predictions are thus almost identical. Models A, B, C, D, and E are theoretical metrics over the graph structure that predict more or less familiarity with the different types of transitions. Models F and G are biologically plausible neural encoding of those metrics. The box colors correspond to the conditions labeled in the top-left panel.

  • Model A: TPs and Ngrams: Local transitions between consecutive elements - PEt|Et-1 - have been proposed as an efficient learning mechanism to structure streams of input. We tested the limits of this simple local learning computation in the presence of a high-order structure. Ngrams are similar to TP but take into account n previous items in the computation of the transition. For example, for trigrams, PEt|Et-1Et-2Et-3. Note that because our designs are random walks into Markovian networks, the TPs and Ngram models are identical, PEt|Et-1Et-2Et-3=PEt|Et-1. Chunking-based models, such as PARSER (Perruchet and Vinter, 1998), rely on the repetition of chunks of consecutive elements and, as TP and Ngrams, would reject any chunk with new transitions as they never occurred during familiarization.

  • Model B: Non-adjacent TP: This metric is similar to the TPs but on non-consecutive items P (Et |XEt−2 ). We included it in our analysis because several studies have shown human sensitivity to such properties in streams (Peña et al., 2002).

  • Model C: Graph communicability: This model comes from the network science literature and computes the relative proximity between nodes in the network, making it sensitive to cluster- like structures like communities. Interestingly, a recent study shows that this measure correlates with fMRI data (Garvert et al., 2017), suggesting a potential relevance in human cognition.

  • Model D: FEMM: This model, recently proposed by Lynn et al., 2020 to account for community sensitivity by humans, is a trade-off between accuracy and computational complexity. It can be explained by memory errors while computing TP between elements in a stream. Participants exposed to a stream of elements reinforce the association between element i and i-1. However, errors in this process may lead participants to sometimes bind element i with element i-2, i-3, i-4.... with a decreasing probability (for a full description of the model, see Lynn et al., 2020). Mathematically, the distribution of the error size that minimizes the free energy function is a decreasing exponential (Boltzmann distribution). Therefore, the estimated mental model of TP is biased compared to the streams' objective TPs enabling participants to encode high-order structure. In more detail, the mental model is a linear combination of the TP matrix (A) and non-adjacent TPs of every order (A∆t) with a weight of P (∆t) where ∆t is the order of non-adjacency (or size of the memory error, i.e. ∆t = n corresponds to P (Et|X.....XEt−n)). The estimated model can then be written as:

  • A^=Δt=0+P(Δt)AΔt+1
  • with

  • P(Δt)=1ZeβΔt
  • where A is the TP matrix of the graph. β was previously estimated to 0.06 in a comparable task with human adults (Lynn et al., 2020). We therefore first used this value to test this model on our behavioral data and later confirmed this estimation with our data (see SI). In the reinforcement learning literature, the hippocampal place cells have been proposed to represent maps of probabilistic future states and reward by encoding successor representation instead of positional cognitive maps (Dayan, 1993; Stachenfeld et al., 2017). Successor representation has been formally defined as the sum of probabilistic future state and can be written SR = tγΔtAΔt. This approach is very similar to FEMM with an infinite sum of all power of the transition matrix, pondered by an exponentially decreasing factor. Here, the factor is γt with 0<γ<1 and generally γ= 0.85λmax with λmax the largest eigenvalue of the transition matrix (Garvert et al., 2017). This approach has been proposed to account for community perception (Pudhiyidath et al., 2022) but here we only included FEMM in our study, as the two models are identical with γ=e-β (with a different constant).

Another metric computing the same property but from a sequence point of view is the hitting time.

  • Model E: Hitting time: This metric, also coming from network science, estimates the distance between two nodes in a graph as the average number of edges needed (path length) to move from one node to another during a random walk. Similar to communicability (model C) and FEMM (model D), it measures a ‘proximity’ between nodes in a network. To make it more comparable with the other models, we computed its inverse value.

Although the different models are partially correlated with each other, they give different predictions about participants’ familiarity responses. First, they were two kinds of local transitions: familiar transitions and new transitions (TP = 0). Since the TP calculation does not consider the community structure (model A), participants should equally reject new transitions regardless of their relation with respect to communities (new within communities = new between communities). Second, concerning the new transitions, FEMM and hitting time models predict that participants should better detect new between community than new within community transitions (completion effect). It is also partly the case for the communicability model, but not for the TP and non-adjacent TP models (models A and B). The similarity of the predictions of FEMM, hitting time, and communicability models is not surprising as they all describe the same property of the network: proximity between nodes. Intuitively, items from the same community will appear closer together than items from different communities, even if the two nodes are not connected. In fact, FEMM and communicability are mathematically very close but with a different decay (exponential vs. factorial). However, they can still be differentiated thanks to the high sparse paradigm were the relative predicted familiarity of new within and familiar between transitions are different between the two models.

In addition to those theoretical models, we considered two putative brain implementations using biologically realistic neural networks:

  • Model F: Hippocampus CA1 similarity: This neural network aims to reproduce the hippocampus structure (Norman and O’Reilly, 2003), which is often described as a key structure in statistical and structure learning (Henin et al., 2021; Schapiro et al., 2017; Schapiro et al., 2016). We compute here the similarity in CA1 layer as it has been proposed to capture community-like structures in previous studies (Schapiro et al., 2017). Indeed, thanks to its ability to have overlapping representations of the input and direct connection with the entorhinal cortex through the monosynaptic pathway, CA1 structure is also sensitive to long-distance dependencies allowing high-order structure learning.

  • Model G: Hebbian learning with decay: Hebbian learning is a biologically plausible implementation of associative learning. Some neurons fire specifically to some objects in the environment. When two of those neurons co-fire, the pair is reinforced. It has been suggested that learning TPs is based on such a mechanism in the cortex. Here, we adapted this idea to implement the FEMM computation instead of TP, specifically by adding a temporal exponential decay in the probability of a neuron firing after a stimulus’s presentation. When the exponential decay has the same β parameter as the FEMM, the results of the FEMM and the Hebbian learning with decay are mostly similar.

Results

Human behavior

Key presses distribution during active listening

All participants were exposed to a stream of either tones or syllables adhering to one of three possible graphs (Figure 1A and B). After a 4 mn familiarization period, they were instructed to press the spacebar when they felt the impression of a natural break in the sequence (2 mn). This task was a sanity check to corroborate that participants were listening to the stream and that their performance was comparable to previous studies testing graph learning using the visual modality at a much slower pace than we used here. Figure 3 top row shows the normalized distribution probability of key presses after a transition, using a kernel approach (see Materials and methods for detailed computation). In all three paradigms (each corresponding to a graph in Figure 1), the significant increase in key presses after between community vs. within community transitions (p<0.05 are indicated in bold lines) reveals that participants were sensitive to the switch between sound communities. Full community and sparse community designs showed a similar effect size, while the high sparse community design elicited a small but significant effect. Unpaired t-tests every ms in [–0.1, 2.750] s window, contrasting the full community vs. high sparse community, show a significant difference between 1 and 2.6 s post-transition (p<0.05 Bonferroni corrected). Similarly, sparse community vs. high sparse community differed between 0.8 and 2.5 s (p<0.05 Bonferroni corrected).

Behavioral results.

Top panel: parsing probability during the active listening phase (distribution of key presses after the offset of a given transition) purple lines: familiar within community transitions, red line: familiar between community transitions. Thin purple lines each represent a bootstrap occurrence of the parsing probability for the familiar within community transition. The bold red line indicates the time points where there was a significant increase of parsing probability after a familiar between community transition compared to a familiar within community transition. Bottom panel: familiarity measure in each paradigm: percentage of responses for each condition during the forced-choice task. By design, the chance level (50%) represents the familiar within community estimated familiarity (reference). The stars indicate significance against the reference and between conditions (pval <0.05 FDRcorr) the dotted line marginal significance (pval = 0.046 uncorr). The error bars represent the standard error for each condition. N=728 participants were tested to the Full Community (N = 250), the Sparse Community (N = 249), or the High Sparse Community (N = 228) paradigms.

Two-forced-choice task

Participants were given a two-forced-choice task, in which they had to choose between two sequences the one that best matched the structure of the stream they had listened to (Figure 1C). This task is the crucial test for comparing models because it allows to present new transitions that matched, or not, the familiar structure and thus to assess the representation of the memorized graph. We report the results at the end of the learning (second block). Results separated by groups and testing block are presented in SI. It can be seen that in contrast with the three other data points, participants’ choice were close to random after the first block in the syllable experiment and their performance could not be explained by any of the models. As pointed in other experiments on statistical learning using syllables (Elazar et al., 2022; Onnis and Thiessen, 2013; Siegelman et al., 2018), the familiarity with speech and the phonetic rules of the native language create priors on the probability of sequences of syllables, that might compete with the real syllable distribution in the task. At the end of learning, no difference was found between the groups using tones and syllables (unpaired t-test for each condition, all ps >0.2), we thus merged the data of the tone and syllable groups.

In this task, scores below 50% indicate that the reference (familiar within community transitions) was judged more familiar than the tested condition. We postulated that if participants were only sensitive to familiar transitions, any novel transitions should be judged less familiar than the familiar between community transition. On the other hand, if participants encoded the underlying structure of the communities, they should not notice the novelty of the new within community transitions and reject the two between community conditions (familiar and new).

As can be seen in Figure 3, participants significantly rejected the new between community transitions in each paradigm (ps <0.01 FDR), this transition is both novel and jumping across communities. The familiar between community transition condition was only significantly rejected in the sparse community paradigm (p<0.01 FDR). Second, the new within community transitions were chosen/rejected at chance in the sparse and high sparse community paradigms indicating a similar perception of familiarity for these never heard transitions and the reference. Third, in the sparse community paradigm, the familiarity score was larger for the new within community transitions than for both between community transitions (new: p<0.01 FDR; and familiar: p<0.05 FDR). These comparisons were only marginally significant in the high sparse paradigm (uncorrected p=0.046). In other words, the participants encoded the graph structure as revealed by the difference in familiarity between within and between community transitions and naturally completed the graph as indicated by the scores at chance for never heard transitions compatible with the graph structure.

Which model best fits the participants’ behavior

Correlation between human data and theoretical model predictions

To estimate the adequacy of the theoretical models to explain the behavioral data, we pooled together the three paradigms and estimated the correlation with each model. We normalized each model prediction by the model’s value for familiar within community transitions to be comparable with the behavioral results of the two-force-choice task. It is worth noticing that models A, B, C, and F predict differences in familiarity for familiar within transitions between the three paradigms (full, sparse, high sparse); however, our experimental design does not allow us to estimate differences in these transitions between paradigms but only relative differences to the familiar within community condition within paradigms. To estimate the significance of the correlation differences, we used a bootstrapping approach with subjects (with replacement) and estimated the number of bootstrap occurrences in favor of one model against another. Figure 4A shows the correlations’ distribution between the data and each model (presented on the diagonal) and between pairs of models. We estimated the significance of the correlation strength between the data and model i or j by counting the percentage of occurrences in which model i had a stronger correlation with the data than model j. All models were significantly correlated with the data (all p<0.01 FDRcorr), with a correlation strength following the order FEMM ≈ hitting time > communicability > non-adjacent TPs ≈ TPs (Figure 4C). Note that the FEMM and hitting time are similar models, and thus predictions are almost identical. They had the best correlation with the data (81%) and were significantly better than all the other theoretical models (p<0.05 FDR).

Figure 4 with 2 supplements see all
Model and data comparisons.

(A) Estimation of the correlation of the participants’ familiarity score pattern with each theoretical model (A to E) using bootstrap re-sampling. The diagonal of the matrix displays the distribution of correlations between the participants' familiarity pattern across conditions and the predictions generated by each model (A), theoretical models (A to E), and (B) neural models (F&G). Each panel of the diagonal presents the same result, the color of the relevant model being highlighted to facilitate the comparison between models. For each pair, the significance between models (indicated by stars) is estimated by counting the number of bootstrap occurrences for which one model was more correlated with the data than the other. We plotted this bootstrap as a cloud of dots in the Correlation with Model1 × Correlation with Model2 subspace. Significance is then represented by the percentage of dots above the diagonal. Models with similar predictions display a line style cloud of dots aligned along the diagonal. (B) We did the same comparison with the two neural models (F&G). (C) Summary of the correlations between each model and the behavioral data. Plain lines above the boxes indicate the significant differences between models. FEMM and hitting time (D&E) are equivalent and equally good and significantly better than all other theoretical models. For neural models, the Hebbian model (G) shows a slight, but highly significant, better fit with the participants’ scores. The dotted line indicates the ceiling fit level estimated for this dataset.

Correlation between human data and neural model predictions

As the FEMM computation and the hitting time were the best theoretical models, we translated them into a realistic biological architecture using Hebbian rules. We estimated this implementation on a 50,000 item-long stream for each paradigm. The correlation between the analytical computation and the Hebbian learning implementation exceeds 99%. Using the same bootstrap approach, we compared this Hebbian approach with a neural network reproducing hippocampus architecture proposed by Norman and O’Reilly, 2003. Both models were significantly highly correlated with the data and with each other. However, the Hebbian implementation of FEMM was slightly but significantly more correlated to our data than the hippocampus model (Figure 4) typically because of the lack of agreement between the hippocampus model and the data in the high sparse paradigm. However, because the hippocampus model highly fits our data, we cannot rule out the hippocampus as a potential crucial structure for such tasks.

Estimation of the ceiling correlation with our data

We also used the same bootstrapping approach to estimate the noise ceiling for the model fit. For each bootstrap, we randomly selected n subjects with replacement twice and correlated the data of those two random samples. We find an average of 84% correlations as a noise ceiling for those data. Our best fit with any model is the 77% average bootstrap correlation between our data and the FEMM, which is relatively close to the ceiling fit given this dataset, showing a very high relevance of the FEMM to account for the data.

Discussion

TPs between elements of the sequence are biased by the structure of the underlying generative network

Our results show that human adults do not encode TPs objectively when familiarized with a stream of sounds. Instead, they seem to have a systematic bias to complete the transitions within a community suggesting a subjective internal representation that differs from the objective distribution of the transitions they heard. This behavior is compatible with two proposed theoretical models: the FEMM and the hitting time.

The high agreement between the FEMM and the data we observed suggests that the bias can be analytically estimated using the FEMM A^=Δt=0+P(Δt)AΔt+1 with P(Δt)=1ZeβΔt. Lynn et al., 2020, proposed that this bias corresponds to memory errors when recalling the previous item of the stream during the TP computation. The bias in the encoding of TP between successive elements enabled the extraction and encoding of high-order structures in graphs, that is, a community structure. We can distinguish two distinct bias effects: First, the pruning of familiar transitions that do not conform to the community structure (i.e. familiar between community transitions are rejected). Second, the completion of the structure by overgeneralizing new transitions when they are compatible with the high-order structure (i.e. new within community transitions are accepted). These perceptual biases lead to a more parsimonious internal representation of graphs.

Putative brain implementation of such computation

We showed that the computation of TPs is biased in humans, and analytically, this bias is characterized as an optimal trade-off between accuracy and computational complexity. Indeed, perfect accuracy in the encoding would result in no sensitivity to the high-order structure, while too low accuracy would result in no learning at all. We also presented putative brain implementations and tested to what extent two previously described mechanisms might explain our results: Hebbian learning and hippocampus episodic memory.

Hebbian learning is a very simple mechanism that consists of reinforcing co-occurrences in a signal. It has been proposed as a learning mechanism in statistical learning tasks (Endress and Johnson, 2021). Here, we minimally modified it as described above to introduce the bias in TP computation. Such learning could be implemented in many brain regions through learning-induced synaptic plasticity and does not require any specific structural organization of neurons. In contrast, the CA1 similarity model relies on the specific architecture of the hippocampus. Testing a hippocampus specific model is essential because several authors have proposed that statistical learning and graph learning might be represented as the construction of an abstract map of relational knowledge, analogous to topographic maps (Constantinescu et al., 2016; Garvert et al., 2017), which are known to involve the hippocampus. Moreover, the hippocampus has also been proposed as a good candidate for the implementation of the successor representation, giving this structure the role of a predictive map unifying temporal and spatial relational knowledge under a common framework (Stachenfeld et al., 2017).

A recent experimental study (Henin et al., 2021) showed that when exposed to statistically organized auditory or visual streams, the hippocampus activity measured with ECoG exhibited a cluster-like behavior, with all elements belonging to the same group being similarly encoded. Using the community paradigm with fMRI, Schapiro et al., 2016, also reported an increased pattern similarity in the hippocampus for elements belonging to the same community (see also Pudhiyidath et al., 2022). Another piece of evidence comes from modeling the hippocampus activity in different statistical learning tasks (Schapiro et al., 2017). In this study, the authors used a neural model mimicking the hippocampus architecture and trained it on different statistical learning tasks including community structure learning. They showed that the pattern of activity in CA1 might account for both pair learning (episodic memory) and community structure learning, and thus is partially consistent with two mechanisms observed in the hippocampus: pattern completion (i.e. the similarity of the neural representations of close stimuli increases, which allows generalization) and pattern separation (i.e. the similarity of neural representation of close stimuli decreases, to disambiguate them) (Bakker et al., 2008; Liu et al., 2016; Yassa and Stark, 2011).

Here, we showed that both a general Hebbian model and a more specific hippocampal model fit very well the pattern of familiarity scores given by the participants with a slightly better result, yet significant, for the Hebbian learning approach. Since we only have behavioral results, it is difficult to conclude on the exact brain regions involved, especially since recent work proposed the joint use of several computation involving cortical and hippocampal learning in similar tasks (Varga et al., 2022; Whittington et al., 2020). In any case, the agreement between the behavioral data and two brain models shows that the FEMM (an analytical model) does not only explain behavioral data but also has biologically valid candidates.

A general model of statistical learning for sequence acquisition

Statistical learning has been proposed as a powerful general learning mechanism that might be particularly useful in language acquisition in order to extract words from the speech stream (Saffran et al., 1996). However, the exact model explaining statistical learning remains under-specified: What is computed remains unclear (Fló et al., 2022; Henin et al., 2021) and authors often tailored the computation to suit the paradigm (TPs in some studies, non-adjacent or backward TPs in others, biased transitions probabilities in network studies, etc.). We argue that the FEMM is a more general model that, beyond explaining community separation, as shown above, can also account for results traditionally explained by the computation of local transitional probabilities and those that require the computation of long-distance dependencies. Indeed, the first-order approximation of the FEMM corresponds to the objective TPs model (A^0, see SI). Thus, the predictions of the FEMM are the same as those of the TP model in many tasks, notably in classical speech segmentation experiments, where a drop in TP signals word edges (Saffran et al., 1996). Another approach in the literature about sequence learning considers the recognition of chunks more than statistical learning as a primary mechanism for segmenting sequences. Based on this approach, PARSER and TRAXCS detect often occurring chunks in sequences but do not associate a familiarity rating with each transition. In a previous experiment (Benjamin et al., 2023), we showed that familiarity based on statistical learning does not always lead to sequence chunking and here we focused on this sense of familiarity which does not require the construction of a repertoire of possible chunks postulated by chunking models. Therefore, we did not consider these models here.

Another part of the statistical learning literature focuses on AxC structures, in which the first syllable of a triplet predicts the last syllable (Buiatti et al., 2009; Endress and Johnson, 2021; Kabdebon et al., 2015; Marchetto and Bonatti, 2015; Peña et al., 2002). The computation of first-order TPs is insufficient to solve this task, which requires the encoding of non-adjacent TPs. However, a bias estimation of TPs following the FEMM is sensitive to non-adjacent dependencies and can explain the emergence of AxC structures. Additionally, as previous papers and our results show, the FEMM can also explain subjects’ behavior in different kinds of network learning (Karuza et al., 2016; Lynn et al., 2020; Schapiro et al., 2013). Lynn et al., 2020, interpret the FEMM as errors in the associations between elements, whose probability decays with the distance between associated elements. We proposed that implementing the TPs computation through Hebbian learning with a firing decay results in a comparable computation to the free energy model.

Finally, a similar Hebbian learning approach enables to explain the sensitivity to backward TP reported in the literature (Endress and Johnson, 2021; Pelucchi et al., 2009). A similar idea has recently been proposed by Endress and Johnson, 2021. However, the authors did not refer to free energy optimum or provide an analytical approach. Instead, they proposed a Hebbian learning rule with the same idea of mixing TP with non-adjacent TP (which corresponds to a second-order approximation of the FEMM that we propose here, see A^1 in SI). Like we do here, they argued that this mechanism could account for results currently explained by different models in the literature. Thus, the FEMM and its putative neural implementation through Hebbian rules unifies different proposals concerning statistical learning on the one hand and network learning results on the other hand, under a common principle. It is important to note that we investigated how the FEMM - and the other models - accounts for the extraction of regularity from a sequence, which is the first needed step of many other processes. We did not test for further abstract representations of the sequence that could be subsequently computed.

Information compression and stream complexity

Our results showed that adult humans have a biased subjective representation of first-order TPs compared to the actual TPs, which makes them to be sensitive to high-order structure in the underlying graph and to overgeneralize transitions that they never experienced. What is the advantage of such a computational bias for human cognition? We postulate three main advantages.

Higher-order structures and generalization can be relevant information to learn. Unlike random networks, many real-world networks have transitivity properties (Girvan and Newman, 2002; Newman, 2006; Newman, 2003) - if A is connected to B and B to C, there is a high chance for A and C to be connected (a friend of my friend is likely to be my friend).

Overgeneralizing enables faster learning. Overgeneralizing means accepting transitions congruent with the structure even before they appear in the stream. Thus, for short exposures, the estimation of the FEMM is closer to the real TP matrix than the estimation of the TP model based on the input because it infers transitions that have not been presented yet. This fast learning might be of importance, for example, for language acquisition, given that human infants are exposed to a limited amount of speech.

Adding to why humans have biased statistical learning, we propose that this learning bias in extracting statistical information might subsequently be used to form abstract condensed network representations. In fact, the extraction of high-order structures might enable information compression in long-term memory. Because of the computational cost and the pressure on memory to encode long sequences, compressing information is a major advantage. In a community paradigm, the learned representation could be later simplified to reduce the stream complexity to a binary sequence with a certain probability of changing between communities A and B (Figure 5). Instead of remembering all the transitions of the stream, remembering community labels and the probability of transition between communities is sufficient. Recent data (Al Roumi et al., 2021; Dehaene et al., 2014; Planton et al., 2021; Sablé-Meyer et al., 2022; Sablé-Meyer et al., 2021) showed that in some circumstances, humans’ performances were highly sensitive to input compressibility, arguing for a condensed encoding of inputs. Note that the familiarity measure we report here does not show compression of the structure. Still, the familiarity bias could be at the basis of a later abstract condensed network representation (this hypothesis is presented in Figure 5). In the same line, a recent study using a graph perspective (Whittington et al., 2020) proposes that the representation of the abstract relational structure of a sequence and the mapping between node and stimuli identity could be factorized. In the case of community paradigm, Pudhiyidath et al., 2022, even proposed that the formation of such an abstract structure could allow humans to transfer learnt properties between elements belonging to the same community. Mark et al., 2020, showed that the learning of the structure of a network could be re-used on the next day to allow fast and generalizable learning arguing for a factorized brain representation between the stimuli mapping and the abstract network encoding. This compressibility hypothesis, represented in Figure 5, needs formal testing to be confirmed or infirmed.

Network compression hypothesis.

Compressibility hypothesis. In the left panel, the real underlying structure of the input presented. In the middle the learned representation by humans. As described above, this representation does not completely reflect the real input structure but a biased parsimonious version of it, including pruning and generalization of transitions. In the right panel, we hypothesized a condensed representation that might be formed subsequently to simplify and compress the information. In this representation, the identity of the elements would be ignored in favor of their community label. The familiarity of each transition is represented with transparency of the edge in the network representation and each condition familiarity pattern is represented with barplots below.

Finally, the human sensitivity to community is in line with Simon’s postulate that the complexity of a system can only be handled, thanks to its hierarchical nearly decomposable property (Simon, 1962). In other words, a complex structure is no more than the sparse assembly of less complex dense substructures. Here, we propose empirical arguments by demonstrating that human adults are sensitive to the decomposition of a complex network into two simpler sub-networks.

Methodological remarks

In this study, we used two different metrics. The press bar task during attentive listening showed high sensitivity, but it only allowed testing within vs between community transitions during learning and thus assessing clustering (different perception of familiar within and familiar between transitions). The forced-choice task on the isolated quadruplets allowed testing for more conditions after learning and thus to distinguish between models. However, this second metric had a low sensitivity because only a few trials could be collected resulting in high error variance that was compensated by a very large sample of participants (N=727).

This design also did not allow us to efficiently study the dynamics of learning. We had only two points for the estimation of the learning of the graph by explicitly detecting quadruplets familiarity. This is particularly insufficient when, as here, the speech or non-speech nature of the stimuli modulate performance because of different priors on the possible composition of the sequences. Even for tones, we could not determine when learning took place as it seems stable from the first measure point.

Materials and methods

Behavioral task

Participants

A total of 727 French adults were recruited via social media (424 of which were retributed $2.5 on Prolific platform). They had to have no hearing or language problems and French had to be their first language. They were assigned to one version of the experiments and instructed to carefully listen for 4.4 min to a nonsense language composed of nonsense words that they had to learn because they would have to answer questions on the words afterward. Participants were either exposed to the full community (N=250), the sparse community (N=249), or the high sparse community (N=228) paradigms with either pure tones or syllables as stimuli.

Ethic approval

Request a detailed protocol

All participants gave their informed consents for participation and publication and this research was approved by the Ethical research committee of Paris-Saclay University under the reference CER-Paris-Saclay-2019-063.

Stimuli

Request a detailed protocol

We generated 12 tones of 275 ms duration, linearly distributed from 300 to 1800 Hz. We also generated syllables with the same duration and flat intonation using the MBROLA text-to-speech software (Dutoit et al., 1996) with French diphones. There was no coarticulation between syllables.

Each experiment was composed of 4.4 min of an artificial monotonous stream of concatenated tones (or syllables) without any pause, resulting from a random walk into the tested graph. The graph was either complete (full community), with one missing transition (sparse community), or two missing transitions at each node (high sparse community) creating three experimental paradigms. To avoid any putative acoustical bias, we collected eight groups of subjects for each paradigm. For each of the eight groups, we randomly generated a new graph (except for the full community graph, for which only one graph was possible), a new correspondence between the alphabet of tones (or syllables) and the nodes of the graph and finally new random walks into the graph.

In the original study (Schapiro et al., 2013), the authors explored different graph traversal: random walk and Hamiltonian path. In the Hamiltonian path, each node is presented only once, avoiding short distance repetitions and thus controlling for a putative novelty effect when there is a change of community which could potentially serve as a parsing cue in a random walk. However, participants did not parse the sequences better in the case of random walks relative to Hamiltonian walks (Figure 2 in Schapiro et al., 2016) minimizing the concern of a possible habituation effect if random walks are used. Here, we chose a random walk because the Hamiltonian path introduces more predictability to the sequence. As previously presented stimuli of the community can no longer be presented, the predictability of the next element increases with the length of the path within a community until a perfect predictability for the fifth and sixth elements (node at the border of communities) and the next element in the other community whereas a random walk keeps the prediction flat. Thus, learning a graph through a Hamiltonian walk can be fully explained with Ngram approaches and cannot disentangle the different learning models proposed. Moreover, the number of Hamiltonian paths available drastically decreases with sparsity up to the point where, in the high sparse paradigm, a single sequence is possible of a given first element leading to a trivial pattern of repetition of 12 elements.

With a random walk, the tones belonging to the same community are presented on average closer in time than those belonging to different communities. However, the length of the walk within one community can be short without repetition or without going through all the tones of the community, or longer with repetition of some tones at a random distance. Therefore, there is no consistency over time that could allow to capture a repetition pattern. Furthermore, the absolute frequency of each tone is equal within the stream, which avoids long-term habituation effects, and the local TP is flat, which avoids the possibility of predicting the next tone. Finally, the tones frequency was distributed between the two communities, to prevent a separation based on an auditory spectral partition. However, due to the design reasons explained before, Halmitonian walks are not usable and thus we could not formally control for potential habituation effect in our design. The key-press results of this study (but not the two-forced-choice results) are therefore potentially subject to confounding by habituation.

For the isolated quadruplets, we concatenated four sounds so that the first and last transition were always non-deviant (familiar within transition) but that the transition in the middle would be of each type of transition. We used quadruplets in this study for consistency with previous work of the team and especially for comparing latencies of developmental ERPs in possible future electrophysiological work.

Procedure

Request a detailed protocol

Participants started with a 4.4 mn familiarization phase of exposure to the stream (960 items). Then learning was tested with two tasks. First, participants were told that the order of the tones/syllables was not random and that they had to press the spacebar when there was a noticeable change in the tones (or syllables) group used in the stream. Second, they were presented with a two-forced-choice task in which they had to choose between two quadri-elements sequences, the most likely sequence, part of the language they learned.

The two-forced-choice trials always comprised a familiar within community transition and one representing the other conditions. These conditions were new within community transitions, new between community transitions, and familiar between community transitions (Figure 1). Participants were exposed to eight trials per type (with different sounds each time) except for the new within community type, where they were only exposed to four trials because, by design, there are only four of those transitions in the graphs. Each transition used in the set was presented in both directions (AB and BA). Four catch trials were also included to control participants’ engagement in the task. These catch trials were two consecutive identical quadruplets that subjects had to detect. Then, they were again exposed to a random walk stream for 2.2 min (active listening - 479 transitions) followed by the same forced-choice task as before.

Data processing: active listening task

Request a detailed protocol

Participants who pressed less than 10, or more than 200, times during the experiments were excluded from further analysis (FC: 52/250; SC: 24/249; HSC: 23/228). A null array of the stream size was built and filled with ones at times when participants pressed the spacebar (Dirac impulses). To convert it into a continuous signal, we convoluted it with an exponential window. Then, we epoched this continuous signal from –2.75 to 2.75 s after each transition’s offset. Finally, we averaged all the epochs corresponding to the four familiar between community transitions and four out of all familiar within community transitions, and compared them. We repeated this with 1000 random groups of four familiar within community transitions in each subject. By normalizing and averaging across subjects, we were able to estimate the increase of the pressing probability after a familiar between community transition compared to a familiar within community transition at each time point. This method is similar to the kernel approach for estimating probability density from discrete observations.

Data processing: forced-choice task

Request a detailed protocol

Participants that failed on more than two catch trials (two identical quadruplets) out of four were excluded from further analysis (FC: 35/250; SC: 45/249; HSC: 34/228). For each subject, we computed a percentage of preference for the tested transition relative to the reference (familiar within community transition) in each condition (i.e. the ratio between the number of trials where the subject chose the tested sequence and the total number of trials of this condition). The measure ranges from 0 (the familiar within community transition is always selected) to 100 (the other transition is always selected) with a chance level of 50%. We estimated the familiarity score of each condition vs the chance level (50%) using paired t-tests. We report the data from the second forced-choice-task session, corresponding to the maximum exposure to the streams. For the tone stream, results were similar in the first and second sessions. For the syllable stream, results from the first session were poorly consistent across participants, probably because the task was more difficult in the case of syllables. Indeed, flat transitions between syllables violate language structure and participants’ priors on syllable sequences. The conflict between priors and the real structure of the sequence might need a variable time to be resolved by each participant (Elazar et al., 2022; Lew-Williams and Saffran, 2012; Onnis and Thiessen, 2013; Siegelman et al., 2018). For completeness, we performed the correlation analysis with each subgroup of data (first vs. second session and tones vs. syllables). These analyses are presented in Figure 4—figure supplement 1. None of the models could adequately explain the first session of the syllable group. To further investigate the learning dynamics and in particular the influence of priors, another paradigm should be proposed, which is beyond the scope of the present study.

Modeling

Theoretical models

Request a detailed protocol

For the four models that could be analytically computed from the TP matrix (A, B, C, and E), we computed the predictions made by the models for each of our graphs (eight with syllables, eight with tones). Given A the transition matrix of the graph, models were computed using the analytical description:

  • Model A: TP and Ngrams:

  • By construction of the transition matrix, the TPs between nodes are the elements of A.

  • A^=A

  • Model B: Non-adjacent TP:

  • Non-adjacent TPs are computed by taking the square of the transition matrix

  • A^=A2

  • Model C: Communicability:

  • A^=Δt=0+P(Δt)AΔtwithP(Δt)=1Δt!

  • Thus, A^ corresponds to the exponential series: A^=eA. We use Matlab function ‘expm’ to compute this value.

  • The communicability model as described in Garvert et al., 2017, uses the adjacency matrix. Here, we used the TP matrix. We believe it is more appropriate to consider the relative weights of each transition and not only its existence or not, because a random walk into a weighted graph follows the transition matrix and not the adjacency one. It makes it also more comparable with the other models.

  • Model D: FEMM:

  • A^=Δt=0+P(Δt)AΔt+1withP(Δt)=eβΔtΔt+eβΔt
  • which can be re-written:

  • A^=(1eβ)A(IeβA)1
  • We then computed the average estimate for each of the conditions for each design. Only the FEMM (model D) had one free parameter in its equation. To remove this free parameter and make the model more comparable to the others, we used a previously estimated value of β=0.06 reported in the literature (Lynn et al., 2020). To confirm that this estimation corresponded to our data, we computed the correlation between the subjects’ data and the predictions for β ranging from 10–15 to 1015. We smoothed this correlation vector to avoid local variations and found a plateau of high correlation for β=[104;101] with a maximum for β= 0.049 (correlation 81%). Similarly, we computed the correlation between the FEMM and the hitting time estimation as a function of β. Here again, following the same procedure, we found a plateau of high correlation from β=[104;101] with a maximum for β= 0.053 (correlation = 99.3%). The two models can then be considered quasi-equivalent with the β parameter considered in this paper (0.06).

  • Model E: Hitting time: For this model, we approximated its value by creating 50,000 item-long streams corresponding to each graph and computing the average number of elements between each pair of stimuli. We took the inverse of this value to make it more directly comparable with the other models.

Neural models

Request a detailed protocol
  • Model F: CA1 similarity: We used the neural network and the procedure explained in Schapiro et al., 2017, originally published by Norman and O’Reilly, 2003. We did not change any parameter from this original study because our goal was to see how predictable this model was for our paradigms. We trained it 25 times on each of our graph structures (for each paradigm, 25 batches for 8 groups with syllables and 8 groups with tones: 25*8*2=400 replications). We then presented after each training each node as input in isolation and recorded the pattern of activity in the CA1 layer. To estimate the similarity in nodes’ encoding, we computed the correlation between the pattern of activity in CA1 for pairs of elements. Finally, we then made predictions on our task by comparing the similarity between two nodes linked by our four types of transitions.

  • Model G: Hebbian Learning with decay: This model aim to implement the FEMM computation with an adaptation of the Hebbian approach proposed for associative learning. To achieve that, we declared a layer of neurons with at least one neuron per node of the graph (it can contain more for generalization to bigger networks). The neurons started firing with an exponential decay corresponding to the FEMM decay for each sound in the sequence. Thus, if another sound was presented before the previous neuron stopped firing, several neurons encoding for different nodes co- fired simultaneously. It biased the estimation of TP between two elements. This co-firing behavior can be computed using Hebbian learning rule to update the weights between the neurons. This weight Matrix is then an estimation of the Free Energy Minimization Model that will converge as the length of the input stream increases. To estimate this model, we followed the same procedure as for the Hitting Time. We created 50 000 item-long streams corresponding to each graph and used those streams as inputs of the neural network. We updated the weight matrix at each step using Hebbian rule as described before. The weight matrix after the 50 000 items was used as an estimation of the model.

Model comparison

Request a detailed protocol

To compare models and data, we considered all experimental paradigms together. To make it comparable with the two-forced-choice data, we normalized each design prediction by the model’s value for familiar within community transitions. We then pooled all data from all paradigms and estimated the correlation between the data and the models’ predictions using 5000 bootstrap re-sampling occurrences. The p-values were estimated by counting the percentage of bootstrap occurrences correlating more with one model compared to another. All the bootstrap occurrences and their correlation with each pair of models are presented in Figure 4B. Each dot represents one bootstrap occurrence. The distribution of these dots below and above the diagonal indicates the comparison between two models. The scatterplot’s shape shows the correlation, independence, or anti-correlation between two models. This main analysis of data and model comparison have also been performed for each subgroup of data (first/second session; tones/syllables) and are presented in Figure 4—figure supplement 1. To try better differentiate communicability with the other models, we recomputed the same correlation analysis but restricted to conditions where communicability makes qualitatively different predictions (new within vs familiar between transitions in the sparse and high sparse designs). By doing so, we reduced most of the correlation between models and only tested for specific contradictory predictions. We again find that hitting time, FEMM, and Hebbian models are equivalent and better than the other models (see Figure 4—figure supplement 2).

Conclusion

Request a detailed protocol

The results shown in this study reveal (1) community representation in the auditory domain; (2) the persistence of a biased, subjective TPs’ representation after learning; and most importantly (3) pruning and completion effects allowing to build a parsimonious representation of the underlying network structure. TPs are thus not exactly encoded by the participants but biased in a way that can be predicted by the free energy minimization computation. Importantly, the same model might explain human sensitivity to local and high-level regularities without the need for specific models for each task.

More research is needed to characterize how and where such computations take place in the human brain and how this bias varies across individuals and with development. However, Hebbian rules in the cortex and/or hippocampus might be plausible candidates for a biological implementation of this analytical model. Finally, finding appropriate metrics to cluster graphs is a current research topic in applied mathematics (Newman, 2006). Thus, we believe that understanding the cognitive processes at stake when humans are exposed to such structured networks might provide insight to cognitively and biologically plausible computations.

Data availability

All Data and analysis are publicly available at https://osf.io/e8u7f/.

The following data sets were generated
    1. Fló A
    2. Al Roumi F
    3. Dehaene-Lambertz G
    4. Benjamin L
    (2022) Open Science Framework
    Data and Analysis for "Humans parsimoniously represent auditory sequences by pruning and completing the underlying network structure".
    https://doi.org/10.17605/OSF.IO/E8U7F

References

    1. Simon HA
    (1962)
    The architecture of complexity
    Proceedings of the American Philosophical Society 106:467–482.

Decision letter

  1. Floris P de Lange
    Senior and Reviewing Editor; Donders Institute for Brain, Cognition and Behaviour, Netherlands
  2. Cameron Ellis
    Reviewer; Haskins Laboratories, United States

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Decision letter after peer review:

[Editors’ note: the authors submitted for reconsideration following the decision after peer review. What follows is the decision letter after the first round of review.]

Thank you for submitting the paper "Humans parsimoniously represent auditory sequences by pruning and completing the underlying network structure" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by a Senior Editor. The following individual involved in the review of your submission has agreed to reveal their identity: Cameron Ellis (Reviewer #3).

Comments to the Authors:

We are sorry to say that, after consultation with the reviewers, we have decided that this work will not be considered further for publication by eLife. Despite excitement about the approach and many strengths of the work outlined by the reviewers below, we felt that the confound in the parsing results (as explained below) was problematic, and there were several other analysis and central framing concerns that led us to the conclusion that this paper would not cross the high bar for publication at eLife.

Reviewer #1 (Recommendations for the authors):

In this paper, participants are exposed to auditory sequences generated by graphs with community structure. Transitions between community nodes are sometimes left out during exposure, allowing tests of generalization to those unseen transitions. The authors find that participants are sensitive to the structure in general, as well as to the novel within-community transitions, indicating an understanding of the structure that goes beyond the directly-experienced information. They apply several theoretical and neural models to the data and find a range of matches to the empirical results. The best-fitting models are FEMM (Free-Energy Minimization Model) and Hitting Time, and the authors conclude that the mechanisms of those models may underly the patterns observed in humans.

The observation that participants choose unseen within-community transitions at a high rate is novel and a compelling demonstration that humans do not objectively encode transition probabilities in a stream of sounds. The many implemented and compared models are also a considerable strength of this work. However, I believe there is a confound in the pressing probability results, and I am also concerned that the behavioral data may be too noisy across participants to confidently test between some of the highly correlated models.

1) The pressing probability results (top row, Figure 3) are interpreted as evidence that the participants have learned the community structure and thus can parse the sequences at community boundaries. However, this effect can arise without there having been any learning: Within a community, stimuli are repeated many times before moving to the next community, which should result in stimulus adaptation. At the transition to a new community, stimuli are observed that have not been repeated as many times as recently, so simple adaptation can serve as a strong parsing cue. The paper that introduced this paradigm (Schapiro et al. 2013) included Hamiltonian paths (where every stimulus is visited exactly once) during the parsing task to avoid this confound, but this paper does not include that condition.

2) The authors acknowledge that the behavioral data are quite noisy across participants, requiring a very large sample to detect differences between conditions. Even with the large sample, many of the pairwise comparisons shown in the bottom row of Figure 3 are not significant. This raises concerns about whether a detailed test between correlated models is possible based on these data. My understanding is that the authors pooled data across all participants and designs and then did bootstrap resampling for statistical tests. I am concerned that this procedure is inflating the seeming reliability of small differences in the data, and sacrificing the ability to statistically generalize to the population. This particular dataset does not seem likely to allow reliable model comparison, at least between the top four or five models here, which are highly correlated.

3) I did not follow the reasoning for the argument that Hebbian learning must be cortical instead of hippocampal. There is a long history in the literature of considering Hebbian learning within the hippocampus.

4) I did not understand the design decision to always include a familiar within-community transition in the forced choice trials nor the analysis/display decision to set those options to 50% condition preference.

Reviewer #2 (Recommendations for the authors):

By testing statistical learning in auditory streams generated based on full and sparse community structures, the authors aimed to clarify what types of representations of structure arise. In order to disentangle different accounts regarding the nature of such representations, they contrast learners' preference for sound quadruplets containing within- versus between-community transitions that either were already presented during the stream or were never presented before. Predictions of 7 different models are outlined and correlated with the human forced-choice data. The main result is that learners show a bias in their representation of local transitions, making them sensitive to the high-order structure that characterizes the environment. This result is in line with previous findings in a different behavioral task and with the predictions of models that implement an accuracy-complexity trade-off.

Strengths:

Directly comparing community structures with different levels of sparseness provides a unique way of generating contrasting model predictions for models that generate highly comparable predictions in most learning situations. The results, especially those of the forced-choice task, are compelling.

The number of models that are directly compared is impressive and data visualizations do a very good job getting across the main conclusions for people without a modeling background.

Weaknesses:

The main result provides a conceptual replication of the finding by Lynn et al. (2020) in the visual domain. I do not think that the current work per definition has insufficient novelty, yet how the current findings relate to but also extend this previous work could be further clarified.

There is very little embedding of the current work within the existing literature. To exemplify, the authors write that "Many studies on sequence learning proposed different and not always compatible ad-hoc models to account for their results" (p. 4). This claim does not do justice to the modeling work that has been done in the domains of statistical learning and sequence learning (e.g., SNR, PARSER, TRACX) targeting specific conditions where models do differ in their predictions (e.g., phantom words).

Analyses in the manuscript itself focus only on the second forced-choice test, but it seems that the trajectory of how representations are formed over time (first vs. second forced-choice test) could also be modeled and could be highly informative. Data for the experiments with syllables and tones are collapsed but there seems to be a large difference in the learning trajectory for the two stimulus types (as reflected in figure 7), which currently remains unexplained.

– Authors like Friston might claim that not only learning of structure but also processes like decision-making and action selection can be understood as minimizing expected free energy. What could the finding that the FEMM model explains the current learning data very well say about the overlap between cognitive representations for very different tasks?

– Multiple statements seem in need of references. Some examples:

"… and their potential importance in language acquisition" (p. 3)

"Many studies on sequence learning proposed different and not always compatible ad-hoc models to account for their results." (p. 4)

"… the classical poverty of the stimulus argument" (p. 14)

– P(A|B) and later notations of adjacent and non-adjacent transitional regularities: Unless you specifically refer to backward transitional probabilities P(B|A) is the more intuitive form to denote the transitional probability of sequence AB, i.e., probability of B given that A has been encountered. Positional subscripts as used for Ngrams could also be used to clarify.

Methods

– Some methodological choices are not clearly motivated:

Why are only quadruplets used in the forced-choice task, and not also pairs?

Why is the judgement always with a familiar-within transition rather than contrasting the other conditions directly as well (e.g. familiar-between vs. novel-between or new-within vs. new-between)?

– What were the instructions participants received before performing the forced-choice task? Relatedly, how might the fact that there were two separate forced-choice tasks, with more active listening in between (now potentially with more awareness), have affected the results?

– "The press bar task during attentive listening showed high sensitivity, but it only allowed to test within vs between community transitions during learning and thus assess for pruning effect and clustering." (p. 15). Whereas I follow how these data are informative about clustering I am not clear on how they assess pruning.

Results

– Figure 3: the grey bar presents chance, but would it not make more sense to plot actual preference for familiar within-community?

Are results for the press task collapsed over the two blocks?

– For the analysis of key presses:

Is this test a significant difference in the difference scores (familiar-within vs. familiar-between)? Would a nonparametric cluster-based test not be a better option?

Parsing probability peaks after 1000 ms. Given that individual auditory stimuli, last 250 ms is it fair to say participants are sensitive to switching between communities (which might suggest they detect the between-community transition), or rather do they detect that they are in a new community after hearing several stimuli of the new community?

– P. 8 "In contrast, the New Within Community transitions were never rejected", unless I misunderstand the preference measure this should be "were rejected at chance", for half of the trials people prefer familiar within-transitions, the other half of these (no preference).

– p. 8 "No differences were found between the experiments using tones and syllables. Thus, data were merged in the following analyses." This should be supported by including basic results, preferably separately for the first and second forced-choice tests. (for example in the supplementary materials).

– One reasonable explanation for slower learning with syllables could be the prior knowledge individuals have about the structure of language (i.e., "linguistic entrenchment").

Reviewer #3 (Recommendations for the authors):

Benjamin and colleagues present a compellingly designed study to address a question currently interesting to the learning/memory community: how do we extract sophisticated structure from statistically regular input? I think the design is elegant, albeit similar to visual analogs. The sample size is high and the analyses are mostly sound. The biggest strength of the analyses is the breadth with which they surveyed different viable models and the surprisingly high model fits they achieved. I raise a few concerns that I believe the authors can likely address.

1. The nature of the forced choice model comparisons

The way that the authors compared their forced-choice data to the model predictions is central to their paper, but two fundamental ambiguities need to be resolved.

Firstly, the authors state that they pool the data across the experiment conditions. Does this mean concatenating the bootstrap average choices per choice lure and experiment condition, and then comparing those with the model? If so, state this explicitly.

Secondly, and more importantly, is the 'Familiar Within-Community' condition included in that correlation? Due to the nature of the forced choice the authors performed, this condition is always one minus the average for the lure condition. My understanding is that the authors choose to peg this value to 50% because it isn't clear what they should do otherwise. For instance, this could be the average of the three lure conditions, but those data points are not independent.

I think including this pegged value in the model comparisons is unfair because 1) this pegged value is arbitrary and not real data, and 2) this unfairly hurts Models A, B, C, and F, which predict that there will be a change in this condition under different levels of sparsity.

2. First vs. Second test epoch

I was surprised to read Figure 7 in the supplement. Until this point in the paper, I believed that the authors used both testing epochs for their analyses and that there were no differences between the tones and syllables conditions. In fact, the authors stated: "No differences were found between the experiments using tones and syllables. Thus, data were merged in the following analyses."

However, in the methods section and the supplement, they show that this was not the case. Figure 7 shows substantial and meaningful differences between the conditions in the first half but the results reported in the main text are only from the second half of the testing. It seems ad hoc to only include the last epoch of testing because 1) it seems they used both epochs for the press task, and 2) if the authors thought that 4 mins of passive listening weren't enough to max out learning, then they should have made the passive listening longer.

Compounding this, it is currently hard to interpret the difference between syllables vs. tones and first vs. second training epoch because no behavioral data is reported for these (akin to Figure 3).

I think learning effects are interesting and should be discussed. It is especially interesting that the syllables condition is so radically different. Syllables are the more typical stimulus used in auditory statistical learning tasks. In fact, tones typically lead to worse statistical learning (Schapiro, et al., 2014 is one example that comes to mind). Hence I think it is worth considering why there is such a drastic learning effect.

3. Conclusions about the brain from model comparisons

I believe that the authors overstate the brain-based conclusions that can be drawn from the model comparisons they perform. Model G uses associative learning to model the participant's choices. This is described as akin to cortical Hebbian learning, in contrast to the hippocampal learning in Model F. I think this juxtaposition is overstated given the nature of the modeling performed.

Modeling behavior can help elucidate the brain basis of phenomena, but I think this is difficult and ought to be done carefully. In this case, Hebbian learning is ubiquitous in the brain (and simple nervous systems). This lack of specificity means it cannot be easily used to discriminate the brain locus of community structure computation. Perhaps there is some reason to think that the exponential decay the authors include in Model G is specific to the cortex, but I am unaware of such a reason.

4. Compression

I am not sure about the relevance of compression to the work presented here. To start, the authors state in their abstract that "the brain does not rely on exact memories but compressed representations of the world". However, I am not sure how the results of this study contribute to our understanding of compression in memory. If the authors want to say that information is lost during memory encoding, I think that is an unnecessary point to make since it is uniformly agreed on. If the authors want to bring in concepts from information theory about compression, I think they need to be more precise since the type of compression they presumably mean is lossy compression (i.e., information is lost), but compression can be lossless (i.e., information is retained but reformatted). In fact, based on how the authors use this term elsewhere, I think chunking might be a better concept to refer to than compression.

In figure 5, the authors suggest that the participants might only learn and retain two chunks. I think this speculation is provocative, but I could not see any evidence in this paper to support this claim. This is peculiar because the authors could have treated this hypothesis as one of the models that they used to try to account for the participants' data. If they did that, I suspect this model would not excel because it assumes that participants treat the edge nodes the same as the non-edge nodes. From multiple studies published using this community structure design, edge nodes are treated differently. Hence, if I were to speculate on a 'compressed representation' that the participants might have, I would assume they have four chunks: a pair of edge nodes and non-edge nodes for each cluster.

Miscellaneous

5. Could the authors report how many of each transition the participants are exposed to during the passive learning and press task? It seems like there are a lot of possible transitions, so it is possible that even in the fully connected condition, their input is still sparse.

6. The authors have two exclusion criteria for the different tests (button presses and choices), which result in different sample sizes. However, if they failed the force choice attention check, I don't think they should be included in the key presses and vice versa. Instead, I believe the exclusions should be the union of these two exclusion criteria.

7. Why were twice as many participants excluded from the press task in the fully connected condition compared to the other conditions? Perhaps participants really did think there were fewer or more events? To address this, the authors could report the histograms for each condition before exclusions to see if those differences exist

8. Why is RT so delayed for the press task? I expect that participants should only take 500-750ms to respond, which suggests that perhaps participants are responding to something after the time locking is done here. What offset is the time locking to? Based on the data, I would guess it is time-locked to the offset of the edge node before the transition rather than the offset of the edge node after the transition.

9. The authors state that they did Bonferroni correction to test whether the likelihood of participants pressing a key after a transition has diverged from chance. Does this mean that every time point (millisecond?) is used as an independent sample in the Bonferroni correction? This doesn't seem plausible with the data shown in Figure 3 and is unnecessarily conservative.

10. What is the noise ceiling for the model fit? Lots of ways to find this but one would be to split the behavioral data in half and see how correlated they are. I ask because I suspect you are close to that ceiling, which is impressive

Regarding point 1:

I think that the authors should exclude the "Familiar Between Condition" values from the model comparison. I firmly believe that removing this condition will make the model comparisons fairer and may change the conclusions.

Regarding point 2:

The authors should more fully discuss the results of Figure 7 in the main text. They may also want to show the learning trajectory. Furthermore, I think they must report the forced choice data for the first testing epoch for syllables and tones separately, so it is clear what is driving the differences shown in Figure 7.

Regarding point 3:

I recommend drastically reducing the amount of commentary that compares the likely brain bases of the community structure computations. In particular, I think Model G can just be referred to as a biologically plausible alternative model to Model F.

Regarding point 4:

I recommend cutting all commentary about compression in the paper, but I am interested to hear justification otherwise.

If the authors wish to keep some of the compression content in, I think the authors could test my conjecture raised in point 4 by looking at the different nodes used in the 'New Between Community' lure trials. In particular, is there a difference when the 'New Between Community' is between an edge node and a non-edge node compared to when it is just between two non-edge nodes (vs. two edge nodes, i.e., 'Familiar Between Community')? I suspect there will be.

Miscellaneous recommendations:

1. Mention in the main text that these were online studies.

2. Ali Preston's work on transitivity is relevant to several points raised here, so she should be cited.

3. I think it would be helpful to explain the logic of Model F more fully. For instance, it would be beneficial to say what the CA1 layer is doing in the model and why you chose to look at it in particular (i.e., MSP.)

4. It would be helpful to expand on the differences between hitting time and communicability since they seem like they would produce predictions that are more similar to what you report.

5. Please report the ISI of the stimuli. They are 250ms long, but I didn't see any mention of an ISI, so I assume the items are back-to-back.

6. Good job releasing the data, although because it was uploaded as a zip, I couldn't download it without my client crashing. Consider breaking up the zip into smaller files.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Humans parsimoniously represent auditory sequences by pruning and completing the underlying network structure" for further consideration by eLife. Your revised article has been evaluated by Floris de Lange (Senior Editor) and a Reviewing Editor.

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

1) Analysis of learning dynamics

Reviewer 1 points out that "they do not analyze learning dynamics. They report behavior from the press task aggregated across all exposure. Only the second testing epoch is used because participants "have not learned and thus have not stabilized behavior" in the first epoch. This presents a missed opportunity to study learning dynamics and evaluate how well the models fit those dynamics – an important piece of evidence for model fit."

2) Make explicit assumptions/limitations

Both reviewers ask you to make more explicit some assumptions/limitations in several instances.

3) Reviewer 2 suggest that "an additional experiment designed to disentangle those could further strengthen the impact of this work". Please treat this as a suggestion for future research, rather than a requirement for the revised version of this manuscript.

Reviewer #1 (Recommendations for the authors):

The authors did a good job of addressing my major concerns; however, in doing so they reinforced, in my opinion, a concern that reviewer 2 raised explicitly and I alluded to in my first round of review. Namely, are the authors capturing the dynamics of sequence learning, and, if not, what does that mean for their claims? Below I summarize my concern and then list a few minor issues that remain.

Throughout the paper and their response, the authors state that they are interested in understanding the dynamics of sequence learning and how hierarchical structure is acquired more generally. One of the most tantalizing outputs of this work is the possibility that they have a "general model of sequence learning". The task was explicitly designed to "have intermediate points to check the learning status" so they could study the dynamic aspects of learning, rather than examine it as a snapshot. Moreover, the FEMM is described as normative account of how learning should unfold in order to quickly generalize to an appropriate hierarchical structure.

However, I believe there are a few ways the authors fail to meet these goals.

First, they do not analyze learning dynamics. They report behavior from the press task aggregated across all exposure. Only the second testing epoch is used because participants "have not learned and thus have not stabilized behavior" in the first epoch. This presents a missed opportunity to study learning dynamics and evaluate how well the models fit those dynamics – an important piece of evidence for model fit.

As an aside, what evidence is there that the participants haven't learned? In both tones and syllable conditions, they show reliable evidence of preference (according to Figure S2) in the first pressing epoch, which indicates learning. In the press task there is evidence of learning in syllables and tones for the first epoch (according to the figure used to respond to Reviewer 2). As stated previously, it seems ad hoc not to include the first epoch in the analyses when the experiment was designed to include it.

Second, the authors argue that FEMM is an optimal way to overgeneralize to learn quickly. If so, it seems that FEMM should be excellent at approximating behavior when participants are initially learning. This is not the case for the syllable condition, where the authors argue more learning has to occur to overcome pre-existing biases.

Third, in the author's new and improved discussion about compression, they state that there is a stage after what is being tested here in which compressed representations form. At this stage, the FEMM would no longer apply, and so it is not a "general model of sequence learning".

I think the author's argument is that FEMM and their other high-performing models capture an initial stage of learning and that some other models (e.g. compression) will explain subsequent representations later in learning. If this is true, they ought to clarify this explicitly by stating their paper is a snapshot in the dynamics of learning. In doing so, they would need to revise their section "A general model of sequence learning" where they imply that the FEMM captures the entire learning process, instead of a portion. Moreover, they should be clear about what portion of the learning process they think FEMM accounts for: their normative claims about the value of overgeneralization suggest early stages of learning, but their exclusion of the first epoch suggests they are not testing early learning. Finally, I think the authors should still shorten their description of the compression section since its presence in the discussion implies that the data in the paper supported this hypothesis.

Reviewer #2 (Recommendations for the authors):

In general, the authors produced a stronger manuscript that addresses many of the concerns that were raised. The analysis (figure S2) of the syllable and tone tasks separately reassures me that, at least for Session 2, a similar behavioral pattern and model comparison result is observed for both tasks. Methodological choices are better motivated, and the adjusted visualization of the AFC results is a lot clearer. Finally, the split between theoretical models and possible neural implementation, in addition to the more careful brain-based conclusion, is also an improvement.

The novelty of the paper is now communicated more clearly, yet I do still share the concern that noisy behavioral data from a single type of learning measure (AFC) limit the possibility for strong conclusions based on model comparison. For many reviewer suggestions, the authors argue that their experiment was not designed to look at those comparisons. Especially for the highly correlated models that do a good just explaining the current AFC data (i.e. Hitting time, FEMM, Hebbian) I am left thinking that an additional experiment designed to disentangle those could further strengthen the impact of this work.

Currently, the AFC judgment is always one against a familiar-within transition "because all models postulate their correctness", but could such different contrasts not help to disentangle correlated models?

In their response to Reviewer 3, the authors mention that the difference between transitions between edge nodes and transitions between non-edge nodes is a prediction made by the FEMM. If this is a unique prediction of FEMM this seems worth testing.

Methodological remarks section (p. 20-21):

–Whereas I found the arguments counter the use of a Hamiltonian path convincing, I do think it would be good to explicitly acknowledge the potential confound outlined by reviewer 1 and the reason why you believe simple adaptation is not what is going on in the pressing probability results.

– "However, this second metric has a low sensitivity as only a few trials can be collected resulting in data variability that was compensated by a very large sample of participants (N=727)." Data variability seems strange phrasing, as true score variance is not a problem, maybe what is meant is high error variance?

Regarding prior knowledge of syllable sequences:

– The authors use both the phrasing "priors on syllable sequences" and "a prioris that syllables are ordered in words" (p. 31), I think the latter is a lot less precise and can better be avoided. The authors also write "This a priori does not exist with tones", but is that true given our exposure to music?

– The idea that participants have prior knowledge of syllable sequences affecting their ability to learn about new transitions between syllables is introduced without references, suggesting that it might be a new idea, whereas there is literature on this: see for example Siegelman et al. (2018)'s paper titled Linguistic entrenchment: Prior knowledge impacts statistical learning performance.

– The Participants section states that participants were recruited via social media but mentions nothing about their language background. The linguistic stimuli were French diphones. So far I assumed the experiment language was French, but maybe that was not the case?

https://doi.org/10.7554/eLife.86430.sa1

Author response

[Editors’ note: the authors resubmitted a revised version of the paper for consideration. What follows is the authors’ response to the first round of review.]

Reviewer #1 (Recommendations for the authors):

In this paper, participants are exposed to auditory sequences generated by graphs with community structure. Transitions between community nodes are sometimes left out during exposure, allowing tests of generalization to those unseen transitions. The authors find that participants are sensitive to the structure in general, as well as to the novel within-community transitions, indicating an understanding of the structure that goes beyond the directly-experienced information. They apply several theoretical and neural models to the data and find a range of matches to the empirical results. The best-fitting models are FEMM (Free-Energy Minimization Model) and Hitting Time, and the authors conclude that the mechanisms of those models may underly the patterns observed in humans.

The observation that participants choose unseen within-community transitions at a high rate is novel and a compelling demonstration that humans do not objectively encode transition probabilities in a stream of sounds. The many implemented and compared models are also a considerable strength of this work. However, I believe there is a confound in the pressing probability results, and I am also concerned that the behavioral data may be too noisy across participants to confidently test between some of the highly correlated models.

1) The pressing probability results (top row, Figure 3) are interpreted as evidence that the participants have learned the community structure and thus can parse the sequences at community boundaries. However, this effect can arise without there having been any learning: Within a community, stimuli are repeated many times before moving to the next community, which should result in stimulus adaptation. At the transition to a new community, stimuli are observed that have not been repeated as many times as recently, so simple adaptation can serve as a strong parsing cue. The paper that introduced this paradigm (Schapiro et al. 2013) included Hamiltonian paths (where every stimulus is visited exactly once) during the parsing task to avoid this confound, but this paper does not include that condition.

Indeed, over the 3 possible ways to traverse the graph (Random walk, Hamiltonian walk – each node seen once – and Eulerian walk – each edge crossed once -), we chose to present participants with random walks. Although Schapiro et al. (2013) showed that this choice had a limited effect on participants’ learning (this lack of effect was replicated by Lynn et al. 2020), we agree that this choice needs better justification and discussion:

1) The Hamiltonian path introduces an increased predictability of the sequence since stimuli already presented can no longer be presented. Learning a graph can then be fully explained with n-gram approaches and cannot disentangle between the different learning models proposed. Indeed, as a node can be visited only once at each passage in the community, the predictability of the next element increases as the journey goes on until a perfect predictability for the last element and for the transition to the other community (after having visited the 5th element of the community, the last one is perfectly predictable as well as the change of community) This would lead to periodic predictions of ngrams (cf. the model presented in Author response image 1) whereas a random walk keeps the prediction flat. Only a random walk, therefore, allows for rigorously disentangling a local TP (or ngrams) calculation from higher order calculations such as FEMM or Hitting Time.

Author response image 1

2) The number of different Hamiltonian paths compatible with the structure decreases drastically with sparsity, and the very sparse community has only one possible Hamiltonian path given the input (each step is imposed because of the previous one) and thus the stream would become a trivial looped repetition of a twelve-tone pattern. An example is presented in Author response image 2. While the consequences of a Hamiltonian walk are less dramatic in the sparse model, the limited number of paths would still have induced a large number of pattern repetitions that are highly salient to humans (Barascud et al., Southwell et al., 2016 2018).

Author response image 2
Example of the only Hamiltonian walk compatible with a high sparse community design.

Given the first item, the full sequence becomes completely deterministic and repeats with loops of 12 elements.

3) Finally, Hamiltonian paths would lead to rhythmic switches between communities that could be used as a segmentation cue and bias the learning.

Concerning the random walk, it is true that on average the tones belonging to the same community are presented closer in time than those belonging to different communities but the length of the walk within one community can be short without repetition or without going through all the tones of the community, or longer with repetition of some tones at a random distance. There is therefore no consistency over time that could allow to capture a repetition pattern. Note also that the absolute frequency of each tone is equal within the stream avoiding long-term habituation effect and the local transition probability is flat avoiding the possibility to predict the next tone. Finally, the tones frequency was distributed between the two communities preventing a separation based on an auditory spectral partition.

“Stimulus (i.e. sensory) adaptation” or repetition suppression in electrophysiological recordings is observed in the case of the immediate repetition of the same stimulus, which is not done in our stream. However, some adaptation could occur because of stimulus prediction (Todorovic et al. 2012) that would then not be a confound but the actual phenomenon explaining the learning of the structure despite no possible prediction with TP only. These two types of adaptations notably differ in their timing, we are currently running MEG with this paradigm to better investigate these questions.

In any case, we would like to emphasize that the adaptation concern raised by the reviewer does not affect the two forced choice analysis (because elements were presented in isolation). It is this analysis which is the main analysis and novelty of our work. The pressing task was only used to keep participants attentive during the stream and to compare with the previous literature in which this task was used (Schapiro et al., 2013) The previous framing of the paper might have been ambiguous and we reframed this part.

We have now included all these elements in the Material and methods / Stimuli paragraph.

2) The authors acknowledge that the behavioral data are quite noisy across participants, requiring a very large sample to detect differences between conditions. Even with the large sample, many of the pairwise comparisons shown in the bottom row of Figure 3 are not significant. This raises concerns about whether a detailed test between correlated models is possible based on these data. My understanding is that the authors pooled data across all participants and designs and then did bootstrap resampling for statistical tests. I am concerned that this procedure is inflating the seeming reliability of small differences in the data, and sacrificing the ability to statistically generalize to the population. This particular dataset does not seem likely to allow reliable model comparison, at least between the top four or five models here, which are highly correlated.

The two forced choice data are indeed noisy. This is due to two main reasons : The difficulty of the task and the few trials presented to each subject to keep the experiment relatively short and compatible with online data collection. However, most pairwise comparisons with an expected high difference based on the FEMM Model are in fact significant. The nonsignificant comparisons mostly concern conditions for which the FEMM model predicts no or marginal differences.

Among all the models proposed in the literature that we wanted to compare, 4 are highly correlated: two theoretical models (Hitting Time, FEMM) and two neural implementations (Hebbian, CA1). We agree that it does not make much sense to compare theoretical and neural implementations, as they are not mutually exclusive and are not on the same Marr level (Computational theory vs Hardware implementation). Therefore, we split the theoretical models and neural implementation in two subplots in Figure 4.

We did not postulate a difference between Hitting Time and FEMM because these two models describe a similar property of the graph (distance between nodes) with two different approaches and are essentially similar (>99% correlations between the two models). We have rewritten the paragraph to make this point clearer.

Concerning the neural implementation models, Hebbian modeling and CA1, they are very similar for most conditions.

However, they differ qualitatively in a substantial way in their predictions regarding New Within and Familiar Between Transitions in the High sparse design for which the CA1 model is clearly at odds with our data. Furthermore, when we estimated the correlation with bootstrapping, both models ranked relatively high but the Hebbian approach was systematically higher (small effect size but high significance).

Regarding now Communicability vs. Hitting time and FEMM, we believe that the claim that Hitting Time and FEMM are both better models than communicability is fair considering our data. First the correlation is largely and significantly stronger for these two models compared to Communicability. But also, communicability is making predictions opposite to what was observed in the data. Indeed, according to Communicability, New Within should be significantly less accepted than Familiar Between transitions whereas we observed the contrary. Therefore, communicability is not a valid model of human behavior in this task.

To be more precise in our assumptions when testing the models, we redid the correlation analysis but limited it to conditions for which the four correlated models and communicability make qualitatively different predictions (New Within vs Familiar Between in Sparse and High Sparse communities). By doing so, we reduced most of the correlation between models and only tested for specific contradictory predictions. We again find that Hitting Time, FEMM and Hebbian models are equivalent and better than the other models (see Figure 4—figure supplement 1).

To conclude, although these models are indeed generally correlated, they have different predictions for some of the tested conditions allowing us to disentangle them, thanks notably to our large sample, the test of three different paradigms (Full, Sparse and High Sparse) and the careful bootstrap comparisons resampling subjects within a large pool. Because our goal was to compare between all the models proposed in the literature, it is the way they account for the relative performance pattern across conditions rather than the effect size per se which is important.

3) I did not follow the reasoning for the argument that Hebbian learning must be cortical instead of hippocampal. There is a long history in the literature of considering Hebbian learning within the hippocampus.

The discussion on this part was badly framed. We wanted to propose that (contrary to the hippocampus suggestion) Hebbian learning could be both cortical or hippocampal and not claim that it must be cortical. We re-framed the argument.

4) I did not understand the design decision to always include a familiar within-community transition in the forced choice trials nor the analysis/display decision to set those options to 50% condition preference.

Our goal was to measure the familiarity for each type of transitions. Because learning might remain implicit for most participants, we were afraid that familiarity ranking would not be sensitive enough. Therefore, we chose a two forced choice task and assess familiarity relative to a reference. The Familiar-within condition is ideal as a reference because first this condition has many possible transitions and second these transitions have correct local transitions and belong to the structure. Therefore all other conditions were contrasted with this one.

The Figure 3 plots were misleading because the first column did not represent data but chance level. It was done to be in congruence with the plots figuring the model. We rewrote this part and redid the plots to be clearer.

Reviewer #2 (Recommendations for the authors):

By testing statistical learning in auditory streams generated based on full and sparse community structures, the authors aimed to clarify what types of representations of structure arise. In order to disentangle different accounts regarding the nature of such representations, they contrast learners' preference for sound quadruplets containing within- versus between-community transitions that either were already presented during the stream or were never presented before. Predictions of 7 different models are outlined and correlated with the human forced-choice data. The main result is that learners show a bias in their representation of local transitions, making them sensitive to the high-order structure that characterizes the environment. This result is in line with previous findings in a different behavioral task and with the predictions of models that implement an accuracy-complexity trade-off.

Strengths:

Directly comparing community structures with different levels of sparseness provides a unique way of generating contrasting model predictions for models that generate highly comparable predictions in most learning situations. The results, especially those of the forced-choice task, are compelling.

The number of models that are directly compared is impressive and data visualizations do a very good job getting across the main conclusions for people without a modeling background.

Weaknesses:

The main result provides a conceptual replication of the finding by Lynn et al. (2020) in the visual domain. I do not think that the current work per definition has insufficient novelty, yet how the current findings relate to but also extend this previous work could be further clarified.

We thank the reviewer for the kind comments. The study by Lynn and colleagues nicely showed that community learning was compatible with a new model they proposed. Here, our goal was different, we wanted to disentangle between the many different models of sequence learning. It is for that reason that we developed this sparse design. We believe that the current study greatly extends previous findings as:

  • It shows the completion and generalization of missing data

  • It disentangles between many different sequence learning models.

  • It strengthens the bridge between two literatures that are often independently considered : sequence and network learning.

  • It shows that the results are preserved with very fast sequence presentation, leaving less time for explicit decision making than in the original paradigms.

  • It extends the previous results to the auditory domain (this replication of Lynn et al’s study concerned only the 1st paradigm).

We made the goal and novelty of the study clearer in the text.

There is very little embedding of the current work within the existing literature. To exemplify, the authors write that "Many studies on sequence learning proposed different and not always compatible ad-hoc models to account for their results" (p. 4). This claim does not do justice to the modeling work that has been done in the domains of statistical learning and sequence learning (e.g., SNR, PARSER, TRACX) targeting specific conditions where models do differ in their predictions (e.g., phantom words).

We acknowledge that this sentence was not well formulated and rephrased it. We also added a full paragraph, to better explain the relationship between this work and previous modelling approach such as PARSER or TRACX. However, we want to point out that those models focus on how chunks are extracted from a stream, whereas here, we only study how familiarity between transitions is perceived. The two processes are not equivalent as we showed in Benjamin et al. (2022). Thus, the approach is different, which explains why we did not add those models to the comparisons. Furthermore, this type of models, based on chunk recognition, should by construction reject new within transitions generalizations, which were accepted by our participants. We added comments on these models in the main text.

Analyses in the manuscript itself focus only on the second forced-choice test, but it seems that the trajectory of how representations are formed over time (first vs. second forced-choice test) could also be modeled and could be highly informative. Data for the experiments with syllables and tones are collapsed but there seems to be a large difference in the learning trajectory for the two stimulus types (as reflected in figure 7), which currently remains unexplained.

Indeed, we focused only on the second forced choice test to build on the maximal learning performances, because the models predict familiarity only after learning. The models even assume learning from an infinite amount of data. However, a sufficiently long learning can be considered to converge to this asymptotic state.

Tones and syllables became indeed similar only at the second test session, and the learning trajectory seems different for both types of stimuli. We postulate that for syllables, because they are the building blocks of language, participants have stronger priors about syllable sequences than about tone sequences. For example, flat transition probabilities violate language structure and this was the main argument for proposing statistical computations as a crucial mechanism for infants to acquire language and notably to build their lexicon (Saffran et al., 1996). Subjects' performance on the first test after syllable streams was not consistent with any of the models (nor between paradigms), contrary to the other 3 datapoints which are also consistent with themselves. Thus the task is probably harder with syllables than tones due to the adjustment which needs to be done for syllables between the priors and the real structure of the stream.

We did not study the tone-syllable difference in this paper because we believe that the data and design are not well suited for this and that we would have to design a specific experiment to be able to model the resolution of a conflict between priors and structure, which would lead to learning.

We used syllables because the team is working on language acquisition, and aspects of this experiment were intended for use in neonates, whose performance is generally better when speech is used compared to non-speech stimuli.

– Authors like Friston might claim that not only learning of structure but also processes like decision-making and action selection can be understood as minimizing expected free energy. What could the finding that the FEMM model explains the current learning data very well say about the overlap between cognitive representations for very different tasks?

We thank the reviewer for this comment; however, we have no data to support or deny any claim in that direction. Thus, we believe that this point goes beyond the focus of our paper. Nonetheless, we think that sharing a common principle or mechanism does not imply that those representations of these domains necessarily overlap.

– Multiple statements seem in need of references. Some examples:

"… and their potential importance in language acquisition" (p. 3)

"Many studies on sequence learning proposed different and not always compatible ad-hoc models to account for their results." (p. 4)

"… the classical poverty of the stimulus argument" (p. 14)

We added references.

– P(A|B) and later notations of adjacent and non-adjacent transitional regularities: Unless you specifically refer to backward transitional probabilities P(B|A) is the more intuitive form to denote the transitional probability of sequence AB, i.e., probability of B given that A has been encountered. Positional subscripts as used for Ngrams could also be used to clarify.

We have changed our notation and used positional subscript for more clarity and consistency in the text.

Methods

– Some methodological choices are not clearly motivated:

Why are only quadruplets used in the forced-choice task, and not also pairs?

This design choice was made for consistency with other studies of the PhD (L.B.), especially with the idea of comparing latencies and ERP in future electrophysiology work. We added a sentence to let the reader know.

Why is the judgement always with a familiar-within transition rather than contrasting the other conditions directly as well (e.g. familiar-between vs. novel-between or new-within vs. new-between)?

We used familiar within transition as a reference because all models postulate their correctness as it is both locally and globally congruent. Furthermore, there are many different usable transitions in this condition. We have added details in the text. See also answer 4 to R1.

– What were the instructions participants received before performing the forced-choice task? Relatedly, how might the fact that there were two separate forced-choice tasks, with more active listening in between (now potentially with more awareness), have affected the results?

They were only asked to choose which of the two quadruplets was most likely to belong to the previously invented sequence.

The effect of awareness and of more active learning is hard to test in such a task. For the tone group, it seems that it changed nothing. For the syllable group, it is hard to disentangle between a longer exposure or the fact that they change their priors about syllable sequences (see response above).

Remember that the data were acquired on-line and that it is impossible to verify the participants’ attention to the stimuli if long unanswered minutes elapse. Thus we preferred to have two shorter streams than a long one.

A passive (with no instruction and no task) MEG experiment is currently run to address those kinds of questions.

– "The press bar task during attentive listening showed high sensitivity, but it only allowed to test within vs between community transitions during learning and thus assess for pruning effect and clustering." (p. 15). Whereas I follow how these data are informative about clustering I am not clear on how they assess pruning.

It is true that they only assess for a difference between familiar within and familiar between but do not inform on the relative familiarity of each (familiar between < familiar within), which is what we referred as pruning in the text. We changed this sentence to avoid any confusion.

Results

– Figure 3: the grey bar presents chance, but would it not make more sense to plot actual preference for familiar within-community?

The grey bar represented chance and all the other bars represented familiarity of the subjects with the tested condition compared to the reference (familiar within transitions). There was no trial in which familiar within transitions were contrasted with themselves (see response to R1).

We changed the figure because it was confusing.

Are results for the press task collapsed over the two blocks?

Yes, we collapsed both blocks (We have added a sentence in the method to mention it). For completeness we report in Author response image 3 the results for the two sessions and the two groups (tones and syllables) separately:

Author response image 3

– For the analysis of key presses:

Is this test a significant difference in the difference scores (familiar-within vs. familiar-between)? Would a nonparametric cluster-based test not be a better option?

We compared the average key press probability after familiar-between transitions (corresponding to 4 transitions) with each combination of 4 familiar-within transitions (each purple line visible on the graph). This made the comparison exhaustive.

Parsing probability peaks after 1000 ms. Given that individual auditory stimuli, last 250 ms is it fair to say participants are sensitive to switching between communities (which might suggest they detect the between-community transition), or rather do they detect that they are in a new community after hearing several stimuli of the new community?

Indeed it is difficult to tell with behavior only on such a fast design. However, a previous slow design (Schapiro et al., 2013) showed that participants were sensitive to the transition and did not need multiple items to press the key and signaled their detection of a change of community. The 2-forced-choice task also provided very limited context and yet showed a difference between within and between transitions. We are currently conducting a MEG experiment to better understand the time course of the effect (we do find a significant effect ~120ms after the community change, arguing for sensitivity to the transition more than for evidence accumulation).

– P. 8 "In contrast, the New Within Community transitions were never rejected", unless I misunderstand the preference measure this should be "were rejected at chance", for half of the trials people prefer familiar within-transitions, the other half of these (no preference).

Indeed, we corrected the sentence.

– p. 8 "No differences were found between the experiments using tones and syllables. Thus, data were merged in the following analyses." This should be supported by including basic results, preferably separately for the first and second forced-choice tests. (for example in the supplementary materials).

All the results from the 2-forced-choice are now reported in the supplementary results. We modified the sentence of the main text presenting the statistical analyses “We report the results at the end of the learning (second block). Because no difference was found between the groups using tones and syllables (unpaired t-test for each condition, all ps>0.2), the data of the tone and syllable groups were merged in the following analyses”.

– One reasonable explanation for slower learning with syllables could be the prior knowledge individuals have about the structure of language (i.e., "linguistic entrenchment").

Indeed, we believe that it is what happened (see above) and we included this hypothesis in the text.

Reviewer #3 (Recommendations for the authors):

Benjamin and colleagues present a compellingly designed study to address a question currently interesting to the learning/memory community: how do we extract sophisticated structure from statistically regular input? I think the design is elegant, albeit similar to visual analogs. The sample size is high and the analyses are mostly sound. The biggest strength of the analyses is the breadth with which they surveyed different viable models and the surprisingly high model fits they achieved. I raise a few concerns that I believe the authors can likely address.

1. The nature of the forced choice model comparisons

The way that the authors compared their forced-choice data to the model predictions is central to their paper, but two fundamental ambiguities need to be resolved.

Firstly, the authors state that they pool the data across the experiment conditions. Does this mean concatenating the bootstrap average choices per choice lure and experiment condition, and then comparing those with the model? If so, state this explicitly.

Yes, it is what we did, with one bootstrap sampling per paradigm. Because different participants have been tested with the three paradigms, we computed the average performance of one bootstrap resampling per paradigm and concatenate those to compare with the predictions of the different models.

Secondly, and more importantly, is the 'Familiar Within-Community' condition included in that correlation? Due to the nature of the forced choice the authors performed, this condition is always one minus the average for the lure condition. My understanding is that the authors choose to peg this value to 50% because it isn't clear what they should do otherwise. For instance, this could be the average of the three lure conditions, but those data points are not independent.

This part was unclear in the original paper. Our goal was to estimate the familiarity of each condition and we used a 2forced-choice task between conditions and a reference. We chose to use a forced-choice test because we thought it would be more sensitive than a familiarity ranking. We used as reference the Familiar Within Community transitions because all models postulate their correctness as they are both locally and globally congruent. Furthermore, there are many different usable transitions in this condition. This implies that all forced-choice tests consisted of one sequence with all Familiar Within Community transitions and one sequence containing a middle transition belonging to another type. While we did not have a test comparing two different Familiar Within Transition, this comparison can be assumed to be 50% by design. The other conditions can range from 0% (participants always preferred the reference) to 100% (participants always preferred the tested condition), with 50% implying no preference.

One minus the average of the other conditions would give the actual number of times the participant preferred the reference in the task but not an estimated familiarity with this condition. We changed the figure to better account for what we did.

I think including this pegged value in the model comparisons is unfair because 1) this pegged value is arbitrary and not real data, and 2) this unfairly hurts Models A, B, C, and F, which predict that there will be a change in this condition under different levels of sparsity.

Although we did not test the familiar-within community vs familiar within community condition, we can fairly assume that there are no differences between sequences of the same condition. We need this milestone in the correlations, because it is not only the relative familiarity between the three tested conditions which is important but also their familiarity relative to the familiar within-community (reference), in other words it is important to take into account that none of our conditions was preferred compared to the reference in our behavioral task. We effectively did not measure the true value of this condition by not measuring familiar within-community transitions versus other familiar within-community transitions, which should correspond to 50% more/less noise. We did include this condition in order to keep the experiment short while collecting as much data as possible to obtain more accurate measures of the familiarity of the other crucial conditions.

To illustrate the necessity of adding this milestone value in the models, we present two different fake models (see Author response image 4) that would not be differentiated if this 50% baseline value is ignored. In our dataset, it would, for example, artificially increase the correlation of the Non adjacent transition probability model with our data whereas the predictions of that model are quite different from the pattern observed in our data.

Author response image 4
Two fake models are presented here that only differ in their predictions of the Familiarity score for the within community transitions (dark purple bar) compared to other conditions, that is, whether or not, its score is higher than the other conditions.

If we consider only the three conditions on the right without their relationship to the purple bar, both models would be considered similar and equally correlated to our data but only Fake Model 1 is consistent with our observations (All conditions < at chance level). Fake Model 2 does not represent our data because all tested conditions are preferred to within-community transitions, thus would have be chosen in our forced-choice task. Therefore, the threshold value of Familiar within community transitions = chance level = 50% is essential to disentangle the models.

Regarding the second point, adding this value does not unfairly hurt models that predict a change in familiarity estimates of familiar within-community transitions as the models in each paradigm have been normalized based on this value to avoid it. Indeed, because each participant participates in only one of the paradigms (to avoid contamination of the learned structure to the next paradigm), there is no way to compare overall changes in familiarity between experiments. We therefore only sought to compare the estimate of familiarity within an experimental paradigm. For this reason, all models were also normalized relative to their familiar within community value before being concatenated for correlation (ModelValuesForCorrelation = ModelValues/FamiliarWithinCommunityValue). After normalization, all models estimates for the familiar within-community transitions was 1 for each of the three paradigms, thus not unfairly penalizing any model in the comparison with the data. To not include this value in the correlation would be to ignore the crucial information that the familiarity ranking of other transitions was actually lower than chance (<50%) in each paradigm.

2. First vs. Second test epoch

I was surprised to read Figure 7 in the supplement. Until this point in the paper, I believed that the authors used both testing epochs for their analyses and that there were no differences between the tones and syllables conditions. In fact, the authors stated: "No differences were found between the experiments using tones and syllables. Thus, data were merged in the following analyses."

We consider the measures at the end of learning. The results of the first session for the syllable group are uninterpretable while the other 3 data points (first and second session in the tone group and second session of the syllable group) are congruent. Participants had probably stronger priors about syllable sequences than about tone sequences. For example, flat transition probabilities violate language structure and this was indeed the main argument for proposing statistical computations as a crucial mechanism for infants to acquire language and notably to build their lexicon (Saffran et al., 1996). Thus the task is probably harder with syllables than tones due to the adjustment which needs to be done for syllables between the priors and the real structure of the stream.

We have corrected the sentence to avoid misunderstanding.

However, in the methods section and the supplement, they show that this was not the case. Figure 7 shows substantial and meaningful differences between the conditions in the first half but the results reported in the main text are only from the second half of the testing. It seems ad hoc to only include the last epoch of testing because 1) it seems they used both epochs for the press task, and 2) if the authors thought that 4 mins of passive listening weren't enough to max out learning, then they should have made the passive listening longer.

As this type of paradigm has never been done in the auditory domain, we were unaware of the time needed for successful learning neither of the difference between syllables and tones learning curve. We would like to point out that our models only predict the final learning for infinite sequences. It seems reasonable that human performances converges toward this with increased learning.

Furthermore, because in an on-line experiment during which it is difficult to ensure that participants are really doing the task, we wanted to avoid too long stream and have intermediate points to check the learning status.

Compounding this, it is currently hard to interpret the difference between syllables vs. tones and first vs. second training epoch because no behavioral data is reported for these (akin to Figure 3).

We added the 2-forced choice data for each of the group in the supplemental figure. Note that the results are very similar for 3 of the 4 blocks (1st session Tones, second session Tones and Syllables). Only the 1st session with syllable differs and does not present any reliable pattern with any of the possible models. Moreover, the pattern of results is not even consistent between the three paradigms, nor with any of the proposed models, making it most probable that a majority of participants have not learned and thus have not a stabilized behavior.

I think learning effects are interesting and should be discussed. It is especially interesting that the syllables condition is so radically different. Syllables are the more typical stimulus used in auditory statistical learning tasks. In fact, tones typically lead to worse statistical learning (Schapiro, et al., 2014 is one example that comes to mind). Hence I think it is worth considering why there is such a drastic learning effect.

Indeed the question of why the syllables and tones are different is an interesting point. However, since the experiment was not designed to disentangle different hypotheses on this aspect, we can only speculate on the possible reasons. As mentioned above, transition between syllables are constrained in speech and it is very likely that adults have internalized the transition matrix corresponding to their native language and first tried to apply this model to new data rather than rebuilding an entire model. Moreover in the daily life, sequences of syllables correspond to a succession of words, biasing participants to search for stable syllable transitions which were not present in the stream. Although we had no formal discussion with participants, some pilots/participants also reported that they were looking for familiar words hidden in a nonsense noisy sequence. These priors might explain the harder task in the case of syllables and that participants need some time before changing strategy and learn the real structure of the sequences.

We believe this hypothesis to be too speculative to make a strong point in the paper. However, we briefly discuss this idea in the text.

3. Conclusions about the brain from model comparisons

I believe that the authors overstate the brain-based conclusions that can be drawn from the model comparisons they perform. Model G uses associative learning to model the participant's choices. This is described as akin to cortical Hebbian learning, in contrast to the hippocampal learning in Model F. I think this juxtaposition is overstated given the nature of the modeling performed.

Indeed the cortical localization of the hebbian learning was overstated in the text, we corrected it (see response to R1).

Modeling behavior can help elucidate the brain basis of phenomena, but I think this is difficult and ought to be done carefully. In this case, Hebbian learning is ubiquitous in the brain (and simple nervous systems). This lack of specificity means it cannot be easily used to discriminate the brain locus of community structure computation. Perhaps there is some reason to think that the exponential decay the authors include in Model G is specific to the cortex, but I am unaware of such a reason.

We agree with that point, see response to the 3rd remark of reviewer1

4. Compression

I am not sure about the relevance of compression to the work presented here. To start, the authors state in their abstract that "the brain does not rely on exact memories but compressed representations of the world". However, I am not sure how the results of this study contribute to our understanding of compression in memory. If the authors want to say that information is lost during memory encoding, I think that is an unnecessary point to make since it is uniformly agreed on. If the authors want to bring in concepts from information theory about compression, I think they need to be more precise since the type of compression they presumably mean is lossy compression (i.e., information is lost), but compression can be lossless (i.e., information is retained but reformatted). In fact, based on how the authors use this term elsewhere, I think chunking might be a better concept to refer to than compression.

In figure 5, the authors suggest that the participants might only learn and retain two chunks. I think this speculation is provocative, but I could not see any evidence in this paper to support this claim. This is peculiar because the authors could have treated this hypothesis as one of the models that they used to try to account for the participants' data. If they did that, I suspect this model would not excel because it assumes that participants treat the edge nodes the same as the non-edge nodes. From multiple studies published using this community structure design, edge nodes are treated differently. Hence, if I were to speculate on a 'compressed representation' that the participants might have, I would assume they have four chunks: a pair of edge nodes and non-edge nodes for each cluster.

We agree with the reviewer that this part of the discussion was not relevant it its previous form. However, in the discussion we would like to propose the hypothesis that the biased computation might be the basis of a later condensed abstract representation. We believe that what we observed in this study is not an abstract graph representation yet, but that this bias could lead (after sleep?) to an abstract compressed mental model only relying on two groups.

We have removed most references to compression in the main text and abstract and we have only mentioned it in a small paragraph in which we explain that it is a hypothesis for interpreting the utility of such a bias for humans. We also completely changed the Figure 5 to better reflect this line of thought. We hope that this part of the discussion is better written and should not be interpreted as a result but as a speculative hypothesis linking this study to the literature on abstract mental representation.

About the difference between transitions between edge nodes and transitions between non-edge nodes, it is a prediction made by the FEMM. However, in this study we did not have enough data per subject to divide each condition into these two subtypes and properly investigate this question.

Miscellaneous

5. Could the authors report how many of each transition the participants are exposed to during the passive learning and press task? It seems like there are a lot of possible transitions, so it is possible that even in the fully connected condition, their input is still sparse.

The training stream is composed of 960 items (959 transitions) and each press task stream is composed of 480 items (2 times 479 transitions).

In the full community there are 48 individual transitions with the same probability, which represents an average of 20 presentations of each transition during the passive listening and 10 per press task which makes it quite unlikely to leave a sparse input. We added those values in the text.

6. The authors have two exclusion criteria for the different tests (button presses and choices), which result in different sample sizes. However, if they failed the force choice attention check, I don't think they should be included in the key presses and vice versa. Instead, I believe the exclusions should be the union of these two exclusion criteria.

We made two independent criteria because the number of button press in the press task could reveal a lack of attention or an inability to learn the structure. It seemed unfair to remove from the two-forced choice subjects because they could fail in learning the structure. However, this rejection criterion is not crucial, and the results remain similar when all subjects are included (see figure 3).

In the forced-choice task, we included catch trials to catch subjects that would press as fast as they could to make the experiment shorter or just randomly press without listening and we excluded these subjects as it is classical.

7. Why were twice as many participants excluded from the press task in the fully connected condition compared to the other conditions? Perhaps participants really did think there were fewer or more events? To address this, the authors could report the histograms for each condition before exclusions to see if those differences exist

We apologize for the error but the numbers reported in the paper were wrong. The real numbers are : FC 28/250, SC 24/249, HSC 23/228, which is close for all paradigms. We corrected it.

8. Why is RT so delayed for the press task? I expect that participants should only take 500-750ms to respond, which suggests that perhaps participants are responding to something after the time locking is done here. What offset is the time locking to? Based on the data, I would guess it is time-locked to the offset of the edge node before the transition rather than the offset of the edge node after the transition.

First, we have checked, and haven’t found any error in the timing reported, the data are time-locked on the offset of the transition. The task is quite hard for participants and they hesitate a lot in responding. We observe that the increase in the pressing behavior begins between 550 and 825 ms after the transition but is maximal around 1-1.2s after the transition.

9. The authors state that they did Bonferroni correction to test whether the likelihood of participants pressing a key after a transition has diverged from chance. Does this mean that every time point (millisecond?) is used as an independent sample in the Bonferroni correction? This doesn't seem plausible with the data shown in Figure 3 and is unnecessarily conservative.

We indeed used Bonferroni correction for 2851 points (time window from -100 to 2750 ms). The difference between MSC and the other paradigms is significant enough to pass this conservative approach. The lack of difference between FC and SC is not significant even without any correction (all ps>0.1) correction. The correction method does not impact the results here.

10. What is the noise ceiling for the model fit? Lots of ways to find this but one would be to split the behavioral data in half and see how correlated they are. I ask because I suspect you are close to that ceiling, which is impressive

We have added an estimation of the noise ceiling fit with the same bootstrapping approach. For each bootstrap, we randomly selected n subjects with replacement twice and correlated the data of those two random samples. We found an average of 84% correlations as a noise ceiling for those data. The maximum fit with FEMM is around 81% while the average of the bootstrap estimation with FEMM is around 77%.

We are indeed close to the ceiling as the model precisely predicts our data and a very large sample has been collected. We added this remark in the text and the figure.

Regarding point 1:

I think that the authors should exclude the "Familiar Between Condition" values from the model comparison. I firmly believe that removing this condition will make the model comparisons fairer and may change the conclusions.

We re-explained why we used this 50% value and made clearer that it did not penalize models because each model was normalized based on this value.

Regarding point 2:

The authors should more fully discuss the results of Figure 7 in the main text. They may also want to show the learning trajectory. Furthermore, I think they must report the forced choice data for the first testing epoch for syllables and tones separately, so it is clear what is driving the differences shown in Figure 7.

As suggested, we reported all the data in the supplementary material and mentioned the difference in the main text. As for the trajectory, unfortunately we believe that the experiment design is not suited to study this difference, which is also not related to the goal of the study.

Regarding point 3:

I recommend drastically reducing the amount of commentary that compares the likely brain bases of the community structure computations. In particular, I think Model G can just be referred to as a biologically plausible alternative model to Model F.

We removed a full paragraph arguing for pattern completion and pattern separation in the hippocampus being the origin of the computation and just mentioned it. We also rephrased all the sentences comparing the models to avoid discussing cortical vs hippocampus computation. Instead we insisted that biologically plausible models exist for these kind of computations.

Regarding point 4:

I recommend cutting all commentary about compression in the paper, but I am interested to hear justification otherwise.

If the authors wish to keep some of the compression content in, I think the authors could test my conjecture raised in point 4 by looking at the different nodes used in the 'New Between Community' lure trials. In particular, is there a difference when the 'New Between Community' is between an edge node and a non-edge node compared to when it is just between two non-edge nodes (vs. two edge nodes, i.e., 'Familiar Between Community')? I suspect there will be.

As explained above, we drastically reduced the discussion on this point.

Miscellaneous recommendations:

1. Mention in the main text that these were online studies.

Done.

2. Ali Preston's work on transitivity is relevant to several points raised here, so she should be cited.

Done.

3. I think it would be helpful to explain the logic of Model F more fully. For instance, it would be beneficial to say what the CA1 layer is doing in the model and why you chose to look at it in particular (i.e., MSP.)

We increased the description in the text.

4. It would be helpful to expand on the differences between hitting time and communicability since they seem like they would produce predictions that are more similar to what you report.

The prediction of hitting time and communicability are largely similar except for the strength of the generalization effect which is greater for hitting time. It leads hitting time to be significantly better correlated with our data than communicability.

5. Please report the ISI of the stimuli. They are 250ms long, but I didn't see any mention of an ISI, so I assume the items are back-to-back.

Yes they are. We added it in the method.

6. Good job releasing the data, although because it was uploaded as a zip, I couldn't download it without my client crashing. Consider breaking up the zip into smaller files.

We checked the folder that should be downloadable now.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

1) Analysis of learning dynamics

Reviewer 1 points out that "they do not analyze learning dynamics. They report behavior from the press task aggregated across all exposure. Only the second testing epoch is used because participants "have not learned and thus have not stabilized behavior" in the first epoch. This presents a missed opportunity to study learning dynamics and evaluate how well the models fit those dynamics – an important piece of evidence for model fit."

2) Make explicit assumptions/limitations

Both reviewers ask you to make more explicit some assumptions/limitations in several instances.

3) Reviewer 2 suggest that "an additional experiment designed to disentangle those could further strengthen the impact of this work". Please treat this as a suggestion for future research, rather than a requirement for the revised version of this manuscript.

To answer reviewers’ comments:

1) We provide a simple model of theoretical learning dynamic during this kind of task with and without prior knowledge on the statistics between elements and reported it in our response. However, the experiment was not designed to test for this dynamic (only two time points, no control measure of the priors), and such an analysis would suffer from these limitations and would not change our main message. We revised the text to explain the message better and insist on the fact that sequence learning in general is not limited to statistical computation and our study and model is therefore not a general model of sequence learning but is limited to a general model of statistical learning in sequences.

2) In general, we tried to better delimitate the assumptions and limitations of this study. We provide further comment on when the model is appropriate (statistical computations in sequences). We also discuss its limitations (not taking priors into account, no precise exploration of the learning dynamic, risk of habituation confound, noisy behavioral data).

3) Regarding the discrimination between Hitting Time and FEMM, we compare the predictions of both models on 1,000 networks randomly sampled from space of 12 nodes networks, and look at the most different predictions between the two models. We showed that they are very similar in most situations and that the networks with different predictions were not well suited for experiments like the current one. Nevertheless, they may provide some clues on which directions should be taken for further investigations. We tried to articulate the two models in the text to show the conceptual similarities and better explain why FEMM and Hitting Time are measuring the same property of the network.

Reviewer #1 (Recommendations for the authors):

The authors did a good job of addressing my major concerns; however, in doing so they reinforced, in my opinion, a concern that reviewer 2 raised explicitly and I alluded to in my first round of review. Namely, are the authors capturing the dynamics of sequence learning, and, if not, what does that mean for their claims? Below I summarize my concern and then list a few minor issues that remain.

Throughout the paper and their response, the authors state that they are interested in understanding the dynamics of sequence learning and how hierarchical structure is acquired more generally. One of the most tantalizing outputs of this work is the possibility that they have a "general model of sequence learning". The task was explicitly designed to "have intermediate points to check the learning status" so they could study the dynamic aspects of learning, rather than examine it as a snapshot. Moreover, the FEMM is described as normative account of how learning should unfold in order to quickly generalize to an appropriate hierarchical structure.

However, I believe there are a few ways the authors fail to meet these goals.

First, they do not analyze learning dynamics. They report behavior from the press task aggregated across all exposure. Only the second testing epoch is used because participants "have not learned and thus have not stabilized behavior" in the first epoch. This presents a missed opportunity to study learning dynamics and evaluate how well the models fit those dynamics – an important piece of evidence for model fit.

As an aside, what evidence is there that the participants haven't learned? In both tones and syllable conditions, they show reliable evidence of preference (according to Figure S2) in the first pressing epoch, which indicates learning. In the press task there is evidence of learning in syllables and tones for the first epoch (according to the figure used to respond to Reviewer 2). As stated previously, it seems ad hoc not to include the first epoch in the analyses when the experiment was designed to include it.

We agree with the reviewer that the claim participants did not learn is unfair, given the results in the key-press task. However, the two tasks (i.e. during and after the stream) are different: During the stream, they react to a change whereas after the stream, isolated quadri-element sequences are proposed for which participants must judge which of the two is more congruent with the language they heard. This part might elicit more interference from prior knowledge on the correctness of isolated segment (which in the case of syllables are close to words). Indeed, several experiments have shown how the native language affects statistical learning in adults (Elazar et al., 2022; Onnis and Thiessen, 2013; Siegelman et al., 2018). We modeled the dynamics of learning with FEMM in the presence and absence of priors (see the figure below). Moreover, in the case of syllables, human adults know a priori that syllables form words with a fixed order of syllables. In our experiment, the borders are between communities and, within communities, the elements appear randomly. In other words, language follows a completely different structure than the networks used here, which may hinder the learning of this type of graph when it is formed of syllables. This second type of prior is more difficult to model.

In the absence of priors, the learning dynamic can be modeled by comparing the FEMM prediction computed of a sequence of n elements compared to the real structure. In Author response image 5, we plotted the learning dynamic of the full community structure. The learning curve takes much longer to converge when we add strong priors on the TP matrix between syllables that are not congruent with the structure (for representation purposes with choose a prior strength of approximately 1000 elements: learning rate α = sequence length / 1000).

Author response image 5

Our results show that previous experience can alter the learning dynamic; however, the two learning curves seem to converge. In our experiment, we cannot properly study participants’ learning curves because we only have two datapoints, and we did not design the experiment to specifically test the effect of priors (we have not used syllables with different transition probabilities in participants’ native language). Thus, further research is needed to address this point. Note that we used syllables because we wanted to test preverbal infants on a similar stream (see our work on statistical learning in neonates (Benjamin et al., 2022; Flo et al., 2022)). Infants respond better to speech stimuli than to non-speech stimuli, probably due to attentional biases and/or neural network dedicated to speech processing. Furthermore, they have not accumulated enough evidence about their native language to have strong priors at the age we test.We are now more cautious in the text about what we can and cannot conclude from these results and have added a paragraph in the methodological remarks section to explicitly state that it is a limitation of the study. We have also enhanced the reference to figure S2 in the main text so that readers interested in Tone/Syllable differences can easily refer to it.

Second, the authors argue that FEMM is an optimal way to overgeneralize to learn quickly. If so, it seems that FEMM should be excellent at approximating behavior when participants are initially learning. This is not the case for the syllable condition, where the authors argue more learning has to occur to overcome pre-existing biases.

This sentence referred to the case in the absence of priors. Indeed, we re-ran the dynamic modeling without prior but comparing FEMM and TP. Zooming in on the early dynamic, we can see that FEMM converges faster than TP and describes the structure better for short sequences (it has already inferred the communities while the TP model needs to see each transition). However, FEMM converges to a lower correlation with the true transition matrix because it has learned biased transitions (as shown by the pruning and completion effect).

Author response image 6

It is true that our text was not explicit about what it referred to. Given the small size of the modeling difference and the fact that we did not directly test it, we did not include this test to avoid overinterpretation of a small difference but modified the text to be more specific.

Third, in the author's new and improved discussion about compression, they state that there is a stage after what is being tested here in which compressed representations form. At this stage, the FEMM would no longer apply, and so it is not a "general model of sequence learning".

I think the author's argument is that FEMM and their other high-performing models capture an initial stage of learning and that some other models (e.g. compression) will explain subsequent representations later in learning. If this is true, they ought to clarify this explicitly by stating their paper is a snapshot in the dynamics of learning. In doing so, they would need to revise their section "A general model of sequence learning" where they imply that the FEMM captures the entire learning process, instead of a portion. Moreover, they should be clear about what portion of the learning process they think FEMM accounts for: their normative claims about the value of overgeneralization suggest early stages of learning, but their exclusion of the first epoch suggests they are not testing early learning. Finally, I think the authors should still shorten their description of the compression section since its presence in the discussion implies that the data in the paper supported this hypothesis.

Indeed, we agree with the reviewer that FEMM is not a general model of sequence representations in the brain but only a general model of how to extract statistical information from a continuous sequence. It says nothing about what how subsequent processes might use this information. We have corrected the text to make clearer the distinction between structure learning and its flexible use. We have limited the implications of our experiment to the statistical learning part only and reduced the paragraph on the compression hypothesis.

Reviewer #2 (Recommendations for the authors):

In general, the authors produced a stronger manuscript that addresses many of the concerns that were raised. The analysis (figure S2) of the syllable and tone tasks separately reassures me that, at least for Session 2, a similar behavioral pattern and model comparison result is observed for both tasks. Methodological choices are better motivated, and the adjusted visualization of the AFC results is a lot clearer. Finally, the split between theoretical models and possible neural implementation, in addition to the more careful brain-based conclusion, is also an improvement.

The novelty of the paper is now communicated more clearly, yet I do still share the concern that noisy behavioral data from a single type of learning measure (AFC) limit the possibility for strong conclusions based on model comparison. For many reviewer suggestions, the authors argue that their experiment was not designed to look at those comparisons. Especially for the highly correlated models that do a good just explaining the current AFC data (i.e. Hitting time, FEMM, Hebbian) I am left thinking that an additional experiment designed to disentangle those could further strengthen the impact of this work.

First of all, it should be noted that we have used all these models because most have been proposed in the literature and we wanted to be as complete as possible and underline for some of them their similarities.

Regarding FEMM and Hebbian models, it is not possible to disentangle them because this Hebbian model was designed as an implementation of the FEMM model using the Hebb rule for neural binding. Indeed, both models share the same linear mixture of all TPs orders with the same exponential decay.

Regarding FEMM and Hitting Time, both models capture the same graph property (average distance between two nodes), thus are also difficult to disentangle. Nevertheless, we tried to find if the properties of certain networks might help to differentiate them. As there is no analytical solution for Hitting Time in the general case, we used simulation to look for such a graph. We simulated 1000 networks randomly sampled from all possible 12-nodes graphs with some reasonable constraints for a cognitive experiment such as: respected connectivity (every node can be reached from any other) and no nearly complete or nearly empty network. Then we computed the two metrics and the correlation between FEMM and HT predictions. The distribution of the correlations is plotted in Author response image 7.

Author response image 7

We can see that both metrics are always highly correlated. The network with the lowest correlation had a 53% correlation (red square in the first plot below). TP, FEMM, and Hitting Time matrices are shown below. We can see that the difference between FEMM and Hitting Time is almost entirely driven by the value at position [11,9], due to a single possible transition from node 11 to node 9 with a transition probability of 1 (hitting time is thus equal to 1, which is much larger than for all other transitions). After removing this outlier transition, the correlation rises to 96%. This is the case (at different degrees) for all graphs we could find with a correlation below 80%, suggesting that the difference between the models is restricted to “aberrant” cases in which a transition with a transition probability of 1 is present and driven by it, and thus does not apply to the general case.Thus, we believe that FEMM, hitting time and Hebbian learning are similar models, capturing the same properties of transitions in a network. We included the three in the paper because they refer to different formalisms. FEMM is more accurate and uses a β parameter that could account for possible differences between ages and populations. Hitting Time is easier to understand if one thinks in terms of sequences rather than statistics and networks. Hebbian learning is just a proposed implementation of FEMM with already described Hebb’s rule (different levels in Marr’s classification). There may be some formalism for differentiating FEMM from Hitting Time, but networks do not seem well suited for this purpose, as they almost always give similar predictions in both models.

For better readability of the paper, we have interchanged Hitting Time and FEMM in the model description and figure 2. We now present FEMM computation first and later present Hitting time as an alternative view of the same property but from a sequential perspective. We hope that this new organization of the paper will help in understanding the conceptual similarities of the two models. We have also modified the text in several places to better describe the two models and their similarities.

Currently, the AFC judgment is always one against a familiar-within transition "because all models postulate their correctness", but could such different contrasts not help to disentangle correlated models?

Unfortunately, most of the other contrasts between conditions are equivalent in all models. The only difference in this design could come from New Within vs Familiar Between in the High Sparse Community design only, where Hitting Time and FEMM (with β = 0.06) have slightly different prediction. However, FEMM with a slightly different β could have similar prediction on the New Within vs Familiar Between condition.

In their response to Reviewer 3, the authors mention that the difference between transitions between edge nodes and transitions between non-edge nodes is a prediction made by the FEMM. If this is a unique prediction of FEMM this seems worth testing.

Here again, by design the three correlated models inherit of this difference because it comes from the property of average distance between nodes in the network which is the common metric that all models approximate.

Methodological remarks section (p. 20-21):

–Whereas I found the arguments counter the use of a Hamiltonian path convincing, I do think it would be good to explicitly acknowledge the potential confound outlined by reviewer 1 and the reason why you believe simple adaptation is not what is going on in the pressing probability results.

We have added a paragraph to explicitly warn the reader about this: “However, due to the design reasons explained before, Halmitonian walks are not usable and thus we could not formally control for potential habituation effect in our design. The key-press results of this study (but not the 2-forced choice results) are therefore potentially subject to confounding by habituation.”

– "However, this second metric has a low sensitivity as only a few trials can be collected resulting in data variability that was compensated by a very large sample of participants (N=727)." Data variability seems strange phrasing, as true score variance is not a problem, maybe what is meant is high error variance?

We corrected in the text.

Regarding prior knowledge of syllable sequences:

– The authors use both the phrasing "priors on syllable sequences" and "a prioris that syllables are ordered in words" (p. 31), I think the latter is a lot less precise and can better be avoided. The authors also write "This a priori does not exist with tones", but is that true given our exposure to music?

We corrected in the text and avoided the second formulation. For tones, habituation to music might also lead to priors, however the different tones we use are not belonging to any musical scale in particular, avoiding familiarity with musical grammar. We changed the text accordingly.

– The idea that participants have prior knowledge of syllable sequences affecting their ability to learn about new transitions between syllables is introduced without references, suggesting that it might be a new idea, whereas there is literature on this: see for example Siegelman et al. (2018)'s paper titled Linguistic entrenchment: Prior knowledge impacts statistical learning performance.

We added this reference and two others to support this claim (Elazar et al., 2022; Onnis and Thiessen, 2013; Siegelman et al., 2018).

– The Participants section states that participants were recruited via social media but mentions nothing about their language background. The linguistic stimuli were French diphones. So far I assumed the experiment language was French, but maybe that was not the case?

It was indeed French participants only. We added this information in the text.

References

Benjamin L, Fló A, Palu M, Naik S, Melloni L, Dehaene‐Lambertz G. 2022. Tracking transitional probabilities and segmenting auditory sequences are dissociable processes in adults and neonates. Developmental Science. doi:10.1111/desc.13300

Elazar A, Alhama RG, Bogaerts L, Siegelman N, Baus C, Frost R. 2022. When the “Tabula” is Anything but “Rasa:” What Determines Performance in the Auditory Statistical Learning Task? Cogn Sci 46:e13102. doi:10.1111/cogs.13102

Flo A, Benjamin L, Palu M, Dehaene-Lambertz G. 2022. Sleeping neonates track transitional probabilities in speech but only retain the first syllable of words. Sci Rep 12:4391. doi:10.1038/s41598-022-08411-w Onnis L, Thiessen E. 2013. Language experience changes subsequent learning. Cognition 126:268–284. doi:10.1016/j.cognition.2012.10.008

Siegelman N, Bogaerts L, Elazar A, Arciuli J, Frost R. 2018. Linguistic entrenchment: Prior knowledge impacts statistical learning performance. Cognition 177:198–213. doi:10.1016/j.cognition.2018.04.011

Whittington JCR, Muller TH, Mark S, Chen G, Barry C, Burgess N, Behrens TEJ. 2020. The Tolman-Eichenbaum Machine: Unifying Space and Relational Memory through Generalization in the Hippocampal Formation. Cell 183:1249-1263.e23. doi:10.1016/j.cell.2020.10.024

https://doi.org/10.7554/eLife.86430.sa2

Article and author information

Author details

  1. Lucas Benjamin

    Cognitive Neuroimaging Unit, CNRS ERL 9003, INSERM U992, Université Paris-Saclay, NeuroSpin center, Gif/Yvette, France
    Contribution
    Conceptualization, Data curation, Formal analysis, Visualization, Methodology, Writing - original draft, Writing – review and editing
    For correspondence
    lucas.benjamin@cea.fr
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-9578-6039
  2. Ana Fló

    Cognitive Neuroimaging Unit, CNRS ERL 9003, INSERM U992, Université Paris-Saclay, NeuroSpin center, Gif/Yvette, France
    Contribution
    Conceptualization, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3260-0559
  3. Fosca Al Roumi

    Cognitive Neuroimaging Unit, CNRS ERL 9003, INSERM U992, Université Paris-Saclay, NeuroSpin center, Gif/Yvette, France
    Contribution
    Methodology, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9590-080X
  4. Ghislaine Dehaene-Lambertz

    Cognitive Neuroimaging Unit, CNRS ERL 9003, INSERM U992, Université Paris-Saclay, NeuroSpin center, Gif/Yvette, France
    Contribution
    Conceptualization, Supervision, Funding acquisition, Methodology, Writing – review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-2221-9081

Funding

Horizon 2020 - Research and Innovation Framework Programme (695710)

  • Ghislaine Dehaene-Lambertz

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This research has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No. 695710 to GDL). We thank Stanislas Dehaene and Mathias Sablé-Meyer for discussions and remarks during the design and the interpretation of the experiment. We also thank the Foundation Les Treilles for supporting this work (LB).

Ethics

All participants gave their informed consents for participation and publication and this research was approved by the Ethical research committee of Paris-Saclay University under the reference CER-Paris-Saclay-2019-063.

Senior and Reviewing Editor

  1. Floris P de Lange, Donders Institute for Brain, Cognition and Behaviour, Netherlands

Reviewer

  1. Cameron Ellis, Haskins Laboratories, United States

Version history

  1. Preprint posted: May 19, 2022 (view preprint)
  2. Received: January 25, 2023
  3. Accepted: April 28, 2023
  4. Accepted Manuscript published: May 2, 2023 (version 1)
  5. Version of Record published: June 5, 2023 (version 2)

Copyright

© 2023, Benjamin et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 583
    Page views
  • 93
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Lucas Benjamin
  2. Ana Fló
  3. Fosca Al Roumi
  4. Ghislaine Dehaene-Lambertz
(2023)
Humans parsimoniously represent auditory sequences by pruning and completing the underlying network structure
eLife 12:e86430.
https://doi.org/10.7554/eLife.86430

Further reading

    1. Neuroscience
    Weiwei Qui, Chelsea R Hutch ... Darleen Sandoval
    Research Article

    Several discrete groups of feeding-regulated neurons in the nucleus of the solitary tract (nucleus tractus solitarius; NTS) suppress food intake, including avoidance-promoting neurons that express Cck (NTSCck cells) and distinct Lepr- and Calcr-expressing neurons (NTSLepr and NTSCalcr cells, respectively) that suppress food intake without promoting avoidance. To test potential synergies among these cell groups we manipulated multiple NTS cell populations simultaneously. We found that activating multiple sets of NTS neurons (e.g., NTSLepr plus NTSCalcr (NTSLC), or NTSLC plus NTSCck (NTSLCK)) suppressed feeding more robustly than activating single populations. While activating groups of cells that include NTSCck neurons promoted conditioned taste avoidance (CTA), NTSLC activation produced no CTA despite abrogating feeding. Thus, the ability to promote CTA formation represents a dominant effect but activating multiple non-aversive populations augments the suppression of food intake without provoking avoidance. Furthermore, silencing multiple NTS neuron groups augmented food intake and body weight to a greater extent than silencing single populations, consistent with the notion that each of these NTS neuron populations plays crucial and cumulative roles in the control of energy balance. We found that silencing NTSLCK neurons failed to blunt the weight-loss response to vertical sleeve gastrectomy (VSG) and that feeding activated many non-NTSLCK neurons, however, suggesting that as-yet undefined NTS cell types must make additional contributions to the restraint of feeding.

    1. Genetics and Genomics
    2. Neuroscience
    Ji-Eun Ahn, Hubert Amrein
    Research Article

    In the fruit fly Drosophila melanogaster, gustatory sensory neurons express taste receptors that are tuned to distinct groups of chemicals, thereby activating neural ensembles that elicit either feeding or avoidance behavior. Members of a family of ligand -gated receptor channels, the Gustatory receptors (Grs), play a central role in these behaviors. In general, closely related, evolutionarily conserved Gr proteins are co-expressed in the same type of taste neurons, tuned to chemically related compounds, and therefore triggering the same behavioral response. Here, we report that members of the Gr28 subfamily are expressed in largely non-overlapping sets of taste neurons in Drosophila larvae, detect chemicals of different valence, and trigger opposing feeding behaviors. We determined the intrinsic properties of Gr28 neurons by expressing the mammalian Vanilloid Receptor 1 (VR1), which is activated by capsaicin, a chemical to which wild-type Drosophila larvae do not respond. When VR1 is expressed in Gr28a neurons, larvae become attracted to capsaicin, consistent with reports showing that Gr28a itself encodes a receptor for nutritious RNA. In contrast, expression of VR1 in two pairs of Gr28b.c neurons triggers avoidance to capsaicin. Moreover, neuronal inactivation experiments show that the Gr28b.c neurons are necessary for avoidance of several bitter compounds. Lastly, behavioral experiments of Gr28 deficient larvae and live Ca2+ imaging studies of Gr28b.c neurons revealed that denatonium benzoate, a synthetic bitter compound that shares structural similarities with natural bitter chemicals, is a ligand for a receptor complex containing a Gr28b.c or Gr28b.a subunit. Thus, the Gr28 proteins, which have been evolutionarily conserved over 260 million years in insects, represent the first taste receptor subfamily in which specific members mediate behavior with opposite valence.