Introduction

Age-related hearing loss, defined as declines in hearing sensitivity, is exceedingly common; according to some estimates, ∼45 million adults in the United States over 50 years of age have age-related hearing loss that is significant enough to interfere with communication (1). Untreated hearing loss decreases quality of life and is considered to be the single-largest modifiable risk factor in middle-age for other age-related comorbidities such as cognitive impairment and dementia (2). However, current measures of hearing sensitivity fail to capture critical aspects of real-world hearing difficulties in this population (3, 4). Hearing difficulties experienced by up to 10% of adults seeking help in the hearing clinic are ‘hidden’ to current diagnostic procedures (36). Peripheral deafferentation caused by cochlear neural degeneration (CND) may underlie many of these perceptual difficulties (7, 8). Anatomical evidence for progressive CND with aging is clear – postmortem studies using human temporal bones estimate a 40% deafferentation caused by CND by the fifth decade of life (911). CND causes neural coding deficits in the peripheral auditory pathway, affecting the faithful representation of spectrotemporally complex auditory stimuli (1214). But the evidence linking CND with perceptual deficits is mixed - current assessments of perceptual deficits associated with CND primarily focus on behavioral measures of speech in noise, with mixed evidence of deficits in individuals with putative CND (1518).

Two challenges impede our understanding of the perceptual consequences of CND. First, while many non-invasive markers of CND have been proposed and validated in animal models (7, 14, 19, 20), non-invasive estimates of putative CND in humans cannot be confirmed with histological assessment of synapses in the same participants. Cross-species comparative studies and computational modeling provide promising avenues for overcoming this gap (21, 22). Secondly, behavioral readouts of perceptual difficulties in humans show mixed results, with putative CND depending on the specific test used and degree of spectrotemporal and contextual information provided in that test (17, 23, 24). The most promising tests for CND are ones with no linguistic context and short spectrotemporal processing windows (13, 24). However, these behavioral readouts may minimize subliminal changes in perception that are reflected in listening effort but not in accuracies (2527). Specifically, two individuals may show similar accuracies on a listening task, but one individual may need to exert substantially more listening effort to achieve the same accuracy as the other. Here, we used a cross-species approach, combined with simultaneous measurements of behavior and listening effort, to show that CND was associated with decreased neural coding fidelity and increased listening effort in middle-aged adults with normal audiometric thresholds. We measured putative CND using the envelope following response (EFR) to rapid (∼1000Hz) modulation frequencies – a suggested marker for CND (12, 14). Cross-species comparisons with identical recordings in a low-frequency hearing animal model, the Mongolian gerbil, confirmed that decreases in EFRs were selective only for responses with generators in the auditory nerve. These EFRs were also associated with histologically-confirmed CND in gerbils. In the human model, we simultaneously measured pupil-indexed listening effort in participants as they performed a speech-in-noise task and showed that increased listening effort was present despite matched behavioral accuracies. These results point to hitherto underexplored aspects of auditory perceptual difficulties associated with listening effort and CND.

Results

“Normal” hearing middle-aged adults show evidence of peripheral neural coding deficits that are associated with CND

Middle-aged (40-55 years) and young adult (18-25 years) listeners were recruited to participate in this study (Fig. 1A). All participants had clinically normal hearing thresholds and spoke fluent American English. Participants had normal otoscopy by visual examination and air conduction thresholds ≤ 25dB HL for octave frequencies between 250Hz to 8 kHz (Fig. 1B, Table 1), consistent with WHO guidelines for normal hearing (28). Threshold differences were exaggerated in MAs at extended high frequencies (>8kHz), which are seldom clinically measured but may be a marker for accumulated lifetime noise damage ((17, 2931), Fig. 1B, Table 2). Outer hair cell function, assessed using distortion product otoacoustic emissions (DPOAEs), were comparable between young adult and middle-aged listeners up to 4 kHz, the frequency regions that contains most of the spectral information in speech (Fig. 1C, Table 3). Participants also had no severe symptoms of tinnitus (Fig. 1D) assessed using the Tinnitus Handicap Inventory (THI; (32)) and Loudness Discomfort Levels (LDLs; (32)) above 80 dB SPL for frequencies up to 3 kHz (Fig. 1E, Table 4). Self-reported noise exposure using the Noise Exposure Questionnaire (NEQ; (34)) was not significantly different between age groups (Fig. 1F, Table 4). Participants also had normal cognitive function indexed by the Montreal Cognitive Assessment (MoCA ≥ 25; (35)) and comparable working memory scores assessed using the operation span task (OSPAN) ((36), Fig. 1G, Table 4). Hence, the middle-aged adults recruited for this study were all “normal” by currently administered behavioral and audiological assessments in the hearing clinic, while exhibiting some sub-clinical outer hair cell dysfunction, especially at frequencies above 4kHz.

Age-related CND occurs prior to overt changes in hearing thresholds and can be assessed non-invasively by measuring phase-locked neural envelope following responses.

(A) Thirty middle-aged (MA, 40-55 yrs, mean = 46.1+4.6 yrs) and 36 young adults (YA, 18-25 years, mean = 21.17+ 1.8yrs) participated in this study. (B) All participants had clinically normal hearing thresholds with some evidence of threshold losses at extended high frequencies above 8 kHz typically not tested in the clinic. Hearing thresholds in dB HL are shown on the Y axis and frequency in kHz is plotted on the X axis. (C) Outer hair cell function assessed using DPOAEs is comparable between YA and MA up to 4kHz and showed age-related decreases at higher frequencies. Both cohorts show no evidence of self-reported tinnitus (D) or hyperacusis measured as LDLs (E), have comparable self-reported noise exposure levels (F), and comparable working memory scores assessed using OSPAN (G). (H) EFRs to modulation frequencies of 1024Hz can be reliably recorded in young and middle-aged adults using ‘tiptrodes’. The panel shows grand-averaged FFT traces for YA and MA. (I) Middle-aged adults showed significant declines in EFR amplitudes at 1024Hz AM, with putative neural generators in the auditory nerve. (J) Signal-to-noise ratios were 8dB on average for YA and 4dB for MA. (K) Statistically significant decreases in EFR amplitudes were selective for 1024Hz AM, the modulation frequency with putative generators in the auditory nerve. All panels: Error bars and shading represent standard error of the mean (SEM). Asterisks represent p<0.05, ANOVA.

Comparison of air conduction thresholds using a 3-way ANOVA (MA = 37, YA = 35)

Comparison of extended high frequencies using 3-way ANOVA (MA = 37, YA = 35)

Comparison of right ear distortion product otoacoustic emissions using a 2-way ANOVA (MA = 34, YA = 31)

Comparisons using 1-way ANOVAs

We then measured putative CND using neural ensemble responses from the auditory periphery phase-locked to the stimulus amplitude envelope via the EFR. EFRs can be used to emphasize neural generators in the auditory periphery by exploiting divergent phase-locking abilities along the ascending auditory pathway. EFRs at rapid amplitude modulation (AM) frequencies above 600Hz have been shown to relate to underlying CND in animal models (12, 14) and in humans (37). Here, we measured EFRs to AM frequencies that have putative neural generators in the central auditory pathway such as the cortex (40Hz AM) (12, 38), as well as faster modulation rates (110Hz, 512Hz, and 1024Hz AM) that emphasize progressively peripheral auditory regions (12). We were able to reliably record EFRs up to 1024Hz by using gold-foil tipped electrodes (‘tiptrodes’) placed in the ear canal, closer to the presumptive neural generators in the auditory nerve (Fig. 1H). EFR peaks analyzed in the spectral domain were above the noise floor, with average signal to noise ratios (SNRs) of 8dB in younger and 4dB in middle-aged adults (Fig. I, J). Statistically significant age-related decreases in EFR amplitudes were only present for EFRs to the 1024Hz AM rate, which has putative generators in the auditory nerve (12, 14) but were not present for slower AM rates with putative generators in the midbrain or cortex (Fig. 1K, Table 5).

Comparison of EFRs using 2-way ANOVAs (MA = 29, YA = 28)

To confirm that the EFR parameters used here were indeed sensitive to putative CND, we measured EFRs using identical stimuli, acquisition, and analysis parameters in young (22wk) and middle-aged (80wk) Mongolian gerbils (Fig. 2A). The hearing range of gerbils largely overlaps with that of humans at speech frequencies (39), making them an ideal animal model for direct comparison in cross-species studies. Middle-aged gerbils showed no loss of hearing thresholds, similar to middle-aged humans (Fig. 2B). Remarkably, gerbils also exhibited a selective decrease in EFR amplitudes for AM rates at 1024Hz, similar to middle-aged humans (Fig. 2C, Table 6). CND in gerbils was assessed using immunohistological analysis of cochlear whole mounts, where the cell bodies, presynaptic ribbon terminals and the post-synaptic glutamate receptor patches were immunostained, visualized using confocal microscopy, and quantified from 3D reconstructed images (Fig. 2D). Significant decreases in afferent synapse counts were present in middle-aged gerbils, reaching up to 20% losses compared to the young gerbils (Fig. 2E, Table 7). Further, EFR amplitudes were significantly correlated to the number of remaining cochlear synapses (Fig. 2F), thus confirming that our EFRs were a sensitive metric of CND.

Cross-species experiments in a rodent model show that EFRs are a sensitive biomarker for histologically confirmed CND.

(A) Cross-species comparisons were made with young (22+ 0.86 weeks, n = 14) and middle-aged (80+ 0.76 weeks, n = 13) Mongolian gerbils, with identical stimuli, recording, and analysis parameters. (B) Middle- aged gerbils did not show any age-related decreases in hearing thresholds. (C) Age-related decreases in EFR amplitudes were isolated to the 1024Hz modulation frequency, similar to middle-aged humans in Fig1K. (D) CND was quantified for a subset of these gerbils (n = 10 young and 10 middle-aged) using immunostained organ of Corti whole mounts, where afferent excitatory synapses were quantified using 3D reconstructed images. (E) Cochlear synapse counts at the 3kHz cochlear region corresponding to the carrier frequency for the EFRs was significantly decreased in middle-aged gerbils, despite matched auditory thresholds. (F) EFR amplitudes at 1024Hz AM were significantly correlated with the number of remaining cochlear synapses, suggesting that these EFRs are a sensitive metric for CND with age. All panels: Error bars and shading represent standard error of the mean (SEM). Asterisks represent p<0.05, ANOVA.

Comparison of 22 week-old gerbil (n= 14) and 80 week-old gerbil (n = 12) EFRs using 2-way ANOVAs

Comparison of synapse counts at 3000 Hz in 22 and 80 week-old gerbils using 1-way ANOVA

Perceptual deficits manifest as increased listening effort prior to behavioral deficits in middle-aged adults

Do middle-aged adults with putative CND experience challenges with hearing in noise despite having clinically normal hearing thresholds? We measured speech perception in noise abilities with the clinically-used Quick Speech-in-Noise (QuickSIN; (40)) task, to assess hearing in noise changes that were closer to real-world listening scenarios. QuickSIN tests suprathreshold hearing of medium context sentences presented in varying levels of four-talker background babble ranging from 25 to 0 dB SNR levels in 5 dB steps (Fig. 3A). Further, QuickSIN is a clinically relevant test that we recently identified as being sensitive to detect perceptual deficits in adult populations with normal audiograms (5). On each trial, participants were required to repeat a target sentence, which contained five key words for identification. Clinically, QuickSIN is scored as dB SNR loss, reflecting the SNR level required to correctly identify key words in noise correctly half the time. No significant age-related differences were observed in clinically scored QuickSIN dB SNR loss (Fig. 3B, Table 4). When analyzing performance at each SNR, accuracy was at near-ceiling from 25 dB SNR to 10 dB SNR, but dropped from 5dB SNR in both young and middle-aged adults. Statistically significant behavioral deficits with age were observed on QuickSIN only in the most challenging SNR of 0 dB (Fig. 3C, Table 8).

Increased listening effort precedes behavioral deficits in speech in noise perception in middle-aged adults.

(A) Speech perception in noise was assessed using the QuickSIN test, which presents moderate context sentences in varying levels of multi-talker babble. Pupillary measures were analyzed in two time-windows – 1. during stimulus presentation, and 2. after target sentence offset and prior to response initiation (B) No significant age- related differences were observed in clinical QuickSIN scores presented as dB SNR loss. (C) QuickSIN performance is matched between middle-aged (MA) and younger adults (YA) until the most difficult noise condition (SNR 0). The x- axis shows the SNR condition that the target sentences were presented in, with 25dB being the easiest noise condition, and 0dB being the most difficult noise condition. The y-axis shows participant accuracy in repeating key words from the target sentences as percent correct. (D) Grand-averaged pupillary responses measured during task listening as an index of effort exhibit modulation with task difficulty, with greater pupillary dilations observed in harder conditions for both groups. (E) Middle-aged adults show consistently higher pupillary responses during performance on the QuickSIN task and at SNR levels prior to when overt behavioral deficits are observed. (F) Grand- averaged pupillary responses measured after target sentence offset as an index of effort exhibit greater modulation with task difficulty, compared to changes in the listening window. (G) Trends seen in the listening window were amplified in this integration window, with middle-aged adults showing even greater effort, especially at moderate SNRs where behavior was matched.

Comparison of QuickSIN performance using a 2-way ANOVA (MA = 34, YA = 31)

Are there perceptual deficits experienced by middle-aged adults that are not captured by traditional behavioral readouts? We addressed this question by measuring isoluminous task-related changes in pupil diameter as an index of listening effort (4143) while participants performed the QuickSIN task (Fig. 3A). Pupillary changes were analyzed using growth curve analysis (GCA, (44)). GCAs provide a statistical approach to modeling changes over time in the timing and shape of the pupillary response and has several advantages to analyzing pupillary response over traditional approaches. First, GCA does not require time-binned samples, thus removing the trade-off between temporal resolution and statistical power, and secondly, GCA can account for individual variability. Two second-order GCAs were fit to different time-windows (Table 9-10, see methods). One time window encompassed the onset of the masker through the first 2.8s of the target sentence (listening window). The second window spanned from the end of the target sentence up to the verbal response prompt (integration window). These two time-windows were hypothesized to represent effort associated with differing sensory and cognitive processes. The listening window reflects linguistic and semantic processing of ongoing speech stimuli and is a physiological response to auditory processing (45), while the integration window reflects error correction, working memory and comparisons with predictive internal models (46, 47). The linear term from the GCA was further analyzed as a marker for the slope of pupillary change over time.

Fixed-effect estimates for model of pupillary responses from 0 to 5.8 seconds time-locked to babble masker onset to examine the effect of SNR and age group (observations = 96,612, groups: participant x SNR = 332, participant = 63)

Fixed-effect estimates for model of pupillary responses from 0 to 3 seconds time-locked to QuickSIN target sentence offset to examine the effect of SNR and age group (observations = 63,184, groups: participant x SNR = 359, participant = 63)

Pupil-indexed listening effort measured during listening was modulated by task difficulty, with pupil diameters showing a larger increase at more challenging SNRs (Fig. 3D). Both younger and middle-aged adults showed increases in pupil-indexed effort prior to overt decreases in behavioral performance (Fig. 3E). While MAs exhibited larger increases in listening effort compared to YA, this change was not statistically significant (Fig. 3E, Supp. Table 9). Trends seen in the pupillary responses for the listening window were further amplified in the integration window (Fig. 3F). Pupillary slopes obtained from the GCA increased with task difficulty for both YA and MA. However, middle-aged adults showed a larger increase in listening effort than younger adults with decreasing SNRs, with significant age group listening effort differences at 10dB SNR, even though behavioral performance was matched (Figure 3G, Supp. Table 10). These results suggest that middle-aged adults may maintain comparable performance to younger listeners at moderate task difficulty but at the cost of greater listening effort.

Pupil-indexed listening effort and CND provide synergistic contributions to speech in noise intelligibility

We sought to understand the relationships between CND, listening effort, and speech-in-noise intelligibility in normal-hearing middle-aged adults. Behavioral performance in QuickSIN at 0dB SNR, where there was a group effect of age, was significantly correlated with putative CND assessed using EFRs at 1024 Hz (Fig. 4A). This suggests that peripheral deafferentation may manifest as overt behavioral deficits under the most challenging listening conditions. Pupil-indexed listening effort was also greater in the integration window in middle-aged adults at 10dB SNR compared to younger adults (Fig. 3G), even though behavioral performance was near ceiling for both age groups. Pupillary slopes at 10dB SNR in the integration window were correlated with behavioral deficits at 0 dB SNR (Fig. 4B). These results add to the growing evidence suggesting that pupil-indexed listening effort to maintain behavioral performance at moderate task difficulties is predictive of behavioral performance at more challenging listening conditions (48). There were significant correlations between pupillary slopes in the listening window as well, even though there were no group level differences with age (Fig. 4C). These data suggest that CND and increased listening effort both associated with listening challenges in middle-aged adults.

Listening effort and CND provide complementary contributions to speech in noise intelligibility.

(A) Behavioral performance at the most challenging SNR was significantly correlated with the EFR measures of CND, with lower EFR amplitudes being associated with poorer behavioral performance. (B) Pupillary responses at 10 dB SNR from the integration window were significantly correlated with behavioral performance at 0dB SNR, (B) These correlations between pupillary responses at 10 dB SNR and behavioral performance at 0dB SNR was also found in the listening window, even though there were no group differences in age, further strengthening the link between listening effort at moderate SNRs and behavioral performance at challenging SNRs. (D) an elastic net regression model with 10-fold cross validation (cv) was fit to the QuickSIN scores at 0dB SNR. The tuning parameter Lambda controls the extent to which coefficients contributing least to predictive accuracy are suppressed. (E) A lollipop plot displaying the coefficients (β) contributing to explaining variance on QuickSIN performance suggests that CND, listening effort and subclinical changes in hearing thresholds all contribute to QuickSIN performance. (F) QuickSIN scores predicted by the elastic net regression are corelated with actual participant QuickSIN scores.

Is increase in listening effort synergistic with CND? To understand the multifactorial contributions of sensory and top-down factors that may affect speech perception in noise, we performed a penalized regression with elastic net penalty (49). QuickSIN performance at 0dB SNR (scaled to 0-100) was used as the outcome variable and all other measured variables were inserted as input variables. The elastic net penalized regression framework is a robust method that blends Lasso’s ability to perform variable selection and Ridge’s ability to handle multicollinearity and grouped covariates. The fitted elastic net regression model showed an R2 value of 0.5981, and five significant predictors – hearing thresholds averaged across 500Hz to 4kHz (PTA4k), EFR amplitudes at 1024Hz AM, pupillary slopes at 10dB SNR and 0 dB SNR in the listening window, and pupillary slopes at 10dB SNR in the integration window (Fig. 4D-E). This model was significantly related to QuickSIN performance and predicted the observed QuickSIN scores across younger and middle-aged adults (r = 0.64/(pseudo-)R2 = 0.41, Fig. 4F). Hence, the output of the elastic net regression suggests that CND and pupil-indexed listening, in addition to subclinical changes in hearing thresholds, all provided complementary contributions to speech perception in noise.

Discussion

Middle-age, typically defined as the fifth and sixth decade of life, has been historically understudied compared to older age ranges (50). Increasing evidence suggests that middle-age is a critical period of rapid changes in brain function (51, 52). The resilience of the brain in keeping with degenerative processes that begin to occur in middle-age predicts further age-related degeneration in later life and presents a critical opportunity for early intervention (50, 5355). Hearing loss in middle-age has recently been identified as the largest modifiable risk factor for dementia and Alzheimer’s disease later in life (2). However, the number of middle-aged patients who seek help for hearing difficulties but show no abnormal clinical indicators suggests the need for the development of sensitive biomarkers for hearing challenges experienced by this population (3, 5, 6, 56).

Anatomical evidence from human temporal bones suggests a 40% deafferentation of cochlear synapses in middle-aged adults, even without substantial noise exposure history (911). Peripheral deafferentation triggers compensatory mechanisms across sensory, language, and attentional systems (5760). But our understanding of the perceptual consequences of cochlear deafferentation are limited by the lack of consensus on sensitive biomarkers for CND (61). Recent studies have identified multiple promising biomarkers for CND in animal models and human populations (21, 37, 62). Reduced wave I amplitudes in the auditory brainstem response are a reliable marker of CND in animal models (7, 12, 63) but can be challenging to obtain in humans (20, 62). The middle-ear muscle reflex, an acoustic measurement of middle-ear immittance driven by efferent feedback to the middle-ear muscles, has also been identified as a promising marker for CND (19, 21, 64). Here, we used the EFR to identify CND in middle-aged adults with normal audiometric thresholds. As opposed to the middle-ear muscle reflex, EFRs measure peripheral neural coding and central auditory activity by exploiting the divergent phase-locking abilities of the ascending auditory pathway (65, 66). EFRs with modulation rates greater than ∼1000 Hz have been associated with CND and are considered to reflect the integrity of the auditory nerve (12, 14), given that midbrain and cortical neurons cannot phase-lock to such high rates (65). We observed decreases in EFRs at modulation rates that were selective to the auditory periphery (i.e., 1024 Hz) in middle-aged adults, while EFRs at slower modulation rates, likely generated from the central auditory structures, were not different from those in younger adults (Fig. 1K). The use of a more rapid onset time in the stimulus modulation envelope, such as the rectangular amplitude modulated tones (RAM EFRs), may result in a larger separation of these groups even at slower modulation rates (67, 68), as sharper onset times result in greater EFR amplitudes (37, 69). However, a more intriguing possibility is that middle-aged adults exhibited an increase in relative central auditory activity, or ‘gain’, in the presence of decreased peripheral neural coding (57, 59). The perceptual consequences of this gain are unclear, but our findings align with emerging evidence suggesting that gain is associated with selective deficits in speech-in-noise abilities (59, 70, 71). EFRs at suprathreshold levels presented here also have contributions from higher frequency regions due to a broader excitation at the cochlea (72, 73). Since cochlear synapse loss is also believed to be flat across frequencies with age, EFRs used here likely index cochlear synapse loss equally across a broad range of frequencies (9, 12, 63). This notion is further supported by emerging evidence that suggests that phase-locking measured to lower frequency pure tones also indexes cochlear synaptopathy in ways that are similar to using a faster modulation rate on a higher frequency tone (74, 75).

The Mongolian gerbil provides a robust model for cross-species comparisons with aging humans, due to overlapping hearing frequency ranges and experimentally tractable lifespans. Here, using young and middle-aged gerbils, we showed similar EFR decreases as seen in human listeners (Fig. 2C). Additionally, age-related changes in the EFR were associated with confirmed CND (Fig. 2F). CND in gerbils reached ∼20% in the middle-aged 80 week group tested here, which is less than what has been observed in middle-aged humans, where CND estimates typically reach 40-50% by the fifth decade of life (9). However, our EFRs were still sensitive to this degree of CND, reiterating that EFRs are a sensitive metric for measuring cochlear deafferentation. Additionally, we confirmed that the gerbils used in this study did not show any changes in hearing thresholds (Fig. 2B). Hence, they were unlikely to have strial degenerations that are known to occur in older gerbils that affect auditory thresholds (76). The synapse loss patterns and EFR amplitude changes seen here in gerbils were in agreement with earlier studies using alternate rodent models (12, 14, 69), further confirming that age-related cochlear synapse loss is a pervasive mammalian phenomenon that can be captured using EFRs to rapid modulation frequencies (∼1000 Hz).

Strong evidence links CND with altered neural coding of sounds in multiple ascending auditory stations (12, 58, 59). However, the perceptual consequences of CND on speech-in-noise abilities remain unclear (61). Evidence for overt behavioral deficits have been mixed and may depend on the specific type of task used for assessment (17, 23). Here we used QuickSIN, a clinically relevant test that we recently identified as being sensitive to changes in adult normal hearing populations with perceived hearing deficits (5). However, tests that are further challenging in spectrotemporal complexity, such as the addition of time compression or reverberation, may further tease apart these differences (17, 37). In the current study, behavioral deficits began to emerge only at the most challenging SNR levels (Fig. 3). However, perceptual deficits in terms of listening effort began to appear prior to behavioral changes.

Listening effort is an umbrella term that may assess multiple forms of executive function such as cognitive resource allocation, working memory, and attention, and can be assessed by measuring isoluminous task-related changes in pupil diameter (26, 4143, 77). The mechanisms underlying these pupillary changes are still under study (78, 79) but are hypothesized to involve the Locus Coeruleus – Norepinephrine (LC-NE) system (80, 81). Here, we observed that pupil-indexed listening effort increased in middle-aged adults, even when behavioral performance was matched (Fig. 3E, F). This suggests that middle-aged adults expend more effort to maintain behavioral performance, which may lead to more listening fatigue or disengagement from conversations (25, 82, 83). Potentially confounding factors impacting pupil measurement such as the decrease of pupil dynamic range with aging (84, 85), participant fatigue, or task habituation (45, 77, 86), can vary between individuals for a multitude of reasons (87). Here, the effects of these factors were minimized by applying trial-by-trial baseline corrections prior to analysis to match the magnitude of response between young and middle-aged adults.

Interestingly, pupil-indexed listening effort at a moderate SNR was a better predictor of behavioral performance at a more challenging SNR using two separate approaches – a Pearsons’s correlation and the elastic net regression model (Fig. 4B-D). We have previously demonstrated similar results in a different test group of young adult participants (48). These results suggest that the amount of effort required to maintain ceiling performance at moderate SNRs are predictive of behavioral performance at harder task difficulties. Pupillary indices at the harder task conditions may be rolling over into hyperexcitability (78, 79) and thus being a poorer predictor of concomitant behavioral performance. Additionally, our elastic net regression model suggested that CND and listening effort provided complementary contributions to explaining variance on the QuickSIN task.

Even though both young and middle-aged adults had clinically normal hearing thresholds, subtle changes within this normal range affected speech-in-noise performance (Fig. 4D), lending support to studies suggesting that the definition of clinically ‘normal’ may need revision (3, 88). Our findings demonstrate a need for next-generation diagnostic measures of auditory processing that incorporate both neurophysiological encoding of the temporal elements of sound and cognitive factors associated with listening effort to better capture one’s listening abilities. Future studies will directly test the link between cochlear and peripheral neural deficits and listening effort, and explore further contributions of other top-down mechanisms that may influence listening effort such as selective attention or semantic load (89, 90).

Methods

Humans

Participants

Recruitment

Young (n = 38; 18-25 years old, male = 10) and middle-aged (n = 45; 40-55 years old, male = 16) adult participants were recruited from the University of Pittsburgh Pitt + Me research participant registry, the University of Pittsburgh Department of Communication Science and Disorders research participant pool, and the broader community under a protocol approved by the University of Pittsburgh Institutional Review Board (IRB#21040125). Participants were compensated for their time, travel, and given an additional monetary incentive for completing all study sessions.

Eligibility

Participant eligibility was determined during the first session of the study. Eligible participants had normal cognition determined by the Montreal Cognitive Assessment (MoCA ≥ 25; Nasreddine et al., 2005), normal hearing thresholds (≤ 25 dB HL 250-8000 Hz), no severe tinnitus as self-reported via the Tinnitus Handicap Inventory (THI; (32), and Loudness Discomfort Levels (LDLs) ≥ 80dB HL at .5, 1, and 3kHz (33). Participants were not required to have specific complaints of speech perception in noise difficulties. The Beck’s depression Inventory (BDI (91)) was administered and participants were excluded if they reported thoughts of self-harm, determined by any response to survey item nine greater than 0. Participants self-reported American English fluency. Thirty-five young (18-25 years old, male = 10) and 37 middle-aged participants (40-55 years old, male = 10) met all eligibility criteria and were tested further using the battery described below.

Audiological assessment

Otoscopy

An otoscopic examination was conducted using a Welch Allyn otoscope to examine the patient’s external auditory canal, tympanic membrane, and middle ear space for excess cerumen, ear drainage, and other abnormalities. The presence of any such abnormality resulted in exclusion from the study, as these may lead to a conductive hearing loss.

Audiogram

Hearing thresholds were collected inside a sound attenuating booth using a MADSEN Astera2 audiometer, Otometrics transducers [Natus Medical, Inc. Middleton, WI], and foam insert eartips sized to the participants’ ear canal width. Tones were presented using a pulsed beat and participants were instructed to press a response plunger if they believed that they perceived a tone being played, even if they were unsure. Extended high frequency hearing thresholds (EHFs) were collected at frequencies 8, 12.5, and 16kHz using Sennheiser circumaural headphones and Sennheiser HDA 300 transducers using the same response instructions.

Loudness Discomfort Levels (LDLs)

LDLs were collected binaurally using Otometrics transducer [Natus Medical, Inc., Middleton, WI] and foam tip ear inserts. Warble tones were presented, and participants were instructed to rate the loudness on a scale of one to seven, with seven being so loud that they would leave the room.

Distortion Product Otoacoustic Emissions (DPOAEs)

Outer hair cell function was assessed using DPOAEs. DPOAEs were collected from both the right and left ear individually, with a starting frequency of 500Hz and an ending frequency of 16kHz. The stimulus had an L1 of 75dB SPL and an L2 of 65dB SPL and was presented in 8 blocks of 24 sweeps in alternating polarities. Responses were collected using rubber ear inserts sized to participants’ ear canal width and ER-10D DPOAE Probe transducer [Etymotic Research Inc., Elk Grove, IL].

Noise Exposure History

Participants completed the Noise Exposure Questionnaire (NEQ; (34)) as a self-reported assay of annual noise exposure, accounting for both occupational and non-occupational sources. Annual noise exposure was expressed using LAeq8760h, representing the annual hourly duration of noise exposure presented in sound pressure level in dB. Calculation of the LAeq8760h followed the original article (34).

OSPAN

Participants also completed the automated version of the OSPAN task(92), as a metric of working memory (36). Participants were shown simple arithmetic problems and asked to decide whether presented solutions to the problems were correct or incorrect. A letter was displayed on the screen after each problem. Following a series of arithmetic-letter presentations, participants were required to recall the letters that were displayed in the order that they appeared. The task consisted of 15 letter sequences that spanned three to seven letters (three repetitions of each span). If a participant correctly recalled all letters from a sequence, the span length was added to their score. The maximum possible score on the OSPAN task was 75.

Speech perception in noise

Sentence-level speech perception in noise

Speech perception in noise was indexed using moderate-predictability sentences masked in multitalker babble at six different signal-to-noise ratios (SNR) from the Quick Speech in Noise test (QuickSIN;(40). QuickSIN is a standardized measure of speech perception in noise that is commonly used in audiology clinics and is representative of a naturalistic listening environment (93). Each QuickSIN test list consisted of six sentences masked in four-talker babble at the following SNR levels: 25, 20, 15, 10, 5, and 0dB. All participants completed four test lists. Participants listened to the sentences through Sennheiser circumaural headphones. The masker was presented at 60dB SPL, and the sound level of the target sentences were varied to obtain the required SNR level. Participants were instructed to repeat the target sentence to the best of their ability. Each target sentence contained five keywords for identification. The number of key words identified per sentence were recorded. Then, the proportion of keywords correctly identified for each SNR across all four test lists (20 total key words per SNR) was calculated for each participant (40, 94). In addition, we calculated the standard clinical QuickSIN score of dB SNR loss, which reflects the lowest SNR level that an individual can accurately identify words 50% of the time. For each participant, the dB SNR loss score was calculated for each test list separately using the following equation: 25.5 − (𝑠𝑢𝑚 𝑜𝑓 𝑘𝑒𝑦𝑤𝑜𝑟𝑑𝑠 𝑖𝑑𝑒𝑛𝑡𝑖𝑓𝑖𝑒𝑑 𝑖𝑛 𝑙𝑖𝑠𝑡) (40). Then, the mean dB SNR loss across all four test lists was calculated and used for analysis.

Pupillometry

Acquisition

Pupillary responses were recorded while participants completed the QuickSIN task. Participants were seated in a testing room with consistent, moderate ambient lighting facing a monitor. Monocular left-eye pupillary responses were recorded at a 1000 Hz sampling rate using an EyeLink 1000 Plus Desktop Mount camera and chin rest (SR Research). Nine-point eye-tracker calibration was performed prior to the start of the experiment. To start each trial, participants were required to fixate on a cross in the center of the screen for a minimum of 500 ms. This fixation criterion was applied to control for the effects of saccades, which can alter pupil diameter, and to minimize pupil foreshortening errors (9597). After meeting the 500 ms fixation criteria, a 100 ms 1000 Hz beep was presented to alert the participant to the start of the trial. There was a two second delay after the beep before the QuickSIN stimulus was presented. The background masker began three seconds before the target sentence and continued for two seconds after the target sentence. After the end of the background masker, there was a two second delay followed by a 100 ms 1000 Hz beep to signal the start of the verbal response period. Manual drift correction was performed at the end of each trial by the experimenter to ensure high quality tracking of the pupil.

Preprocessing

Pupillary data were processed in R (98) using the eyelinker package (99) and custom written scripts. Pupillary responses were analyzed in two windows of interest: 1) listening window, from multi-talker babble onset through 5800 ms, and 2) integration window, from target sentence offset to 1000 ms prior to behavioral response period. Separately for each window of interest, data were first processed to remove noise from blinks and saccades. Any trial with more than fifteen percent of the samples detected as saccades or blinks were removed. For the remaining trials, blinks were linearly interpolated from 60 ms before to 160ms after the detected blinks. Saccades were linearly interpolated from 60 ms before to 60 ms after any detected saccade. The de-blinked data were then down sampled to 50 Hz. Pupillary responses were baseline corrected and normalized on a trial-by-trial basis to account for a downward drift in baseline that can occur across a task and for individual differences in pupil dynamic range (96). Baseline pupil size was defined as the average pupil size in the 1000 ms period prior to the start of the window of interest . The pupillary response was then averaged across all four test lists for each SNR per participant in each window of interest. The outcome reported is percent change in pupil size from baseline.

Growth curve analyses (GCA; Mirman, 2014) were used to obtain a measure of the slope of the pupillary response during QuickSIN. GCA uses orthogonal polynomial time terms to model distinct functional forms of the pupillary response over time. Two GCAs were fit using a second-order orthogonal polynomial to model the interaction of age group with SNR level, separately for the listening window and the integration window. This second-order model provides three parameters to explain the pupillary response. The first is the intercept, which refers to the overall change in the pupillary response over the time-window of interest. The second is the linear term (ot1), which represents the slope of the pupillary response over time, or the rate of dilation. The third is the quadratic term (ot2), representing curvature of the pupil response, or the change in rate of the pupillary response over time. GCA were conducted in R (R Core Team, 2022) using the lme4 package (100) and p-values were estimated using the lmerTest package (101).

For the listening window, the best-fit GCA model included fixed effects of each time term (ot1, ot2), SNR (reference = 25), Group (reference = younger), and all 2- and 3-way interactions between SNR, Group, and time terms. The random effect structure consisted of a random slope of each time term per participant that removed the correlation between random effects, and a random slope of each time term per the interaction of participant and SNR level.

For the integration window, the best-fit GCA model included fixed effects of each time term (ot1, ot2), SNR (reference = 25), Group (reference = younger), and all 2- and 3-way interactions between SNR, Group, and time terms. The random effect structure consisted of a random slope of each time term per participant, and a random slope of each time term per the interaction of participant and SNR level.

Electrophysiology

Envelope Following Responses (EFRs)

EFRs were collected in a sound attenuating booth using a BioSemi ActiveTwo EEG system while participants were seated in a recliner. Stimuli were presented using ER-3C transducers [Etymotic Research Inc., Elk Grove, IL] with gold-foil tiptrodes placed in the ear canals to deliver sound stimuli and record additional channels of evoked potentials. EFRs were recorded to a 250 ms tone with a carrier frequency of 3000Hz, amplitude modulated (AM) at 40, 110, 512, and 1024Hz. Stimuli were presented in alternating polarity, with 500 repetitions each at 85dB SPL to the right ear. Each token was presented at 3.1 repetitions/second, for a period of 322ms.

Preprocessing

EFRs from the Fz to the ipsilateral (right) tiptrode were processed and analyzed using custom written scripts in MATLAB v. 2022a (Mathworks Inc., Natick, Massachussetts). EFRs were processed using a fourth-order Butterworth filter with a lowpass filter of 3000Hz. The highpass filter cutoffs used were 5Hz, 80Hz, 200Hz, 300Hz for 40Hz, 110Hz, 512Hz, and 1024Hz AM stimuli, respectively. Fast Fourier transforms (FFTs) were performed on the averaged time domain waveforms for each participant at each AM rate starting 10ms after stimulus onset to exclude auditory brain stem responses (ABRs) and ending 10ms after stimulus offset. The maximum amplitude of the FFT peak at one of three adjacent bins (∼3Hz) around the modulation frequency of the AM rate was reported as the EFR amplitude.

Animals

Subjects

Fourteen young adult Mongolian gerbils aged 18-27 weeks (male = 9) and thirteen middle-aged Mongolian gerbils aged 75-82 weeks (male = 6) were used in this study. All animals are born and raised in our animal care facility from breeders obtained from Charles River. The acoustic environment within the holding facility was characterized by noise-level data logging and was periodically monitored. Data logging revealed an average noise level of 56 dB, with transients not exceeding 74 dB during regular housing conditions and 88dB once a week during cage changes. All animal procedures were approved by the Institutional Animal Care and Use Committee of the University of Pittsburgh (Protocol #21046600).

Experimental Setup

Experiments were performed in a double walled acoustic chamber. Animals were placed on a water circulated warming blanket set to 37 °C with the pump placed outside the recording chamber to eliminate audio and electrical interferences. Gerbils were initially anesthetized with isoflurane gas anesthesia (4%) in an induction chamber. The animals were transferred post induction to a manifold and maintained at 1%–1.5% isoflurane. Subdermal electrodes (Ambu) were then placed on the animals’ scalps for the recordings. A positive electrode was placed along the vertex. The negative electrode was placed under the ipsilateral ear, along the mastoid, while the ground electrode was placed in the base of the tail. Impedances from the electrodes were always less than 1 kHz as tested using the head-stage (RA4LI, Tucker Davis Technologies (TDT)). The average duration of isoflurane anesthesia during the electrode setup process was approximately 10 min. After placing electrodes, animals were injected with dexmedetomidine (Dexdomitor, 0.3 mg/kg subdermal) and taken off the isoflurane. Dexmedetomidine is an alpha-adrenergic agonist that acts as a sedative and an analgesic and is known to decrease motivation but preserve behavioral and neural responses in rodents (102, 103). This helps to maintain animals in an un-anesthetized state, where they still respond to pain stimuli, such as a foot pinch, but are otherwise compliant to recordings for a period of about 3 hours. The time window for the effects of isoflurane to wear off was determined empirically as 9 minutes, based on ABRs waveforms and latencies, as well as the response to foot pinch stimuli. Recordings then commenced 15 minutes after cessation of isoflurane.

Stimulus presentation, acquisition, and analysis

Stimuli were presented to the right ear of the animal using insert earphones (ER3C, Etymotic), which matched the stimulus presentation in humans. Stimuli presentation and acquisition were done by a custom program for gerbils in LabView. The output from the insert earphones was calibrated using a Bruel Kjaer microphone and was found to be within ±6 dB for the frequency range tested. Digitized waveforms were recorded with a multichannel recording and stimulation system (RZ-6, TDT) and analyzed with custom written programs in MATLAB (Mathworks).

Hearing thresholds were obtained using ABRs presented to tone stimuli that were 5 ms long, with a 2.5 ms on and off ramp, at 27.1 repetitions per second. ABRs were filtered from 300Hz to 30000Hz, and thresholds were determined as the minimum sound level that produced a response as assessed using visual inspection by two blinded, trained observers.

EFRs were elicited to sinusoidally AM tones (5ms rise/fall, 250ms duration, 3.1 repetitions/s, alternating polarity) at a 3KHz carrier frequency presented 30dB above auditory thresholds obtained using ABRs at 3kHz. The modulation frequency was systematically varied from 16Hz to 1024Hz AM. Responses were amplified (×10,000; TDT Medusa 4z amplifier) and filtered (0.1–3 kHz). Trials in which the response amplitude exceeded 200μV were rejected. 250 artifact-free trials of each polarity were averaged to compute the EFR waveform. FFTs were performed on the averaged time–domain waveforms starting 10ms after stimulus onset to exclude ABRs and ending at stimulus offset using custom-written programs in MATLAB (MathWorks). The maximum amplitude of the FFT peak at 1 of 3 frequency bins (∼3 Hz each) around the modulation frequency was recorded as the peak FFT amplitude. The FFT amplitude at the AM frequency was reported as the EFR amplitude. The noise floor of the EFR was calculated as the average of 5 frequency bins (∼3 Hz each) above and below the central three bins. A response was deemed as significantly above the noise floor if the FFT amplitude was at least 6 dB greater than the noise floor.

Immunohistology

Animals were transcardially perfused using a 4% paraformaldehyde solution (Sigma-Aldrich, 441244) for approximately five minutes before decapitation and isolation of the right and left cochlea. Following intra-labyrinthine perfusion with 4% paraformaldehyde, cochleas were stored in paraformaldehyde for one hour. Cochleae were decalcified in EDTA (Fisher Scientific, BP120500) for 3 to 5 days, followed by cryoprotection with sucrose (Fisher Scientific, D16500) and flash freezing. All chemicals were of reagent grade. Cochlea were thawed prior to dissection, then dissected in PBS solution. Immunostaining was accomplished by incubation with the following primary antibodies: 1) mouse anti-CtBP2 (BD Biosciences) at 1:200, 2) mouse anti-GluA2 (Millipore) at 1:2000, 3) rabbit anti-myosin VIIa (Proteus Biosciences) at 1:200; followed by incubation with secondary antibodies coupled to AlexaFluors in the red, green, and blue channels. Piece lengths were measured and converted to cochlear frequency using established cochlear maps (104) and custom plugins in ImageJ. Cochlear stacks were obtained at the target frequency (3kHz) spanning the cuticular plate to the synaptic pole of ∼10 hair cells (in 0.25 μm z-steps). Images were collected in a 1024 × 1024 raster using a high-resolution, oil-immersion objective (x60) and 1.59x digital zoom using a Nikon A1 confocal microscope. Images were denoised in NIS elements and loaded into an image-processing software platform (Imaris; Oxford Instruments), where inner hair cells were quantified based on their Myosin VIIa-stained cell bodies and CtBP2-stained nuclei. Presynaptic ribbons and postsynaptic glutamate receptor patches were counted using 3D representations of each confocal z-stack. Juxtaposed ribbons and receptor puncta constitute a synapse, and these synaptic associations were determined using IMARIS workflows that calculated and displayed the x–y projection of the voxel space (12, 105).

Statistical analysis

Analysis of Variance (ANOVA)

Normality of all variables was first checked visually using Q-Q plots and statistically using Shapiro-Wilks test with alpha = 0.05. Homogeneity of variance was assessed using Levene’s test. N-way ANOVAs were completed using R 2022.07.1 for each measure to determine statistically significant differences between groups (106). The function employed, aov, uses treatment contrasts in which the first baseline level is compared to each of the following levels. The number of factors was determined based on the conditions tested in each measure. Bonferroni corrections were used to control familywise error rate due to multiple comparisons.

Correlations

Outliers were detected using Tukey’s Fence with a boundary distance of k = 1.5 and removed. Correlations were computed using Pearson’s correlations. Degrees of freedom, r, and p-values were reported.

Elastic Net Regression

We used an linear model with an elastic net penalization/regularization (49) to simultaneously estimate the underlying contributions of the various predictor variables measured in our studies, and perform model selection. This approach has been previously validated for model selection using multidimensional data related to hearing pathologies like tinnitus and hyperacusis (107). The relative strength of selection and shrinkage is controlled by the hyper-parameters 𝜆 and 𝛼: a higher 𝜆 implies more stringent penalization pushing towards the null model, and 0 ≤ 𝛼 ≤ 1 controls the degree of convexity and hence the amount of sparsity, with 𝛼 = 0 implying a Ridge regression with no variable selection. Elastic net is a regularized regression method that minimizes the negative log-likelihood with a penalty on the parameters that combines the l1 (LASSO) and l2 (Ridge) penalty, i.e. the elastic net penalty on the regression parameters β can be written as 𝑃𝑒𝑛(β) = λ(α‖β‖1 + (1 − α)/2‖β‖2). An elastic net regularization has several advantages over both of LASSO or Ridge as well as a simple linear model. The l1 part of the elastic net (‖β‖1) leads to a sparse model where some of the coefficients are shrunk to exact zeroes, thereby performing an automatic model selection without the combinatorial computational complexities of a best-subset selection approach. Further, the quadratic l2 part (‖β‖2) encourages grouped variable selection and removes the limitation of number of selected variables unlike LASSO while stablizing the selection path. To choose the tuning parameters 𝜆 and 𝛼, we used a 10-fold cross-validation that minimizes the out-of-sample root mean-squared error (RMSE). We used the R packages glmnet (108) and caret (109) for training the elastic net regularizer.

Data Availability

All data reported and analyzed in this study can be found on the Open Science Framework at http://doi.org/10.17605/OSF.IO/4BGDA

Acknowledgements

This work was supported by the National Institute on Deafness and Other Communication Disorders-National Institutes of Health Grants R21DC018882 to A.P, T32DC011499 to K. Kandler and B. yates (Trainee: M.E.Z) and F31DC020085 to J.R.M., and the PNC-Trees Charitable Trust (PNC to B.C. and A.P.). We thank Dr. Carl Snyderman for collaboration on the PNC-Trees grant, and Megan Hallihan, Kathryn Bergstrom, Sarah Anthony, and Shaina Wasileski for their assistance with participant recruitment and data collection. Thanks also to Dr. Simon Warkins, Katherine Helfrich and Mike Calderon at the Center for Biological Imaging at the University of Pittsburgh, supported by NIH grant 1S10RR028478-01 for collaboration on confocal imaging, and the Clinical and Translational Science Institute at the University of Pittsburgh, supported by the NIH Clinical and Translational Science Award (CTSA) program, grant UL1 TR001857 for assistance with participant recruitment.

Additional information

Author Contributions

Conceptualization: AP; Methodology: AP, BC, JM, JD; Data collection: MEZ, JK, KY, VC, OF, CM; Data analysis: MEZ, LZ, JRM, KY, VC, OF, CM; Statistical analysis: JRM, MEZ, LZ, JD; Writing: MEZ, JRM; Editing: AP, JD, BC; Supervision, Project administration: AP, BC; Funding acquisition: AP, BC, JRM

Funding

National Institute on Deafness and Other Communication Disorders (R21DC018882)

National Institute on Deafness and Other Communication Disorders (T32DC011499)

National Institute on Deafness and Other Communication Disorders (F31DC020085)

National Institutes of Health (1S10RR028478)

National Institutes of Health (UL1TR001857)