Mesolimbic confidence signals guide perceptual learning in the absence of external feedback

Abstract
eLife digest
Introduction
Results
Discussion
Materials and methods
References
Article and author information
Metrics

Abstract

It is well established that learning can occur without external feedback, yet normative reinforcement learning theories have difficulties explaining such instances of learning. Here, we propose that human observers are capable of generating their own feedback signals by monitoring internal decision variables. We investigated this hypothesis in a visual perceptual learning task using fMRI and confidence reports as a measure for this monitoring process. Employing a novel computational model in which learning is guided by confidence-based reinforcement signals, we found that mesolimbic brain areas encoded both anticipation and prediction error of confidence—in remarkable similarity to previous findings for external reward-based feedback. We demonstrate that the model accounts for choice and confidence reports and show that the mesolimbic confidence prediction error modulation derived through the model predicts individual learning success. These results provide a mechanistic neurobiological explanation for learning without external feedback by augmenting reinforcement models with confidence-based feedback.

https://doi.org/10.7554/eLife.13388.001

eLife digest

Much of our behavior is shaped by feedback from the environment. We repeat behaviors that previously led to rewards and avoid those with negative outcomes. At the same time, we can learn in many situations without such feedback. Our ability to perceive sensory stimuli, for example, improves with training even in the absence of external feedback.

Guggenmos et al. hypothesized that this form of perceptual learning may be guided by self-generated feedback that is based on the confidence in our performance. The general idea is that the brain reinforces behaviors associated with states of high confidence, and weakens behaviors that lead to low confidence.

To test this idea, Guggenmos et al. used a technique called functional magnetic resonance imaging to record the brain activity of healthy volunteers as they performed a visual learning task. In this task, the participants had to judge the orientation of barely visible line gratings and then state how confident they were in their decisions.

Feedback signals derived from the participants’ confidence reports activated the same brain areas typically engaged for external feedback or reward. Moreover, just as these regions were previously found to signal the difference between actual and expected rewards, so did they signal the difference between actual confidence levels and those expected on the basis of previous confidence levels. This parallel suggests that confidence may take over the role of external feedback in cases where no such feedback is available. Finally, the extent to which an individual exhibited these signals predicted overall learning success.

Future studies could investigate whether these confidence signals are automatically generated, or whether they only emerge when participants are required to report their confidence levels. Another open question is whether such self-generated feedback applies in non-perceptual forms of learning, where learning without feedback has likewise been a long-standing puzzle.

https://doi.org/10.7554/eLife.13388.002

Introduction

Learning is an integral part of our everyday life and necessary for survival in a dynamic environment. The behavioral changes arising from learning have quite successfully been described by the reinforcement learning principle (Sutton and Barto, 1998), according to which biological agents continuously adapt their behavior based on the consequences of their actions. Thus, reinforcement learning models and most other learning models depend on feedback from the environment. Yet, there are important instances of learning where no such external feedback is provided, challenging the generality of these learning models in shaping our behavior.

A well-studied case of learning is the improvement of performance in perceptually demanding tasks through training or repeated exposure (Gibson, 1963). Such perceptual learning has repeatedly been demonstrated to occur without feedback (Herzog and Fahle, 1997; Gibson and Gibson, 1955; McKee and Westheimer, 1978; Karni and Sagi, 1991) and is therefore ideally suited as a test case to study learning in the absence of external feedback. Previous work has emphasized the role of reinforcement learning in perceptual learning (Kahnt et al., 2011; Law and Gold, 2009). However, these accounts were based on perceptual learning with external feedback and therefore cannot account for instances in which learning occurs without external feedback. Here, we pursued the idea that, in the absence of external feedback, learning is guided by internal feedback processes that evaluate current perceptual information in relation to prior knowledge about the sensory world. We reasoned that introspective reports of perceptual confidence could serve as a window into such internal feedback processes. In this scenario, low or high confidence would correspond to a negative or positive self-evaluation of one’s own perceptual performance, respectively. Accordingly, confidence could act as a teaching signal in the same way as external feedback in normative theories of reinforcement learning (Daniel and Pollmann, 2012; Hebart et al., 2014). Applied to the case of perceptual learning, a confidence-based reinforcement signal could serve to strengthen neural circuitry that gave rise to high-confidence percepts and weaken circuitry that led to low-confidence percepts, thereby enhancing the quality of future percepts.

We tested this idea in a challenging perceptual learning task, in which participants continuously reported their confidence in perceptual choices while undergoing functional magnetic resonance imaging (fMRI). No external feedback was provided; instead, confidence ratings were used as a proxy of internal monitoring processes. To account for perceptual learning in the absence of feedback, we devised a confidence-based associative reinforcement learning model. In the model, confidence prediction errors (Daniel and Pollmann, 2012) serve as teaching signals that indicate the mismatch between the current level of confidence and a running average of previous confidence experiences (expected confidence). Based on recent evidence of confidence signals in the mesolimbic dopamine system (Daniel and Pollmann, 2012; Hebart et al., 2014; Schwarze et al., 2013), we hypothesized to find neural correlates of confidence prediction errors in mesolimbic brain areas such as the ventral striatum and the ventral tegmental area. Since confidence prediction errors act as a teaching signal in our model, we hypothesized that the strength of these mesolimbic confidence signals should be linked to individual perceptual learning success.

Results

Human participants (N=29) learned to detect the orientation of peripheral noise-embedded Gabor patches relative to a horizontal or vertical reference axis while undergoing functional magnetic resonance imaging (fMRI). Overall, the experiment comprised four sessions: (i) an initial behavioral test session to establish participants’ baseline contrast thresholds for a performance level of 80.35% correct responses, (ii) an intensive perceptual learning session (training) in the MRI scanner with a continuous threshold determination, and two behavioral post-training test sessions to examine (iii) short-term and (iv) long-term stimulus-specific training effects (Figure 1A). While the training session was based on one reference axis, all test sessions comprised a contrast threshold determination for both reference axes. The training session additionally included a control condition in interleaved presentation, for which the contrast was kept constant to enable an exploratory multivariate analysis of changes in neural stimulus representation. The Gabor stimuli were flashed briefly in the upper right quadrant and participants had to judge their orientation with respect to the current reference axis (Figures 1B,C). Eyetracking ensured that participants maintained fixation throughout the training session (Figure 2—figure supplement 1). Importantly, participants did not receive external cognitive or rewarding feedback during the entire experiment. Rather, in addition to their choice, they reported their confidence about the stimulus orientation on a visual analogue scale (for a verification of accurate usage, see Figure 2—figure supplement 2). The confidence reports were used to compute the internal feedback in our model on a trial-by-trial basis.

Figure 1

Download asset Open asset

Experimental design.

(A) Overview over experimental sessions. The experiment consisted of one training session and three test sessions (pre-test, post-test and longterm-test). The test sessions included both reference axes and were used to determine the contrast threshold for a performance of 80.35 percent correct at different stages of the experiment. In the training session, only one reference axis was shown. Here too, a staircase procedure was used to continuously determine the contrast threshold for a performance level of 80.35%. In addition, the training session included a condition with constant contrast as a control for stimulus factors. (B) Procedure of an experimental trial. Participants were presented with Gabor stimuli, which were oriented either clockwise or counterclockwise with respect to a reference axis. In the unspeeded response phase participants indicated their level of confidence about the stimulus orientation on an analogue scale and subsequently made a binary orientation judgment. (C) Examples of the stimuli. Gabor patches were oriented 20° clockwise (cw) or 20° counterclockwise (ccw) relative to either the vertical or the horizontal reference axis. Three exemplary contrast levels are shown, where 8% corresponds to the participant average during training, 16% to the highest obtained thresholds and 100% to full contrast.

https://doi.org/10.7554/eLife.13388.003

Stimulus-specific perceptual learning

To establish stimulus-specific perceptual learning, we compared perceptual thresholds in pre- and post-experimental sessions between the trained and untrained reference axis (Figure 2A). The contrast thresholds improved for the trained (t₂₈ = 6.73, p < 0.001, two-tailed), but not for the untrained reference axis (t₂₈ = 0.41, p = 0.68; interaction of training × time: F_1,28 = 14.2, p < 0.001), demonstrating clear and specific effects of perceptual learning. These stimulus-specific training effects could still be detected 10 weeks later (F_1,28 = 4.3, p = 0.047), indicating long-term stability and thus demonstrating a key characteristic of perceptual learning (Karni and Sagi, 1993). To test whether the effects of learning could already be detected during the training session, we linearly fitted the contrast thresholds across trials in the critical constant performance condition. The analysis showed that contrast threshold consistently decreased across runs (linear slope: −0.006 ± 0.002, t₂₈ = −2.38, p = 0.024), from 8.64% ± 0.47 (mean ± SEM) in the first training run to 7.68% ± 0.52 in the last training run.

Figure 2 with 2 supplements see all

Download asset Open asset

Behavioral results.

(A) Contrast thresholds across the runs of the training session and in the three test-sessions (pre/post/long-term). (B) Relationship between confidence ratings and performance during the training session. Percent correct responses were computed by means of a sliding window across sorted confidence values (window size: 5% of all trials).

https://doi.org/10.7554/eLife.13388.004

To assess whether participants adequately used the continuous confidence rating scale during training, the relationship between reported confidence and performance (proportion correct) was analyzed by means of a sliding window across sorted confidence values. The performance increased monotonically with confidence (main effect of percentile: F_22,2442 = 5.76, p < 0.001, one-way ANOVA with repeated measures), approaching chance at low levels of confidence, without showing ceiling effects at high levels of confidence (Figure 2B). This pattern indicates a close link between the decisional certainty of the choice and the reported confidence and shows that confidence represents valuable self-generated feedback.

A confidence-based model of perceptual learning

To account for perceptual learning without external feedback, we devised an associative reinforcement learning model with confidence as internal feedback (Figure 3). Learning in this model was guided by the combination of a confidence-based reinforcement signal and Hebbian plasticity, inspired by the previously proposed three-factor learning rule (dopaminergic reinforcement signal, pre-synaptic activity, post-synaptic activity) for neural plasticity in the mesolimbic system (Reynolds et al., 2001; Schultz, 2002). Our model assumes that observers improve perceptual performance by optimizing a filter on incoming sensory evidence. The filter is represented by two components: signal weights for clockwise (cw) and counterclockwise (ccw) stimulus orientation (w_ccw_,ccw; w_cw_,cw), connecting orientation energy detectors E_ccw_/cw to decision units A_ccw_/cw with same orientations; and noise weights (w_ccw_,cw; w_cw_,ccw), connecting detectors E_ccw_/cw to decision units A_cw_/ccw with opposing orientations. The clockwise and counterclockwise orientation energy contained in the stimuli is computed by a simple model of primary visual cortex (Petrov et al., 2005). The weighted sums of E_ccw_/cw determine the activities of decision units A_cw_/ccw which, in a next step, are integrated in a decision value DV = A_cw − A_ccw. DV translates into probabilities for clockwise or counterclockwise choices via a softmax action selection rule, and to the model’s equivalent of confidence—decisional certainty—through its absolute value. Finally, perceptual learning is based on an associative reinforcement learning rule with two separate components: a reinforcement component utilizes a confidence prediction error (CPE; denoted as δ) as internal feedback, representing the mismatch between current confidence and a long-term estimate of expected confidence (via a learning rate α_c); and a Hebbian component ensures that the weights are updated in proportion to how strongly orientation detectors and decision units co-activate. In addition, a learning rate α_w accounts for inter-individual differences in learning speed. Figure 3—figure supplement 1 provides an exemplary time slice of one participant's behavioral reports and accompanying model variables.

Figure 3 with 1 supplement see all

Download asset Open asset

Confidence-based model of perceptual learning.

Counterclockwise (*E_cw*) and clockwise (*E_ccw*) orientation energy detectors of a dedicated representational subsystem are connected via *signal weights* (horizontal) and *noise weights* (diagonal) to decision units (*A_ccw, A_cw*). Reported choices (decisions) d are probabilistically modeled by a *decision value DV = A_ccw*− *A_cw* and the reported confidence c is modeled through the absolute value of x. Weights are updated through an associative reinforcement learning update rule. The reinforcement component is based on a *confidence prediction error δ*, reflecting the difference between reported confidence and a weighted running average of previous confidence experiences (*expected confidence $\bar{c}$* ). The Hebbian component (*E_i*× *A_j*) ensures that the update more strongly affects those connections that contribute more to the final choice. Grey-shaded boxes indicate observed variables.

https://doi.org/10.7554/eLife.13388.007

In a first step, we validated the representational subsystem by computing the average energy content for a range of spatial frequencies (one octave above and below the actual frequency of 1.25 cycles/degree) and for a range of orientations (−40° to +40° relative to the reference axes). As expected, the energy content was higher for the spatial frequency and orientations used to generate the Gabor patches relative to other frequencies and orientations (Figure 4—figure supplement 1). Further, when orientation energy was computed separately for correct and incorrect responses, the energy for designated orientations (i.e., the orientations used to generate the Gabor patches) was higher for correct than for incorrect responses (t₂₈ = 8.1, p < 0.001), whereas the energy for opposite orientations (i.e., ∓20° if ±20° was presented) was lower for correct than for incorrect responses (t₂₈ = −3.6, p = 0.001) (Figure 4A). This pattern demonstrated that the varying orientation energy was directly associated with behavior and adds to the validation of the model. This pattern did also hold when the analysis was restricted to trials of the constant contrast condition (designated orientation: t₂₈ = 6.64, p < 0.001; opposite orientation: t₂₈ = −2.38, p = 0.024), thereby showing that the representational subsystem accounted for additional variance due to the random noise field over and above the variance due to changes in stimulus contrast.

Figure 4 with 3 supplements see all

Download asset Open asset

Modeling results.

(A) Orientation energy computed by the model’s representational subsystem. The energy is depicted separately for correct and incorrect responses as well as for designated and opposing orientations. (B) Binned choice probabilities (clockwise) for observed data (black) and model predictions (red) as a function of the model-derived DV (gry: logistic fit to data). (C) Correspondence between participants’ binned confidence ratings and model-based decisional certainty (grey: linear fit). (D) Change of signal and noise weights across training runs. All error bars denote SEM corrected for between-subject variance (Cousineau, 2005).

https://doi.org/10.7554/eLife.13388.009

We then went on to fit the model parameters to participants’ orientation and confidence reports in the training session using maximum likelihood approximation (median ± SE of the median: α_w= 0.0018 ± 0.0007, α_c = 0.533 ± 0.077; see Supplementary file 1 for other parameters). To assess the model fit, we correlated model-based choice probabilities with participants’ actual choices, separately for each participant. This analysis showed that the model accounted well for participants’ choices (mean ± SE of individual z-transformed correlation coefficients: r_Pearson = 0.64 ± 0.03; one-sample t-test against Fisher z’ = 0: t₂₈ = 26.2, p < 0.001). This correspondence is reflected in the fact that participants’ and model-based choice probabilities show a nearly identical (sigmoidal) dependency on DV (Figure 4A; see Figure 4—figure supplement 2 for single-subject fits). We next assessed whether the model could predict participants’ trial-wise confidence reports. A correlation between the model-based decisional certainty and participants’ confidence reports confirmed that confidence, too, was captured by the model (mean ± SE of r_Pearson = 0.32 ± 0.02, t₂₈ = 15.4, p < 0.001; Figure 4B; see Figure 4—figure supplement 3 for single-subject data).

Finally, we evaluated how perceptual learning was reflected in the update of the model’s sensory filter by computing the change of signal and noise weights across runs. We expected that an increase of signal weights and a decrease of noise weights over the course of the training session was responsible for perceptual improvements. As depicted in Figure 4C, we found a linear increase for signal weights across runs (mean ± SEM of slope = 0.0147 ± 0.0036, t₂₈ = 4.1, p < 0.001), and a linear decrease for noise weights (slope = −0.0036 ± 0.0010, t₂₈ = −3.5, p = 0.001). Furthermore, the individual contrast threshold learning slopes correlated negatively with the slopes of signal weights (r_Pearson = −0.45, p = 0.013) and positively with the slopes of noise weights (r_Pearson = 0.46, p = 0.011). Thus, individual learning was well captured by the signal and noise weights of the model.

Model-free analysis of brain activation

We reasoned that if confidence-based internal feedback and reward-based external feedback share a common neural basis, neural responses in the ventral striatum to high-, average- and low-confidence events should exhibit a qualitatively similar pattern as reported for rewarding, neutral and punishing outcomes in reward-based learning (Delgado et al., 2000; Knutson et al., 2001). The results of these previous studies suggest that striatal activation reflects a positive anticipatory response at the beginning of a trial as well as a subsequent prediction error response related to the outcome. To simulate the BOLD response that arises from such a scheme, we convolved vectors coding the neural activation for an initial anticipatory response and three different outcome scenarios (positive, absent and negative prediction error) with a canonical double-gamma hemodynamic response function (Figure 5A). In accordance with the fMRI results of these previous studies, the simulation shows (i) an increase of striatal BOLD responses related to trial onset reflecting the anticipatory signal, and (ii) a subsequent positive, absent, or negative deflection of the BOLD response reflecting prediction errors.

Figure 5 with 1 supplement see all

Download asset Open asset

Confidence signals in the mesolimbic system and their relation to perceptual learning.

(A) Neural activation time courses consisting of an anticipatory peak at trial onset and a positive, absent, or negative reward prediction error (PE) during outcome (stimulus onset). To simulate the associated BOLD response, the time courses were convolved with the standard canonical hemodynamic response function provided by SPM. (B) Event-related BOLD time courses in the ventral striatum for three tertiles of the behavioral confidence reports (representing 'low', 'middle' and 'high' confidence trials). The shaded areas denote SEM. (**C, D**) Whole-brain t-maps showing brain regions with a positive relationship between BOLD signal and expected confidence at trial onset (C), and between BOLD signal and CPE at stimulus onset (D). The t-maps were thresholded at p<0.005 (C) and p<0.001 (D), uncorrected, for illustration purposes. (E) Scatter plot for the relation between the strength of striatal modulation by confidence prediction errors (peak values, after age correction) and individual perceptual learning success.

https://doi.org/10.7554/eLife.13388.013

To relate the neural signature of confidence in the present study to the simulation and to the results of previous reward-based studies, we binned the data into tertiles of the behavioral confidence rating (low, middle, and high confidence) and extracted the average BOLD time course in an anatomical mask of the ventral striatum using the SPM toolbox rfxplot (Gläscher, 2009). As shown in Figure 5B, the obtained event-related BOLD time courses are in remarkable agreement with the predictions of the simulation of Figure 5A and previous empirical findings of reward studies (Delgado et al., 2000). Specifically, 4–6 s after trial onset (reflecting the hemodynamic delay), the BOLD time courses exhibited a first peak, consistent with an anticipatory confidence signal at the start of a trial; 4–6 s after stimulus onset, the BOLD time courses displayed a positive deflection for high-confidence trials and a negative deflection for low-confidence trials. A statistical analysis confirmed above-baseline striatal activation at trial onset, indicative of an anticipatory signal (left peak at [−10 14 −6], t₂₈ = 6.42, p_rFWE < 0.001; right peak at [12 14 −8], t₂₈ = 7.78, p_rFWE < 0.001), as well as a main effect of confidence at stimulus onset in the bilateral ventral striatum (left peak at [−10 14 −4], t₂₈ = 10.56, p_rFWE < 0.001; right peak at [16 12 −8], t₂₈ = 11.46, p_rFWE < 0.001). This model-free assessment provides initial support for the idea that reinforcement based on reward and based on confidence share a common neural substrate both in the anticipation and the outcome period. In addition, the results lend plausibility to a model utilizing confidence as a reinforcement signal.

Model-based analysis of brain activation

Expected confidence

To link the confidence-based reinforcement learning model to brain activity, we estimated a new general linear model (GLM) using two time-varying parametric variables generated from the model: expected confidence at trial onset and CPE at stimulus onset. As conjectured, we found a significant parametric modulation of striatal activation by expected confidence at trial onset (right peak at [8 14 −4], t₂₈ = 4.12, p_rFWE = 0.018; left peak at [−12 20 −2], t₂₈ = 3.97, p_rFWE = 0.026; Figure 5C). This model-based result corroborates the notion that striatal activation at trial onset reflects anticipated confidence for the upcoming stimulus presentation, analogous to previous findings of an anticipatory reward signal in the ventral striatum (Delgado et al., 2000; Preuschoff et al., 2006; Knutson et al., 2001).

Confidence prediction error

Using the parametric CPE regressor, we next tested for a positive linear relationship between BOLD and the CPE at the time of stimulus presentation (Figure 5D). As hypothesized, the bilateral ventral striatum showed a strong positive relationship with the CPE (left peak at [−16 8 −10], t₂₈ = 7.64, p_rFWE < 0.001; right peak at [16 14 −8], t₂₈ = 7.81, p_rFWE < 0.001). This modulation was also present in our second region of interest, the ventral tegmental area (peak at [−6 −22 −16], t₂₈ = 3.02, p_rFWE=0.027; see Supplementary file 2 for a whole-brain list of active brain regions). A complementary analysis ruled out that this modulation was due to stimulus salience (Figure 5—figure supplement 1). In sum, the CPE was represented by the same key brain structures that have been implicated in signaling reward prediction errors (Schultz et al., 1997; O’Doherty et al., 2004; Berns et al., 2001).

Orientation energy and decision value

To assess whether the model variables associated with the sensory and decisional subsystem would be reflected in brain activity, we performed a multivariate whole-brain searchlight (Kriegeskorte et al., 2006) analysis using cross-validated MANOVA (Allefeld and Haynes, 2014). As a variable describing the sensory evidence, we used orientation energy (OE), defined as the sign of the stimulus orientation (ccw = −1, cw = 1) in conjunction with the corresponding orientation energy tertile (low = 0.5, middle = 1.5, high = 2.5). We then searched for brain areas showing a multivariate linear relationship with OE. We found that OE was encoded in left occipital cortex (peak at [−44 70 0], t₂₈ = 4.77, p_cFWE = 0.033), an area overlapping with voxels activated by the stimulus localizer (Figure 6A). An additional ROI analysis to track the distinctness of stimulus-coding patterns across training runs within the control condition (constant contrast) was not feasible due to a lack of statistical power, i.e., decoding of orientation energy within a localizer-based visual cortex ROI was not possible when the data were restricted to the constant contrast condition (mean pattern distinctness ± SEM, D = 0.0029 ± 0.0033, t₂₈ = 0.86, p = 0.198, one-tailed t-test against the null hypothesis of a pattern distinctness equal to zero).

Figure 6

Download asset Open asset

The neural basis of perceptual and decisional model variables.

(A) Model-derived signed orientation energy (OE). The panel shows the t-map for multivariate decoding of OE. Red outlines indicate areas generally responding to the stimulus as measured with the independent stimulus localizer (t-contrast: stimulus > baseline). (B) Model-derived decision value (DV). T-map for multivariate decoding of the model-derived DV. All t-maps are thresholded at p < 0.005, for illustration.

https://doi.org/10.7554/eLife.13388.015

For the analysis of the decision value (DV), trials were sorted in an analogous manner, such that negative and positive DVs (reflecting the model’s tendency for counterclockwise and clockwise choices) were separately sorted into DV tertiles. Interestingly, decoding of DV showed only a trend in sensory occipital cortices (peak at [−40 −72 26], t₂₈ = 3.35, p = 0.001, uncorrected). Instead, we found significant encoding of the DV in right middle frontal gyrus (peak at [32 14 44], t₂₈ = 5.87, p_FWE = 0.026; Figure 6B), in line with a recent report (Hebart et al., 2014). Thus, perceptual and decision variables of our model can be mapped to visual and frontal cortex, respectively.

Relation to individual perceptual learning success

Finally, we investigated whether the strength of the striatal confidence prediction error modulation translated into improvements in perceptual performance. For that purpose, we correlated the parameter estimates at the peaks of the bilateral striatal CPE contrast (coordinates [−16 8 −10] and [16 14 −8], see above) with an index for the perceptual learning success that quantified the threshold change for the trained reference axis while accounting for baseline thresholds. Age was included as an additional factor in the regression model to preclude confounding effects of age-related variation in striatal BOLD signal (Duijvenvoorde et al., 2014). As hypothesized, we found a significant relationship between the striatal CPE signal and individual perceptual learning success in the left ventral striatum (r_Pearson = 0.400, p = 0.016, one-tailed; without age correction: r_Pearson = 0.37, p = 0.026) and a trend in the right ventral striatum (r_Pearson = 0.268, p = 0.080; without age correction: r_Pearson = 0.24, p = 0.11) (see Figure 5E). This result is congruent with a viable role of CPE-based feedback signals for perceptual learning in the absence of external feedback.

Discussion

In this study, we used perceptual learning to address the question of how humans can improve performance in the absence of external feedback. Previous reinforcement learning accounts of perceptual learning were based on external cognitive and rewarding feedback (Law and Gold, 2009; Kahnt et al., 2011) and could not explain the established phenomenon of perceptual learning without such feedback (Herzog and Fahle, 1997; Gibson and Gibson, 1955; McKee and Westheimer, 1978; Karni and Sagi, 1991). Here, we suggest that observers are capable of generating internal feedback by utilizing confidence signals that provide a graded evaluation of the correctness of a perceptual decision. In this way, confidence may serve as a reinforcement signal similar to reward and guide perceptual learning in cases where no external feedback is provided.

In support of this view, our model-free fMRI analyses revealed that mesolimbic confidence signals mirror those typically found for reward, both in the anticipation period (Preuschoff et al., 2006; Delgado et al., 2000; Knutson et al., 2001) and for prediction errors (Schultz et al., 1997; O’Doherty et al., 2004; Berns et al., 2001). To establish a mechanistic ground for this suggested parallel, we devised an associate reinforcement learning model, which links behavior to computational variables that each account for a different aspect of the learning process. CPEs served as feedback in the model, defined as the difference between the current level of confidence and a long-term estimate of expected confidence. The model successfully described the learning process as a continuous adjustment of a perceptual filter linking sensory and decision units. Our model-based fMRI analyses confirmed and extended the results of the model-free analyses by demonstrating a parametric modulation in mesolimbic brain areas both by expected confidence and confidence prediction. Importantly, the strength of the striatal modulation by CPEs predicted participants’ perceptual improvements, further corroborating the behavioral relevance of these internally-generated feedback signals.

The observed pattern of confidence-related activity in the mesolimbic system, including the co-modulation of the ventral tegmental area, fit well with the prediction error hypothesis of dopamine, which posits that dopaminergic midbrain neurons and their targets respond at two time points during a learning trial (Schultz et al., 1997). In this framework, the first response is triggered by an outcome-predictive cue and reflects an anticipatory signal. In the case of classical reinforcement learning, such a cue may be probabilistically coupled with rewards of possibly varying magnitudes. The anticipated value of the cue is then assumed to be computed as the average reward magnitude—contingent on the cue—in previous trials (Schultz, 2006). Here, we argue that the same principle could hold for confidence: participants learn to anticipate a certain level of confidence for the upcoming trial based on past confidence experiences, and this anticipatory state is activated when the beginning of a new trial is indicated (equivalent to a cue). In congruence with this postulation, we indeed found a modulation of striatal activity by expected confidence at trial onsets—as previously reported for expected reward (Preuschoff et al., 2006; Delgado et al., 2000; Knutson et al., 2001). The second response is triggered by the actual outcome and corresponds to a prediction error signal. In classical reinforcement learning, the reward prediction error represents the difference between expected value and actual outcome. In the confidence domain, the outcome would correspond to the level of confidence calculated from the stimulus and the prediction error would be computed as the difference between expected confidence and actual confidence. Overall, our results may therefore indicate that self-generated confidence assumes the role of external reward in dopaminergic prediction-error-based reinforcement learning when no external feedback is available.

A number of previous studies have used reinforcement learning models to capture the neural underpinnings of perceptual learning (Law and Gold, 2009; Kahnt et al., 2011) and category learning (Daniel and Pollmann, 2012). In particular, an fMRI study by Kahnt and colleagues (Kahnt et al., 2011) investigated perceptual learning with external reward and found that behavioral improvements were well explained by a reinforcement learning model. Their results exhibit a notable parallel to the present findings: the authors reported stimulus information encoded in visual cortex and model-derived decision value in frontal cortices, in agreement with the findings of the present study. In addition, this previous study identified a perceptual learning-related reward prediction error in the ventral striatum, dovetailing with our finding of a perceptual learning-related confidence prediction error in the same brain region. Importantly, our combined Hebbian and reinforcement learning model extends and improves previous models in several ways. First and foremost, by implementing confidence prediction errors in replacement of reward prediction errors, it extends previous reward reinforcement learning models of perceptual learning (Law and Gold, 2009; Kahnt et al., 2011) to cases without feedback. Second, these previous models were based on the assumption that perceptual performance is determined by a single 'readout weight', representing the amplification of stimulus information in sensory areas. While the simplicity of these models is appealing, they are limited in the sense that negative prediction errors have an unreasonable influence on behavior: according to these models, worse-than-expected feedback reduces the readout weight, which leads to an additional reduction in performance. This property runs counter to the idea that reinforcement learning agents improve their behavior through both positive and negative prediction errors. By contrast, the associative reinforcement learning rule of the present model entails a behaviorally advantageous and plausible function of negative prediction errors: inhibition of sensory noise. Third, a conceptually related reinforcement learning model for perceptual categorization (Daniel and Pollmann, 2012) implies that stimuli exclusively activate the correct stimulus category, an assumption that disregards the fact that the ambiguity of incoming stimulus information is an essential property of perceptually demanding tasks. In contrast, the present model utilizes a dedicated representational subsystem (Petrov et al., 2005) to estimate the activation of all implemented input units, and it is their differential activity that determines perceptual choices.

The present model and results are biologically plausible and fit well with theoretical accounts of the neural basis of learning. The associative reinforcement learning rule in the model was inspired by the three-factor learning rule (Schultz, 2002; Reynolds et al., 2001), which has been proposed to underlie the potentiation of synapses in the striatum. It proposes that changes in neural transmission in cortico-striatal synapses not only depend on coincident presynaptic and postsynaptic activity (Hebbian learning), but also on the presence of dopamine error signals. Indeed, Ashby and colleagues (Ashby et al., 2007; Hélie et al., 2015) have previously suggested that the basal ganglia, which represent the predominant site of dopaminergic synaptic plasticity, are themselves a key region for learning in perceptual tasks. They proposed that (i) the basal ganglia serve to activate the appropriate target regions in executive frontal cortices shortly after sensory cortex activation; and (ii) such basal ganglia learning is superseded by cortico-cortical Hebbian learning, once the correct cortico-cortical synapses are built. This account fits well with the present model, in which perceptual learning corresponds to the process of reweighting connections between sensory and decisional units. These considerations in combination with the present results thus lend support to the hypothesis that the optimization of perceptual read-out (as implicated by our model) could be mediated via reinforcement learning in the basal ganglia.

While our study represents a first but important step towards understanding the role of confidence signals in perceptual learning, future studies are needed to investigate in more detail the characteristics of these signals which were not addressed in the current study. First, are these signals triggered independent of whether participants have to report their level of confidence after the percept or independent of whether they receive external feedback? Investigating these questions could clarify whether the observed activity in the reward network is an automatic response or depends on the task of the observer. Second, are these learning signals independent of making a perceptual decision? In other words, are they triggered only when participants have to engage in a subsequent perceptual choice? Similarly, can these confidence signals be disentangled from choice accuracy, for instance by manipulating stimulus luminance (Busey et al., 2000)? An answer to this latter question would shed light on the nature of the confidence signals, i.e. whether they can also be affected by metacognitive biases.

In summary, our study devised and tested a novel model of perceptual learning in the absence of external feedback, utilizing confidence prediction errors to guide the learning process. Our analyses revealed a compelling analogy between confidence-based and reward-based feedback, suggesting a similar neural mechanism for learning with and without external feedback. Future work could investigate whether a learning mechanism based on such self-generated feedback is also applicable outside the realm of perception, where learning without feedback has likewise been a long-standing puzzle (Köhler, 1925).

Materials and methods

Participants

Thirty healthy, right-handed female participants took part in the experiment in return for payment after giving written informed consent. Participants of only one gender were selected in view of the planned between-subject analysis, because male and female brains on average have different brain volumes (Ruigrok et al., 2014), introducing between-subject noise in the spatial normalization procedure, and slightly different hemodynamic response profiles (Jaušovec and Jaušovec, 2010), which could impact the detectable BOLD signal. One participant was excluded due to fixation failure, leaving 29 valid participants (24.1 ± 2.5 years, range 19–31 years). The relatively large total sample size of 30 size was chosen in view of the planned between-subject correlation between striatal modulation and learning success. As the best available study for comparison, we determined Schlagenhauf et al. (2013) (N = 28), which investigated the association between striatal prediction error signaling and fluid intelligence. The sample size for the present study was estimated through a method recommended by Hulley et al. (2013) ( $N \approx {[\frac{z_{α} + z_{β}}{0.5 \ln [(1 + r) / (1 - r)]}]}^{2} + 3$ ), thereby applying the observed associative strength in this previous study (r = 0.47), a Type I error rate α = 0.05 (Z_α = 1.960) and a Type II error rate β = 0.2 (Z_β = 0.842). The present study was conducted according to the declaration of Helsinki, and approved by the ethics committee of the Charité Universitätsmedizin Berlin.

Share this article

Cite this article

Experimental design.

Behavioral results.

Confidence-based model of perceptual learning.

Modeling results.

Confidence signals in the mesolimbic system and their relation to perceptual learning.

The neural basis of perceptual and decisional model variables.

Author details

Matthias Guggenmos

Contribution

For correspondence

Competing interests

Gregor Wilbertz

Contribution

Competing interests

Martin N Hebart

Contribution

Contributed equally with

Competing interests

Philipp Sterzer

Contribution

Contributed equally with

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism