Introduction

Humans and animals have a remarkable ability to navigate complex environments and infer the likely state of the world from observed phenomena. Such adaptive behavior requires the ability to learn about causal relationships between one’s choices and subsequent outcomes. A key challenge for learning systems in the brain arises when a task involves temporal delays between choices and their outcomes. Cooking is one such task in which many decisions may be made about how to adjust the flavor profile of a dish, but the resultant outcomes of these choices typically will not be evaluated until sitting down to eat. Moreover, cooking often requires juggling multiple sub-tasks simultaneously, meaning that interim decisions need to be performed in between adding an ingredient and observing its effect on the dish’s flavor. In such cases, discerning the causal relationship between a particular choice and possible outcomes is nontrivial. While this ability to link choices and outcomes is critical to success in real-world tasks, little is known about how these links are forged at the neural level.

A large body of pioneering work focusing on the role of the lateral orbitofrontal cortex (lOFC) has highlighted the importance of this region in contingent learning (Gardner & Schoenbaum, 2021; Murray & Rudebeck, 2018; Rushworth et al., 2011). Recent studies in multiple species have emphasized a special role for lOFC in leveraging task knowledge for credit assignment, linking specific reinforcement outcomes to specific past choices (Boorman et al., 2013; Jocham et al., 2016; Lamba et al., 2023; Stalnaker et al., 2015; Sutton & Barto, 2014; Walton et al., 2010). In one key study, lesions to the macaque lOFC, impaired the ability of animals to use a model of the task structure in order to track the contingency between specific choices and outcomes they caused, with credit erroneously spreading to non-causal choices (Walton et al., 2010). These results suggest that lOFC is required for using a model of the task structure to form, or update, an association between specific choices and outcomes. Such findings were subsequently replicated and extended in both rats and humans (Costa et al., 2023; Noonan et al., 2017). Other studies in humans have shown that outcome-related blood oxygen-level-dependent (BOLD) activity in lOFC is specific to contingent, but not non-contingent, reward observations (Jocham et al., 2016), and the magnitude of activity reflects the degree to which credit for an outcome is assigned (Boorman et al., 2013, 2016). Collectively, these findings suggest that computations within the lOFC are critical to credit assignment; however, little is known about the mechanisms by which the lOFC supports assigning credit for outcomes to specific causes.

One possible mechanism by which the brain assigns credit when reinforcement is delayed is by reinstating a representation of the causal choice at the time of feedback. In principle, this could enable the choice representation to be associated with the online encoding of the outcome, potentially via changes in synaptic plasticity between co-active neuronal ensembles. Such coding of past choices specifically at the time of feedback has been identified in macaque lOFC neuronal ensembles, albeit in the absence of any task requirement for contingent learning (Tsujimoto et al., 2009). Likewise, altered dopaminergic prediction error responses in lOFC-lesioned rats were elegantly accounted for by a computational model that incorporates a loss of internal representations of an outcome-linked choice, leading to misattributing value across states (Takahashi et al., 2011). Information about previous choices is also found in regions to which the lOFC shares reciprocal connectivity, particularly the hippocampus (HC) (Barbas & Blatt, 1995; Wikenheiser & Schoenbaum, 2016). A largely separate literature focusing on HC has shown reinstatement of neural activity patterns previously elicited by a stimulus both at the time of choice and reward in sensory pre-conditioning paradigms (Barron et al., 2020; Kurth-Nelson et al., 2015; Wimmer & Shohamy, 2012), and likewise during associative inference and integration (Koster et al., 2018; Park et al., 2020; Zeithamova et al., 2012). Such hippocampal reinstatement of stimulus identity representations might be expected to support lOFC coding of relevant past choices for credit assignment, particularly following lengthier delays (Foerde & Shohamy, 2011; Shohamy et al., 2009; Wang et al., 2020).

In complex tasks where subsequent decisions intervene on the transitions between choices and resultant outcomes, the neural regions supporting credit assignment may extend to encompass regions that also support maintaining information about causal choices pending their resultant outcome. This would allow learning systems to precisely assign credit to causal choices by bridging over interim decisions that may otherwise be inappropriately linked to the observed outcome. A key region for maintaining such “pending” information is the lateral frontal pole (FPl), which has been implicated in maintaining information about prospective actions or cognitive processes that must be delayed and performed in the future (Burgess et al., 2007, 2011, 2022). Other research has shown that FPl activity reflects the reliability of pending alternative task sets (Donoso et al., 2014; Koechlin et al., 2003; Koechlin & Hyafil, 2007), and that it tracks evidence favoring adapting behavior to specific counterfactual alternatives, and directed exploratory choices, in the future (Badre et al., 2012; Boorman et al., 2009, 2011; Zajkowski et al., 2017). On this basis, we hypothesized that the FPl would play a critical role in maintaining information about previous choices that will be needed for future credit assignment during interim decisions.

In the current study, we test these hypotheses using a learning task in which participants must track contingencies between specific choices and outcomes under conditions where choice-outcome transitions are direct following a delay, or indirect and involve an intervening decision. We show that in both conditions, the lOFC and HC reinstate representations of causal choices at the time of feedback. In the indirect condition, this information is critically dependent on representations of the causal choice maintained in a “pending state” in FPl, which predict subsequent reinstatement in lOFC and HC. Finally, we show that lOFC and HC code task-independent stimulus identity representations during feedback, suggesting a link between coding of a state’s identity and precise credit assignment.

Results

Learning task with direct and indirect choice-outcome transitions

Participants completed a learning task in which they chose between two abstract shapes to obtain one of two distinct outcomes (gift cards to locally available stores rated to be approximately equally desirable). Each shape had a certain probability of leading to one gift card and the inverse probability of leading to the other. These probabilities drifted over time but could be tracked based on the recent choice-outcome observations made in each trial (see Fig. S1 for probability trajectories and Bayesian model fitting). Participants were informed of how many points each gift card would yield on each trial by colored numbers on the top of the screen, and that these points changed randomly from one trial to the next (Fig. 1A). They were further told that at the end of the experiment one trial would be selected at random to count “for real”. That is, they would receive the gift card obtained on that trial with a value proportional to the number of points won. Thus, participants were incentivized to maximize their potential winnings on every trial by accurately tracking the probability that each shape would lead to each outcome, but not the history of reward amounts.

A) Two abstract shapes were probabilistically related to each of two outcome identities by independent transition probabilities p1 and p2. B) Schematic of the direct transition condition. Participants chose one of the two shapes on each trial based on two pieces of information: their estimates of the probability that each would lead to either outcome identity (gift cards) and the randomly generated number of points they could potentially win if that outcome was obtained. The color of each number indicated the identity of the outcome on which that number of points could be won. In the example, green indicates the number of points for the Starbucks gift card, while pink indicates the number of points for iTunes. Next, participants observed the outcome of their choice (the gift card and amount) after a delay. C) Schematic of the indirect transition condition. Same as (B) except that after participants made their choice they transitioned into another independent decision. After this second decision was made, participants observed the outcome of their first decision. D) Results of logistic regression analysis predicting the current choice based on previously observed choice-outcome relationships. Each cell represents the combination of a previously observed choice with an observed outcome. The color of each cell shows the value of beta estimates for each combination of previous choice and observed outcome, averaged across participants. Positive values indicate that the choice-outcome pair predicted choosing the same shape again when that shape previously led to the currently desired outcome. E) Theoretical decomposition of the matrix in (D) into groups of cells which reflect “appropriate credit assignment” given the task structure (orange) and “credit spreading” (pink). F) Mean (±SEM) of beta coefficients for specific choice-outcome combinations averaged across the groupings of cells shown in E for each condition.

The task had two conditions which proceeded in a blocked fashion. In the “direct transition” condition, participants saw the outcome of a choice after a delay period (Fig. 1B). In the “indirect transition” condition, participants did not see the outcome of their choice until after another choice had been made, requiring them to delay assigning credit to the initial choice until the appropriate outcome was observed (Fig. 1C). Finally, at the beginning of each block participants passively viewed each of the two abstract shapes and two outcome stimuli in a random order, without making decisions or observing outcomes. This “template” block allowed us to measure neural responses to stimuli independently of the learning task.

Predicting current choice based on previous choice-outcome relationships

To test whether participants were using the structure of each condition to appropriately assign credit to causal choices, we performed a multiple logistic regression analysis testing the influence of previous choice-outcome combinations on the current choice. For each participant, independently in each condition, we constructed a GLM that predicted the current choice as a function of nine different combinations of previous choices and outcomes (Eq.1). For example, the first regressor predicted the current choice based on the previous choice and the previous outcome (trial t-1). These values were coded as 1 if the past choice led to the currently desired outcome, assumed to be the outcome with the largest monetary point value on the current trial, and −1 if it did not (results were virtually identical if we used the participant-specific indifference point (α) to define the desired outcome instead (see Eq. 9)). The second regressor predicted the current choice based on the previous choice (t-1) and the outcome received two trials in the past t-2), and so on for all nine combinations of previous choices and outcomes covering the previous three trials.

In the direct transition condition, we observed significant positive effects along the diagonal of the matrix (choicet−1 *outcomet−1: β = 6.09, t(19) = 4.81, p < 0.001; choicet−2 *outcomet−2: β = 8.78, t(19) = 5.41, p < 0.001; choicet−3 *outcomet−3, β = 6.76, t(19) = 4.16, p <0.001; Fig. 1D), indicating that participants assigned credit for each outcome to the choice made in same trial. In the indirect transition condition, current choices were significantly predicted by the most recently observed outcomes combined with choices made in the trial previous to those outcomes (choicet−2 *outcomet−1: β = 4.20, t(19) = 2.92, p <0.01; choicet−3 *outcomet−2: β = 5.07, t(19) = 4.75, p <0.001). Furthermore, the mean of the β-values which reflect appropriate credit assignment in each condition were significantly higher than the mean β-values which represented credit spreading (direct transition condition: t(19) = 5.39, p < 0.001, indirect transition condition: t(19) = 4.34, p<0.001; Fig. 1E and F). Follow-up analysis showed that participants’ choices in each trial integrated expectations about the probability of receiving a particular outcome and its magnitude and did not rely on estimates of a cached option value (Fig. S1). These results show that participants used the appropriate task-structure when assigning credit for observed outcomes in each condition.

Next, we compared the relative precision of credit assignment between our behavioral conditions, where we predicted credit assignment would be less precise in the indirect transition condition compared to direct transition condition, owing to additional task complexity. We found that β-values representing appropriate credit assignment in the direct transition condition were higher than those in the indirect transition condition (t(19) = 1.81, p <0.05). However, β-values in cells that represent credit spreading in the direct transition condition were not significantly lower than those in the indirect transition condition (t(19) = 1.11, p=.14). These results indicate that credit assignment was less precise in the indirect transition condition compared to the direct transition condition, despite each being appropriate for the respective task structure overall.

Causal choice codes are reinstated in lOFC and HC when viewing the outcome of choices

For the direct feedback condition, our main hypothesis was that lOFC codes for the specific causal choice when participants view the outcome of their choice. We also reasoned that, due to the delay between choice and feedback, this lOFC choice code would be supported by choice reinstatement in the interconnected HC (Barbas & Blatt, 1995; Wimmer & Shohamy, 2012). We tested this hypothesis by training a linear support vector machine (SVM) to distinguish BOLD activity patterns at the time of feedback based on the previously chosen shape, cross-validated across scanning runs (see Methods for details on decoding procedure). We used a searchlight analysis within a priori defined ROIs for lOFC and HC to estimate decoding accuracy for each voxel within the ROI (Kriegeskorte et al, 2008).

We found evidence for choice decoding in the predicted network of regions. Specifically, we found significant and marginally significant decoding of the causal choice in left ([x,y,z] = [−26, 42, −8], t(19) = 4.22, pTFCE <0.05 ROI-corrected using threshold-free cluster enhancement (TFCE) correction (Smith & Nichols, 2009)) and right ([x,y,z] = [24, 46, −8], t(19) = 3.45, pTFCE = 0.081 ROI-corrected]) lOFC, respectively (Fig. 2A). A similar pattern was also apparent in the HC, where right HC showed significant decoding ([x,y,z] = [36, −20, −16], t(19) = 4.02, pTFCE <0.05 ROI-corrected]), while left HC showed a marginal effect ([x,y,z] = [−22, −10, −24], t(19) = 2.86, pTFCE = 0.080 ROI-corrected]). Together, these results show that the lOFC and HC represent the causal choice at the time when credit is assigned in the direct condition of our task.

Left side shows the analysis scheme for decoding representations of the causal choice at feedback in the direct transition condition. An SVM decoder was used to differentiate trials at the time of the outcome (purple) based on the causal choice selected during the “choice period” (cyan). The right side shows axial and coronal slices through a t-statistic map showing significant decoding in OFC and HC during feedback. For illustration, all maps are displayed at threshold of t(19) = 2.54, p<0.01 uncorrected. All effects survive small volume correction in a priori defined anatomical ROIs.

Pending item representations in FPl during indirect transitions predict credit assignment in lOFC

The indirect transition condition allowed us to test whether similar reinstatement mechanisms, as described above, support credit assignment when choice-outcome transitions are punctuated by interim decisions. We anticipated that the structure of the indirect transition condition would render credit assignment more difficult compared to the direct transition condition; a prediction borne out by our behavioral analysis of learning (Fig. 1F). Repeating the causal choice decoding analysis on this condition did not reveal a significant effect in any a priori defined ROI (all pTFCE >0.05 ROI corrected), nor did we find significant decoding elsewhere in the brain (all pTFCE >0.05 whole brain corrected). However, a key attribute of this condition is that causal choices must be held in a pending state during interim choices until a prospective outcome is observed. Thus, we reasoned that the fidelity of credit assignment at the time of feedback would be intimately related to the fidelity with which representations were maintained during the interim decision.

Following previous work suggesting that prospective representations of to-be-completed tasks are supported by FPl (Burgess et al., 2011; Koechlin & Hyafil, 2007), we predicted that FPl would hold causal choices in a “pending state” when credit assignment needs to be deferred until the resulting outcome is observed. To test this hypothesis, we used a linear SVM to classify neural activity at the time of feedback based on the immediately preceding choice. Note that in this condition the immediately preceding choice is not cause of the currently observed outcome, but is the cause of the outcome for which credit will be assigned in the next trial. We call this the “pending causal choice”. Our analysis revealed a cluster of voxels specifically within the predicted right FPl ([x,y,z] = [28, 54, 8], t(19) = 3.74, pTFCE <0.05 ROI-corrected; left hemisphere all pTFCE > 0.1), consistent with right FPl coding for the pending causal choice at feedback time, precisely when the outcome of the prior choice causal choice needed to be evaluated.

To test whether pending choice information held in FPl was directly related to the causal choice information coded during subsequent credit assignment we used an “information connectivity” (IC) analysis, which seeks to identify how information is shared between brain regions (Coutanche & Thompson-Schill, 2013). Specifically, we tested the correlation between the fidelity of the previous choice representation when in a pending state, and the same causal choice representation during subsequent credit assignment. We began using a SVM to classify representations of the causal choice during the interim feedback period in voxels in the FPl that were shown to code this information in our previous analysis (thresholded at t(19) = 2.54, p<.01). Note that this relatively liberal threshold simply allows for the inclusion of more voxels for a statistically independent test in a left-out set of trials, thereby obviating selection bias. In a left-out set of trials, we calculated the distances between the estimated hyperplane and trial-level voxel activation patterns, and then signed these distances such that positive distances reflected “correct” classifications and negative distances reflected “incorrect” classifications. Next, we applied the same method to quantify and sign the distances when decoding the same causal choices at the time of credit assignment – that is, when viewing the relevant outcome in the next trial. Finally, we correlated the decoding distances of causal choices in a pending state in FPl with decoding distances of these choices during credit assignment in our lOFC and HC ROIs. This allowed us to assess whether the fidelity of pending causal choices representations in FPl predicts the fidelity of representations during credit assignment in the lOFC and HC.

This analysis revealed strong IC between representations in FPl at feedback on trial t and the representations in lOFC and HC during feedback on trial t+1. Specifically, we found significant correlations in decoding distance between FPl and bilateral lOFC ([x,y,z] = [−32,24, −22], t(19) = 3.81, [x,y,z] = [20, 38, −14], t(19) = 3.87, pTFCE <0.05 ROI corrected]) and bilateral HC ([x,y,z] = [−28, −10, −24], t(19) = 3.41, [x,y,z] = [22, −10, −24], t(19) = 4.21, pTFCE <0.05 ROI corrected]), Fig. 2C). Subsequent analyses confirmed that this effect was due to these regions showing a significant increase in decoding accuracy when FPl correctly codes the pending causal choice, and not simply lOFC or HC decoding becoming “less incorrect” (see Fig. S7). This finding is consistent with the coding of the causal choice during feedback in lOFC and HC being dependent on that causal choice being faithfully maintained in a pending state in the FPl.

HC represents task-independent stimulus identity at feedback

Next, we tested whether the content of past choice coding at feedback includes a stimulus identity code that is reinstated during credit assignment. To test for task-independent representations of the causal stimuli, we trained a linear SVM to distinguish neural patterns evoked when participants passively viewed each shape in “template trials” (see Methods). Importantly, these were presented outside the context of the learning task and were not connected to a specific action or outcome. We then tested the classifier on neural patterns evoked at the time of feedback during the learning task. This revealed significant decoding of the causal stimulus identity at the time of feedback when averaged across direct and indirect conditions, in the left HC (Fig. 4A; [x,y,z] = [−26, −16, −16], t(19) = 5.20, pTFCE < 0.001 ROI-corrected; right hemisphere all pTFCE>.1). Follow-up analyses showed a marginally significant effect in the direct transition condition alone ([x,y,z] = [−24, −16, −14], t(19) = 3.41, pTFCE = .08 ROI-corrected), and a significant effect in the indirect transition condition alone ([x,y,z] = [−28, −16, −18], t(19) = 3.65 pTFCE < 0.05). These results show that when observing an outcome, the HC reinstates task-independent representations of causal stimuli, suggesting a role for the HC in retrieving the causal stimulus identity during credit assignment.

A) Left side shows the analysis scheme for decoding information about the causal choice in “pending state” (pink) in the indirect transition condition. We decoded information about the previous choice during the feedback period, during which the causal stimulus should be “pending” credit assignment in the next trial. The image on the right shows a coronal slice through a t-statistic map, showing significant decoding in FPl. B) The analysis scheme for the information connectivity analysis which uses the trial-by-trial fidelity of causal choice representations in the “pending state” (pink) to predict the fidelity of these same choices when the outcome is observed (purple). The right side shows axial and coronal slices of a t-statistic map showing effects in lOFC and HC. All maps are displayed using the same conventions as Fig. 2 and all effects survive small volume correction in a priori defined anatomical ROIs.

A) Schematic of the decoding procedure. In task-independent “template trials”, participants passively viewed images corresponding to the two choice stimuli and two outcome stimuli in the main task. We used these trials to train a SVM to differentiate stimuli outside the task context and then tested for representations of the causal choice stimulus at the time of feedback during the learning task. B) A coronal slice through a t-statistic map showing regions of the HC with significantly above chance decoding for the causal choice stimulus identity at the time of feedback, across conditions. In this figure, “CA” refers to “credit-assignment”. C) Analysis scheme for generating each participant’s overall credit assignment precision. β-values for each participant were taken from the behavioral model predicting current choices given all combinations of the previous three choices and outcomes (Eq.1). Each participant’s pattern of β-values (left side matrices) were correlated with a matrix representing an optimal pattern of regression betas given the task structure (right side matrices). The optimal matrix was a binary matrix with ones where credit should be assigned for a given outcomes and zeros everywhere else. D) Axial slice through a t-statistic map showing regions where decoding of the stimulus identity was significantly correlated with estimates of credit assignment precision. All maps are displayed using the same conventions as Fig. 2 and all effects survive small volume correction in a priori defined anatomical ROIs.

We reasoned further that if the HC supports credit assignment by evoking task-independent identity representations, then the extent to which this information is coded in the HC should be intimately tied to behavioral estimates of credit assignment precision. Alternatively, identity representations in the HC might support credit assignment processes in lOFC, such that the extent to which this information is represented in lOFC is predictive of precise credit assignment. To test these predictions, we estimated each participant’s overall credit assignment precision by correlating their pattern of β-values from the logistic regression models predicting choice with those of an “ideal learner” (Fig. 4B). The pattern for an ideal learner was taken to be 1 for any choice-outcome combination that reflected the true task structure, and 0 everywhere else. Higher correlations between these patterns meant that participants appropriately assigned credit to causal choices without attribution spreading to non-causal choices. We then correlated each participant’s estimated credit assignment precision with the average decoding accuracy in HC and lOFC. We found that there was a significant correlation between credit assignment precision and decoding accuracy of the causal stimulus identity reinstatement in lOFC ([x,y,z] = [−24, 34, −16], t(19) = 3.24, pTFCE <0.05 ROI-corrected), but not HC (all pTFCE >0.09 ROI-corrected) (Fig. 4C). These results suggest that the extent to which identity information is reinstated in lOFC is directly related to the precision with which participants link appropriate choices and outcomes together.

Discussion

Flexible decision making in dynamic environments requires an ability to learn choice-outcome relationships across prolonged delays, which may often be punctuated by interim decisions. Understanding how the brain assigns credit for specific outcomes, and forges connections with their causal choices, is essential for models of learning and decision-making that seek to explain how organisms implement such goal-directed behaviors. The current study reveals critical roles of the lOFC and HC in such credit assignment by showing that these regions specifically represent the causal choice at the time the outcome is observed. Importantly, we show that when credit assignment must be delayed due to an intervening choice, representations of the causal stimulus are maintained in a “pending state” in FPl. The fidelity of these representations determines the strength of causal choice representations in lOFC and HC when the outcome is subsequently observed. Finally, we show that the content of representations in HC includes the task-independent stimulus identities of the causal choice at the time of feedback, and the extent to which these are also represented in lOFC predicts precise credit assignment. Together, these results show that lOFC and HC adaptively use the task structure to associate identity-specific representations of causal choices to their resultant outcomes during learning and provide novel evidence for interactions between learning systems and FPl in elaborated task structures which emulate real-world complexity.

Our finding that the lOFC instantiates a representation of the causal stimulus at the time of feedback contributes to a broader literature concerning the role of the lOFC in credit assignment. Previous research has shown that monkeys with lOFC lesions exhibit deficits in appropriately assigning credit to causal choices (Walton et al., 2010). Similarly, activity in human lOFC has been consistently associated with learning about contingencies between choices and rewards (Boorman et al., 2016; Jocham et al., 2016; Lamba et al., 2023; Noonan et al., 2017; Witkowski et al., 2022). We add to this literature by showing that the lOFC and HC contain specific multivariate patterns for inferred causal choices when an outcome is observed, suggesting that these regions are involved in updating links between choices and outcomes. Our results from the “indirect transition” condition show that these patterns are not merely representations of the most recent choice but are representations of the causal choice given the current task structure. These findings highlight a key role for the lOFC and HC in creating links between causal states and goal-states (Boorman et al., 2021; Gardner & Schoenbaum, 2021; Howard & Kahnt, 2021; Wang & Kahnt, 2021), and suggest these regions use the specific task structure to construct causal associations between states.

Importantly, we present novel evidence that representations of “pending” causal choices are stored online in the FPl and predict the strength of causal choice representations at the time of the outcome. Our results fit precisely with theoretical proposals of FPl functions, which propose that this region is involved in “prospective memory” and tracking alternative behaviors or task sets during ongoing behaviors which may be returned to in the future (Boorman et al., 2009; Burgess et al., 2011; Koechlin & Hyafil, 2007; Tsujimoto et al., 2011). In the “indirect transition” condition, participants needed to delay assigning credit when the first outcome was presented but return to this process when a prospective outcome was observed in the future. We show that when participants viewed outcomes for an unrelated choice, the FPl held the content of the pending causal choice. These “pending” representations predicted the strength of subsequent causal choice representations in lOFC and HC during the next feedback period, replicating the same network we observed in the “direct transition” condition. The results extend prior work by showing that FPl activity not only reflects statistics related to the evidence favoring pending options (Badre et al., 2009; Boorman et al., 2009, 2011; Donoso et al., 2014), but the content of information held in a pending state. Furthermore, the findings provide new evidence for the involvement of the FPl in learning within complex task structures where the transitions between choices and outcomes are indirect - structures which abound in the real world. This highlights the critical nature of interactions between cognitive subsystems that make different contributions to the learning process in these complex tasks.

A revealing aspect of our study was the inclusion of “template” trials, which allowed us to measure task-independent neural responses to the stimuli used during the learning task. By training a classifier to decode stimulus representation during passive viewing, we were able to test which regions of the brain coded the specific stimulus identity of the causal choices during credit assignment. Consistent with previous accounts of hippocampal involvement in associative learning and inference (Barron et al., 2020; Kurth-Nelson et al., 2015; Luettgau et al., 2020; Mack & Preston, 2016; Ranganath & Ritchey, 2012; Schuck & Niv, 2019; Wimmer & Shohamy, 2012), we found significant decoding of task-independent choice identities in HC across participants in both direct and indirect conditions. This suggests that the HC retrieves a representation of the stimulus identity to bind together outcomes with causal choice information at the time of credit assignment, supporting the idea that the HC is involved in linking together previous experiences of sensory information (McClelland et al., 1995). Interestingly, recent work has shown the HC neuronal ensembles code a veridical representation of stimulus identities and predicted outcomes, which are critical to inference-guided choices (Barron et al., 2020). Together, these findings imply that a state’s identity relationships constructed during credit assignment in the HC may be critical for future simulation of state-to-state transitions during outcome-guided inferences.

Interestingly, we found that the strength with which a stimulus identity can be decoded in the lOFC was correlated with behavioral measures of credit assignment, but not in HC. Recent work has shown that synchronized theta oscillations in macaques support information transfer from HC to the lOFC during value learning (Knudsen & Wallis, 2020). Disrupting these signals leads to learning deficits, suggesting that these regions work in concert to support value learning based on a relational cognitive map of the task. This synchrony between regions also finds support in human work showing strong functional connectivity and shared information between the anterior medial temporal cortex and OFC (Barnett et al., 2021; Mızrak et al., 2021; Ranganath & Ritchey, 2012). In our task, it is possible that while the HC coded task-independent identities of causal stimuli, the extent to which this information was transferred to, and represented, in the lOFC determined the efficacy of credit assignment. Future studies using methods with higher temporal resolution can elaborate on this idea by testing whether the HC and lOFC also share coherent stimulus identity information that is likewise channeled via theta phase coupling at the time of outcome, and how this information influences the credit assignment process.

In conclusion, we find that the lOFC and HC are critical to using model-based knowledge for efficiently forging links between outcomes and causal choices. Further, we show that in complex tasks where choice-outcome transitions may be interrupted, this credit assignment network relies on interactions with the FPl, which maintains “pending” representations of causal stimuli during the interim decision. Collectively, these findings make a novel contribution to our understanding of credit assignment in the brain by illuminating the neural mechanisms which underlie linking causal choices to outcomes in complex, real-world tasks.

Acknowledgements

Funding was provided by a Sir Henry Wellcome Postdoctoral Fellowship to EDB, a Senior Research Fellowship from the Wellcome Trust and an award from the James S. McDonnell Foundation to TEB, and a Principal Research Fellowship from the Wellcome Trust to RJD. This work was also in part supported by the Intramural Research Program at the National Institute on Drug Abuse (ZIA DA000642). The opinions expressed in this work are the authors’ own and do not reflect the view of the NIH/DHHS.

Additional information

Author Contributions

P.P.W., L.R. and E.D.B. lead the data analysis and writing of the manuscript. L.R and E.B. acquired the data. E.D.B., Z.K.-N., M.G. & T.E.B. designed and performed analyses. T.E.B. and E.D.B. conceived and designed the experiment and research question; R.J.D., T.E.B. and E.D.B. obtained funding and supervised the study.

Declaration of Interests

The authors declare no competing interests.

Data and Code Availability

Unthresholded group-level statistical maps have been deposited at NeuroVault (https://neurovault.org/collections/17702/) and are publicly available as of the date of publication. Links are listed in the key resources table. All original code has been deposited at Open Science Framework (https://osf.io/b9m6q/?view_only=eb58dd2f2076477c9bb01a8bd430b53d) and is publicly available as of the date of publication. Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

Methods

Participants

Twenty participants (11 females; 9 males; mean age = 23.5) were recruited from the general population around University College London to participate in the study. This sample size was commensurate with previous studies similar in design (Boorman et al., 2016; Howard et al., 2015; Jocham et al., 2016). Participants were paid £10 and obtained a gift card of various amounts depending on their performance in the task. None of the participants reported a history of neurological or psychiatric disorder. All participants spoke fluent English and had normal or corrected-to-normal vision. The study was approved by the UCL Research Ethics Committee (Project ID Number: 3450/002), and all participants gave written informed consent.

Task Design

Learning task

Participants completed a learning task in which they tracked associations between abstract shapes and specific reward identities (gift cards to two different stores), which were rated for approximately equal desirability. In each trial, participants selected one of two abstract shapes, which were randomly presented on either the left or right side of the screen. Decisions were based on two pieces of information: (1) inferred estimates of the probability that a particular shape would lead to each gift card based on the history of previous trials, and (2) the point value of each gift card on the current trial (Fig. 1A-C). Participants were informed prior to starting the task that one of the trials would be chosen at random to count “for real” at the end of the experiment. For this trial, they would receive money on the awarded gift card that was commensurate with the number of associated points (number of points divided by four). Point values for each outcome were presented as two numbers at the top of the screen, with the color of each number indicating the associated gift card identity. Their position relative to each other (top or bottom) was determined randomly on each trial.

Each shape had a specific probability of leading to each outcome and an inverse probability of leading to the other outcome. For example, shape 1 (S1) might lead to a Starbucks gift card with probability p1 and to an iTunes gift card with probability 1-p1. Shape 2 (S2) would lead to the same outcomes but with independent probabilities p2 and 1-p2, respectively. These true probabilities would drift independently over the course of the experiment, meaning that information about outcome probabilities could not be shared across shapes. On any given trial, the number of points that could be won for each gift card ranged from 20 to 100, with a minimum difference of at least 15 points. Although these magnitudes were predetermined, participants were told they were randomly generated at the beginning of each trial and that it was not useful to track them (Pearson correlation between magnitudes in trial n and n+1 was less than .2). Instead, to maximize rewards, participants had to track the probability that a shape led to each outcome and combine this with the reward magnitudes associated with each outcome on the current trial.

Each trial began with viewing the two possible choices for 0.5s, during which selection was not possible. They then had 3.5s to make their selection between the two options. The selected shape was highlighted for 0.5s, before proceeding to the interstimulus interval (ISI), which lasted for a randomly selected duration between 4s and 8s. The outcome was then presented for 2000ms before a jittered inter-trial-interval (ITI) of 4s to 8s.

Participants did not have any prior knowledge about choice-outcome associations or how quickly these associations might change, but they knew that they could change throughout the task.

Therefore, participants needed to infer both the current associative contingency for each shape and when these contingencies changed from their history of choices and observed outcomes.

Template task

Each run of the scanning session began with a “template task”. In this task, participants passively viewed a sequence of all four stimuli (two shapes and two gift cards), individually presented in random order. To ensure that participants were paying attention during passive viewing, they were presented with 4 “catch trials” which occurred at random between images. In catch trials, all four stimuli were presented simultaneously, and participants were asked to indicate which stimulus had just been presented (Fig. S6). Participants were told they could earn an additional £10 on the selected gift card if they responded correctly. However, they would be deducted £1 for each incorrect response or for not making responses in time (max response time = 3s). Average accuracy for these catch trials was generally high (mean = .75, std=.15). Participants viewed each item for 1s followed by a 2.5s ISI.

Stimuli

Two visually distinct abstract shapes were used as choice objects. These shapes were randomly assigned to serve as S1 or S2 for each participant. The two gift cards were chosen to serve as reward identities during the experiment from 6 different possible gift cards (iTunes, Argos, Blackwells, Marks & Spencers, Boots, and Starbucks). Each participant rated the 6 gift cards on a scale from 0 (not preferable) to 100 (extremely preferable). The two gift cards were selected to have the minimal difference in ratings among the highest rated gift cards. This was done to prevent a strong preference for one outcome over the other. All stimuli were presented on a computer running Presentation® software (Version 18.1, www.neurobs.com).

Task-schedule and procedure

We generated a reward schedule that predetermined the outcome obtained for each choice on each trial, but this schedule was unknown to the participants. We optimized the schedule such that an ideal Bayesian learner (see Bayesian Computational model) would choose each shape and receive each outcome approximately an equal number of times (percent of overall trials where S1 was chosen was between 42% and 57%). This was done to reduce the potential for sampling bias in planned multivariate analyses. The schedule of outcomes for each shape was generated with independently drifting probabilities so participants could not learn anything about one shape from observing the outcome of the other shape (see Fig. S1).

Participants completed three scanning runs in one session. The first two runs began with the template task, which was followed by the learning task (37 trials of the direct transition condition, then 37 trials of the indirect transition condition). The third run consisted of only the template task. The learning task began with instructions stating, “Your latest choice”, indicating that participants were in the direct transition condition. After 37 trials, a second instruction screen showed “Your previous choice” indicating that participants were about the start indirect transition condition. Participants knew that in the indirect transition condition, the first outcome observed was not linked to any choice.

In each run, we included three “bonus trials” (two in the direct transition condition and one in the indirect transition condition), distributed throughout choice trials, which occurred between a choice and the outcome. Participants were shown the two gift cards on either side of a question mark and were given the chance to predict which outcome they would receive in the upcoming feedback period. For each correct gift card prediction, they received an additional £3 on the gift card they would receive at the end.

Behavioral Training

Prior to each scanning session, participants completed a shortened (76 trials) behavioral training session. In the training session, participants completed a practice version of the choice task, which had a unique reward schedule. Prior to the practice trials, participants were verbally given a “comprehension quiz” to verify they understood key elements of the task, such as the difference between choice-outcome transitions in each condition. Finally, the distribution of ISI and ITI durations for this session was constrained to 2s to 4s.

MRI data acquisition and preprocessing

The brain images were acquired using a 32-channel head coil from a 3 Tesla Siemens Trio scanner. We used a T2*-weighted echo-planar imaging (EPI) sequence to collect 43 2mm slices in ascending order, with 1 mm gaps. The in-plane resolution was of 3 × 3 mm, with a repetition time (TR) of 3.01s and echo-time (TE) of 70ms. We set the slice angle to a 30-degree tilt relative to the rostro-caudal axis to minimize signal loss from the lOFC (Weiskopf et al., 2006) and applied a local z-shim with a moment of −0.4 mT/m to the OFC. The first five volumes of each block were discarded to allow for T1 equilibration effects. For accurate registration of the EPI to a standard space, we acquired a T1-weighted anatomical scan with a magnetization-prepared rapid gradient echo sequence (MPRAGE) with a 1 × 1 × 1 mm resolution. Finally, to measure and correct for geometric distortions due to susceptibility-induced field inhomogeneities, a whole-brain field map with dual echo-time images (TE1 = 10 ms, TE2 = 14.76 ms, resolution 3 × 3 × 3 mm) was also acquired.

We performed slice time correction, corrected for signal bias, and realigned functional scans to the first volume in the sequence using a six-parameter rigid body transformation to correct for motion. Images were then spatially normalized by warping participant-specific images to the reference brain in the MNI (Montreal Neurological Institute) reference brain and smoothed using an 8-mm full-width at half maximum Gaussian kernel. Pre-processing was done in SPM12 (Wellcome Trust Centre for Neuroimaging, http://www.fil.ion.ucl.ac.uk/spm) using Matlab 2018a.

Quantification and Statistical Analyses

Regression Analysis

To test whether participants showed a behavioral effect of learning on choice, we fit logistic regression models estimating the influence of past choice-outcome observations on choices in the current trial t. The regression model included the effect of the past three choices (Ct-n) in combination with the past three observed outcomes (Ot-n). For example, Ct-1Ot-1 represents the influence of the most recent choice and the most recent outcome on the current choice. The model estimates the probability of making choice C on trial t given all 9 combinations of previous choices and outcomes:

The value of Ct-n was taken to be 1 if they chose shape S1 on trial t-n and −1 if they chose S2. The value of Ot-n was taken to be 1 if the outcome on trial t-n matched the currently desired outcome, on trial t, and −1 if it did not. The currently desired outcome was assumed to be the outcome with the largest point value in each trial. Thus, the value of Ct-nOt-n for each trial was 1 if choice C led to the currently desired outcome n-trials back and −1 if it did not:

We fit separate regression models for each condition in each run for every participant. We then averaged the resulting regression coefficients (β) across runs, resulting the participant specific influence of previous decisions on the current choice.

Bayesian Computational model

We used a Bayesian computational model to predict choices in each trial t based on each participant’s previously observed shape-outcome relationships (i.e., the estimated associative probability), and reward magnitudes in the current trial. We briefly describe the model here, but a full description can be found in (Behrens et al., 2007; see also Arulampalam et al., 2002 for a related model).

Since the true probability of the associative contingencies cannot be observed, the model estimated, in a Markovian fashion, the subjective belief that making a given shape (S) would lead to outcome 1 (O1), and to outcome 2 (O2) with the inverse probability:

where ps denotes the associative probability of a given shape S leading to O1. On each trial (t) the model estimated the current value of pst, based on the previous observations of outcomes y1:t. We modeled beliefs about the likelihood of each contingency as a beta distribution over possible values of pst:

where pst is the mean of the beta distribution and V = exp (v) describes the variance. A large value of v means that the value of pst is likely to change in the next trial whereas low values of v mean that it is unlikely to change. Here, v is referred to as the “volatility” because it controls the learning rate for shape-outcome associations. The change in the estimated volatility from previous trial to the current trial is controlled by k. This describes the model’s belief that some level of change in the volatility is going to occur in the next trial. Because there are no constraints on values for vt, this distribution can be modeled as a Gaussian:

After observing each piece of evidence about the contingency between shape S and the outcome, the estimate of each parameter could then be updated following Bayes rule

This gives us the 3-dimension joint probability of the parameters. On each trial, the learner only needs to know the estimated contingency between a shape and outcome which is performed first by marginalizing over v and k:

And then taking the mean of the resulting distribution.

For each participant, we initialized the model with a uniform prior over the entire parameter space. All integral computations are performed using numerical grid integration. We then used the prior belief in the associative contingencies to compute the expected value of each shape on each trial according to the following formula:

where α was a free parameter and reflected a participant’s preference for O1 over O2 (0< α <2), and mo1t and mo2t indicated the reward magnitudes of the outcome available in the current trial, t. We then measured the likelihood of each participants choice on each trial according to a SoftMax function:

where the free parameter b, captured the level of sensitivity of choices to expected values (inverse temperature; 0<b<1). Free parameters were fitted using Markov Chain Monte Carlo (see below).

Value Based RL-model

This model estimated the value of each shape given the history of rewards received from choosing the shape. The value of each shape was initiated at 0, then updated using the following equation:

where Rt is the magnitude of the reward on trial t and α is an individual difference term estimating a participant preference for one outcome over the other (0< α <2). The learning rate (δ) was estimated for each participant to capture the magnitude of the update (0< δ <1). We entered these values into a SoftMax function to generate choice probabilities:

where the free parameter b, captured the level of sensitivity of choices to expected values (inverse temperature; 0<b<1). Free parameters were fitted using Markov Chain Monte Carlo (see below).

Parameter estimates

The Bayesian learning model has two free parameters, α and b. The value RL-model had an additional parameter δ. We fit these parameters independently for each participant using custom Markov Chain Monte Carlo (MCMC) code in MATLAB R2018a. Model parameters were bounded by the following: [0<α<2], [0<b<1], [0< δ <1] and were initialized at α=1 and b=.5, δ=.5. Each model was fit to maximize the likelihood of a participant’s choices given model estimates of the expected value of each choice on each trial (Eq.10; Eq.12).

Multivariate decoding of causal choice and pending causal choice representations

Using multivariate pattern analysis (MVPA), we aimed to identify regions of the brain that coded knowledge of causal choices during the feedback period. To test this, we estimated the BOLD activity patterns during the feedback phase for each trial using unsmoothed preprocessed images. The feedback periods were modeled as boxcars that had a constant duration lasting 2000ms from the onset of the outcome presentation in each trial. The GLM also included regressors for the decision period (modeled as boxcars with a duration equal to RT) and template presentations (modeled as boxcars with a 1000ms duration). No parametric modulators were added. Each trial was labeled according to which shape was chosen during the choice period (either S1 or S2). For our analysis of “pending” representations in the indirect transition condition, we linked these labels to the immediately following, interim feedback phase - a time when participants should be delaying credit assignment in anticipation of assigning credit in the next trial.

We used a searchlight procedure to identify regions of the brain that contained representations of the causal choice. Each searchlight consisted of a 5×5×5 voxel cube placed around a centroid voxel in the brain. Each centroid was required to have values in at least 10 of the surrounding voxels to be considered for further processing. The activity in each trial was standardized by z-scoring the β-values across voxels within each searchlight. The data were then split by blocks into training and test sets by run. We used LIBSVM (Chang & Lin, 2011) to fit linear classifiers with training data, which were subsequently used to classify data points from the test set. We iterated through this process for each of the two runs then computed the mean decoding accuracy (average proportion of correct classifications) across both classifiers. The mean decoding accuracy for each voxel was compared to a voxel-specific null distribution which was estimated by repeating this procedure while randomly assigning the labels for 100 permutations at each searchlight. The mean classification accuracy of this null distribution was subtracted off the classification accuracy of each searchlight to give us a measure of how reliably information about the causal choices could be decoded above chance. The resulting maps were then spatially smoothed using a Gaussian kernel with full width at half maximum of 8mm.

Group-level analyses were performed using a one-sample t-test on accuracy maps across participants (see Group-level statistical inference). We corrected for multiple comparisons over a priori defined ROIs in lOFC, HPC, and FPl, and used functionally defined ROIs for lOFC in a data driven ROI analysis (see Fig. S3-5). We corrected for multiple comparisons using small volume correction TFCE. The threshold for significance remained the same in all analyses (pTFCE <.05).

Multivariate analyses of information connectivity between regions

To test whether decoding of the causal choice at feedback in the indirect transition condition depended on the strength of “pending” representations held during the interim trial, we tested whether the fidelity of representations of the pending causal choice in FPl was associated with the fidelity of those same choices at the time of credit assignment (i.e., in the feedback phase of the next trial). We used the same decoding procedure mentioned above to classify voxel patterns at feedback in each trial, but additionally calculated the distance of each pattern from the hyperplane that divides categories. Distances were obtained using the equation specified on the LIBSVM webpage (https://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html). Patterns that are more distant from the hyperplane can be thought of as having higher fidelity, and those that are closer to the hyperplane as having less (Schuck & Niv, 2019). We then signed the distance of each point according to whether the predicted category label was correct (+ for correct, – for incorrect).

First, we calculated trial-by-trial distance from the hyperplane when causal choice information was believed to be held in a “pending” state, focusing on FPl as our “seed-region”. For this, we calculated the average distances for voxels within the FPl that showed significant decoding of the pending choice during the interim feedback period (t(19)=2.54, p<.01 uncorrected). This gave us a measure of the information about the pending item on each trial. We calculated the decoding strength of these same choices when the true outcome was shown, as a measure of the information about the causal choice during credit assignment. Here, we calculated distances for every 5×5×5 voxel cube using the same searchlight procedure we described above. Note that the decoding fidelity metric at each time point represents the decodability of the same choice at different phases of the task. We then correlated the decoding distance for representations in FPl during “pending” state and the decoding distance of those same choices at credit assignment. Thus, the correlation value between them gives us a measure of whether strong representations of pending causal choices in FPl predict stronger representations at credit assignment.

To confirm that this correlation did not simply arise because the classifier in each region is “less wrong” when the decoder in FPl makes correct classifications (i.e., all classifications were wrong, but the test region was less wrong), we performed two control analyses. First, we calculated the frequency of correct classifications for the subset of trials in which FPl also showed correct classifications. We then compared the frequency of correct classifications to a permuted baseline frequency by randomizing trial distances in the searchlight then recomputed the frequency of correct classifications. We subtracted the mean of the randomized baseline from the true frequency of correct classifications. This gave us a measure of decoding accuracy in each searchlight when FPl showed correct decoding accuracy. Our second control analysis involved rerunning the classification procedure (see Multivariate analyses of credit-assignment and pending representations), but only for trials in which the FPl had already shown correct decoding of the causal choice in a pending state. Again, we compared the accuracy of the classifier in each searchlight to a randomized baseline frequency by randomizing trial labels and recomputing the accuracy of the classifier. The mean of the randomized distribution was then subtracted from the classification accuracy using the true labels.

Group-level analyses were performed by Fisher-z transforming the correlation values then using a one-sample t-test on each voxel. We corrected for multiple comparisons using TFCE correction on the resulting volumes within a priori defined ROIs. The same thresholds were applied for group level statistical correction (pTFCE <.05).

Multivariate analyses of identity codes during credit assignment

To test whether the task-independent identity of the causal choice was reinstated during feedback, we trained a linear support vector machine (SVM) to decode representations of causal choice stimuli but trained the classifier during periods when participants passively viewed the stimuli outside of the task context (see “Template trials”). In each condition the SVM was trained on all the trials of the three template runs and tested during the feedback period of the learning task. For each participant and in each trial, we estimated the BOLD activity patterns using the same GLM as described above (see “Multivariate decoding of causal choice and pending causal choice representations”). Further, we used the same procedure in which we randomly permuted the training labels 100 times to create a null distribution of decoding accuracy. We then averaged decoding accuracy over runs and subtracted the mean of the null distribution from the true decoding accuracy of the classifier.

To test for associations between credit assignment precision and causal choice identity decoding accuracy, we first generated estimates of credit assignment precision based on each participant’s behavior during the task. For each participant we created a behavioral matrix, which included β-values from nine combinations of possible choice-outcome relationships used to assign credit when an outcome is observed (see “Regression model”). For the direct transition condition, values along the diagonal of this matrix represent appropriate credit assignment given the task structure and should have high positive values if the participant is assigning credit precisely. All other values should be near 0. A similar matrix can be generated for the indirect transition condition, but appropriate for the causal structure of this condition (see Fig. 1E). Next, we created a comparison matrix based on an idealized learner, with values of 1 in each cell that represented appropriate credit assignment for the condition, and values of 0 for non-causal relationships. We then correlated each participant specific behavioral matrix with the comparison matrix. High correlation values represent more precise credit assignment, and the average across conditions was taken to be a measure of the overall credit precision in the learning task. We then regressed each participant’s overall credit precision estimate against voxel-level decoding accuracy across participants. We corrected for multiple comparisons using TFCE correction to volumes within predefined ROIs. The same thresholds were applied for group-level statistical correction (pTFCE <.05).

Group-level statistical inference

Group-level testing was done using a one-sample t-test (df=19) on the cumulative functional maps generated by the first-level analysis. All first-level maps were smoothed prior to being combined and tested at the group level. To correct for multiple comparisons, we first extracted voxels from each ROI in each participant’s first-level activation map, then applied Threshold-Free Cluster Enhancement (TFCE) which uses permutation testing and accounts for both the height and extent of the cluster (Smith & Nichols, 2009). All parameters were set to default parameters (H=2, E=0.5) and used 5000 permutations for the analysis. We report effects that surpassed a pTFCE< .05 threshold in each ROI.

Region of interest selection

Regions of interest in the prefrontal cortex were generated from anatomically defined regions with unique functional connectivity fingerprints (Neubert et al., 2015). The lOFC ROIs corresponded to bilateral area BA11 (indexes 9 and 30). We included these regions because they have been previously implicated in credit assignment for causal choices, particularly in similar contingency learning tasks (Boorman et al., 2016; Jocham et al., 2016). For the lateral frontal pole, we used indexes 14 and 35. All of these ROIs were threshold at 60% inclusion criteria, although our results did not qualitatively change at different thresholds. Finally, we used a priori anatomically defined bilateral HC ROIs to test for effects in hippocampus (Yushkevich et al., 2015). These ROIs are illustrated in Fig. S6.

Supplemental Figures

Follow up behavioral analyses

A. Example trajectory across the experiment of the belief estimates generated from the Bayesian learner. Top is the trajectory of S1, and the bottom is the trajectory of S2. While lines represent the true probability trajectory is shown in white and the estimated belief is shown in pink. Color heatmap shows the probability mass for each possible belief in Sx ->O1. B. Comparison of model fits between our Bayesian model and a value-based RL model (vRL) which used an interactive updating procedure to track the value of each shape based on the history of received rewards. The exceedance probability for the Bayesian model was 1, and 0 for the vRL model, suggesting that Bayesian model, which tracked transition probabilities between choices and outcomes, better fit participants actual choices compared to a value tracking model. C. Logistic regression curves estimating the change in choice probabilities given the expected value difference between choices. Gray line shows participant specific lines, and the black line shows the effect across groups (associated t-statistics are calculated across participants). The left side shows the effect in the direct transition condition and the right side shows the indirect transition condition.

Pre-selected anatomical ROIs

Illustrations of pre-selected anatomical ROIs taken from Neubert et al, 2015. The lOFC ROI corresponds to index 9 and 30, FPl corresponds to indexes 14 and 35. The HC ROI was defined in Yushkevich et al., 2015.

Functionally defined ROIs for in the direct transitions condition.

A) Despite having a priori defined anatomical ROIs for our decoding analysis of the causal choice, we wanted to test whether our results depended on these ROI definitions by using a data-driven approach. Here, we trained an SVM classifier to decode representations of the causal choice in run 1 of the direct transition condition, then tested the decoder on run 2 to find regions of the orbitofrontal cortex (OFC) and hippocampus (HC) that significantly decoded causal choice representations at a significance level of t(19) > 2.54, p < .01, uncorrected. We then used these regions as ROIs for a separate analysis which trained the classifier in run 1 and tested the classifier in run 2. B) Shows ROIs generated from the same procedure as described in A, but the use of each run for training and testing are switched.

Main effect of choice decoding accuracy at the time of feedback TFCE corrected in each run of the direct transition condition

A. Regions of the OFC showing significant decoding of the causal choice in run 1 of the direct transition condition. Significance was tested using TFCE correction over voxels with the ROI generated from run 2, using the procedure described above (Fig.S1). For illustration, we show voxels that survive at threshold to t(19)=1.73, p<.05 uncorrected. B. Shows the same as A but for voxels in run 2, using the ROI generated from run 1.

Significnant informaton connectivity between FPl and OFC in functionally defined ROI from direct transition condition

A. We did not observe signficiant decoding of the causal choice a in bilateral OFC ROI defined by significant cluseter in in the idirected transition condition. Thus, we used the accuracy map for decoding choices at feedback during the direct transition condition (t (19) > 1.73; p < .05) in the OFC, averaged across runs. B) We then used those cluster as ROI for TFCE correction for regions of the lOFC that showed significant information connectivity with FPl. We did this by testing for significant correlations between the trial-by-trial fidelity of pending representations in the FPl and causal choice representation during feedback in lOFC (see Methods).

Depiction of catch trials

A. To ensure that participants where we included valuable catch trials in the passive observing “template task”. Participants were asked to report which image out of the four (2 gift cards and 2 stimuli) was the last one presented on the screen. They were endowed an extra £10 from which we removed £1 for every incorrect response. There were four catch trials per template run. B. The decision task included “bonus trials” in which participants could predict which gift card they expected to see on the subsequent feedback screen given their choice. They were given 3£ extra on the final gift card that was given to them for every correct answer. The first run of the direct transition condition had two catch trials; the second run had one. Both runs of the indirect transition condition had one catch trial each.

Control Analysis for Pending-to-Credit Assignment Information Connectivity in the Indirect Transition Condition

A. Axial (left) and coronal (right) slices through a t-statistic map showing the results of a control analysis in which test the proportion of correct classifications of causal stimulus information in OFC and HPC at the time of the outcome for trials in which the FPl showed correct classification for the causal stimulus during pending trials. The proportion of correct trials was compared to a permuted baseline of randomly drawn trials for each participant then combined over participants to create a t-statistic. B. Secondary control analysis in which we reran the classification analysis for causal choice stimulus information at the time of outcome, but only on trials where FPl was found to correctly decode pending causal choice information. Note that this test is different from A because we allowed the classifier to create a new hyperplane separating categories for only those trials in which the FPl decoding was “correct”. For illustration, all maps are displayed at threshold of t(19)=2.54, p<.01 uncorrected. All effects survive small volume correction in a priori defined anatomical ROIs.