1. Neuroscience
Download icon

Computer code comprehension shares neural resources with formal logical inference in the fronto-parietal network

  1. Yun-Fei Liu  Is a corresponding author
  2. Judy Kim
  3. Colin Wilson
  4. Marina Bedny
  1. Johns Hopkins University, United States
Research Article
  • Cited 2
  • Views 2,220
  • Annotations
Cite this article as: eLife 2020;9:e59340 doi: 10.7554/eLife.59340

Abstract

Despite the importance of programming to modern society, the cognitive and neural bases of code comprehension are largely unknown. Programming languages might ‘recycle’ neurocognitive mechanisms originally developed for natural languages. Alternatively, comprehension of code could depend on fronto-parietal networks shared with other culturally-invented symbol systems, such as formal logic and symbolic math such as algebra. Expert programmers (average 11 years of programming experience) performed code comprehension and memory control tasks while undergoing fMRI. The same participants also performed formal logic, symbolic math, executive control, and language localizer tasks. A left-lateralized fronto-parietal network was recruited for code comprehension. Patterns of activity within this network distinguish between ‘for’ loops and ‘if’ conditional code functions. In terms of the underlying neural basis, code comprehension overlapped extensively with formal logic and to a lesser degree math. Overlap with executive processes and language was low, but laterality of language and code covaried across individuals. Cultural symbol systems, including code, depend on a distinctive fronto-parietal cortical network.

Introduction

In 1800, only twelve percent of the world’s population knew how to read, while today the world literacy rate is over eighty-five percent (https://ourworldindata.org/literacy). The ability to comprehend programming languages may follow a similar trajectory. Although only an estimated 0.5% of the world’s population is currently proficient at computer programming, the number of jobs that require programming continues to grow. Coding is essential in scientific fields and in areas as diverse as artistic design, finance, and healthcare. As many industries incorporate artificial intelligence or other information technologies, more people seek to acquire programming literacy. However, the cognitive and neural mechanisms supporting coding remain largely unknown. Apart from its intrinsic and societal interest, programming is a case study of ‘neural recycling’ (Dehaene and Cohen, 2007). Computer programming is a very recent cultural invention that the human brain was not evolutionarily adapted to support. Studying the neural basis of code offers an opportunity to investigate how the brain performs novel complex skills.

Hypotheses about how the human brain accommodates programming range widely. One recently popular view is that code comprehension recycles mechanisms developed for human language (Fedorenko et al., 2019; Fitch et al., 2005; Pandža, 2016; Portnoff, 2018; Prat et al., 2020). Consistent with this idea, a recent study reported that individual differences in the ability to learn a second language predict aptitude in learning to program (Prat et al., 2020). Computer languages borrow letters and words from natural language and, in some programming languages like Python, the meanings of the borrowed symbols (e.g. if, return, print) relate to the meanings of the same symbols in English. As in natural languages, the symbols of code combine generatively according to a set of rules (i.e. a formal grammar). The grammars of language and that of code share common features, including recursive structure (Fitch et al., 2005). In natural languages, a phrase can be imbedded within another phrase of the same syntactic category (Hauser et al., 2002; Yang et al., 2017). Analogously, in programming languages, data structures, such as lists and trees, can be recursive and a function can call itself. For example, in Python, IF conditionals can be embedded within IF conditionals:

if (condition_1):
        if (condition_2):
                print('Both conditions are True.')
        else:
                print('Condition_1 is True, condition_2 is False.')
else:
        print('Condition_1 is False. Condition_2 not evaluated.')

To give another textbook example, the factorial of a positive integer N can be computed using a recursive Python function:

def factorial(N):
        return N*factorial(N-1) if (N > 1) else 1

Here, the function ‘factorial’ is called in the definition of itself. Given these similarities between programming languages and natural languages, one possibility then is that coding recycles neurocognitive mechanisms involved in producing and comprehending natural language. Other culturally-invented symbol systems, such as formal logic and mathematics do not appear to depend on the same neural network as natural language. Like code, formal logic and mathematics borrow symbols from language and are also hierarchical and recursive (e.g. (7*(7*(3+4)))). Unlike language, however, culturally-invented symbol systems are explicitly taught later in life. Computer coding, mathematics and logic, all involve manipulation of arbitrary variables without inherent meaning (e.g. X, Y, input, ii) according to a set of learned rules (McCoy and Burton, 1988). While each symbol system has its own conventionalized way of referring to variables and its own set of rules — indeed, these aspects differ somewhat among programming languages — there are nevertheless many common features. For example, conditional and other connectives (e.g. ‘if…then’, ‘and’, ‘or’, ‘not’) occur in both formal logic and programming languages with closely related meanings. Consider a function containing an if conditional written in Python,

def fun(input):
        result = "result: "
        if input[0]=='a':
                result + = input[0].upper()
        return result

The value of the ‘result’ variable depends on whether the ‘input’ meets the specific conditions of the if statement. Similarly, in the logical statement ‘If both X and Z then not Y’ the value of the result (Y) depends on the truth value of the condition ‘both X and Z’. One hypothesis, then, is that coding depends on similar neural resources as other culturally-invented symbol systems, such as formal logic and math.

Rather than recruiting perisylvian fronto-temporal areas, logic and math recruit a fronto-parietal network, including the dorsolateral prefrontal cortex (PFC) and the intraparietal sulcus (IPS) as well as putative symbol representations (i.e. numberform area) in inferior temporal cortex (Amalric and Dehaene, 2016; Coetzee and Monti, 2018; Goel et al., 2007; Monti et al., 2009). This fronto-parietal network overlaps partially with the so-called central executive/working memory system, which is implicated in a variety of cognitive tasks that involve maintaining and manipulating information in working memory, processes that are part and parcel of understanding and writing code (Brooks, 1977; Duncan, 2010; Letovsky, 1987; Miller and Cohen, 2001; Soloway and Ehrlich, 1984; Weinberg, 1971; Zanto and Gazzaley, 2013) (for a review of the cognitive models of code comprehension, see Von Mayrhauser and Vans, 1995). The central executive system is usually studied using simple rule-based tasks, such as the multisource interference task (MSIT), Stroop task, and spatial or verbal working memory (Banich et al., 2000; Bunge et al., 2000; Bush and Shin, 2006; January et al., 2009; Milham et al., 2001; Woolgar et al., 2011; Zanto and Gazzaley, 2013; Zhang et al., 2013). Logic and math activate a similar network but also have unique neural signatures. Within the PFC, logic in particular recruits more anterior regions associated with more advanced forms of reasoning and symbol manipulation (Coetzee and Monti, 2018; Ramnani and Owen, 2004). The degree to which code comprehension relies on the same network as these other cultural symbol systems is not known.

Only a handful of previous studies have looked at the neural basis of code processing (Duraes et al., 2016; Floyd et al., 2017; Ikutani and Uwano, 2014; Peitek et al., 2018; Siegmund et al., 2014; Huang et al., 2019). Two studies observed larger fronto-parietal responses when comparing code writing and editing to prose writing and editing (Floyd et al., 2017; Krueger et al., 2020). When comprehension of code was compared to detection of syntactic errors in code, activity in both fronto-parietal and putative language areas was observed (Siegmund et al., 2014, Siegmund, 2017). None of these prior studies localized neural networks involved in language, or cultural symbol systems such as math and logic, in the same participants — leaving the question of direct neural overlap unanswered.

The goal of the current study was to ask whether basic computer code comprehension has a consistent neural signature across people, and, if so, whether this signature is more similar to those of other culturally-invented symbol systems (i.e. logic and math) or of natural language.

A group of expert programmers (average 11 years of programming experience) performed a code comprehension task while undergoing functional magnetic resonance imaging (fMRI). We chose a comprehension task partly because it could be analogous to understanding language vignettes and because it is arguably simpler than writing or debugging code. On each real-code trial, participants saw a short function definition, followed by an input and a possible output, and judged whether the output was valid. In fake code control trials, participants performed a memory task with unstructured text. A fake function was generated by scrambling a real function per line at the level of word/symbol. Each fake function preserved the perceptual and lexical elements of a real function, but was devoid of syntactic structure. The real-code condition contained two subtypes or ‘control structures’, for loops and if conditionals. We used multi-voxel-pattern analysis to decode for from if functions to test whether the code-responsive cortical system encodes code-relevant information. Finally, we examined the overlap of code comprehension with language (sentence comprehension), formal logic, and mathematical tasks. We also tested overlap of code with the MSIT to determine whether the overlap with culturally-invented symbol systems (i.e. logic and math) is more extensive than with simpler experimentally defined rule-based tasks.

Results

Behavioral results

Accuracy was similar across real and fake code trials (real M = 92%, SD = 0.045; fake M = 0.90, SD = 0.069; binary logistic mixed regression, real to fake odds ratio β = 1.27; Wald’s z statistic, z = 1.21; p=0.23). Accuracy was also similar across for and if trials (for M = 0.92, SD = 0.056; if M = 0.92, SD = 0.076; if to for odds ratio β = 0.95; Wald’s z statistic, z = −0.28; p=0.77). Participants were slower to respond to fake as compared to real-code trials (real M = 1.73 s, SD = 0.416; fake M = 2.03 s, SD = 0.37; t(73) = 2.329, p=0.023) and slower to respond to for as compared to if trials (for M = 1.85 s, SD = 0.46; if M = 1.60 s, SD = 0.44; t(58) = 2.127, p=0.038) (Figure 1—figure supplement 1).

In the language/math/logic localizer task, participants performed least accurately on logic trials, followed by math and language (logic M = 0.82, SD = 0.13; math M = 0.94, SD = 0.028; language M = 0.98, SD = 0.023; one-way-ANOVA, F(2, 42)=18.29, p<0.001). Participants were slowest to respond to logic trials, followed by math trials, and fastest on the language trials (logic M = 6.47 s, SD = 2.42; math M = 4.93 s, SD = 1.32; language M = 4.03, SD = 1.27; one-way-ANOVA F(2, 42)=7.42, p=0.0017) (Figure 1—figure supplement 1).

In the MSIT experiment, hard and easy conditions did not differ in terms of accuracy (hard M = 0.97, SD = 0.038; easy M = 0.98, SD = 0.034; t(28) = −1.363, p=0.18), but the hard trials took significantly longer to respond to than the easy trials (hard M = 0.792 s, SD = 0.092; easy M = 0.506 s, SD = 0.090; t(28)=8.59, p<0.001) (Figure 1—figure supplement 1).

fMRI results

Code comprehension experiment

As compared to fake code, real-code elicited activation in a left-lateralized network of regions, including the lateral PFC (middle/inferior frontal gyri, inferior frontal sulcus; mainly BA 44 and 46, with partial activation in BA 6, 8, 9, 10, 47), the parietal cortex (the IPS, angular, and supramarginal gyri; BA 7) and the pMTG and superior temporal sulcus (BA 22 and 37). Activity was also observed in early visual cortices (Occ) (p<0.01 FWER, Figure 1; Supplementary file 2).

Figure 1 with 1 supplement see all
Whole-brain contrasts.

Areas shown are p<0.05 cluster-corrected p-values, with intensity (both warm and cold colors) representing uncorrected vertex-wise probability. In the maps for each localizer contrast, both warm and cold colors indicate activated vertices in the contrast, with the cold color labelling the overlap with the code contrast.

MVPA analysis revealed that for and if functions could be distinguished based on patterns of activity within PFC (accuracy = 64.7%, p<0.001), IPS (accuracy = 67.4%, p<0.001) and pMTG (accuracy = 68.4%, p<0.001). for and if functions could also be distinguished within the early visual cortex (accuracy = 55.7%, p=0.015), however, decoding accuracy was lower than in the other regions (F(3, 56)=4.78, p=0.0048) (Figure 2).

MVPA decoding accuracy in ROIs revealed by the code contrast.

(a) The four search spaces (IPS, pMTG, PFC, OCC in the left hemisphere) within which functional ROIs were defined for the MVPA. (b) The MVPA decoding accuracy in the four ROIs. Error bars are mean ± SEM. *p<0.05. ***p<0.001.

Overlap between code comprehension and other cognitive domains

The language/math/logic localizer task activated previously identified networks involved in these respective domains. Responses to language were observed in a left perisylvian fronto-temporal language network, to math in parietal and anterior prefrontal areas as well as posterior aspect of the inferior temporal gyrus, and finally to logic, like math, in parietal and anterior prefrontal areas as well as posterior aspect of the inferior temporal gyrus. Logic activated more anterior and more extensive regions in PFC than math. The MSIT hard >easy contrast also activated a fronto-parietal network including the IPS, however, the activation in the lateral frontal cortex was posterior and close to the precentral gyrus. (Figure 3, see Supplementary file 2 for full description of activity patterns associated with language, logic, math and MSIT). Note that although in the current experiment logic, math and language were compared to each other, the networks observed for each domain are similar to those previously identified with other control conditions (e.g. lists of non-words for language and hard vs. easy contrast in a logic task) (e.g. Coetzee and Monti, 2018; Fedorenko et al., 2011).

Overlap between the brain map revealed by the code contrast and each of the brain maps revealed by the localizer contrasts.

(a) Brain map with the activated regions in the five contrasts reported in Figure 1 overlain. The language network is shown in transparent blue, math in transparent red, and logic in transparent green. The regions activated in the MSIT contrast are enclosed in black outlines, and the code-responsive regions are enclosed in yellow outlines. (b) Cosine similarity between code contrast and each localizer contrast, in each hemisphere. Each dot represents the data from one participant. The dotted line on each bar indicates the null similarity between code contrast and the given localizer contrast. The yellow dashed line in each hemisphere indicates the empirical upper bound of the cosine similarity, the similarity between code comprehension and itself, averaged across participants. Error bars are mean ± SEM. *p<0.05. **p<0.01. ***p<0.001.

Because code comprehension was highly left-lateralized, overlap analyses focused on the left hemisphere. Right hemisphere results are reported in the appendix. Code comprehension (real >fake) overlapped significantly above chance with all localizer tasks: logic, math, language and MSIT (each task compared to chance p’s < 0.001, compared to code split-half overlap p’s < 0.005) (Figure 3). The degree of overlap differed significantly across tasks (repeated-measures ANOVA: F(3,42) = 5.04, p=0.0045). Code comprehension overlapped most with logic (logic >language), followed by math and least with MSIT and language (Figure 3). Overlap with logic was significantly higher than with all other tasks, while the overlaps with the other three tasks (language, math, MSIT) were statistically indistinguishable from each other (post-hoc paired t-tests, FDR-corrected p’s < 0.05) (Supplementary file 3). Note that overlap analyses control for the overall number of activated vertices in each contrast.

The overlap of code with logic and math was observed in the IPS, PFC, and a posterior portion of the inferior temporal gyrus (IT). PFC overlap was localized to the anterior middle frontal gyrus (aMFG, BA 46) and posteriorly in the precentral gyrus (BA 6). Overlap of code and the MSIT (hard >easy) was also observed in the IPS, precental gyrus and a small portion of the inferior temporal sulcus. Although MSIT and code overlapped in frontal and parietal areas, like code with logic/math, the precise regions of overlap within these general locations differed.

Finally, code overlapped with language (language >math) in portions of the inferior frontal gyrus and the posterior aspect of the superior temporal sulcus/middle temporal gyrus. The overlap between language and code was on average low, and the degree of overlap varied considerably across participants (cosine sim range: [0.105, 0.480]), with only half of the participants showing above chance overlap. Notably there was no relationship between overlap of code and language and level of expertise, as measured either by years of experience coding (regression against code-language overlap: R2 = 0, p=0.99; regression against code-math overlap: R2 = 0.033, p=0.52) or performance on coding assessments (regression against code-language overlap: R2 = 0.033, p=0.52; regression against code-math overlap: R2 = 0.064, p=0.36).

Lateralization

The group activation map suggested that code comprehension is left-lateralized. Analyses of individual lateralization indices showed that indeed, code comprehension was as left-lateralized as language (Code lateralization index mean = 0.451, one sample t-test against 0: t(14) = 5.501, p<0.001; Language mean = 0.393, t(14) = 5.523, p<0.001; paired t-test between code and language: t(14) = 1.203, p=0.25). Moreover, lateralization indices of code and language were highly correlated across individuals (R2 = 0.658, p<0.001) (Figure 4).

The lateralization index of the code contrast and the localizer contrasts.

(a) The lateralization index of the code contrast and the localizer contrasts. Each white dot stands for one participant, and the enlarged dots represent the mean values. (b) The lateralization indices of code contrast and language contrast are highly correlated.

Discussion

A consistent network of left-lateralized regions was activated across individuals during Python code comprehension. This network included the intraparietal sulcus (IPS), several regions within the lateral PFC and the posterior-inferior aspect of the middle temporal gyrus (pMTG). This code-responsive network was more active during real than fake code trials, even though for expert Python coders, the fake code task was more difficult (as measured by reaction time) than the real-code task. Involvement of the fronto-parietal system, as opposed to the fronto-temporal language network, in code processing is consistent with prior work using more complex coding tasks, such as debugging and editing (Siegmund et al., 2014, Siegmund, 2017; Huang et al., 2019; Krueger et al., 2020; Floyd et al., 2017). The fact that fronto-parietal involvement is observed even for simple code function comprehension suggests that it is not related solely to cognitive control processes specific to these more complex coding tasks.

Within this code-responsive neural network, spatial patterns of activation distinguished between for vs. if code functions, suggesting that this network represents code-relevant information and is not merely activated during the coding task due to general difficulty demands. In overlap analyses, the code comprehension network was most similar to the fronto-parietal system involved in formal logical reasoning and to a lesser degree math. By contrast overlap with the perisylvian fronto-temporal language network is low. Notably, in the current study, neural responses associated with language, math and logic were localized partly relative to each other (e.g. logic >language). This approach focuses on networks that are uniquely involved in one domain vs. another. The degree to which areas shared across language, logic, and math are implicated in code comprehension remains to be addressed in future work.

Code overlaps with logic

Code, logical reasoning, math and the MSIT task all activated aspects of the so-called fronto-parietal executive control system. However, overlap of code with logic was most extensive, followed by math and finally the MSIT. The difference between the MSIT task on the one hand and code comprehension, logic and math on the other, was particularly pronounced in the frontal lobe. There only code, logic and math activated more anterior regions of PFC, including BA 46 and BA 9, although logic-associated activation extended even more anteriorly than code. These findings suggest that neural overlap between logic and code is specific, and not fully accounted for by the general involvement of the central executive system. Note that although the logical reasoning task was more difficult than the language task, larger overlap with logic is unlikely to relate to task difficulty since the current overlap analyses control for the overall number of activated vertices.

Previous studies also find that the fronto-parietal network, including anterior prefrontal areas, are involved in logical reasoning (Prado et al., 2011; Tsujii et al., 2011). For example, anterior PFC is active when participants solve formal logical problems with quantifiers (e.g. ‘all X are Y; Z is a X; therefore Z is Y’) and connectives (e.g. ‘if X then Y; not Y; therefore not X’) and plays a key role in deductive reasoning with variables (Coetzee and Monti, 2018; Goel, 2007; Goel and Dolan, 2004; Monti et al., 2009; Reverberi et al., 2010; Reverberi et al., 2007; Rodriguez-Moreno and Hirsch, 2009).

A fronto-parietal network has also been consistently implicated in math (Friedrich and Friederici, 2013; Maruyama et al., 2012; Piazza et al., 2007; Wendelken, 2014). Some of the parietal responses to math have been linked to the processing of quantity information (Eger et al., 2009; Nieder, 2016; Nieder and Miller, 2004; Piazza and Eger, 2016; Roitman et al., 2007; Tudusciuc and Nieder, 2009). For example, neurons in the IPS of monkeys, code numerosity of dots (Nieder, 2016). However, much of the same fronto-parietal network is also active during the processing of mathematical statements free of digits and arithmetic operations (Amalric and Dehaene, 2018; Amalric and Dehaene, 2018; Wendelken, 2014). In the current study, both the anterior prefrontal areas and parietal areas involved in math also overlapped with code and logical reasoning. Some of this activation could therefore reflect common operations, such as the manipulation of rules and symbols in working memory. On the other hand, the lower overlap between coding and math, as compared to coding and logic, could stem from math uniquely involving quantitative processing in the current study.

The present evidence suggests that culturally-invented symbol systems (i.e. code comprehension, formal logic and math) depend on a common fronto-parietal network, including the executive system. As noted in the introduction, although each of these symbol systems has its unique cognitive properties, they also have much in common. All involve the manipulation of abstract arbitrary symbols without inherent semantic content (e.g. X, Y, input, result) according to explicit rules. In the current logical inference and code experimental tasks, mental representations of several unknown variables are constructed (for logic ‘X’, ‘Y’, and ‘Z’, for code ‘input’ and ‘result’) and the relationships between them deduced according to rules of formal logic or code.

There are also important differences between the rules of logical inference and programming. Take ‘if’ conditional judgement for example again. In formal logic, the statement ‘if P then Q’ doesn’t imply anything about what happens when P is false. On the contrary, in Python and most other programming languages, the statement.

if condition == True: 
        do_something()

automatically implies that when the condition is false, the function ‘do_something()” isn’t executed, unless otherwise specified. Learning to program involves acquiring the particular set of conventionalized rules used within programming languages and a syntax that specifies how the programming language in question expresses logical operations (Dalbey and Linn, 1985; Pea and Kurland, 1984; Pennington, 1987; Robins et al., 2003). We speculate that such knowledge is encoded within the fronto-parietal network identified in the current study. It is also worth pointing out that although we found substantive overlap between the neural networks involved in code and logic, it is possible that, at a finer neural scale, these functions dissociate. Future studies comparing coders with different levels of expertise should test whether learning to code modifies circuits within the code-responsive neural network identified in the current study and address whether learning to code leads to specialization of a subset of this network for code in particular. A detailed understanding of the neural basis of code will also require development of cognitive models of code comprehension.

The involvement of the multiple-demand executive control system in code comprehension

Code comprehension showed partial overlap with the MSIT task, particularly in the parietal cortex and in posterior frontal areas. Previous work has noted cognitive and neural similarity between arbitrary small-scale working memory tasks, such as the MSIT, and culturally-derived formal symbol systems (Anderson, 2005; Qin et al., 2004). As noted in the introduction, the MSIT task is a classic localizer for the executive function system (e.g. Stroop, n-back, and MSIT) (Duncan, 2010; Fedorenko et al., 2013; Miller and Cohen, 2001; Woolgar et al., 2011; Zanto and Gazzaley, 2013; Zhang et al., 2013). Like code comprehension, most experimental tasks that activate the central executive system involve the maintenance, manipulation and selection of arbitrary stimulus response mappings according to a set of predetermined rules (Woolgar et al., 2011; Zhang et al., 2013). For example, in the MSIT task among the many possible ways to map a visually presented digit triplet to a button press, participants maintain and select the rule ‘press the button whose index corresponds to the value of the unique digit in the triplet.’ The difficult condition requires using this less habitual rule to make a response.

Previous studies also find that the fronto-parietal executive system is involved in rule maintenance and switching, as well as variable representation. In one task-switching study, participants maintained a cued rule in working memory and the level of fronto-parietal activity increased with the complexity of the rule maintained (Bunge et al., 2003). Patterns of neural activity within the executive system encoded which rule is currently being applied and activity is modulated by rule switching (Buschman et al., 2012; Crittenden and Duncan, 2014; Xu et al., 2017). Finally, studies with non-human primates find that neurons in the frontal lobe encode task-based variables (Duncan, 2010; Kennerley et al., 2009; Nieder, 2013). Such processes, studied in the context of simple experimental tasks, may also play a role in code comprehension.

Although culturally-invented formal symbol systems and simple experimental rule-based tasks (e.g. MSIT) share cognitive elements, the latter, unlike the former, involve simple rules that specify stimulus response mappings, rather than mental manipulations of variables. An intriguing possibility is that code comprehension and other culturally-invented symbol systems recycle a subset of the neural machinery that originally evolved for the maintenance and manipulation of simpler variables and rules (Anderson, 2005; Qin et al., 2004).

Code comprehension and language

In the current study, perisylvian fronto-temporal network that is selectively responsive to language had low and variable overlap with the neural network involved in code comprehension. The regions that did show some overlap between language and code (i.e. left inferior frontal and middle temporal gyri), have been implicated in high-level linguistic processing, including sentence-level syntax (Friederici, 2017; Hagoort, 2005; Pallier et al., 2011; Bornkessel-Schlesewsky and Schlesewsky, 2013; Fedorenko and Thompson-Schill, 2014; Matchin and Hickok, 2020). The current results therefore do not rule out the possibility that the language system plays some role in code. Nevertheless, on the whole, the results do not support the hypothesis that the language system is recycled for code comprehension (see also Ivanova et al, in press). Previous studies also find that math and formal logic do not depend on classic language networks (Amalric and Dehaene, 2016; Monti et al., 2009). The low degree of overlap between code and language is intriguing given the cognitive similarities between these domains (Fedorenko et al., 2019; Pandža, 2016; Peitek et al., 2018; Portnoff, 2018; Prat et al., 2020; Siegmund et al., 2014). As noted in the introduction, programming languages borrow letters and words from natural language, and both natural language and code have hierarchical, recursive grammars (Fitch et al., 2005).

One possible explanation for low overlap between the perisylvian fronto-temporal language network and code, is that the language system is evolutionarily predisposed to support natural language processing in particular, and is therefore not generalizable even to similar domains, like computer code and formal logic (Dehaene-Lambertz et al., 2006; Fedorenko et al., 2011). Timing could also play a role. The perisylvian fronto-temporal language network may have a sensitive period of development during which it is most capable of learning (Cheng et al., 2019; Mayberry et al., 2018; Cheng et al., 2020; Ferjan Ramirez et al., 2016) By the time people learn to code, the network may be incapable of taking on new cognitive functions. Indeed, even acquiring a second language late in life leads to lower levels of proficiency and responses outside the perisylvian fronto-temporal system (Hartshorne et al., 2018; Johnson and Newport, 1989; ). These observations suggest that domain-specific systems, like the perisylvian fronto-temporal language network, are not always amenable for ‘recycling’ by cultural inventions. The fronto-parietal system might be inherently more flexible throughout the lifespan and thus more capable of taking on new cultural skills (Riley et al., 2018).

Despite lack of direct overlap, lateralization patterns of language and coding were highly correlated across individuals that is those individuals with highly left-lateralized responses to sentences also showed highly left-lateralized responses to code. This intriguing observation suggests that the relationship between code and language may be ontogenetic as well as phylogenetic. It is hard to imagine how code in its current form could have been invented in the absence of language (Fitch et al., 2005). Ontogenetically, code-relevant neural representations might be enabled by the language system, even though they are distinct from it.

An analogous example comes from the domain of reading (Dehaene et al., 2010; McCandliss et al., 2003). Reading-relevant regions, such as the visual word form area (VWFA), are co-lateralized with the perisylvian fronto-temporal language network across people (Cai et al., 2010). The VWFA has strong anatomical connectivity with the fronto-temporal language network even prior to literacy (Bouhali et al., 2014; Saygin et al., 2016). Analogously, code comprehension may colonize a left-lateralized portion of the central executive system due to its stronger (i.e. within hemisphere) connectivity with the perisylvian fronto-temporal language network.

Relationship to co-published work by Ivanova and colleagues

The current results are consistent with the co-published study by Ivanova and colleagues. Like the current study, Ivanova et al report the involvement of a fronto-parietal network in code comprehension and low overlap with fronto-temporal language processing systems. The consistency of these two studies is striking in light of the difference in the study designs. While the current study compared code comprehension to a working memory control task with linguistic symbols, Ivanova et al. compared code comprehension to matched verbal descriptions of program-like algorithms. In the current study, the value of input variables was provided only after function presentation, by contrast, the value was stipulated as part of the function in Ivanova et al. While the current study localized the language network using a passive/active sentence comparison task relative to a math control condition, Ivanova and colleagues compared sentence comprehension to a non-word memory control task. Finally, in addition to examining the neural basis of Python comprehension, Ivanova et al also studied the neural basis of ScratchJr and found partially similar results.

There are also interesting differences across experiments that may relate to differences in design. Both the current study and Ivanova et al’s Python comprehension tasks revealed robust responses on prefrontal cortices, by contrast prefrontal responses to ScratchJr were weaker. Unlike Python, ScratchJr doesn’t declare variables and update their values. Previous studies have implicated anterior prefrontal cortices in variable manipulation (Monti et al., 2009; Diester and Nieder, 2007). The degree of prefrontal involvement in code comprehension may therefore depend in part on whether variable manipulation is involved. In the current study, we observed strong left-lateralization of code comprehension and co-lateralization of code comprehension and language across people. By contrast, Ivanova and colleagues did not observe left-lateralization of code comprehension. Since Ivanova and colleagues compared code comprehension to a sentence reading task, left-lateralization may have been obscured by subtracting out a left-lateralized pattern associated with sentence processing. Alternatively, the Ivanova study focused on the so- called ‘syntactic’ aspects of code comprehension rather than the semantics of code by comparing code comprehension to matched sentences describing similar algorithms. It is possible that the semantics or algorithmic aspects of code are more left-lateralized. Notably, since the current sample is relatively small (n = 15), any differences between the current study and the Ivanova findings should be interpreted with caution and replicated in future work.

Conclusions

A fronto-parietal cortical network is consistently engaged in expert programmers during code comprehension. Patterns of activity within this network distinguish between FOR and IF functions. This network overlaps with other culturally-invented symbol systems, in particular formal logic and to a lesser degree math. By contrast, the neural basis of code is distinct from the perisylvian fronto-temporal language network. Rather than recycling domain-specific cortical mechanisms for language, code, like formal logic and math, depends on a subset of the domain general executive system, including anterior prefrontal areas. The executive system may be uniquely suited as a flexible learning mechanism capable of supporting an array of cultural symbol systems acquired in adulthood.

Materials and methods

Participants

Seventeen individuals participated in the study; one did not complete the tasks due to claustrophobia, and another was excluded from analyses due to excessive movement (>2 mm). We report data from the remaining fifteen individuals (three women, age range 20–38, mean age = 27.4, SD = 5.0). All participants had normal or corrected to normal vision and had no known cognitive or neurological disabilities. Participants gave informed consent according to procedures approved by the Johns Hopkins Medicine Institutional Review Board (IRB protocol number: NA_00087983).

All participants had at least 5 years of programming experience (range: 5–22, mean = 10.7, SD = 5.2), and at least 3 years of experience with Python (range: 3–9, mean = 5.7, SD = 1.8).

Behavioral pre-test

Request a detailed protocol

In addition to self-reported programming experience, Python expertise was evaluated with two out-of-scanner Python exercises (one easier and one more difficult) the week prior to the fMRI experiment. These exercises also served to familiarize participants with the particular Python expressions that would be used during the fMRI experiment.

The easier exercise consisted of three phases. During the first phase (initial test), we evaluated participants’ knowledge of every built-in Python function that would appear in the stimuli of the fMRI experiment. Participants were asked to type the output of a single-line print() statement (e.g. for ‘print(”3.14’.split(‘1’))' one should type ‘[‘3.’, ‘4’]'). On average participants answered M = 82.9% (SD = 6.9%) of the questions correctly (range: 70–96%). Since even expert programmers may not have used a particular function in the recent past, the second phase (recap) explicitly reviewed the definitions and purposes of all of the relevant functions and expressions. During the final phase (retest), participants were once again asked to type the output of a single-line statement for each function (M = 92.0% (SD = 7.5%), range: 72.4–100%).

The more difficult exercise evaluated the participants’ knowledge about when and how to use Python functions and expressions. Each participant answered sixteen questions consisting of a code snippet with a blank. A prompt was presented alongside the code snippet to explain what the snippet should output if executed. The participant was asked to fill in the blank in order to complete the code (see the subsection ‘example of the difficult out-of-scanner exercise’ in the appendix). The questions were designed by the experimenter to cover some of the objectives specified in the exam syllabus of the Certified Associate in Python Programming Certification held by the Python Institute (https://pythoninstitute.org/certification/pcap-certification-associate/pcap-exam-syllabus/). On average, the participants got 64.6% (SD = 16.6%) of the questions correct (range: 37.5–93.75%).

fMRI task design and stimuli

Code comprehension experiment

Request a detailed protocol

In real-code comprehension trials, participants were presented with Python functions designed for the purposes of the experiment. In fake code control trials, they were presented with incomprehensible scrambled versions of the same functions (for details on real and fake code, see below). To help participants distinguish between real and fake code trials and to prevent the participants from erroneously attempting to parse fake code, real-code appeared in white text and fake code in yellow text.

Each trial had three phases: function (24 s), input (6 s), and question (6 s) (Figure 5). First, participants viewed a Python function for 24 s, followed by a 500 millisecond fixation-cross delay. During the input phase, the original code function re-appeared on the screen with a potential input below consisting of a single-line character string (6 s). Participants were instructed to use the input to mentally derive the output of the function shown during the input phase. After the input phase there was a 500 millisecond fixation-cross delay followed by a proposed output along with the prompt ‘TRUE?’ Participants were asked to determine whether the output was correct within 6 s. All trial phases had a shortening bar at the bottom of the screen indicating the remaining time during that particular phase of the trial. Each trial was followed by a 5 s inter-trial interval during which the text ‘Your response is recorded. Please wait for the next trial.’ was shown on the screen.

The experiment design.

The FAKE function (bottom row) in this figure is created by scrambling the words and symbols in each line of the REAL function (top row). Note that for the purpose of illustration, the relative font size of the text in each screen shown in this figure is larger than what the participants saw during the actual MRI scan.

Each real-code function consisted of five lines. The first line (def fun(input):) and the last (return result) were always the same. The second line always initialized the result variable, and the third and fourth lines formed a control structure (either a for loop or an if conditional) that may modify the value of the result. real-code trials were divided into two sub-conditions, for and if, according to the control structures the functions contained. Each condition included two variants of the for or if functions (see the subsections ‘detailed information about the stimuli’ and ‘the two variants of each control structure’ in the appendix). All functions took a letter string as input and performed string manipulation.

Fake code trials were analogous to the real-code trials in temporal structure (i.e. function, input, question). However, no real-code was presented. Instead, participants viewed scrambled text and were asked to remember it. During the function phase of a fake code trial, participants saw a scrambled version of a real-code function. Scrambling was done within line at word and symbol level (Figure 5, bottom row). Because fake functions were derived from real functions, the words, digits and operators that existed in real functions were preserved; however, none of the scrambled lines comprised an executable Python statement. During the input phase, an additional fake input line appeared below the fake function. The fake input line didn’t interact with the fake function, the participants only had to memorize this line. During the question phase, a new character line appeared along with the prompt ‘SAME?’ Participants judged whether this line had been presented during the function and input phases (including the additional input line), or it came from a different fake function. The correct response was ‘true’ for half of the real-code trials and half of the fake code trials.

There were six task runs, each consisting of 20 trials, eight real if code, eight real for code and four fake code trials. Each participant saw a total of 48 for functions (24 per variant), 48 if functions (24 per variant), and 24 fake functions (12 fake for, and 12 fake if functions). After each run of the task, participants saw their overall percent correct and average response time. Participants were divided into two groups such that the variants of the functions were counterbalanced across groups; the same participant never saw different variants of the same function. The order of the presentation of the functions was pseudo-randomized and balanced across participants. In total, 192 real functions (96 per group) and 48 fake functions (24 per group) were used in the experiment. All the functions are listed in Supplementary file 1. We permuted the order of the functions systematically such that each participant saw a unique order (see the subsection ‘algorithm for stimulus permutation’ in the appendix).

Localizer Tasks

Request a detailed protocol

During a separate MRI session, participants took part in two localizer experiments. A single experiment was used to localize responses to formal logic, symbolic math, and language using each condition as the control for the others: logic/math/language localizer. The task design was adapted from Monti et al., 2009, Monti et al., 2012 (Kanjlia et al., 2016; Monti et al., 2009; Monti et al., 2012). On language trials, participant judged whether two visually presented sentences, one in active and one in passive voice, had the same meaning (e.g. ‘The child that the babysitter chased ate the apple’ vs ‘The apple was eaten by the babysitter that the child chased’). On math trials, participant judged whether the variable X had the same value across two equations (e.g. ‘X minus twenty-five equals forty-one’ vs ‘X minus fifty-four equals twelve’). On formal logic trials, participant judged whether two logical statements were consistent, where one statement being true implied the other also being true (e.g. ‘If either not Z or not Y then X’ vs ‘If not X then both Z and Y’).

Each trial began with a 1 s fixation cross. One member of a pair appeared first, the other following 3 s later. Both statements remained on the screen for 16 s. Participants pressed the right or left button to indicate true/false. The experiment consisted of 6 runs, each containing 8 trials of each type (language/math/logic) and six rest periods, lasting 5 s each. All 48 statement pairs from each condition were unique and appeared once throughout the experiment. In half of the trials, the correct answer was ‘true’. Order of trials was counterbalanced across participants in two lists.

Although all of the tasks in the language/math/logic localizer contain language stimuli, previous studies have shown that sentences with content words lead to larger responses in the perisylvian fronto-temporal language network than spoken equations or logical statements with variables (Kanjlia et al., 2016; Monti et al., 2009; Monti et al., 2012). The perisylvian fronto-temporal language network shows enhanced activity for stimuli that contain meaningful lexical items and sentence-level syntax (e.g. Fedorenko et al., 2016). Furthermore, previous studies have found that responses to language, logic and math when compared to each other were similar to what was observed for each domain relative to independent control conditions (e.g. sentences relative to lists of non-words for language, and easy vs. hard logic problems; Kanjlia et al., 2016; Monti et al., 2009, Monti et al., 2012).

The multi-source interference task (MSIT) was adapted from Bush and Shin, 2006 to engage executive control processes and localize the multiple-demand network. On each trial, a triplet of digits was shown on the screen, two of which were the same. The participant pressed a button (1, 2, or 3) to indicate the identity of the target digit that was different from the distractors. For example, for ‘131’ the correct response is ‘3’; for ‘233’ it is ‘2’. The participants always pressed buttons ‘1’, ‘2’, and ‘3’ with their index, middle, and ring fingers, respectively.

MSIT consisted of interference blocks and control blocks, each containing 24 trials (1.75 s each). On interference trials, the location of the target digit was inconsistent with the identity of the digit (e.g. trials such as ‘133’ or ‘121’ did not occur). On control trials, the distractors were always ‘0’, and the target digit was always at the same location as its identity. In other words, there were only three kinds of control trial, namely ‘100’, ‘020’, and ‘003’.

Participants performed 2 runs of MSIT. Each run began with 15 s of fixation, followed by four interference blocks and four control blocks interleaved, and ended with another 15 s of fixation. Each block lasted 42 s. The order of the blocks was balanced within and between participants. Trial order was arranged such that all 12 interference trials appeared exactly twice in an interference block, and all three control trials appeared exactly six times in a control block. Identical trials never appeared in succession, and the order of the trials was different across all 8 blocks of the same kind.

Data acquisition

Request a detailed protocol

MRI data were acquired at the F.M. Kirby Research Center of Functional Brain Imaging on a 3T Phillips Achieva Multix X-Series scanner. T1-weighted structural images were collected in 150 axial slices with 1 mm isotropic voxels using the magnetization-prepared rapid gradient-echo (MP-RAGE) sequence. T2*-weighted functional BOLD scans were collected in 36 axial slices (2.4 × 2.4×3 mm voxels, TR = 2 s). We acquired the data in one code comprehension session (six runs) and one localizer session (2 runs of MSIT followed by 6 runs of language/math/logic), with the acquisition parameters being identical for both sessions.

The stimuli in both the code comprehension and localizer sessions were presented with custom scripts written in PsychoPy3 (https://www.psychopy.org/, Peirce et al., 2019). The stimuli were presented visually on a rear projection screen, cut to fit the scanner bore, with an Epson PowerLite 7350 projector. The resolution of the projected image was 1600 × 1200. The participant viewed the screen via a front-silvered, 45°inclined mirror attached to the top of the head coil.

fMRI data preprocessing and general linear model (GLM) analysis

Request a detailed protocol

Data were analyzed using Freesurfer, FSL, HCP workbench, and custom in-house software written in Python (Dale et al., 1999; Smith et al., 2004; WU-Minn HCP Consortium et al., 2013). Functional data were motion corrected, high-pass filtered (128 s), mapped to the cortical surface using Freesurfer, spatially smoothed on the surface (6 mm FWHM Gaussian kernel), and prewhitened to remove temporal autocorrelation. Covariates of no interest were included to account for confounds related to white matter, cerebral spinal fluid, and motion spikes.

The four real-code (for1, for2, if1, if2) and corresponding fake code conditions were entered as separate predictors in a GLM after convolving with a canonical hemodynamic response function and its first temporal derivative. Only the images acquired during the twenty-four-second function phase were modeled.

For the localizer experiment, a separate predictor was included for each of the three conditions (language, math, and logic) modeling the 16 s during which the statement pair was presented, as well as a rest period (5 s) predictor. In the MSIT task, the interference condition and the control condition were entered as separate predictors.

Each run was modeled separately, and runs were combined within each subject using a fixed-effects model (Dale et al., 1999; Smith et al., 2004). For the group-level analysis across participants, random-effects models were applied, and the models were corrected for multiple comparisons at vertex level with p<0.05 false discovery rate (FDR) across the whole brain. A nonparametric permutation test was further implemented to cluster-correct at p<0.01 family-wise error rate.

ROI definition

Request a detailed protocol

For each participant, four code-responsive functional ROIs were defined to be used in the MVPA analysis. First, random-effects whole-brain univariate analysis for the real >fake code contrast revealed four major clusters in the left hemisphere: the intraparietal sulcus (IPS), the posterior middle temporal gyrus (pMTG), the lateral PFC, and the early visual cortex (Occ). These clusters were used to define group search spaces. Each search space was defined by combining parcels from Schaefer et al. that encompassed each cluster (400-parcel map, Schaefer et al., 2018). Next, individual functional ROIs were defined within these clusters by taking the top 500 active vertices for the real >fake contrast within each participant.

MVPA

Request a detailed protocol

MVPA was used to distinguish for and if functions based on the spatial activation pattern in code-responsive ROIs. Specifically, we used the support vector machine (SVM) implemented in the Python toolbox Scikit-learn (Chang and Lin, 2011; Pedregosa et al., 2011).

For each participant, the spatial activation pattern for each function was defined as the beta parameter estimation of a GLM with each function entered as a separate predictor. Within each ROI in each participant, the 96 spatial patterns elicited by the real functions were collected. Normalization was carried out separately for the for condition and if condition such that in either condition, across all vertices and all functions, the mean was set to 0 and standard deviation to 1. The purpose of the normalization is to eliminate any difference in the baselines of the two conditions while preserving distinctive spatial patterns.

The whole dataset was split into a training test (90%, 86 functions) and a testing set (10%, 10 functions), where in each set, half of the patterns came from for functions. A linear SVM (regularization parameter C = 5.0) was trained on the training set and tested on the testing set. Classification was carried out on 100 different train-test splits, and the average accuracy value was recorded as the observed accuracy.

We tested the classifier performance against chance (50%) using a combined permutation and bootstrapping approach (Schreiber and Krekelberg, 2013; Stelzer et al., 2013). We derived the t-statistic of the Fisher-z transformed accuracy values against chance (also Fisher-z transformed). The null distribution for each participant was generated by first shuffling the condition labels 1000 times, then computing the mean accuracy derived from the 100 train-test split of each shuffled dataset. Then, a bootstrapping method was used to generate an empirical distribution of the t-statistics. In each of the 106 iterations of the bootstrapping phase, one Fisher-z transformed null accuracy value (out of 1,000) per participant was randomly selected, and a one sample t-test was applied to the null sample. The empirical p-value of the real t-statistic was defined as the proportion of the null t-statistics greater than the real value.

Overlap analysis

Request a detailed protocol

For each participant, and in each hemisphere, we used cosine similarity to quantify the overlap of the activated vertices between code comprehension and each of the four localizer contrasts: language (language >math), math (math >language), logic (logic >language), and multi-source interference (hard >easy). First, we generated the binary activation map for each contrast. A vertex was assigned the value one if the significance of its activation is above the 0.05 (FDR-corrected) threshold, and 0 otherwise. Each binary map was regarded as a vector, and the cosine similarity between two vectors (e.g. code comprehension and logic) was defined as the inner product of the vectors divided by the product of their respective lengths (norms). Note that this measure controls for overall vector length (i.e. the overall number of active voxels in each contrast). The cosine similarities of code to each of the localizer tasks was then compared using repeated-measure ANOVA and post-hoc pairwise comparisons with false discovery rate (FDR) correction.

The empirical lower bound was calculated separately for each localizer task to account for differences in the number of activated vertices across tasks. For each participant, for each localizer task, we computed the cosine similarity between the binary map for code comprehension and a shuffled binary map for each localizer task. This step was repeated 100 times to generate the null distribution of the similarity values.

We used a bootstrapping approach to test whether each observed cosine similarity value was significantly above the empirical lower bound. For each localizer task, we randomly selected one similarity value from the null distribution of one participant and computed a null group mean similarity. This step was repeated 106 times to derive the null distribution of the null group mean similarity. The empirical p-value of the real group mean similarity was defined as the proportion of the null values greater than the real value.

We operationalized the empirical upper bound as the cosine similarity of code comprehension and itself. For each participant, we split the data for code comprehension in half, ran a GLM for each half, and derived two binary maps whose cosine similarity was computed. We averaged all the similarity values resulting from the 10 possible splits of the six runs and across all participants.

Appendix

An example of the difficult out-of-the-scanner exercise

Consider the following code:
### 
 def xx(a):
         b = a+3
         return 
if xx(1)= =________: 
         print("HEY!!")
 ###
What keyword should be filled in the blank if we want the code to print  "HEY!!"?
(It's a  keyword. There should be only letters in your answer.) 

Detailed information about the stimuli

Across all real functions, the variable names input, result, and ii, were the same, and only the 12 built-in functions (capitalize(), isalnum(), isalpha(), isdigit(), len(), lower(), range(), sorted(), split(), str(), swapcase(), and upper()) and three expressions (list comprehension, slice notation, and string formatting, see Supplementary file 1 for examples) tested during the screening exercise were included in the user-defined functions.

The addition operator (+) occurred in all functions, but always meant string concatenation rather than numeric addition, and never took a numeric as operand. In each group, the multiplication operator (*) existed in 32 out of the 96 real functions, and 10 of them took a numeric as one of its operands. However, in all these instances, the ‘multiplication’ meant repetition of strings or lists instead of numeric multiplications (e.g. ‘abc’*3 results in ‘abcabcabc’). In each group, 12 out of the 96 real functions contained a comparison to a numeric value, such as ‘len(input)>5’.

The two variants of each control structure

We designed two variants to implement each control structure, for and if. In the first variant of a for code, the for loop was implemented in the canonical way. In the second variant of a for code, we implemented the loop with a Python-specific expression ‘list comprehension’, where the operation to be performed on each element in a list (or string) was stated before specifying the list to be iterated over. In the first variant of an if code, the if conditional was implemented in the canonical way. In the second variant of an if code, the conditional was implemented by first stating the action to take if a condition is true, then multiplying this action to the true/false judgement statement of the condition. There was not a formal jargon for this kind of implementation, for the sake of convenience, we called it ‘conditional multiplication’ in this study. Please refer to Supplementary file 1 for examples of each variant.

The algorithm for stimulus permutation

In this experiment, there were five conditions, ‘FOR1’, ‘FOR2’, ‘IF1’, ‘IF2’, and ‘FAKE’. For simplicity, from here on we label them as ‘A’, ‘B’, ‘C’, ‘D’, and ‘E’, respectively.

There were 120 permutations for five distinct labels, such as ‘ABCDE’, ‘BCDEA’, ‘CDEAB’, ‘DEABC’, ‘EABCD’, ‘ACBDE’, ‘CBDEA’, etc. Each run consisted of 20 functions, which was 4 permutations of 5 labels. Therefore, for each run, we drew four permutations out of the 120 possible permutations. So, the order a participant saw in the first run can be:

ABCDE BCDEA CDEAB DEABC

And the order in the second run can be:

EABCD ACBDE CBDEA BDEAC

The permutations were allocated such that every participant saw 24 permutations across all six runs, and every five participants saw all the 120 permutations.

After determining the order of the conditions, we assigned actual instances of the conditions to the labels. The order of presentation for the functions in each condition was also permuted such that a function in run 1 for participant one appeared in run 2 for participant 2, run 6 for participant 6, run 1 for participant 7, and so on. Specifically, the first run of the first participant could be:

A1B1C1D1E1 B2C2D2E2A2 C3D3E3A3B3 D4E4A4B4C4

While the first run of the second participant could be:

A5B5C5D5E5 B6C6D6E6A6 C7D7E7A7B7 D8E8A8B8C8

The second participant still saw A1, B1, C1, …… D4, E4, just in some later runs.

As a result of permutations of both conditions and functions within condition, all of the participants saw a unique order of presentation.

Overlap analysis in the right hemisphere

Code comprehension (real >fake) overlapped significantly above chance with all localizer tasks: logic, math, language and MSIT (each task compared to chance p’s < 10–6 compared to code split-half overlap p’s < 10–6). The degree of overlap differed significantly across tasks (repeated-measures ANOVA: F(3,42) = 3.03, p=0.040). Post-hoc paired t-tests (FDR-corrected p’s < 0.05) revealed that the overlap with logic was significantly higher than with math and MSIT, but indistinguishable from the overlap with language. The overlaps with the other there tasks (language, math, MSIT) were statistically indistinguishable from each other (Supplementary file 3).

Data availability

Data includes human brain imaging; therefore, they can't be posted online without prior consents from the participants. De-identified behavioral data has been posted on OSF.

The following data sets were generated
    1. Liu Y
    (2020) Open Science Framework
    ID 10.17605/OSF.IO/58MWU. The neural basis of code comprehension.

References

    1. Chang C-C
    2. Lin C-J
    (2011) LIBSVM: a library for support vector machines
    ACM Transactions on Intelligent Systems and Technology 2:1–27.
    https://doi.org/10.1145/1961189.1961199
  1. Conference
    1. Duraes J
    2. Madeira H
    3. Castelhano J
    4. Duarte C
    5. Branco MC
    (2016) WAP: understanding the brain at software debugging
    Paper Presented at the 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE). pp. 23–27.
    https://doi.org/10.1109/ISSRE.2016.53
  2. Conference
    1. Floyd B
    2. Santander T
    3. Weimer W
    (2017)
    Decoding the representation of code in the brain: an fMRI study of code review and expertise
    Paper Presented at the Proceedings of the 39th International Conference on Software Engineering.
  3. Conference
    1. Huang Y
    2. Liu X
    3. Krueger R
    4. Santander T
    5. Hu X
    6. Leach K
    7. Weimer W
    (2019) Distilling neural representations of data structure manipulation using fMRI and fNIRS
    Paper Presented at the 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). pp. 25–31.
    https://doi.org/10.1109/ICSE.2019.00053
  4. Conference
    1. Ikutani Y
    2. Uwano H
    (2014) Brain activity measurement during program comprehension with NIRS
    Paper Presented at the 2014 15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD).
    https://doi.org/10.1109/SNPD.2014.6888727
  5. Conference
    1. Krueger R
    2. Huang Y
    3. Liu X
    4. Santander T
    5. Weimer W
    6. Leach K
    (2020)
    Neurological divide: an fMRI study of prose and code writing
    Paper Presented at the 2020 IEEE/ACM 42nd International Conference on Software Engineering.
  6. Software
    1. McCoy LP
    2. Burton JK
    (1988)
    The Relationship of Computer Programming and Mathematics in Secondary Students
    The Relationship of Computer Programming and Mathematics in Secondary Students.
  7. Book
    1. Pandža NB
    (2016)
    Computer programming as a second language Advances
    In: Pandža N. B, editors. Human Factors in Cybersecurity. Springer. pp. 439–445.
    1. Pedregosa F
    2. Varoquaux G
    3. Gramfort A
    4. Michel V
    5. Thirion B
    6. Grisel O
    7. Dubourg V
    (2011)
    Scikit-learn: machine learning in Python
    Journal of Machine Learning Research 12:2825–2830.
  8. Conference
    1. Siegmund J
    (2017)
    Measuring neural efficiency of program comprehension
    Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering.
  9. Book
    1. Weinberg GM
    (1971)
    The Psychology of Computer Programming (Vol. 29
    New York: Van Nostrand Reinhold.

Decision letter

  1. Andrea E Martin
    Reviewing Editor; Max Planck Institute for Psycholinguistics, Netherlands
  2. Timothy E Behrens
    Senior Editor; University of Oxford, United Kingdom
  3. William Matchin
    Reviewer; University of South Carolina, United States
  4. Ina Bornkessel-Schlesewsky
    Reviewer

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Decision letter after peer review:

Thank you for submitting your article "Computer code comprehension shares neural resources with formal logical inference in the fronto-parietal network" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Timothy Behrens as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: William Matchin (Reviewer #1); Ina Bornkessel-Schlesewsky (Reviewer #3).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

We would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). Specifically, when editors judge that a submitted work as a whole belongs in eLife but that some conclusions require a modest amount of additional new data, as they do with your paper, we are asking that the manuscript be revised to either limit claims to those supported by data in hand, or to explicitly state that the relevant conclusions require additional supporting data.

Our expectation is that the authors will eventually carry out the additional experiments and report on how they affect the relevant conclusions either in a preprint on bioRxiv or medRxiv, or if appropriate, as a Research Advance in eLife, either of which would be linked to the original paper.

First, thank you for taking part in the review process.

As you know, eLife is invested in changing scientific publishing and experimenting to embody that change, even if that involves a degree of risk in order to find workable changes. In this spirit, the remit of the co-submission format is to ask if the scientific community is enriched by the data presented in the co-submitted manuscripts together more so than it would be by the papers apart, or if only one paper was presented to the community. In other words, are the conclusions that can be made are stronger or clearer when the manuscripts are considered together rather than separately? We felt that despite significant concerns with each paper individually, especially regarding the theoretical structures in which the experimental results could be interpreted, that this was the case.

We want to be very clear that in a non-co-submission case we would have substantial and serious concerns about the interpretability and robustness of the Liu et al. submission given its small sample size. Furthermore, the reviewers' concerns about the suitability of the control task differed substantially between the manuscripts. We share these concerns. However, despite these differences in control task and sample size, the Liu et al., and Ivanova et al. submissions nonetheless replicated each other – the language network was not implicated in processing programming code. The replication substantially mitigates the concerns shared by us and the reviewers about sample size and control tasks. The fact that different control tasks and sample sizes did not change the overall pattern of results, in our view, is affirmation of the robustness of the findings, and the value that both submissions presented together can offer the literature.

In sum, there were concerns that both submissions were exploratory in nature, lacking a strong theoretical focus, and relied on functional localizers on novel tasks. However, these concerns were mitigated by the following strengths. Both tasks ask a clear and interesting question. The results replicate each other despite task differences. In this way, the two papers strengthen each other. Specifically, the major concerns for each paper individually are ameliorated when considering them as a whole.

In your revisions, please address the concerns of the reviewers, including, specifically, the limits of interpretation of your results with regard to control task choice, the discussion of relevant literature mentioned by the reviewers, and most crucially, please contextualize your results with regard to the other submission's results.

Reviewer #1:

This manuscript is clearly written and the methods appear to be rigorous, although the number of subjects (15) is a bit low; however, this does not appear to critically limit interpretation of the results. I appreciated the focused inclusion on expert coders to make a clear comparison to language. I also thought that the inclusion of multiple domains for comparison (logic, math, executive function, and language) was quite informative. The laterality covariance between code and language was also quite interesting. I do have some concerns with the literature review and discussion of present and previous results.

1) My main concern with this paper is that it does not clearly review previous fMRI studies on code processing. How do the present results compare with previous studies? E.g. Castelhano et al., 2019; Floyd et al., 2017; Huang et al., 2019; Krueger et al., 2020; Siegmund et al., 2017, 2014;) It seems like the localization/lateralization obtained in the present study is largely similar to these previous studies (e.g. Siegmund et al., 2017). If so, this should discussed: a convergence across multiple methods/authors is useful to know. Any discrepancies are also useful to know. The authors suggest that "Moreover, no prior study has directly compared the neural basis of code to other cognitive domains." However, Krueger et al., (2020) and Huang et al., (2019) appear to have done this.

2) The authors should point out and discuss the difficulty of understanding the psychological and neural structure of coding in absence of a clear theory of coding, as is the case for language (e.g. Chomsky, 1965; Levelt, 1989; Lewis and Vasishth, 2005). On this point, I appreciate the reference to Fitch et al., (2005) regarding recursion in coding, but I think it would be most helpful to have a clear example of recursion in python code. However, the authors at least focus their results on neural underpinnings without attempting to make strong claims about cognitive underpinnings.

3) The authors’ report overlap between code comprehension and language in the posterior MTG and IFG. They note that these activations were somewhat inconsistent; yet they did observe this significant overlap. However, the paper discusses the results as if this overlap did not occur, e.g. "We find that the perisylvian fronto-temporal network that is selectively responsive to language, relative to math, does not overlap with the neural network involved in code comprehension." This is not accurate, as there indeed was overlap. It is important to point out that among language-related regions, these two regions are the most strongly associated with abstract syntax (Friederici, 2017; Hagoort, 2005; Tyler and Marslen-Wilson, 2008; Pallier et al., 2011; Bornkessel-Schlesewsky and Schlesewsky, 2013; Matchin and Hickok, 2019), which very well could be a point of shared resources among code and language (as discussed in Fitch, 2005).

Reviewer #2:

The goal of this fMRI study was to determine which brain systems support coding, by way of extent of overlap of univariate maps with localizer tasks for language, logic, math, and executive functions. The basic conclusion is one we could have anticipated: coding engages a widespread frontoparietal network, with stronger involvement of the left hemisphere. It overlaps with all of the other tasks, but most with the map for logic. This doesn't seem too surprising, but the authors argue convincingly that others wouldn't have predicted that.

It's unfortunate that there are differences in task difficulty among the tasks, in particular, that the logic task was the most difficult of all (both in terms of accuracy and response times), since that happens to be the one that had the largest number of overlapping voxels with the coding task. We can't know whether coding and language task voxels would have overlapped more if the language task had been more difficult.

It seems a shame to present data only from highly experienced coders (11+ years of experience); I can imagine that the investigators are planning to write up another study examining effects of expertise, in comparison with less experienced coders. This seems like an initial paper that's laying the groundwork for a more groundbreaking one.

Reviewer #3:

This fMRI study examines an interesting question, namely how computer code – as a "cognitive/cultural invention" – is processed by the human brain. However, I have a number of concerns with regard to how this question was examined in terms of experimental design, including the choice of control condition (fake code) and the way in which localiser tasks were utilised. In addition, the sample size is very small (n=15) and there appear to be large inter-individual differences in coding performance (in spite of the recruitment of expert programmers). In summary, while promising in its aims, the study's conclusions are weakened by these considerations related to its execution.

1) The control condition

The experiment contrasted real Python code with fake code in the form of "incomprehensible scrambled Python functions". Real and fake code also differed in regard to the task performed (code comprehension versus memory) and were distinguished via colour coding. There is a lot to unpack here in regard to how processing might differ between the two different conditions. For example, the real-code blocks required code comprehension as well as computational problem solving (which does not necessarily require the use of code), while the control task requires neither. As a result of the colour coding, it also appears likely that participants will have approached the fake code blocks with a completely different processing strategy than the real-code blocks. These are just a few obvious differences between the conditions but there are likely many more given how different they are. This, in my view, makes it difficult to interpret the basic contrast between real and fake code.

2) Use of localiser tasks

A similar concern as for point 1 holds in regard to the localiser tasks that were used in order to examine anatomical overlap (or lack thereof) between code comprehension and language, maths, logical problem solving and multiple-demand executive control, respectively. I am generally somewhat sceptical in regard to the use of functional localisers in view of the assumptions that necessarily enter into the definition of a localiser task. This concern is exacerbated by the way in which localisers were employed in the present study. Firstly, in addition to the definition of the localiser task itself, this study used localiser contrasts to define networks of interest. For example, the contrast language localiser > maths localiser served to define the "language network". Thus, assumptions about the nature of the localiser itself are compounded with those regarding the nature of the contrast. Secondly, particularly with regard to language, the localiser task was very high level, i.e. requiring participants to judge whether an active and a passive sentence had the same meaning (with both statements remaining on the screen at the same time). While of course requiring language processing, this task is arguably also a problem solving task of sorts. It is certainly more complex than a typical task designed to probe fast and automatic aspects of natural language processing.

In addition, given that reading is also a cultural invention, is it really fair to say that coding is being compared to the "language network" here rather than to the "reading network" (in view of the visual presentation of the language task)? The possible implications of this for the interpretation of the data should be considered.

More generally, while an anatomical overlap between networks active during code comprehension and networks recruited during other cognitive tasks may shed some initial light on how the brain processes code, it doesn't support any particularly strong conclusions about the neural mechanisms of code processing in my view. While code comprehension may overlap anatomically with regions involved in executive control and logic, this doesn't mean that the same neuronal populations are recruited in each task nor that the processing mechanisms are comparable between tasks.

3) Sample size and individual differences

At n=15, the sample size of this study is quite small, even for a neuroimaging study. This again limits the conclusions that can be drawn from the study results.

Moreover, the results of the behavioural pre-test – which was commendably included – suggest that participants differed considerably with regard to their Python expertise. For the more difficult exercise in this pre-test, the mean accuracy score was 64.6% with a range from 37.5% to 93.75%. These substantial differences in proficiency weren't taken into account in the analysis of the fMRI data and, indeed, it appears difficult to meaningfully do so in view of the sample size.

https://doi.org/10.7554/eLife.59340.sa1

Author response

Reviewer #1:

This manuscript is clearly written and the methods appear to be rigorous, although the number of subjects (15) is a bit low; however, this does not appear to critically limit interpretation of the results. I appreciated the focused inclusion on expert coders to make a clear comparison to language. I also thought that the inclusion of multiple domains for comparison (logic, math, executive function, and language) was quite informative. The laterality covariance between code and language was also quite interesting. I do have some concerns with the literature review and discussion of present and previous results.

1) My main concern with this paper is that it does not clearly review previous fMRI studies on code processing. How do the present results compare with previous studies? E.g. Castelhano et al., 2019; Floyd et al., 2017; Huang et al., 2019; Krueger et al., 2020; Siegmund et al., 2017, 2014;) It seems like the localization/lateralization obtained in the present study is largely similar to these previous studies (e.g. Siegmund et al., 2017). If so, this should discussed: a convergence across multiple methods/authors is useful to know. Any discrepancies are also useful to know. The authors suggest that "Moreover, no prior study has directly compared the neural basis of code to other cognitive domains." However, Krueger et al., (2020) and Huang et al., (2019) appear to have done this.

In response to the reviewer’s suggestion we have added a paragraph in the Introduction which reviews prior literature on code processing. We also added a section in the Discussion integrating the current results with the co-published article by Ivanova and colleagues. Overall, the current results are consistent with the findings of Ivanova et al., in finding fronto-parietal responses to code. The convergence and differences among the current article and that of Ivanova is discussed in detail, with regard to lateralization and localization.

Overall, it seems that the current results are convergent with prior work. However, with regard to studies prior to the Ivanova paper, it is difficult to say conclusively because of the differences between the current study and prior research. The handful of prior studies that have examined the neural bases of coding have used more complex tasks (e.g. writing code, debugging code, evaluating proposed code edits), rather than code comprehension. Control conditions in prior studies are different from the current study (e.g. mental rotation) and often complex (e.g. prose editing). In several cases the contrasts that would enable comparing to the current study are not reported or the methods and analyses are not described in detail, perhaps partly because all prior findings are reported in computer science proceedings journals with different reporting practices. Below we summarize the studies noted by the reviewer in particular.

Huang et al., (2019) subtracted neural activation associated with the mental rotation of three-dimensional objects from neural activation associated with manipulation of two types of data structures (sequence and binary search tree). For sequences, but not for binary trees, greater fronto-parietal activity was observed the code tasks than the mental rotation task. This study did not directly examine code comprehension per se and did not localize functions, such as language or working memory in the same group of participants.

Krueger et al., (2020) compared code writing to prose writing, either in a fill-in-the-blank format, or in the format of free writing in response to a prompt. Analyses compared code and prose directly using three contrasts: all code > all prose, fill-in-the-blank code > fill-in-the-blank prose, and free code > free prose. As we now note in the Introduction, this study observed fronto-parietal activity for code relative to prose and is thus generally consistent with the current findings.

Note that neither Huang et al., (2019) or Krueger et al., (2020) localized specific previously documented neural networks and compared code comprehension to these networks. Therefore, they reported the difference in neural basis between code and other cognitive tasks, but not their shared neural resources.

Siegmund et al., (2017) compared comprehension of different types of Java code. In particular, they compared comprehension of code with meaningful and meaningless variable and function names and compared code comprehension of both types to detection of syntactic errors within code. These comparisons are not specifically designed to identify networks involved in code comprehension per se. A collection of regions was more responsive during code comprehension than syntactic error detection, some of these regions may overlap with the current fronto-parietal system. However, no coordinate table is reported, and the authors say that some of the areas may overlap with language networks. Direct comparisons are difficult. We now specifically point to the Siegmund et al., 2017 and 2014 papers in the Introduction.

2) The authors should point out and discuss the difficulty of understanding the psychological and neural structure of coding in absence of a clear theory of coding, as is the case for language (e.g. Chomsky, 1965; Levelt, 1989; Lewis and Vasishth, 2005). On this point, I appreciate the reference to Fitch et al., (2005) regarding recursion in coding, but I think it would be most helpful to have a clear example of recursion in python code. However, the authors at least focus their results on neural underpinnings without attempting to make strong claims about cognitive underpinnings.

We agree with the reviewer’s point that a theory of coding is needed to properly understand its neurocognitive basis. We added text to the Discussion section pointing this out.

In response to the reviewer’s suggestion, we also added examples of recursion in computer code in the Introduction. Specifically, we mentioned the following examples:

IF conditionals embedded within IF conditionals:

if (condition_1):

if (condition_2):

print(“Both conditions are True.”)

else:

print(“Condition_1 is True, condition_2 is False.”)

else:

print(“Condition_1 is False. Condition_2 not evaluated.”)

A function calling itself in its definition:

def factorial(N):

return N*factorial(N-1) if (N>1) else 1

3) The authors’ report overlap between code comprehension and language in the posterior MTG and IFG. They note that these activations were somewhat inconsistent; yet they did observe this significant overlap. However, the paper discusses the results as if this overlap did not occur, e.g. "We find that the perisylvian fronto-temporal network that is selectively responsive to language, relative to math, does not overlap with the neural network involved in code comprehension." This is not accurate, as there indeed was overlap. It is important to point out that among language-related regions, these two regions are the most strongly associated with abstract syntax (Friederici, 2017; Hagoort, 2005; Tyler and Marslen-Wilson, 2008; Pallier et al., 2011; Bornkessel-Schlesewsky and Schlesewsky, 2013; Matchin and Hickok, 2019), which very well could be a point of shared resources among code and language (as discussed in Fitch, 2005).

In response to the reviewer’s suggestion, we changed the wording in the Discussion to more narrowly state that there was low and variable overlap between code comprehension and language, rather than no overlap. It is worth noting that half of the participants showed overlap that was no different from chance. Nevertheless, we now also point out that the overlapping regions have been implicated in syntactic processing, as well as in semantics and that we cannot rule out the possibility that the language network plays some role in code comprehension

Reviewer #2:

The goal of this fMRI study was to determine which brain systems support coding, by way of extent of overlap of univariate maps with localizer tasks for language, logic, math, and executive functions. The basic conclusion is one we could have anticipated: coding engages a widespread frontoparietal network, with stronger involvement of the left hemisphere. It overlaps with all of the other tasks, but most with the map for logic. This doesn't seem too surprising, but the authors argue convincingly that others wouldn't have predicted that.

We agree with the reviewer that overlap between the activation for code comprehension and logical reasoning makes a lot of sense but is not represented in the literature.

It's unfortunate that there are differences in task difficulty among the tasks, in particular, that the logic task was the most difficult of all (both in terms of accuracy and response times), since that happens to be the one that had the largest number of overlapping voxels with the coding task. We can't know whether coding and language task voxels would have overlapped more if the language task had been more difficult.

We agree with the reviewer that the existence of a difference in difficulty among the localizer tasks is less than ideal. However, in the current study it is also intrinsic to the design. Language is a putatively evolutionarily ancient function that is acquired by children without explicit instruction early in life. By contrast, formal logical reasoning is an explicit skill that requires teaching and is not acquired by most educated adults. Therefore, these cognitive domains differ in difficulty by design. We could make the language task more difficult artificially, for example by constructing complex grammatical sentences. However, such sentences are not representative of the type of processing we typically do with language. More importantly, previous evidence suggests that when language gets difficult in this way, non-language specific networks get recruited for processing. Despite the fact that the language task is easier, large swaths of cortex are responsive more to language than math, suggesting that responses in this particular network are not related to task difficulty per se.

Finally, because we worried about the same issue, in our analyses we control for the number of overall active vertices in each contrast. Overlap is calculated as the ratio between “the number of overlapping vertices” and “the square root of the product of the numbers of vertices in both the code contrast and each localizer contrast”, thus normalizing for the number of vertices involved in the contrasts. Additionally, we plot the degree of overlap for each task with itself separately and show the degree of overlap with code relative to this measure.

It seems a shame to present data only from highly experienced coders (11+ years of experience); I can imagine that the investigators are planning to write up another study examining effects of expertise, in comparison with less experienced coders. This seems like an initial paper that's laying the groundwork for a more ground-breaking one.

We appreciate the reviewer’s insight into the future direction of this research project. The current study is indeed an initial paper which serves as the groundwork for further studies. Among the many possible follow-up studies, examining the effects of expertise is definitely one of the most exciting.

Reviewer #3:

This fMRI study examines an interesting question, namely how computer code – as a "cognitive/cultural invention" – is processed by the human brain. However, I have a number of concerns with regard to how this question was examined in terms of experimental design, including the choice of control condition (fake code) and the way in which localiser tasks were utilised. In addition, the sample size is very small (n=15) and there appear to be large inter-individual differences in coding performance (in spite of the recruitment of expert programmers). In summary, while promising in its aims, the study's conclusions are weakened by these considerations related to its execution.

1) The control condition

The experiment contrasted real Python code with fake code in the form of "incomprehensible scrambled Python functions". Real and fake code also differed in regard to the task performed (code comprehension versus memory) and were distinguished via colour coding. There is a lot to unpack here in regard to how processing might differ between the two different conditions. For example, the real-code blocks required code comprehension as well as computational problem solving (which does not necessarily require the use of code), while the control task requires neither. As a result of the colour coding, it also appears likely that participants will have approached the fake code blocks with a completely different processing strategy than the real-code blocks. These are just a few obvious differences between the conditions but there are likely many more given how different they are. This, in my view, makes it difficult to interpret the basic contrast between real and fake code.

We agree with the reviewer that the real-code condition and the fake code condition were different in various ways. Since this is one of the first studies to examine code comprehension, the experiment was designed to maximally capture potentially interesting neural structures involved in code comprehension by having a relatively low-level control condition that removes basic reading processes and working memory demands. For this reason, we also presented real-codes and fake codes in different colors such that the participants are not trying to figure out whether the stimulus is or is not real-code, contaminating the neural response to fake codes with attempted code comprehension. We now explicitly point out these goals in the Materials and methods section.

Because the contrast between code and fake code is large, it was also important to show that patterns of activity in the code-sensitive network can distinguish IF and FOR code functions. In future work, it will be important to dig more deeply into the question of what and how the identified network is processing. We now specifically point this out in subsection “Code overlaps with logic”. In particular, as the reviewer correctly points out, the current paper does not determine whether the code-sensitive network identified is involved in computer coding algorithms (what the reviewer calls computational problem solving) or code relevant syntax. In the coding literature this is sometimes referred to as the semantics vs. syntax of code (e.g. citation). Both of these types of processes are part and parcel of code comprehension, although only some of them are unique to code. The results of the Ivanova study speak to this question to some degree. We now discuss this issue in the Discussion section.

2) Use of localiser tasks

A similar concern as for point 1 holds in regard to the localiser tasks that were used in order to examine anatomical overlap (or lack thereof) between code comprehension and language, maths, logical problem solving and multiple-demand executive control, respectively. I am generally somewhat sceptical in regard to the use of functional localisers in view of the assumptions that necessarily enter into the definition of a localiser task. This concern is exacerbated by the way in which localisers were employed in the present study. Firstly, in addition to the definition of the localiser task itself, this study used localiser contrasts to define networks of interest. For example, the contrast language localiser > maths localiser served to define the "language network". Thus, assumptions about the nature of the localiser itself are compounded with those regarding the nature of the contrast. Secondly, particularly with regard to language, the localiser task was very high level, i.e. requiring participants to judge whether an active and a passive sentence had the same meaning (with both statements remaining on the screen at the same time). While of course requiring language processing, this task is arguably also a problem solving task of sorts. It is certainly more complex than a typical task designed to probe fast and automatic aspects of natural language processing.

We appreciate the reviewer’s point that localizer approaches have limitations. In our view, for the present purposes this was the best approach available, despite these limitations. Nevertheless, the approach has consequences for the inferences that can be made based on the present results. We now talk about this issue in the Discussion section:

“Notably, in the current study neural responses associated with language, math and logic were localized partly relative to each other (e.g. logic > language). This approach focuses the analyses on networks that are uniquely involved in one domain vs. another. The degree to which areas shared across language, logic, and math are implicated in code comprehension remains to be addressed in future work.”

With regard to the reviewer’s specific comment about the language task, we agree that there is a problem solving component. However, this component is present also in the other control tasks (i.e. logic and math) and is matched as closely as possible across the tasks. Furthermore, the language-related responses observed in the current study are consistent with previous studies using other tasks and contrasts, including comparing passive sentence comprehension to a non-word control task (Fedorenko et al., 2011, 2016). Perhaps even more directly relevant, Ivanova and colleagues used a different language localizer task but found results consistent with the current study. This is now discussed in detail in the Discussion section.

In addition, given that reading is also a cultural invention, is it really fair to say that coding is being compared to the "language network" here rather than to the "reading network" (in view of the visual presentation of the language task)? The possible implications of this for the interpretation of the data should be considered.

We appreciate the reviewer’s point regarding the fact that the current language task used reading, rather than spoken language comprehension. Notably, we have previously done the exact same language task using spoken language stimuli and observed the same neural network responsive to spoken language (Kanjlia et al., 2016). Since the control tasks are also written in English characters, the reading-specific activation is likely similar in the language and control conditions.

More generally, while an anatomical overlap between networks active during code comprehension and networks recruited during other cognitive tasks may shed some initial light on how the brain processes code, it doesn't support any particularly strong conclusions about the neural mechanisms of code processing in my view. While code comprehension may overlap anatomically with regions involved in executive control and logic, this doesn't mean that the same neuronal populations are recruited in each task nor that the processing mechanisms are comparable between tasks.

We agree with the reviewer’s point that the current results do not support precise conclusions about the cognitive or neural processes involved in code comprehension, beyond comparison to the tasks described in the study (see also response to reviewer 1, comment 2). In future work we plan to use the findings of the current study as a necessary step towards more detailed understanding of the neural and cognitive basis of code. In particular, we agree that future work will need to examine closely distribution of responses to code within fronto-parietal systems. We now point out in the Discussion that future work will need to examine whether there is sub-specialization within the fronto-parietal network for code as opposed to other functions.

3) Sample size and individual differences

At n=15, the sample size of this study is quite small, even for a neuroimaging study. This again limits the conclusions that can be drawn from the study results.

We updated the manuscript to point out the sample size limitation. (Discussion section)

Moreover, the results of the behavioural pre-test – which was commendably included – suggest that participants differed considerably with regard to their Python expertise. For the more difficult exercise in this pre-test, the mean accuracy score was 64.6% with a range from 37.5% to 93.75%. These substantial differences in proficiency weren't taken into account in the analysis of the fMRI data and, indeed, it appears difficult to meaningfully do so in view of the sample size.

As the reviewer points out, there is variability among participants’ expertise. Since coding, unlike language, is a cultural invention acquired in adulthood, such variability is to be expected. Absence of variability would likely indicate insensitivity of the testing measurement, rather than absence of variability. We agree with the reviewer, that given the small sample size, the current study is not suited to comprehensively testing effects of expertise level. We did test whether expertise predicted lateralization patterns and it did not. However, this conclusion is limited by the sample size. An investigation into the individual difference in programming expertise, and its influence on the neural response to code comprehension will be one of our future research topics.

https://doi.org/10.7554/eLife.59340.sa2

Article and author information

Author details

  1. Yun-Fei Liu

    Johns Hopkins University, Baltimore, United States
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing
    For correspondence
    yliu291@jhu.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6644-813X
  2. Judy Kim

    Johns Hopkins University, Baltimore, United States
    Contribution
    Conceptualization, Resources, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
  3. Colin Wilson

    Johns Hopkins University, Baltimore, United States
    Contribution
    Conceptualization, Resources, Software, Formal analysis, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
  4. Marina Bedny

    Johns Hopkins University, Baltimore, United States
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Validation, Methodology, Project administration, Writing - review and editing
    Competing interests
    No competing interests declared

Funding

Johns Hopkins University

  • Marina Bedny

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Ethics

Human subjects: Participants gave informed consent according to procedures approved by the Johns Hopkins University Institutional Review Board. (IRB protocol number: NA_00087983).

Senior Editor

  1. Timothy E Behrens, University of Oxford, United Kingdom

Reviewing Editor

  1. Andrea E Martin, Max Planck Institute for Psycholinguistics, Netherlands

Reviewers

  1. William Matchin, University of South Carolina, United States
  2. Ina Bornkessel-Schlesewsky

Publication history

  1. Received: May 26, 2020
  2. Accepted: November 6, 2020
  3. Version of Record published: December 15, 2020 (version 1)

Copyright

© 2020, Liu et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 2,220
    Page views
  • 133
    Downloads
  • 2
    Citations

Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Neuroscience
    Jan L Klee et al.
    Research Article Updated

    The ability to use sensory cues to inform goal-directed actions is a critical component of behavior. To study how sounds guide anticipatory licking during classical conditioning, we employed high-density electrophysiological recordings from the hippocampal CA1 area and the prefrontal cortex (PFC) in mice. CA1 and PFC neurons undergo distinct learning-dependent changes at the single-cell level and maintain representations of cue identity at the population level. In addition, reactivation of task-related neuronal assemblies during hippocampal awake Sharp-Wave Ripples (aSWRs) changed within individual sessions in CA1 and over the course of multiple sessions in PFC. Despite both areas being highly engaged and synchronized during the task, we found no evidence for coordinated single cell or assembly activity during conditioning trials or aSWR. Taken together, our findings support the notion that persistent firing and reactivation of task-related neural activity patterns in CA1 and PFC support learning during classical conditioning.

    1. Neuroscience
    Peter H Chipman et al.
    Research Article

    Experience-dependent plasticity is a key feature of brain synapses for which neuronal N-Methyl-D-Aspartate receptors (NMDARs) play a major role, from developmental circuit refinement to learning and memory. Astrocytes also express NMDARs although their exact function has remained controversial. Here we identify in mouse hippocampus, a circuit function for GluN2C NMDAR, a subtype highly expressed in astrocytes, in layer-specific tuning of synaptic strengths in CA1 pyramidal neurons. Interfering with astrocyte NMDAR or GluN2C NMDAR activity reduces the range of presynaptic strength distribution specifically in the stratum radiatum inputs without an appreciable change in the mean presynaptic strength. Mathematical modeling shows that narrowing of the width of presynaptic release probability distribution compromises the expression of long-term synaptic plasticity. Our findings suggest a novel feedback signaling system that uses astrocyte GluN2C NMDARs to adjust basal synaptic weight distribution of Schaffer collateral inputs, which in turn impacts computations performed by the CA1 pyramidal neuron.