An Information-Theoretic Approach to Reward Rate Optimization in the Tradeoff Between Controlled and Automatic Processing in Neural Network Architectures

  1. NPLab, Network Science Institute, Northeastern University London, London, UK
  2. CENTAI Institute, Turin, Italy
  3. Department of Cognitive, Linguistic, and Psychological Sciences, Brown University, Providence, US
  4. Institute of Cognitive Science, Osnabrück University, Osnabrück, Germany
  5. Princeton Neuroscience Institute, Princeton University, Princeton, USA

Editors

  • Reviewing Editor
    Joshua Gold
    University of Pennsylvania, Philadelphia, United States of America
  • Senior Editor
    Joshua Gold
    University of Pennsylvania, Philadelphia, United States of America

Reviewer #1 (Public Review):

Summary:
A long literature in cognitive neuroscience studies how humans and animals adjudicate between conflicting goals. However, despite decades of research on the topic, a clear computational account of control has been difficult to pin down. In this project, Petri, Musslick, & Cohen attempt to formalize and quantify the problem of control in the context of toy neural networks performing conflicting tasks.

This manuscript builds on the formalism introduced in Petri et al (2021), "Topological limits to the parallel processing capability of network architectures", which describes a set of tasks as a graph in which input nodes (stimuli) are connected to output nodes (responses). Each edge in this graph links an input node to an output node, representing a "task"; i.e. a word reading task connects the input node "word" to the output node "read". Cleverly, patterns of interference and conflict between tasks can be quantified from this graph. In the current manuscript, the authors extend this framework by converting these graphs into neural networks and a) allowing edges to be continuous rather than binary; b) introducing "hidden layers" of units between input and output nodes; and c) introducing a "control" signal that modulates edge weights. The authors then examine how, in such a network, optimal behavior may involve serial versus parallel execution of different sets of tasks.

Strengths:
There is a longstanding belief in cognitive neuroscience that "control" manages conflicts by scheduling tasks to be executed in parallel versus serially; I applaud the efforts of the authors to give these intuitions a more concrete computational grounding.

My main scientific concern is that the authors focus on what seems like an arbitrary set of network architectures. The networks considered here are derived by converting task graphs, which represent a multitasking problem, into networks for _performing_ that multitasking problem. Frankly, these networks do not look like any neural network a computer scientist would use to actually solve a problem, nor do they seem biologically realistic. Furthermore, adding hidden layers to these networks only ever seems to make performance worse (Figures 4, 11), introducing unnecessary noise and interference; it would seem more useful to study a network architecture in which hidden layers fulfilled some useful purpose (as they do in the brain and machine learning).

However, this scientific concern is secondary to the major problem with this paper, which is clarity.

Major problem: A lack of clarity

I found this paper extremely difficult to read. To illustrate my difficulty, I will describe a subset of my confusion.

The authors define the "entropy" of an action in equation 1, but the content of the equation gives what is sometimes referred to as the "surprisal" of the action. Conventionally (as per Wikipedia and any introductory textbook I am familiar with), entropy is the "expected surprisal" of a random variable, not the surprisal of a single action. This creates immediate confusion going into the results. Furthermore, defining "entropy" this way means that "information" is functionally equivalent to accuracy for the purposes of this paper, in which case I do not know what has been gained by this excursion into (non-standard) information-theoretic terminology.

They next assert that equation 1 is the information _cost_ of an action. No motivation is given for this statement and I do not know what it means. In what sense is a "cost" associated with the negative logarithm of a probability?

In the next section II.B, the authors introduce a new formalism in which responses are represented by task graph nodes _R_. What is the relationship between an action _a_ and the responses _R_? Later, in section II.C, edges _f_ in the task graph are used as seemingly drop-in replacements for actions _a_.

I simply have no idea what is going on in equations 31 through 33. Where are the functions _R_ (not to be confused with the response nodes _R_) and _S_ defined? Or how are they approximated? What does the variable _t_ mean and why does it appear and disappear from equations seemingly at random?

Response times seem to be important, but as far as I can tell, nowhere do the authors actually describe how response times are calculated for the simulated networks.

Similar issues persist through the rest of the paper: unconventional formalism is regularly introduced using under-explained notation and without a clear relationship to the scientific questions at hand. As a result, the content and significance of the findings are largely inscrutable to me, and I suspect also to the vast majority of readers.

Reviewer #2 (Public Review):

Summary:
The authors develop a normative account of automaticity-control trade-offs using the mathematics of information theory, which they apply to abstract neural networks. They use this framework to derive optimal trade-off solutions under particular task conditions.

Strengths:
On the positive side, I appreciate the effort to rigorously synthesize ideas about multi-tasking within an information-theoretic framework. There is potentially a lot of promise in this approach. The analyis is quite comprehensive and careful.

Weaknesses:
Generally speaking, the paper is very long and dense. I don't in principle mind reading long and dense papers (though conciseness is a virtue); it becomes more of a slog when it's not clear what new insights are being gained from laboring through the math. For example, after reading the Stroop section, I wasn't sure what new insight was provided by the information-theoretic formalism which goes beyond earlier models. Is this just an elegant formalism for expressing previously conceived ideas, or is there something fundamentally new here that's not predicted by other frameworks? The authors cite multiple related frameworks addressing the same kinds of data, but there is no systematic comparison of predictions or theoretical interpretations. Even in the Discussion, where related work is directly addressed, I didn't see much in terms of explaining how different models made different predictions, or even what predictions any of them make.

After a discussion of the Stroop task early in the paper, the analysis quickly becomes disconnected from any empirical data. The analysis could be much more impactful if it was more tightly integrated with relevant empirical data.

Author Response

We thank both the editors and the Reviewers for their thoughtful comments and recommendations, that will certainly help us improve the manuscript. Below we address in a brief format some of the comments made, and then outline the changes to the manuscript that we plan to implement in the revision.

We see three interrelated issues in the comments of the Reviewers:

• the length and complexity of the manuscript;

• the link to previously proposed formalisms;

• the impact of adopting the proposed information-theoretic framework.

With regard to all of these issues, we would first like to highlight that the overall goal of our effort was to integrate con tributions to understanding the mechanisms underlying cognitive control across multiple different disciplines, using the information theoretic framework as a common formalism, while respecting and building on prior efforts as much as possible. Accordingly, we sought to be as explicit as possible about how we bridge from prior work using information theory, as well as neural networks and dynamical systems theory, which contributed to length of the original manuscript. While we continue to consider this an important goal, we will do our best to shorten and clarify the main exposition by reorganizing the manuscript as suggested by Reviewer #1 (i.e., in a way that is similar to what we did in our previous Nature Physics paper on multitasking). Specifically, we will move a substantially greater amount of the bridging material to the Supple mentary Information (SI), including the detailed discussion of the Stroop task, and the description of the link to Koechlin & Summerfield’s [L1] information theory formalism. We will also now include an outline of the full model at the beginning of the manuscript, that includes control and learning, and then more succinctly describe simplifications that focus on specific issues and applications in the remainder of the document.

Along similar lines, we will revise and harmonize our presentation of the formalism and notations, to make these more consistent, clearer and more concise throughout the document. Again, some of the inconsistencies in notation arose from our initial description of previous work, and in particular that of Koechlin & Summerfield[L1] that was an important inspiration for our work but that used slightly different notations. An important motivation for our introduction of new notation was that their formulation focused on the performance of a single task at a time, whereas a primary goal of our work was to extend the information theoretic treatment to simultaneous performance of multiple tasks. That is, in focusing on single tasks, Koechlin & Summerfield could refer to a task simply as a direct association between stimuli and responses, whereas we required a way of being able to refer to sets of tasks performed at once (”multitasks”), which in turn required specification of internal pathways. Moreover, they do not provide a mechanism to compute the conditional information Q(a|s) of a response/action s conditioned to a stimulus s does not provide a way to compute it explicitly. Our formalism instead provides a way to explicitly unpack this expression in terms of the efficacies –automatic (Eq. 5) or controlled (Eq. 15)– which can also account for the competition between different stimuli {s1, s2, . . . sn}. It also describes explicitly the competition between multiple tasks (Eq. 18, and Eq. 25 for multiple layers), because different ways of processing schemes for the same combinations of stimuli/responses can incur different levels of internal dependencies and thus require different control strategies.

To mitigate any confusion over terminology we will, as noted above, move a detailed discussion of Koechlin & Summer- field’s formulation, and how it maps to the one we present, to the SI, while taking care to introduce ours clearly at the beginning of the main document, and use it consistently throughout the remainder of the document. We will also make an important distinction – between informational and cognitive costs – more clearly, that we did not do adequately in the original manuscript.

Finally, to more clearly and concretely convey what we consider to be the most important contributions, we will restrict the number of examples we present to ones that relate most directly to the central points (e.g., the effect and limits of control in the presence of interference, and the differences in control strategy under limited temporal horizons). Accompanying our revision, we will also provide a full point-by-point response to the comments and questions raised by the Reviewers. We summarize some the key points we will address below.

PRELIMINARY REPLY TO THE REPORT OF REVIEWER #1

We want to thank the Reviewer for the time and effort put into reviewing our paper and constructive feedback that was provided. We also thank the Reviewer for recognizing the need for a clear computational account of how ”control” manages conflicts by scheduling tasks to be executed in parallel versus serially, and for the positive evaluation on our “efforts of the authors to give these intuitions a more concrete computational grounding.”. As noted in the general reply above, we regret the lack of clarity in several parts of the manuscript and in our introduction and use of the formalism. We consider the following to be the main points to be addressed:

• the role of task graphs and their mapping to standard neural architectures

• the description of entropy and related information-theoretic concepts;

• confusing choice of symbols in our notation between stimuli/responses and serialization/reconfiguration costs;

• missing definition of response time;

Regarding the first part point, we acknowledge that the network architectures we focus on do not draw direct inspiration from conventional machine learning models. Instead, our approach is rooted in the longstanding tradition of using (often simpler, but also more readily interpretable) neural network models to address human cognitive function and how this may be implemented in the brain [L2]; and, in particular, the mechanisms underlying cognitive control (e.g., [L3, L4]). In this context, we emphasize that, for analytical clarity, we deliberately abstract away from many biological details, in an effort to identify those principles of function that are most relevant to cognitive function. Nevertheless, our network architecture is inspired by two concepts that are central to neurobiological mechanisms of control: inhibition and gain modulation. Specifi- cally, we incorporate mutual inhibition among neural processing units, a feature represented by the parameter β. This aspect of our model is consistent with biologically inspired frameworks of neural processing, such as those discussed by Munakata et al. (2011)[L5], reflecting the competitive dynamics observed in neural circuits. Moreover, we introduce the parameter ν to represent a strictly modulatory form of control, akin to the role of neuromodulators in the brain. This modulatory control adjusts the sensitivity of a node to differences among its inputs (e.g., Servan-Schreiber, Printz, & Cohen, (1990)[L6]; Aston-Jones & Cohen (2005)[L7]). Finally, as the Reviewer notes, additional hidden layers can improve expressivity in neural networks, enabling the efficient implementation of more complex tasks, and are a universal feature of biological and artificial neural systems. We thus examined multitasking capability under the assumption that multiple hidden layers are present in a network; irrespective of whether they are needed to implement the corresponding tasks.

Regarding the second point, as noted above, we believe that the confusion arose from our review of the work by Koechlin & Summerfield. In their formalism, in which an action a is chosen (from a set of potential actions) with probability p(a), the cost of choosing that action is − log p(a). This is usually referred to as the information content or, alternatively, the localized entropy [L8]. As the Reviewer correctly observed, the canonical (Shannon) entropy is actually the expectation lEa[− log p(a)] over the localized entropies of a set of actions. In summarizing their formulation, we misleadingly stated that ”they used standard Shannon entropy formalism as a measure of the information required to select the action a.” We will now correct this to state: “[..] they used local entropy (− log p(a)) as a measure of the information required to select the action a, that can be treated as the cost of choosing that action.” We follow this formulation in our own, referring to informational cost as Ψ, and generalizing this to include cases in which more than one action may be chosen to perform at a time.

Regarding the third point, the confusion is due to our use of the letters S and R for both the stimulus and response units (in Sec. II.B) and then serialization and reconstruction costs (in eqs 31-33). We will fix this by renaming the serialization and reconstruction costs more explicitly as S er and Rec.

Finally, we realized we never explicitly stated the expression of the response time we used, but only pointed to it in the literature. In the manuscript we used the expression given in Eq. 53 of [L9], which provides response times as function of the error rates ER and the number of options .

PRELIMINARY REPLY TO THE REPORT OF REVIEWER #2

We want to thank the Reviewer for recognizing our effort to ”rigorously synthesize ideas about multi-tasking within an information-theoretic framework” and its potential. We also thank the Reviewer for the careful comments.

To our best understanding, and similarly to Reviewer #1, the main comments of the Reviewer are on:

• the length and density of the paper;

• the presentation of the Koechlin & Summerfield’s formalism, and the mismatch/lack of clarity of ours in certain points;

• the added value of the information theoretic formalism.

Regarding the first two points, which are common to Reviewer #1, we plan to move a significant part of the manuscript to the Supplementary Information, both to improve readability and make the manuscript shorter, as well as to provide one consistent and cleaner formalism (in particular with regards to the typos and errors highlighted by the Reviewer). In par- ticular, with respect to the comment on Eq. 4-5-6, we will clarify that the probability p[ fi j] is the probability that a certain input dimension (i in this case) is selected by on node j to produce its response (averaged over the individual inputs in each input dimension). We will also take care to make sure that the definition and domain of the various probabilities and probability distributions we use are clearly delineated (e.g. where the costs computed for tasks and task pathways come from).

Regarding the third point, we hope that our work offers value in at least two ways: i) it helps bring unity to ideas and descriptions about the capacity constraints associated with cognitive control that have previously been articulated in different forms (viz., neural networks, dynamical systems, and statistical mechanical accounts); and ii) doing so within an information theoretic framework not only lends rigor and precision to the formulation, but also allows us to cast the allocation of control in normative form – that is, as an optimization problem in which the agent seeks to minimize costs while maximizing gains. While we do not address specific empirical phenomena or datasets in the present treatment, we have done our best to provide examples showing that: a) our information theoretic formulation aligns with treatments using other formalisms that have been used to address empirical phenomena (e.g., with neural network models of the Stroop task); and b) our formulation can be used as a framework for providing a normative approach to widely studied empirical phenomena (e.g., the transition from control-dependent to automatic processing during skill acquisition) that, to date, have been addressed largely from a descriptive perspective; and that it can provide a formally rigorous approach to addressing such phenomena.

[L1] E. Koechlin and C. Summerfield, Trends in cognitive sciences 11, 229 (2007).

[L2] J. L. McClelland, D. E. Rumelhart, P. R. Group, et al., Explorations in the Microstructure of Cognition 2, 216 (1986).

[L3] J. D. Cohen, K. Dunbar, and J. L. McClelland, Psychological Review 97, 332 (1990).

[L4] E. K. Miller and J. D. Cohen, Annual review of neuroscience 24, 167 (2001).

[L5] Y. Munakata, S. A. Herd, C. H. Chatham, B. E. Depue, M. T. Banich, and R. C. O’Reilly, Trends in cognitive sciences 15, 453 (2011).

[L6] D. Servan-Schreiber, H. Printz, and J. D. Cohen, Science 249, 892 (1990).

[L7] G. Aston-Jones and J. D. Cohen, Annu. Rev. Neurosci. 28, 403 (2005).

[L8] T. F. Varley, Plos one 19, e0297128 (2024).

[L9] T. McMillen and P. Holmes, Journal of Mathematical Psychology 50, 30 (2006).

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation