Body size as a metric for the affordable world

Xinran Feng; Shan Xu; Yuannan Li; Jia Liu

doi:10.7554/eLife.90583.1

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Reviewing Editor
Clare Press
University College London, London, United Kingdom
Senior Editor
Timothy Behrens
University of Oxford, Oxford, United Kingdom

Reviewer #1 (Public Review):

Ps observed 24 objects and were asked which afforded particular actions (14 action types). Affordances for each object were represented by a 14-item vector, values reflecting the percentage of Ps who agreed on a particular action being afforded by the object. An affordance similarity matrix was generated which reflected similarity in affordances between pairs of objects. Two clusters emerged, reflecting correlations between affordance ratings in objects smaller than body size and larger than body size. These clusters did not correlate themselves. There was a trough in similarity ratings between objects ~105 cm and ~130 cm, arguably reflecting the body size boundary. The authors subsequently provide some evidence that this clear demarcation is not simply an incidental reflection of body size, but likely causally related. This evidence comes in the flavour of requiring Ps to imagine themselves as small as a cat or as large as an elephant and showing a predicted shift in the affordance boundary. The manuscript further demonstrates that ChatGPT (theoretically interesting because it's trained on language alone without sensorimotor information; trained now on words rather than images) showed a similar boundary.

The authors also conducted a small MRI study task where Ps decided whether a probe action was affordable (graspable?) and created a congruency factor according to the answer (yes/no). There was an effect of congruency in the posterior fusiform and superior parietal lobule for objects within body size range, but not outside. No effects in LOC or M1.

The major strength of this manuscript in my opinion is the methodological novelty. I felt the correlation matrices were a clever method for demonstrating these demarcations, the imagination manipulation was also exciting, and the ChatGPT analysis provided excellent food for thought. These findings are important for our understanding of the interactions between action and perception, and hence for researchers from a range of domains of cognitive neuroscience.

The major elements that limit conclusions and I'd recommend to be addressed in a revision include justification of the 80% of Ps removed for the imagination analysis, and consideration that an MRI study with 12 P in this context can really only provide pilot data. I'd also encourage the authors to consider theoretically how else this study could really have turned out and therefore the nature of the theoretical progress.

Specifics:
1. The main behavioural work appears well-powered (>500 Ps). This sample reduces to 100 for the imagination study, after removing Ps whose imagined heights fell within the human range (100-200 cm). Why 100-200 cm? 100 cm is pretty short for an adult. Removing 80% of data feels like conclusions from the imagination study should be made with caution.

2. There are only 12 Ps in the MRI study, which I think should mean the null effects are not interpreted. I would not interpret these data as demonstrating a difference between SPL and LOC/M1, but rather that some analyses happened to fall over the significance threshold and others did not.

3. I found the MRI ROI selection and definition a little arbitrary and not really justified, which rendered me even more cautious of the results. Why these particular sensory and motor regions? Why M1 and not PMC or SMA? Why SPL and not other parietal regions? Relatedly, ROIs were defined by thresholding pF and LOC at "around 70%" and SPL and M1 "around 80%", and it is unclear how and why these (different) thresholds were determined.

4. Discussion and theoretical implications. The authors discuss that the MRI results are consistent with the idea we only represent affordances within body size range. But the interpretation of the behavioural correlation matrices was that there was this similarity also for objects larger than body size, but forming a distinct cluster. I therefore found the interpretation of the MRI data inconsistent with the behavioural findings.

5. In the discussion, the authors outline how this work is consistent with the idea that conceptual and linguistic knowledge is grounded in sensorimotor systems. But then reference Barsalou. My understanding of Barsalou is the proposition of a connectionist architecture for conceptual representation. I did not think sensorimotor representation was privileged, but rather that all information communicates with all other to constitute a concept.

6. More generally, I believe that the impact and implications of this study would be clearer for the reader if the authors could properly entertain an alternative concerning how objects may be represented. Of course, the authors were going to demonstrate that objects more similar in size afforded more similar actions. It was impossible that Ps would ever have responded that aeroplanes afford grasping and balls afford sitting, for instance. What do the authors now believe about object representation that they did not believe before they conducted the study? Which accounts of object representation are now less likely?

https://doi.org/10.7554/eLife.90583.1.sa2

Reviewer #2 (Public Review):

Summary
In this work, the authors seek to test a version of an old idea, which is that our perception of the world and our understanding of the objects in it are deeply influenced by the nature of our bodies and the kinds of behaviours and actions that those objects afford. The studies presented here muster three kinds of evidence for a discontinuity in the encoding of objects, with a mental "border" between objects roughly of human body scale or smaller, which tend to relate to similar kinds of actions that are yet distinct from the kinds of actions implied by human-or-larger scale objects. This is demonstrated through observers' judgments of the kinds of actions different objects afford; through similar questioning of AI large-language models (LLMs); and through a neuroimaging study examining how brain regions implicated in object understanding make distinctions between kinds of objects at human and larger-than-human scales.

Strengths 
The authors address questions of longstanding interest in the cognitive neurosciences -- namely how we encode and interact with the many diverse kinds of objects we see and use in daily life. A key strength of the work lies in the application of multiple approaches, as noted in the summary. Examining the correlations among kinds of objects, with respect to their suitability for different action kinds, is novel, as are the complementary tests of judgments made by LLMs.

Weaknesses 
A limitation of the tests of LLMs may be that it is not always known what kinds of training material was used to build these models, leading to a possible "black box" problem. Further, presuming that those models are largely trained on previous human-written material, it may not necessarily be theoretically telling that the "judgments" of these models about action-object pairs show human-like discontinuities. Indeed, verbal descriptions of actions are very likely to mainly refer to typical human behaviour, and so the finding that these models demonstrate an affordance discontinuity may simply reflect those statistics, rather than evidence that affordance boundaries can arise independently even without "organism-environment interactions" as the authors claim here.

The authors include a clever manipulation in which participants are asked to judge action-object pairs, having first adopted the imagined size of either a cat or an elephant, showing that the discontinuity in similarity judgments effectively moved to a new boundary closer to the imagined scale than the veridical human scale. The dynamic nature of the discontinuity suggests a different interpretation of the authors' main findings. It may be that action affordance is not a dimension that stably characterises the long-term representation of object kinds, as suggested by the authors' interpretation of their brain findings, for example. Rather these may be computed more dynamically, "on the fly" in response to direct questions (as here) or perhaps during actual action behaviours with objects in the real world.

https://doi.org/10.7554/eLife.90583.1.sa1

Reviewer #3 (Public Review):

Summary:
Feng et al. test the hypothesis that human body size constrains the perception of object affordances, whereby only objects that are smaller than the body size will be perceived as useful and manipulable parts of the environment, whereas larger objects will be perceived as "less interesting components."

To test this idea, the study employs a multi-method approach consisting of three parts:

In the first part, human observers classify a set of 24 objects that vary systematically in size (e.g., ball, piano, airplane) based on 14 different affordances (e.g., sit, throw, grasp). Based on the average agreement of ratings across participants, the authors compute the similarity of affordance profiles between all object pairs. They report evidence for two homogenous object clusters that are separated based on their size with the boundary between clusters roughly coinciding with the average human body size. In follow-up experiments, the authors show that this boundary is larger/smaller in separate groups of participants who are instructed to imagine themselves as an elephant/cat.

In the second part, the authors ask different large language models (LLMs) to provide ratings for the same set of objects and affordances and conduct equivalent analyses on the obtained data. Some, but not all, of the models produce patterns of ratings that appear to show similar boundary effects, though less pronounced and at a different boundary size than in humans.

In the third part, the authors conduct an fMRI experiment. Human observers are presented with four different objects of different sizes and asked if these objects afford a small set of specific actions. Affordances are either congruent or incongruent with objects. Contrasting brain activity on incongruent trials against brain activity on congruent trials yields significant effects in regions within the ventral and dorsal visual stream, but only for small objects and not for large objects.

The authors interpret their findings as support for their hypothesis that human body size constrains object perception. They further conclude that this effect is cognitively penetrable, and only partly relies on sensorimotor interaction with the environment (and partly on linguistic abilities).

Strengths:
The authors examine an interesting and relevant question and articulate a plausible (though somewhat underspecified) hypothesis that certainly seems worth testing. Providing more detailed insights into how object affordances shape perception would be highly desirable. Their method of analyzing similarity ratings between sets of objects seems useful and the multi-method approach is quite original and interesting.

Weaknesses:
The study presents several shortcomings that clearly weaken the link between the obtained evidence and the drawn conclusions. Below I outline my concerns in no particular order:

Even after several readings, it is not entirely clear to me what the authors are proposing and to what extent the conducted work actually speaks to this. In the introduction, the authors write that they seek to test if body size serves not merely as a reference for object manipulation but also "plays a pivotal role in shaping the representation of objects." This motivation seems rather vague motivation and it is not clear to me how it could be falsified.
Similarly, in the discussion, the authors write that large objects do not receive "proper affordance representation," and are "not the range of objects with which the animal is intrinsically inclined to interact, but probably considered a less interesting component of the environment." This statement seems similarly vague and completely beyond the collected data, which did not assess object discriminability or motivational values.
Overall, the lack of theoretical precision makes it difficult to judge the appropriateness of the approaches and the persuasiveness of the obtained results. This is partly due to the fact that the authors do not spell out all of their theoretical assumptions in the introduction but insert new "speculations" to motivate the corresponding parts of the results section. I would strongly suggest clarifying the theoretical rationale and explaining in more detail how the chosen experiments allow them to test falsifiable predictions.
The authors used only a very small set of objects and affordances in their study and they do not describe in sufficient detail how these stimuli were selected. This renders the results rather exploratory and clearly limits their potential to discover general principles of human perception. Much larger sets of objects and affordances and explicit data-driven approaches for their selection would provide a far more convincing approach and allow the authors to rule out that their results are just a consequence of the selected set of objects and actions.
Relatedly, the authors could be more thorough in ruling out potential alternative explanations. Object size likely correlates with other variables that could shape human similarity judgments and the estimated boundary is quite broad (depending on the method, either between 80 and 150 cm or between 105 to 130 cm). More precise estimates of the boundary and more rigorous tests of alternative explanations would add a lot to strengthen the authors' interpretation.
Even though the division of the set of objects into two homogenous clusters appears defensible, based on visual inspection of the results, the authors should consider using more formal analysis to justify their interpretation of the data. A variety of metrics exist for cluster analysis (e.g., variation of information, silhouette values) and solutions are typically justified by convergent evidence across different metrics. I would recommend the authors consider using a more formal approach to their cluster definition using some of those metrics.
While I appreciate the manipulation of imagined body size, as a way to solidify the link between body size and affordance perception, I find it unfortunate that this is implemented in a between-subjects design, as this clearly leaves open the possibility of pre-existing differences between groups. I certainly disagree with the authors' statement that their findings suggest "a causal link between body size and affordance perception."
The use of LLMs in the current study is not clearly motivated and I find it hard to understand what exactly the authors are trying to test through their inclusion. As noted above, I think that the authors should discuss the putative roles of conceptual knowledge, language, and sensorimotor experience already in the introduction to avoid ambiguity about the derived predictions and the chosen methodology. As it currently stands, I find it hard to discern how the presence of perceptual boundaries in LLMs could constitute evidence for affordance-based perception.
Along the same lines, the fMRI study also provides very limited evidence to support the authors' claims. The use of congruency effects as a way of probing affordance perception is not well motivated. What exactly can we infer from the fact a region may be more active when an object is paired with an activity that the object doesn't afford? The claim that "only the affordances of objects within the range of body size were represented in the brain" certainly seems far beyond the data.

Importantly (related to my comments under 2) above), the very small set of objects and affordances in this experiment heavily complicates any conclusions about object size being the crucial variable determining the occurrence of congruency effects.

I would also suggest providing a more comprehensive illustration of the results (including the effects of CONGRUENCY, OBJECT SIZE, and their interaction at the whole-brain level).

Overall, I consider the main conclusions of the paper to be far beyond the reported data. Articulating a clearer theoretical framework with more specific hypotheses as well as conducting more principled analyses on more comprehensive data sets could help the authors obtain stronger tests of their ideas.

https://doi.org/10.7554/eLife.90583.1.sa0

Author Response

We appreciate the insightful comments from three reviewers on our manuscript. These comments help us improve the clarity of this manuscript. We will revise our manuscript comprehensively in subsequent revision, and enclose a detailed response to each of these comments. In this public reply, we focus on (a) clarifying the theoretical motivation and implication of the present study, and (b) discussing the implications of our LLM study. Besides, we provide a brief justification regarding some methodological concerns shared by the reviewers.

Theoretical rationale and implication

As we stated in the manuscript, the present study tested whether body size serves as a reference for locomotion and object manipulation, or alternatively, plays a pivotal role in shaping the representation of objects as suggested by Protagoras. Behind this question is the long-lasting debate regarding the representation versus direct perception of affordance.

One outstanding theme shared by many embodied theories of cognition is the replacement hypothesis (e.g., Van Gelder, 1998). This hypothesis challenges the necessity of representation in the sense of computationalism cognitive theories (e.g., Fodor, 1975), which implies discretizing/categorizing inputs and then subjecting them to certain abstraction or symbolization so as to create discrete stand-ins for the input (e.g., representations/states). In this sense, our theoretical motivation can be restated explicitly as to test the ‘representationalization’ of affordance. That is, we tested whether object affordance would simply covary with its continuous constraints such as object size, in line with the representation-free view, or, whether affordance would be ‘representationalized’, in line with the representation-based view, under the constrain of body size. Such representationalization would generate categorization between the affordable (the objects) and those beyond affordance (the environment).

Debates regarding the replacement hypothesis often turn into wrestles on the definition of representation (Shapiro, 2019). The present study tried to avoid this pitfall but examined where the embodied and computational theories make opposite hypotheses: discontinuity. Specifically, we considered two computationalism propositions about representation: (a) representations entail discretization of continuous input, and (b) the product of such discretization (representations) is supramodally accessible (that is, transcending sensorimotor processes). These claims are opposite to the prediction based on the idea of direct perception and other representation-free embodied theories.

Thus, we tested whether, for continuous action-related physical features (such as object size relative to the agents), affordance perception introduces discontinuity and qualitative dissociation, i.e., to allow the sensorimotor input to be assigned into discrete states/kinds, as representations envisioned by computationalists. Alternatively, does the activity directly mirror the input, free from discretization/categorization/abstraction, as proposed by the replacement hypothesis that organisms do not need to re-present the world as they are always in contact with the world in a continuous way?

All the experiment settings and analyses in the present study were organized around this motivation, following a progressive logic chain.

First, we tested the discretization hypothesis, that is, whether affordance leads to discontinuity in perception. Here, the discontinuity in affordance perception would be in line with the representation-based view instead of the representation-free proposals. Second, to ensure that the observed discontinuity can be attributed to the discretization of sensorimotor input involved in human-object interaction rather than amodal sources, such as the discrete abstract concepts of the objects (independent from agent motor capability), we tested the embodied nature of this discontinuity through the body imagination experiment. If there is discontinuity in representing embodied information, this discontinuity should be locked to the motor capacity (constrained by the physical constitution such as body size) of the agent, rather than reflecting independent categorization of the absolute size of the objects. Finally, we probed the supramodality of this embodied discontinuity: whether this discontinuity is accessible beyond the sensorimotor domain. To do this, we leveraged the recent advance in AI and tested whether the discretization observed in affordance perception is supramodally accessible to disembodied agents which lack access to sensorimotor input but only have access to the linguistic materials built upon discretized representations, such as large language models (LLM).

In this way, the experiments in the present study collectively contributed to the debate on the replacement theme of the embodiment of cognition, which serves as one of the three key themes of embodied theories of cognition (Shapiro, 2019). By addressing this theme, we hope to shed light on the nature of representation in, and resulting from, the vision-for-action processing. Our finding regarding discontinuity suggested that sensorimotor input undergoes discretization implied in the computationalism idea of representation. Further, not contradictory to the claims of the embodied theories, these representations do shape processes out of the sensorimotor domain, but after discretization.

Implication in the development of LLM-based agents

The finding that affordance was representationalized may have profound implications for the development of LLM-based agents. Traditional robots and non-LLM-based agents require implementation-level action instruction, acting as a tool for human beings to achieve desired results. In contrast, LLM-based agents (for a review, see Wang et al., 2023), such as Auto-GPT and BabyAGI, are able to autonomously perform tasks and achieve desired results based on LLMs’ planning ability. In this sense, LLM-based agents show a primary ability to interact on their own with the world. Generative agents, for instance, the agents in Smallville (Park et al., 2023), are a particularly applauded recent advantage in the school of LLM-based agents, which show even larger potentials in this aspect. Drawing on generative models to simulate human behaviors, these agents can formulate their own memories and goals, generate new environment-dependent behaviors, and interact convincingly with humans and other agents and their environments in the course. This brings new possibilities in resolving the long-lasting challenge in artificial general intelligence (AGI) development, that is, to bestow AI with human-level ability in agent-environment interactions. However, it is worth noting that the present investigation in LLM-based agents is still largely confined to virtual environments. This leaves an open question as to how to equip these agents with the ability of agent-environment physical interaction. Especially, according to embodied theories of cognition, sensorimotor interactions with the environment provide unique knowledge upon which various cognitive domains are built. From this point of view, building agents with human-level ability in agent-environment physical interactions might provide an unreplaceable missing piece for AGI.

By probing the representation of action possibilities (affordances) provided by the environment to the agent (or the absence of them), the present study provided a clue in achieving such ability by illustrating the representationalization of affordance and the supramodality of these representations. For instance, the finding of supramodality may alleviate the doubts about the physical interaction ability of LLM-based agents comparable to biological agents. Specifically, LLM-based agents can leverage the affordance representation distilled into language to interact with the physical world. Indeed, by clarifying and aligning such representation with the physical constitutes of LLM-based agents, and even by explicitly constructing an agent-specific object space, we may facilitate the sensorimotor interactions of LLM-based agents so as to achieve animal-level interaction ability with the world. This in turn may provide new instances for embodied theories.

Clarification on incomplete evidence

In response to the methodological and validity concerns of the reviewers, we will provide a point-by-point detailed response to reviewers enclosed with the revised manuscript. Here, we reply to the most prominent concerns.

Reviewers were concerned about the statistical power of both the body imagination experiment and the fMRI experiment. Regarding the number of participants in the imagination study, we would like to clarify that we did not remove 80% of the participants. Actually, a separate sample of participants was recruited in the body imagination experiment. The sample size for the body imagination experiment (100 participants) was indeed smaller than that recruited for the first experiment (528 participants). This is because the first experiment was set for exploratory purposes, and was designed to be over-powered.

Admittedly, the fMRI experiment recruited a small sample (12 participants), which might lead to low power in estimating the affordance effect. In revision, we will acknowledge this issue explicitly. Having said this, note that the null hypothesis of this fMRI study is the lack of two-way interaction between object size and object-action congruency, which was rejected by the significant interaction. That is, the interpretation of the present study did not rely on accepting any null effect. In addition, the fMRI experiment provided convergent evidence for the affordance discontinuity at the neural level. We showed that behind the behavioral discontinuity in action judgement, neural activity was qualitatively different between objects within the affordance boundary and those beyond, which reinforces our statement that objects were discretized along the continuous size axis into two broad categories.

Reviewers also commented that more objects and actions should be included. We agree, and in revision, we will advocate future studies with more objects and more actions to comprehensively portray discontinuity. The present set of objects was designated to cover a relatively large range of object sizes, ranging from 14 cm to 7,618 cm to cover most size categories studied in Konkle and Oliva's (2011) work. In addition, the actions were selected to cover daily interactions between human and objects or environments from single-point movements (e.g., hand, foot) to whole-body movements (e.g., lying, standing) referencing the kinetics human action video dataset (Kay et al., 2017). Thus, this set of selected objects and actions is sufficient to test the discontinuity.

References

Fodor, J. A. (1975). The Language of Thought (Vol. 5). Harvard University Press.

Park, J. S., O'Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442.

Shapiro, L. (2019). Embodied Cognition. Routledge.

Van Gelder, T. (1998). The dynamical hypothesis in cognitive science. Behavioral and Brain Sciences, 21(5), 615-628.

Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., ... & Wen, J. R. (2023). A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432.

https://doi.org/10.7554/eLife.90583.1.sa4

Body size as a metric for the affordable world

Peer review process

Editors

Be the first to read new articles from eLife