# Abstract

The fact that objects without proper support will fall to the ground is not only a natural phenomenon, but also common sense in mind. Previous studies suggest that humans may infer objects’ stability through a world model that performs mental simulations with *a priori* knowledge of gravity acting upon the objects. Here we measured participants’ sensitivity to gravity to investigate how the world model works. We found that the world model on gravity was not a faithful replica of the physical laws, but instead encoded gravity’s vertical direction as a Gaussian distribution. The world model with this stochastic feature fit nicely with participants’ subjective sense of objects’ stability and explained the illusion that taller objects are perceived as more likely to fall. Furthermore, a computational model with reinforcement learning revealed that the stochastic characteristic likely originated from experience-dependent comparisons between predictions formed by internal simulations and the realities observed in the external world, which illustrated the ecological advantage of stochastic representation in balancing accuracy and speed for efficient stability inference. The stochastic world model on gravity provides an example of how *a priori* knowledge of the physical world is implemented in mind that helps humans operate flexibly in open-ended environments.

# Significance Statement

Humans possess an exceptional capacity for inferring the stability of objects, a skill that has been crucial to the survival of our predecessors and continues to facilitate our daily interactions with the natural world. The present study elucidates that our representation of gravitational direction adheres to a Gaussian distribution, with the vertical orientation as the maximum likelihood. This stochastic representation is likely to have originated from our interactions with the physical world, conferring an ecological advantage of balancing accuracy with speed. Therefore, the world model on gravity in the mind is a distorted replica of the natural world, enabling adaptive functionality in open-ended environments and thus shedding light on developing machines imbued with embodied intelligence.

**eLife assessment**

In this **valuable** study, the authors present findings that suggest that people do not faithfully replicate the physics of the real world but rather have a stochastic world model, specifically a stochastic representation of gravity. This contrasts with prior accounts that suggested a potentially noisy Newtonian model where the noise arises from perceptual uncertainty or (inferred) external perturbations. The experimental evidence is generally **solid**, with all experiments and model simulations being consistent with the proposed account. In the revision, the authors also added a number of control experiments that address some of the most pressing concerns of the original submission.

# Introduction

About two thousand years ago, Confucius warned his disciples that a wise man should not stand next to a collapsing wall. We, wise or not, can easily judge whether a wall is stable or collapsing in a fraction of a second (Battaglia et al., 2013; Kubricht et al., 2017; McCloskey, 1983). This astonishing performance is unlikely to have been achieved by previous visual experience alone. Taking a stack consisting of ten blocks as an example (Fig. 1), we can quickly report its stability with a satisfactory accuracy of 70% on average (Bear et al., 2021; Zhang et al., 2016), but the universal cardinality of possible configurations is at least 1.14×10^{50} (Supplementary Figure 1), which is much larger than the total number of sand grains on Earth (est. 7.5×10^{18}) (Blatner, 2013). Contrary to this intuition, four-month-old infants, who have little visual experience of the physical world, expect a box to fall if it loses contact with a support platform (Baillargeon, 2004, 1994). Our minds may therefore have devised a mechanism that differs from the widely used discriminative approach in artificial neural networks, which relies on the extensive visual experience of objects and feedback about their stability (Bear et al., 2021; Li et al., 2016; Zhang et al., 2016).

Indeed, both behavioral and neuroimaging studies have suggested that humans possess *a priori* knowledge of Newton’s law of physics in the mind. For example, infants as young as seven months expect a downward-moving object to accelerate and an upward-moving object to decelerate (Friedman, 2002; Kim and Spelke, 1999), and adults can estimate the remaining time to catch a moving ball (McIntyre et al., 2001; Zago and Lacquaniti, 2005) even in the absence of visual information (Lacquaniti and Maioli, 1989; Zago et al., 2009). Further fMRI studies have revealed the parieto-insular vestibular cortex in the brain as the neural basis for gravity-based stability inference, suggesting that this knowledge is encapsulated as a cognitive module (Fischer et al., 2016; Indovina et al., 2005; Pramod et al., 2022). Accordingly, our brain is proposed as a set of generative machines that actively predict future events of the ever-changing physical world through mental simulation with *a priori* knowledge acting upon the world (Battaglia et al., 2013; Hegarty, 2004; Huang and Rao, 2011; Tenenbaum et al., 2011; Ullman et al., 2017). For this reason, the generative machine is also called the world model (Land, 2014; Tenenbaum et al., 2011).

Recently, the idea of the world model has become popular to explain the predictive nature of the brain (Friston et al., 2021) and to improve the generality and robustness of the artificial neural networks (Matsuo et al., 2022). However, how *a priori* knowledge is implemented in the world model remains to be determined. A prevailing theory suggests that the world model in the brain accurately mirrors the physical laws of the world (Allen et al., 2020; Battaglia et al., 2013; Zhou et al., 2022). For example, the direction of gravity encoded in the world model, a critical factor in stability inference, is assumed to be straight downward, aligning with its manifestation in the physical world. To explain the phenomenon that tall and thin objects are subjectively perceived as more unstable compared to short and fat ones (Supplementary Figure 2), external noise, such as imperfect perception and assumed external forces, is introduced to influence the output of the model. However, when the brain actively transforms sensory data into cognitive understanding, these data can become distorted (Kriegeskorte and Douglas, 2019; Naselaris et al., 2011), hereby introducing uncertainty into the representation of gravity’s direction. In this scenario, the world model inherently incorporates uncertainty, eliminating the need for additional external noise to explain the inconsistency between subjective perceptions of stability and the actual stability of objects. Note that this distinction of these two theories is nontrivial: the former model implies a deterministic representation of the external world, while the latter suggests a stochastic approach. Here, we investigated these two alternative hypotheses regarding the construction of the world model in the brain by examining how gravity’s direction is represented in the world model when participants judged object stability. Here, we investigated these two alternative hypotheses for the construction of the world model in the brain by examining how gravity’s direction was represented in the world model when participants judged the stability of objects.

To do this, we measured participants’ sensitivity to gravity’s direction in a stability inference task (Battaglia et al., 2013) and found that gravity’s direction was encoded in a Gaussian distribution, with the vertical direction as the maximum likelihood. This stochastic parameter was then built into the world model to simulate the displacement of blocks in a stack under the force of gravity, and the simulation result fits nicely with participants’ judgment of stacks’ stability and explains the daily illusion that taller objects are perceived as more likely to fall. A computational model with a reinforcement learning algorithm was devised to reveal its origin through interactions with the physical world. Finally, we explored the ecological advantage of the stochastic feature of the world model.

# Results

## The direction of gravity in the world model

The direction of gravity is perpendicular to the ground surface. Here, we first tested humans’ sensitivity to gravity’s direction to investigate how faithfully our gravity is represented in the world model compared to gravity in the physical world. To do this, we used Pybullet (Coumans and Bai, 2016), a forward physics simulator, to manipulate gravity’s direction. Then, we asked the participants to judge whether the collapse trajectories of unstable stacks were normal (Fig 1a, Supplementary Movie S1). The direction of simulated gravity was measured by a parameter pair *(θ, φ*) (Fig 1b), which determines the deviation of the direction of simulated gravity from the direction of gravity in the physical world. Specifically, *θ* is the vertical component of the direction that affects the degree of collapse, and *φ* is the horizontal component that determines the orientation of collapse. We collected participants’ judgment of the normality of collapse trajectories while varying *θ* from 0 to 45° and *φ*from 0° to 360° across the force space, and the confidence level of the judgment for each angle pair was used to index participants’ sensitivity to gravity’s direction (Fig 1c). As expected, when *θ* is equal to 0 (i.e., the direction of the simulated gravity is the direction of the natural gravity), the participants were likely to report that the collapse trajectory was normal (accuracy: 91.0%, STD: 8.0%). Then, the critical question is how participants’ subjective sense of the normal degree of collapse trajectories changes as a function of *θ*. If our world model on gravity is a faithful replica of the physical reality, we should expect the immediate detection of abnormality when *θ* is away from 0.

Contrary to this intuition, the subjective sense of the abnormality was not immediately apparent as *θ* moved away from 0; instead, the confidence level of reporting normal trajectories decreased gradually as a function of *θ*, which was the best fit by a Gaussian function with *σ*= 19.9 (Fig. 1d left). That is, the participants were 50.9% confident in reporting a normal collapse trajectory when the vertical offset of *θ* was 19.9°. In addition, accuracy in detecting the abnormality was not affected by *φ* (Supplementary Figure 3), consistent with the uniformly distributed gravitational field in the physical world. This pattern was observed for all participants tested, with a varying from 11.1 to 37.1 (Supplementary Figure 3), and remained unchanged with the addition of a wall on one side to block potential external disturbances from wind (Supplementary Figure 4). Therefore, the world model on gravity is unlikely to be a faithful replica of the physical world; instead, it encodes gravity’s direction as a Gaussian distribution with the vertical direction as the maximum likelihood (Fig 1d right).

To further test whether the world model on gravity, once established, is encapsulated from visual experience and task context, we inverted the virtual environment upside down with gravity’s direction pointing upward, and then asked the same group of participants to judge whether collapse trajectories were normal (Fig 1e, see Supplementary Movie S2). We found that the confidence level also decreased gradually as a function of *θ* (Fig. 1f, *σ* = 17.2; Supplementary Figure 5 for each participant), which was not significantly different from that in the environment with gravity pointing downward. Indeed, each participant’s *σ* in the upright condition was in high agreement with the *σ* in the upside-down condition (r = 0.91, p < 0.01). That is, the visual experience and task context apparently did not cognitively penetrate humans’ world model on gravity, suggesting that it is likely encapsulated as a cognitive module.

How does the stochastic gravity’s direction in the world model affect our inference on objects’ stability? To answer this question, we recruited an independent group of participants to estimate the stability of 60 stacks of different configurations (Fig 2a), half of which were stable. During the experiment, the participants were required to judge how stable each stack was on a 0-7 scale without feedback, which was used to index their subjective sense about stacks’ stability. Two world models were constructed for comparison. One world model was equipped with a vertically downward direction of gravity without any stochastic variance. This deterministic model is intended to simulate how the stacks fell in the real world, and is therefore called a natural gravity simulator (NGS) (Fig 2b top). The other model is the same as the NGS, except that the deterministic direction of gravity in the NGS was replaced by the stochastic direction obtained from the previous psychophysical experiment. This model is thus called the mental gravity simulator (MGS, Fig 2c top). Both models were used to quantify the degree of stability by measuring the proportion of unmoved blocks after the collapse, where the proportion of unmoved blocks after the simulation was used to estimate the stability of the stacks.

NGS-estimated stability was significantly correlated with participants’ subjective sense (Fig 2b bottom; r = 0.70, p < 0.01), consistent with previous findings (Battaglia et al., 2013). However, the participants showed an obvious bias towards predicting a collapse for stacks regardless of their actual stability, as the dots in Fig 2b are more concentrated on the lower side of the diagonal line. This phenomenon is referred to as the inference bias, which was indexed as the difference in stability estimates between the participants and the NGS (inference bias = -0.31, p < 0.01) (see Methods). In other words, the participants were unlikely to infer stacks’ stability from simulations with a deterministic direction of gravity pointing vertically downward. In contrast, the MGS randomly sampled pairs of (*θ*_{s},*φ*_{s}) from the Gaussian distribution as gravity’s directions 100 times, and the estimated stability of a stack was the averaged stability of simulations with different angle pairs. Aside from a similar magnitude of the correlation in the stability estimates between the participants and the MGS (Fig 2c bottom; r = 0.75, p < 0.01), the MGS, in contrast to the NGS, more precisely reflected participants’ judgments of stability because the points were evenly distributed along the diagonal line (inference bias = 0.04, p > 0.05; see Supplementary Figure 6 for the agreement when the MGS was implemented with different Gaussian functions). In other words, the magnitude of the correlation coefficients is not the only indicator to evaluate the model’s fitness. In short, the world model that represents gravity’s direction as a Gaussian distribution around the vertical direction properly explains our tendency to judge stacks as more prone to collapse.

The stochastic world model illustrated by the MGS that led to participants’ inference bias may explain the daily illusion that we perceive taller objects to be more unstable than shorter ones (Fig 2d left). An intuitive explanation from physics is that a tall object has a higher center of gravity, and thus an external perturbation makes it more likely to collapse. Our stochastic world model, on the other hand, provides an alternative explanation without introducing external perturbations, because the center of gravity in taller objects is more susceptible to influence when gravity deviates slightly from a strictly downward direction during humans’ internal simulations. To test this conjecture, we constructed a set of stacks with different heights, and estimated the degree of stacks’ stability with the MGS and the NGS, respectively. Because the MGS was considered to be the world model implemented in the brain, the inference bias here was calculated as the difference in stability estimates between the MGS and the NGS, with negative values indicating a tendency to judge a stable stack as an unstable one. Consistent with the inference bias found in humans, the MGS found stacks of all heights to be more prone to collapse (Fig 2d right; inference bias < 0, p < 0.01 for all heights). Critically, the bias increased monotonically with increasing height, consistent with the illusion that taller objects are considered more prone to collapse (see Supplementary Figure 7 for the inference bias when the MGS was equipped with different levels of deviation). In short, the stochastic world model on gravity provides a more concise explanation for the daily illusion that taller objects are perceived as more likely to collapse, without assuming external perturbations.

## The origin of the stochastic feature of the world model

A deterministic model that combines gravity’s veridical direction with external perturbations, such as an external force or perceptual uncertainty (Allen et al., 2020; Battaglia et al., 2013; Lake et al., 2017; Smith and Vul, 2013), is theoretically equivalent to our stochastic model that represents gravity’s direction in a Gaussian distribution; therefore, it also fits well with humans’ inference on stability by fine-tuning the parameters of external perturbations. While the cognitive impenetrability and the self-consistency observed in this study, without resorting to an external perturbation, favor the stochastic model over the deterministic one, the origin of this stochastic feature of the world model is unclear.

Here we used a reinforcement learning (RL) framework to unveil this origin, because our intelligence emerges and evolves under the constraints of the physical world. Therefore, the stochastic feature may emerge as a biological agent interacts with the environment, where the mismatches between external feedback from the environment and internal expectations from the world model are in turn used to fine-tune the world model (Friston et al., 2021; MacKay, 1956; Matsuo et al., 2022). Note that a key aspect of the framework is determining whether the stochastic nature of the world model on gravity emerges through this interaction, even in the absence of external noise. To simulate this process, here we designed a reinforcement learning (RL) framework to model this interactive process to illustrate how the world model on gravity evolves (Fig 3a). Specifically, an agent perceived a stack in the environment, which was then acted upon by a simulated gravity with direction parameters (i.e., *θ* and *φ*) sampled from a spherical direction space. The initial probabilities for the sampling directions were identical (Fig 3b, left). The final state of the stack served as the agent’s expectation under the effect of the simulated gravity. The mismatch between the expectation and the observed final state of the stack under the natural gravity was used to update the sampling probability of the direction space, with a larger discrepancy leading to a larger decrease in probabilities through RL. Within this RL framework, we constructed abundant stacks of 2 to 15 blocks to train the world model on gravity. As the training progressed, the probabilities of the direction space gradually converged downward (Fig 3b, middle; see Supplementary Figure 8 for the training trajectory). Although gravity’s direction in the environment was vertical, the distribution of updated probabilities in the direction space was gradational (*σ* = 21.6; Fig 3b, right), which is close to gravity’s direction represented in the world model derived from the psychophysics experiment on human participants. Therefore, the world model representing gravity’s direction in a Gaussian distribution can emerge automatically as the agent interacts with the environment, without the need for any external perturbation.

To further illustrate the idea that the environment constrains the form of intelligence, we systematically manipulated the appearance of the physical world while holding the natural gravity constant. Specifically, we constructed 14 worlds, each containing stacks of the same number of blocks, but with different configurations. The number of blocks ranged from 2 to 15. We trained the world model on gravity under the same RL framework for each world, and found that all world models represented gravity’s direction in a Gaussian distribution (Fig 3c left; see Supplementary Figure 9 for all world models). However, the width of the distribution, indexed by the parameter of *σ*, decreased monotonically as the number of blocks increased (Fig 3c right). This phenomenon was shown because in general stacks containing more blocks were more likely to be affected by forces whose directions were not perpendicular to the ground surface, which provided more information about gravity, and thus resulted in a more accurate representation of gravity’s direction in the world model. In short, the world model on gravity resonates with not only the physical law governing the environment, but also the specific regularities of the environment the agent encountered.

## The ecological advantage of the stochastic world model

When passing a cliff face, we have to be constantly aware of the stability of the rocks on the cliff. The ideal response would be both accurate and fast, but accuracy and speed are often difficult to achieve simultaneously. Here we investigated how the world model on gravity balances these two factors with its stochastic feature. To answer this question, we used a linear classifier (i.e., logistic regression) to model humans’ decision-making behavior at different stages of the mental simulation. Specifically, we collected all the position coordinates of a stack’s blocks at different stages of the simulation. The position difference between the intermediate states of the stack and the initial state provides information about the stability of the stack. For example, a stable stack should have no difference in the positions of the component blocks at all simulation stages, and an unstable stack should have a gradually increasing position difference. If the linear classifier detected the difference in positions sufficient for the classification at any stage, it classified the stack as unstable, otherwise stable (Fig 4a). The classification accuracy gradually increased as the simulation progressed until it reached the asymptote.

As expected, for the NGS (i.e., the world model with the deterministic direction of gravity), the accuracy at the plateau was close to 100% (95.3% on average, Fig 4b top red box), significantly higher than that for the MGS (80.1% on average, Fig 4b top blue box) (t = 19.59, p<0.001), simply because of the stochastic feature of gravity’s direction. However, while the initial growth rates of both models were comparable, the MGS reached the plateau crucial for decision-making sooner than the NGS (response time, indexed by the ratio between the time to reach the plateau and the time to reach the final stage: 27.1% vs. 75.2%, t = 15.58, p < 0.001) (Fig 4b middle). The same pattern was also observed with different variances of the Gaussian distribution (Supplementary Figure 10). That is, the stochastic world model prioritized speed over accuracy, echoing the basic principle of survival: fleeing potential danger as quickly as possible, rather than making a perfect decision with a dreadful delay. In addition, by integrating the prediction accuracy and the response time as a measure of efficiency, we found that the stochastic world model provided a better balance between accuracy and speed, with an efficiency significantly higher than that provided by the NGS (3.49 vs. 1.32, t = 9.12, p < 0.001; Fig 4b bottom).

On the other hand, if time permits, multiple simulations with the MGS can significantly reduce the variance introduced by the stochastic representation of gravity’s direction (Fig 4c). To explore whether humans adopted this strategy of performing multiple simulations before making a decision, we ran simulations with the MGS at different numbers of times and then matched them with humans’ performance. We found that the variance of humans’ inference on stability best matched that of the MGS after three simulations (Fig 4d; see Supplementary Figure 11 for the model-behavior correspondence under different numbers of simulations). Therefore, humans are likely to run simulations a limited number of times to infer stacks’ stability.

# Discussion

In this study, we investigated how the physical law of gravity is embodied in the brain as a world model that guides inferences on objects’ stability. A series of psychophysics experiments showed that the world model on gravity is not a faithful replica of the physical world, but rather a stochastic model that captures the essence of the vertically downward direction of gravity as the maximum likelihood of a Gaussian distribution. The stochastic feature of the world model not only fits humans’ stability inference behavior better than the deterministic model, but also provides new insight into the daily illusion that taller objects are perceived as more likely to collapse. We further illustrated how the stochastic feature evolved through interactions with the environment using reinforcement learning, and well-balanced accuracy and speed to produce a unique ecological advantage for our survival in the physical world.

About 300 years ago, the philosopher Immanuel Kant proposed the intuition of space and time as *a priori* knowledge in the mind for us to understand the physical world (Kant, 1781), but only until recently have researchers investigated how the intuition is implemented in the brain as intuitive physics (Kubricht et al., 2017; McCloskey, 1983). In the Noisy Newtonian Framework, intuitive physics is depicted as a combination of Newtonian physics and uncertainty generated by noise (Battaglia et al., 2013; Kubricht et al., 2017; Sanborn et al., 2013). The introduction of uncertainty helps to reconcile the misconception occurring under unfavorable conditions, such as unfamiliar events or static scenes (Kaiser et al., 1992, 1986; Kim and Spelke, 1999; McCloskey, 1983; Smith and Vul, 2013), which was once thought to support Aristotelian physics (DiSessa, 1982; Halloun and Hestenes, 1985). The noise in previous studies was thought to originate from sources such as perceptual uncertainty or external perturbations of forces, rather than from the intuitive physical engine itself, which is thought to be a deterministic system. Our study extends these deterministic models by showing a stochastic world model that the noise instead came from the representation of gravity’s direction under Gaussian distribution. The inherent stochastic feature of gravity’s direction did not need to rely on external noise to explain the illusory instability of taller objects. In addition, it was also confirmed by the cognitive impenetrability of the Gaussian distribution of gravity’s direction when gravity’s direction in the physical world was reversed (Pylyshyn, 1980).

With a reinforcement learning framework, we further proposed a possible origin of the stochastic feature of the world model through interactions with the physical world. In contrast to summarizing statistical patterns from experience (Bear et al., 2021; Li et al., 2016; Zhang et al., 2016), this framework was designed to simulate how an agent constructed the world model on gravity through agent-environment interactions. Specifically, a world model with undifferentiated directions of gravity generated a prediction on the stability of an object, and the mismatches between the prediction and the observation of the object from the physical world were used to fine-tune the distribution of the directions in the world model. This process is similar to how humans update their internal knowledge by comparing simulated expectations (Hegarty, 2004; Ullman et al., 2017) with actual observations (Baillargeon, 2004, 1994; Kotovsky and Baillargeon, 2000). After several generations of error minimization, a Gaussian distribution of gravity’s direction with the vertically downward direction as the maximum likelihood was similar to that observed in the human world model. Interestingly, when the physical worlds that the agent interacted with changed their appearance with stacks of different heights, the world models maintained their general patterns, but the stochastic representation of gravity’s direction changed accordingly. This finding not only supports the robustness of the active inference (Hegarty, 2004; Ullman et al., 2017), which efficiently encodes critical features under different physical worlds, but also resonates with the idea that intelligence develops under the constraints of the physical world. Taken together, the finding from the RL framework implies that the world model on gravity in humans may also be constructed in the same way, possibly through the mechanism of the predictive coding in a generative process (Friston, 2018; Huang and Rao, 2011).

Our world model on gravity provides an example of the world model theory that emphasizes the predictive nature of generative neural networks implemented with *a priori* knowledge of the physical world (Friston et al., 2021; Land, 2014; Matsuo et al., 2022). In contrast to traditional discriminative neural networks that learn statistical patterns for stability from gigantic amounts of labeled stacks, generative models equipped with the physics laws governing the physical world rely much less on experience. Importantly, the stochastic feature of the model further enhances the efficiency by balancing accuracy and speed, which improves our chances of better survival (Cosmides and Tooby, 1997) and adaptation to novel environments (e.g., astronauts in outer space (Wang et al., 2022)). Indeed, the close link between human cognition and the physical world through interaction may shed light on the development of a new generation of AI with human-like intelligence that can work flexibly in open-ended environments (Marcus, 2020, 2018).

# Methods

## Creating stacks with different configurations

We designed a block-stacking procedure in a physical simulation platform (PyBullet) to generate stacks with different configurations. All stacks used in this study were generated using this procedure with the same parameters listed below.

The block-stacking procedure includes three steps (Supplementary Figure 1a): (1) defining the designated area, (2) stacking blocks, and (3) fine-tuning block positions. The first step is to designate a restricted place area. All blocks of a stack were required to be placed within the designated area. The designated area controls the aggregation level of blocks, with a small area clustering blocks closer than a large area. The designated area is determined by two horizontal parameters *x* and *y*, which separately represent the size of the area in two horizontal directions. Therefore, when the block number is fixed, a smaller area in general constructs a higher stack. After designating the area, in step two we stacked blocks in random horizontal positions within the area one by one. If no block was positioned under a new block, the new block would be directly placed on the ground; otherwise, it would stack on the positioned block. The horizontal position of each block was independently sampled from a uniform distribution, with lower and upper bounds being -*x* and +*x*, or -*y* and +*y* separately (*x* and *y* were all independently sampled from a uniform distribution *U* (*0.2,0.8*). The first two steps allow us to generate a large number of configurations within the designated area, which is the only restriction of the block-stacking procedure. To better control the physical stability of each stack, in step three we fine-tuned blocks in the stack by adjusting overlaps between every neighboring one, which was randomly sampled from a uniform distribution 0.2, 0.8). Smaller overlap between neighboring blocks is more likely to construct unstable stacks, whereas more extensive overlap results in more stable stacks. The overlap of neighboring blocks without contact is set to 0. Note that the overlap between neighboring blocks is not the only factor determining a stack’s stability, and step three is used to generate stacks without consuming too many computational resources.

The size of each block has a 3D aspect ratio of 3:1:1 (length: width: height), with an arbitrary unit of 1.2:0.4:0.4. This constitutes three types of blocks (length, width, or height is 1.2, respectively, see Supplementary Figure 1b). Each block of a stack was randomly selected as one of the three types of blocks. The mass of each block is set to 0.2 kg, and the friction coefficients and the coefficients of restitution between blocks are set to 1 and 0, respectively.

## Estimating the stability of a stack

The stability of a stack was obtained by a rigid-body forward simulation under the natural gravity environment (i.e., natural gravity simulator, NGS). The direction of the natural gravity points downward , and all blocks of a stack are affected by the same gravity. Gravity is the only factor for changing the state of each block, and no external force is added during the simulation. Within each simulation, we recorded 500 simulation stages. In each stage, the center position of each block was collected to measure the stability of the stack. If the position of any block does not change during the simulation, the stack is considered stable, otherwise unstable. We formulate the stack’s state according to the below criteria:

Where *t* is a simulation stage, m is the block number of a stack, *P*_{tm} is the position of the block m at stage *t*, and *ε* is the just noticeable difference (i.e., j.n.d) of the perception, which is set to 0.01.

The stability of a stack is further calculated by measuring the proportion of displaced blocks, which is formulated as the following,

Where M is the total number of blocks of a stack, and T is the final stage of the simulation (i.e., T = 500). 𝕀(•)=1 when|*P*_{Tb}-*P*_{Ob}|<*ε*, which denotes that the stack is stable.

## Measuring participants’ sensitivity to gravity’s direction

We decomposed gravity’s direction into three independent components (Fig. 1b).

Where g is the magnitude of gravity (g = 9.8), which was fixed in this study. *θ* represents the vertical component, *φ* represents the horizontal component, and x, y, and z are three mutually perpendicular axes. The direction of the gravity was determined by the angle pair (*θ,φ*), where *θ* affects the extent of the collapse, and ;,*φ* affects the orientation of the collapse. When *θ* is 0, gravity’s direction is vertical.

We performed a psychophysics experiment to measure humans’ sensitivity to gravity’s direction. In this experiment, 10 participants (5 female, age range: 21-28) from Tsinghua University were recruited to finish four runs of the behavioral experiment, which measured their ability to detect the abnormality of stacks’ collapse trajectories. The experiment was approved by the Institutional Review Board of Tsinghua University (2022 No. 34), and informed consent was obtained from all participants before the experiment.

The collapse trajectory of a stack was solely determined by gravity with different directions, where larger values of *θ* and *φ* made the trajectories more abnormal. A pilot experiment showed that almost all *θ*_{s} greater than 45 degrees made the collapse trajectory abnormal to most participants, and therefore in the experiment, *θ* ranges from 0 to 45 degrees with a step of 3 degrees. *φ*ranges from 0 to 360 degrees with a step of 24 degrees. Therefore, *θ* and *φ* consists of 16 values, respectively, which were randomly combined into 96 pairs of (*θ, φ*) with each value repeating 6 times in each run. In a trial, an unstable stack was constructed, and then the camera rotated one circle to show the 3D configuration of the stack to participants (Supplementary Movie S1). The configuration was randomly selected from a dataset with more than 2,000 unstable stacks, which was generated with the block-stacking procedure before the experiment. Each stack in the database was constructed with 10 blocks, and the color of each block was randomly rendered. There was a 1-sec delay after the rotation, during which the participants were instructed to infer the collapse trajectory based on the configuration. Then, simulated gravity with a direction determined by an angle pair (*θ,φ*) was applied to the stack, and the stack started to collapse. If the collapse trajectory met participants’ expectations, they were instructed to choose ‘Normal,’ otherwise ‘Abnormal’. Once the judgment was made, the subsequent trial started immediately. Each trial lasts about 10 seconds, taking 16 minutes for a run.

In addition, to test if participants’ sensitivity to gravity’s direction is encapsulated from visual experience and task context, we flipped gravity’s direction upside down by inverting the camera’s view, and the rest procedure remained the same.

To calculate participants’ sensitivity to gravity’s direction, we converted their behavioral judgment into confidence levels about the normal trajectory, which is the percentage that a trajectory was judged as normal, which was calculated as below:

Where *n*_{θ,φ} is the number of trajectories that were judged as ‘Normal’ with the angle pair (*θ,φ*),*N*_{θφ} is the total number of trajectories with the same angle pair. Because the angle pairs tested were a subset of all possible angle pairs, we used the average ratio along *φ*as the ratio of angle pairs untested (Fig. 1c) to acquire each participant’s tuning curve. Finally, we calculated participants’ sensitivity by fitting their confidence levels at different *θ* to a Gaussian distribution.

Where *Ratio*_{θ} is the confidence level of *θ*, which was calculated by averaging the confidence level along all *φ*_{s}, A is the magnitude of the Gaussian curve, *σ* is the variance of the Gaussian curve. The best-fitted *σ* was used to index participants’ sensitivity to gravity’s direction, and a larger *σ* indicates a lower sensitivity.

## Measuring participants’ ability on stability inference

Another group of 11 participants (5 female, age range: 21-32) from Tsinghua University completed a behavioral experiment for judging the stability of 60 stacks. The experiment was approved by the Institutional Review Board of Tsinghua University, and informed consent was obtained from all participants before the experiment. One male participant (age: 25) was excluded from further analyses because his judgment showed an extremely weak correlation with the actual stability of stacks (r_{s} < 0.30 for all experimental runs), as compared to the rest of the participants.

The stacks contained 26 unstable and 34 stable stacks, which were randomly interleaved in each run. The participants were instructed to judge stacks’ stability on an 8-point Likert scale, with 0 referring to ‘definitely unstable’ and 7 to ‘definitely stable.’ There was no feedback after each judgment. The participants completed six runs, within which the same group of stacks was presented but the sequence, blocks’ colors, and camera’s perspective were all randomized. After the experiment, only two participants reported that they suspected a few stacks were repeated in different runs, but they could not locate the stacks they suspected. Besides, their behavioral performance was not significantly different from other participants.

Participants’ stability judgment was rescaled to 0 and 1 to match the scale of the stacks’ stability. The participants’ inference bias (IB) was indexed as the difference in stability judgment between the participants and the NGS, shown as

Negative IB indicates that participants tended to consider a stable stack as an unstable one.

## Estimating the stability of stacks based on the stochastic world model on gravity

The actual stability of a stack can be calculated with a one-time simulation of NGS . In contrast, the stochastic nature of mental gravity requires a multiple-time simulation with different gravity’s directions. Specifically, we first randomly sampled several angle pairs (*θ*_{s},*φ*_{s}) from the Gaussian distribution of gravity’s directions in humans. The distribution was the average of two distributions acquired from the real world (i.e., gravity’s direction is downward) and the inverted world (the direction is upward), with angles having larger confidence levels more likely being sampled. We then applied the simulated gravity with these sampled directions to the stack, and used the averaged stability with these directions as the stability of the stack estimated by the MGS. Similar to the IB between the participants and the NGS, the IB between the MGS and NGS was calculated as

Stacks of different heights were created to investigate whether the stochastic world model on gravity results in the illusion that tall objects are considered less stable than short ones. The height of a stack was correlated with the size of the designated area, with a smaller area size corresponding to taller stacks. Therefore, we designated several square areas with different sizes. The side length of the squares ranged from 0.2 to 2.0, with an increase of 0.1. For each square, we used the block-stacking procedure to generate 100 stable and 100 unstable stacks consisting of 10 blocks. The height of each stack was the height of the highest block.

## Investigating the origin of the stochastic world model on gravity

A reinforcement learning (RL) framework was used to simulate the development of the stochastic nature of the world model on gravity. To do this, we first created stacks whose block number ranged from 2 to 15 with the block-stacking procedure, and initialized a spherical force space, where *θ* ranged from 0 to 180 degrees and *φ*from 0 to 360 degrees, separately divided them into 61 sampling angles across the spherical force space (i.e., the angle density). The spherical space covered all possible force directions, with the initial probability of being sampled by the MGS identical. During the training, three angle pairs (*θ*_{s}, *φ*_{s}) were sampled according to the probability of the spherical space, and then applied to a stack for simulating its collapse trajectory, which was divided into 500 stages. We optimized the sampling probability of gravity’s direction by comparing the estimated stability (i.e., expectation) with the actual stability (i.e., observation) as a Q value, with a higher Q value suggesting that the sampled gravity’s direction more likely mismatched the actual gravity’s direction. The Q value was calculated as

Where *pm*,(*θ,φ*) is the final position of block m with gravity’s direction (*θ, φ*), *P*_{m} is the final position of block m with NGS, M is the block number of the stack, and the j.n.d. *ε* is set to 0.01. The mismatch between the expectation and the observation was used to update the sampling probability of the angle pair using a temporal difference optimization

Where *γ* = 0.15 is the learning rate. This process was iterated to update the sample probability of angle pairs (*θ*_{s}, *φ*_{s}) until the training stopped. The angle density and learning rate are two factors that affect the learning speed. A larger angle density prolongs the time to reach convergence but enables a more detailed force space; a higher learning rate accelerates convergence but incurs larger variance during training. To balance speed and convergence, we utilized 100,000 configurations for the training.

## Evaluating the ecological advantage of the model

To investigate how the world model on gravity balances response accuracy and speed, we trained a linear classifier (i.e., logistic regression) to model humans’ decision-making process at different simulation stages. During the simulation, the same stack was separately simulated using the NGS and MGS, and we collected the position coordinates of all blocks at each stage. Differences in the positions of the blocks between the intermediate stage and the initial stage provided information about the stability of a stack, with more displaced blocks suggesting the lower stack’s stability. As the simulation proceeded, differences in position gradually accumulated for unstable stacks, otherwise unchanged for stable stacks. The linear classifier was trained to judge whether a stack is stable with differences in position as inputs.

We used the block-stacking procedure to create stacks consisting of 2 to 10 blocks, and estimated their stabilities with the NGS for simulation in 500 stages. For each block number, there were 100 stable and 100 unstable stacks to train the linear classifier, and its prediction accuracy was measured with another group of 100 stable and 100 unstable stacks at every simulation stage.

The difference in positions of each block between the intermediate and initial stages was used as the input of the linear classifier. Specifically, we collected all vertex positions of a block during the simulation to acquire the difference in position, which included 8 coordinate points for each block in each stage. We did not collect the central position as previously used in the stability estimation, simply because it did not provide information on the shape and size of the block. We separately performed the simulation using the MGS and NGS, calculated the difference in position between the intermediate stage and the initial stage, and then flattened the difference to generate 24 position features for each block (i.e., eight positions per block in three-dimensional space). Therefore, for a 10-block stack as an example, there were 240 position features were prepared as the input of the linear classifier.

Prediction accuracy at each stage was estimated by evaluating whether a stack tested was stable with the MGS or with the NGS. The highest accuracy in the whole simulation stages was used as the prediction accuracy. Accordingly, the first simulation stage to reach the maximum accuracy provided information on response speed: reaching the maximum accuracy with a smaller number of stages indicates the classifier model accomplishes stability inference in a shorter amount of time (i.e., quick response). Therefore, we measured the response speed by estimating the steps to reach the accuracy plateau.

Where *Accuracy*_{t} is the accuracy of stage *t*. is the stage that a linear classifier acquires the maximum accuracy for the first time, T is the total stage number of each simulation (T = 500). Higher values indicate longer response time (i.e., slower response). Finally, the efficiency of the stability inference, which is the balance between accuracy and speed, by dividing the prediction accuracy by the response time.

# Acknowledgements

We thank all members of the Liu Lab for their valuable comments.

# Funding

This study was funded by the Beijing Municipal Science & Technology Commission and Administrative Commission of Zhongguancun Science Park (Z221100002722012), the Shuimu Tsinghua Scholar Program (T.H.), Tsinghua University Guoqiang Institute (2020GQG1016), Tsinghua University Qiyuan Laboratory, and Beijing Academy of Artificial Intelligence (BAAI).

# Competing interests

Authors declare no competing interests.

# Data and materials availability

All code and data underlying our study and necessary to reproduce the results are available on Github: https://github.com/helloTC/GravityWorldModel.

# Appendix Estimate the lower bound of the possible number of configurations

A configuration is a structure composed of several contact blocks. To simplify the computation of estimating the number of possible configurations, here we constrained the shape of blocks and the position where the blocks were placed.

** The shape constraint**:

*the blocks used to form a configuration are all uniform rectangular blocks with the same aspect ratio*.

** The position constraint**:

*only one block is allowed to be placed on the same layer of the configuration*.

Thus, the problem is then simplified to estimate the possible number of configurations when only one rectangular block with the aspect ratio of *α*: *β*: *γ* (i.e., **the shape constraint**) is allowed to place in one layer (i.e., **the position constraint**). Note that the constraints significantly reduce the number of estimated configurations.

We illustrated our solution by starting with a simple case: the aspect ratio of blocks is *α*: *α*: *α*.

## The condition when the aspect ratio of blocks is *α*: *α*: *α*

The block with the aspect ratio of *α*: *α*: *α* is a cube (Appendix Fig 1a). The side length of the cube is defined as *α*. Consider a configuration with two stacking blocks, the upper block needs to be placed in a 3*α* × 3*α* area to ensure contact with the bottom block (Appendix Fig 1b). To estimate the possible number of this simple situation, we defined a visual acuity *υ*, which is the minimum resolution to distinguish two stacks (i.e., j.n.d.). Note that *υ* is a small value and here we set it as *υ* = 0.01 to match the minimal position difference for stability estimation in the simulation platform (please see Methods). Therefore, the possible number of the configuration containing two cubic blocks is

Where indicates the possible number of configurations containing two cubic blocks.

We further consider the situation with more cubic blocks. For a stack that contains three cubic blocks, it can be viewed as placing a cubic block on a two-block stack (Appendix Fig 1c). Therefore, the total possible number of configurations is the multiplication of two two-block configurations, which is formulated as

Similarly, the possible number of configurations for stacks containing four cubic blocks is

Accordingly, the possible number of configurations with M cubic blocks is

Now, we have introduced the basic idea of calculating the number of configurations using a block with an *α*: *α*: *α* aspect ratio as a special case. Then we generalized the idea to estimate the possible number when the block is rectangular with the aspect ratio as *α*: *β*: *β*.

## The condition when the aspect ratio of blocks is

A block with the aspect ratio of *α*: *β*: *β* has three types, corresponding to the sides of length, width and height are *α* and the rest sides are *β* (*α*: *β*: *β*, *β*: *α*: *β*, and *β*: *β*: *α*; see Appendix Fig 2a). For simplicity, we label the three basic blocks as A, B and C. The three types of blocks can generate 9 (i.e., 3^{2}) two-block configurations in total (Appendix Fig 2b). We calculate each of the possible numbers of two-block configurations below.

The possible number of configurations for stacks containing two rectangular blocks with the aspect ratio of *α*: *β*: *β* is

For a configuration containing three blocks, it can be viewed as a block stacked on a two-block stack (Appendix Fig2c). Therefore,

Where *N*_{‥A}indicates the possible number when block A stacked at the upper layer, and each term can be expanded as below.

Combining equations (4), (5) and (6), we have

And

Therefore,

Following a similar logic, the possible number of configurations containing M blocks with an aspect ratio of *α*: *β*: *β* is

### The aspect ratio of blocks is *α*: *β*: *γ*

We further generalize the proble by considering the aspect rati of blocks as *α*: *β*: *γ*.This forms six different types: *α*: *β*: *γ, α*:*γ*: *β, β*: *α*: *γ, β*: *γ*: *α, γ*: *α*: *β, γ* : *β* :*α*, for each type the three proportional values corresponding to length, width and height, respectively. We label the six types of blocks as A, B, C, D, E, F, and G for simplicity.

Following the similar logic as above, different types of blocks generated 36 (i.e., 6^{2}) two-block configurations in total, and the possible number of each two-block configuration is

The possible number of configurations for stacks with M blocks with an aspect ratio of *α*: *β*: *γ* is

Therefore, we can estimate the possible number of configurations when only one rectangular block with the aspect ratio of *α*: *β*: *γ* is allowed to place in each layer using the formula (9) and (10).

Finally, in this study we chose blocks with an aspect ratio of 3:1:1 as building blocks for stacks whose stability was evaluated. Specifically, for stacks consisting of 10 blocks and j.n.d. of *ν=*0.01, the number of configurations can be estimated with formula (9), which is 1.14 ×10^{50.}

# References

- Rapid trial-and-error learning with simulation supports flexible tool use and physical reasoning
*Proceedings of the National Academy of Sciences***117**:29302–29310 - Infants’ physical world
*Current directions in psychological science***13**:89–94 - How do infants learn about the physical world?
*Current Directions in Psychological Science***3**:133–140 - Simulation as an engine of physical scene understanding
*Proceedings of the National Academy of Sciences***110**:18327–18332 - Physion: Evaluating physical prediction from vision in humans and machines
*arXiv preprint* - Spectrums: our mind-boggling universe from infinitesimal to infinity
*A&C Black* - Evolutionary psychology: A primer
- Pybullet, a python module for physics simulation for games, robotics and machine learning
- A distributional code for value in dopamine-based reinforcement learning
*Nature***577**:671–675 - Unlearning Aristotelian physics: A study of knowledge-based learning
*Cognitive science***6**:37–75 - Functional neuroanatomy of intuitive physical inference
*Proceedings of the national academy of sciences***113**:E5072–E5081 - Arrows of time in Infancy: The representation of temporal– causal invariances
*Cognitive Psychology***44**:252–296 - Does predictive coding have a future?
*Nature neuroscience***21**:1019–1021 - World model learning and inference
*Neural Networks***144**:573–590 - Common sense concepts about motion
*American journal of physics***53**:1056–1065 - Mechanical reasoning by mental simulation
*Trends in cognitive sciences***8**:280–285 - Predictive coding
*Wiley Interdisciplinary Reviews: Cognitive Science***2**:580–593 - Representation of visual gravitational motion in the human vestibular cortex
*Science***308**:416–419 - Intuitive reasoning about abstract and familiar physics problems
*Memory & Cognition***14**:308–312 - Influence of animation on dynamical judgments
*Journal of experimental Psychology: Human Perception and performance***18** - The Critique of Pure Reason
- Perception and understanding of effects of gravity and inertia on object motion
*Developmental Science***2**:339–362 - Reasoning about collisions involving inert objects in 7.5-month-old infants
*Developmental Science***3**:344–359 - Interpreting encoding and decoding models
*Current opinion in neurobiology***55**:167–179 - Intuitive physics: Current research and controversies
*Trends in cognitive sciences***21**:749–759 - Adaptation to suppression of visual information during catching
*Journal of Neuroscience***9**:149–159 - Building machines that learn and think like people
*Behavioral and brain sciences***40** - Do we have an internal model of the outside world?
*Philosophical Transactions of the Royal Society B: Biological Sciences***369** - To fall or not to fall: A visual approach to physical stability prediction
*arXiv preprint* - The epistemological problem for automataAutomata Studies.(AM-34):235–252
- The next decade in ai: four steps towards robust artificial intelligence
*arXiv preprint* - Deep Learning: A Critical Appraisal
- Deep learning, reinforcement learning, and world models
*Neural Networks* - Intuitive physics
*Scientific american***248**:122–131 - Does the brain model Newton’s laws?
*Nature neuroscience***4**:693–694 - Encoding and decoding in fMRI
*Neuroimage***56**:400–410 - Dopaminedependent prediction errors underpin reward-seeking behaviour in humans
*Nature***442**:1042–1045 - Invariant representation of physical stability in the human brain
*eLife***11** - Computation and cognition: Issues in the foundations of cognitive science
*Behavioral and Brain sciences***3**:111–132 - Reconciling intuitive physics and Newtonian mechanics for colliding objects
*Psychological review***120** - Sources of uncertainty in intuitive physics
*Topics in cognitive science***5**:185–199 - How to grow a mind: Statistics, structure, and abstraction
*science***331**:1279–1285 - Mind games: Game engines as an architecture for intuitive physics
*Trends in cognitive sciences***21**:649–665 - Modulation of biological motion perception in humans by gravity
*Nature Communications***13**:1–10 - Visual perception and interception of falling objects: a review of evidence for an internal model of gravity
*Journal of Neural Engineering***2** - Visuo-motor coordination and internal models for object interception
*Experimental Brain Research***192**:571–604 - A comparative evaluation of approximate probabilistic simulation and deep neural networks as accounts of human physical scene understanding
*arXiv preprint* - Mental Jenga: A counterfactual simulation model of physical support

# Article and author information

## Version history

- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
- Reviewed Preprint version 2:
- Version of Record published:

## Copyright

© 2023, Taicheng Huang & Jia Liu

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

# Metrics

- views
- 744
- downloads
- 53
- citations
- 0

Views, downloads and citations are aggregated across all versions of this paper published by eLife.