Assessing spatial knowledge in non-spatial perception systems using a linear decoding approach in a virtual environment.

(A) A three-dimensional virtual space is created to resemble a realistic laboratory environment with a variety of visual features. An agent moves randomly in a two-dimensional area which is within a three-dimensional space and processes first-person views of the environment. Central image shows the threedimensional environment. Surrounding images are example views taken by the agent at different locations and heading directions. (B) Top-down view of the area where the agent can explore. We define spatial knowledge of the agent with four values. The agent’s location is denoted by the Cartesian coordinates tx, ty. The agent’s heading direction is denoted by the angle tr. The distance between the agent and the nearest wall is tb. (C) Individual views are processed by perception models (deep neural networks of object recognition). We train linear regression models with various levels of internal representations from these networks to assess spatial knowledge related to self-location (tx, ty), heading direction (tr) and distance to the closest wall (tb).

Perception models absent a spatial basis possess extensive spatial knowledge.

(A) Decoding performance across tasks and model layers. Across three tasks, mid-to-late layers exhibited lower decoding errors compared to early layers and the penultimate layer of VGG-16 (fc2). All layers outperformed chance (green and blue lines). (B) Decoding performance across a number of deep neural network architectures (penultimate layer; see Appendix for full results), including convolutional networks and vision transformers. All pre-trained and untrained models outperformed baseline measures. Error is in normalized virtual environment units (see Methods). See full results across models, layers and sampling rates in the Appendix. Shaded areas in B and error bars in C represent 95% confidence intervals of the mean decoding error (bootstrapped across locations).

Representations of perception models developed for object recognition exhibit typical spatial cell-like firing profiles.

Active units were classified across model layers based on standard criteria used to identify place cells (P), head-direction cells (D), and border cells (B). Example units from VGG-16 (see Appendix for more examples from other models). (A) Pie charts illustrating the proportion of “spatial” cell types identified in DNNs, including units that are inactive. Many units satisfied the criteria for place, head-direction and border cells irrespective of layer depth. Many units exhibited mixed selectivity, with a significant amount of units displaying strong place and directional tuning. (B-E) Spatial firing profiles of model units in spatial activation maps and polar plots . For activation maps, each unit’s activation was plotted at each location in the two-dimensional area irrespective of heading direction. For direction selectivity, polar plots show the average activity of the activity map across location at a given angle, which reflects the tuning magnitude to each heading direction). (B) Example place cell units that show strong spatial selecitivity with little direction selectivity. (C) Examples of head-direction cell units that show strong direction selectivity but weak location selectivity. (D) Examples of border cell units that respond strongly to boundaries of the environment. (E) Examples of mixed-selective place and head-direction cell units with strong spatial and directional tuning. Examples presented are from VGG-16. See Appendix for more examples.

“Spatial” cells do not play a privileged role in spatial cognition.

Exclusion analyses showed that units exhibiting traditional spatial firing profiles do not form the basis of the model’s spatial knowledge. (A) Units in each layer were ranked based on standard criteria for place (maximum place field activity, number of place fields), head-direction (strength of directional tuning), and border (strength of border tuning) cells. As more highly-ranked spatial units were excluded, the overall decoding performance remained relatively stable across all tasks (top). Excluding an equivalent number of units randomly yielded similar performance (bottom). (B) Model units’ contribution to spatial knowledge based on their contribution to decoding performance (magnitude of regression coefficients). More highly-ranked task-relevant units excluded results in a marked deterioration of decoding performance by decoders trained on the remaining units (top). Randomly excluding an equivalent number of units had minimal impact on performance (bottom). Results here are from VGG-16. For other models, please refer to the Appendix.

VGG-16 (untrained).

Linear decoders trained with representations from various layers of the untrained VGG-16 achieve low errors across sampling rates; though not as good as its trained version. Mid-to-advanced layers show superior performance than early layers. As more locations are sampled for training the linear decoders, overall decoding performance improves. All model layers can decode better than two visual-invariant baselines.

ResNet-50 (trained).

Linear decoders trained with representations from various layers of the ResNet-50 pretrained on object recognition achieve low errors across sampling rates. Mid-to-advanced layers show superior performance than early layers. The penultimate layer decoding performance did not improve as much as the intermediate layers as more locations are sampled for training the linear decoders. All model layers can decode better than two visual-invariant baselines.

ResNet-50 (untrained).

Similar to ResNet-50 pretrained on images, the untrained counterpart can effectively decode spatial knowledge related to location, heading direction and distance to borders. All layers decoder better than the baseline decoders which do not rely on visual signals of the environment.

ViT-B/16 (trained).

Linear decoders trained on different layers of the pretrained ViT model show very similar decoding performance. Overall the decoding performance is much better than the two baseline decoders which do not incorporate visual signals.

ViT-B/16 (untrained).

Linear decoders trained on the untrained ViT model also achieve accurate decoding performance on location, heading direction and distance to the nearest border. Similar to the trained version, all layers considered in our analysis achieve comparable performance and outperform the baselines.

Distribution of different spatial unit types across layers of perceptual models of object recognition.

Examples of model units exhibiting spatial characteristics.

Excluding units with the strongest spatial profiles had minimal impact on spatial knowledge (ResNet-50).

Excluding units by task-relevance affects spatial decoding performance (ResNet-50).

Excluding units with the strongest spatial profiles had minimal impact on spatial knowledge (ViT-B/16).

Excluding units by task-relevance affects spatial decoding performance (ViT-B/16).