Neuroscience

Determinantal Point Process Attention Over Grid Cell Code Supports Out of Distribution Generalization

Shanka Subhra Mondal author has email address
Steven Frankland author has email address
Taylor W. Webb author has email address
Jonathan D. Cohen author has email address

Department of Electrical and Computer Engineering, Princeton University
Princeton Neuroscience Institute, Princeton University
Department of Psychology, University of California, Los Angeles

https://doi.org/10.7554/eLife.89911.2

Open access
Copyright information

Figures and data

Schematic of the overall framework. Given a task (e.g., an analogy to solve), inputs (denoted as {A, B, C, D}) are represented by the grid cell code, consisting of units (“grid cells”) representing different combinations of frequencies and phases.
Grid cell embeddings (x_A, x_B, x_C, x_D) are multiplied elementwise (represented as a Hadamard product ⊙) by a set of learned attention gates g, then passed to the inference module R. The attention gates g are optimized using 𝓛_DPP, which encourages attention to grid cell embeddings that maximize the volume of the representational space. The inference module outputs a score for each candidate analogy (consisting of A, B, C and a candidate answer choice D). The scores for all answer choices are passed through a softmax to generate an answer , which is compared against the target y to generate the task loss 𝓛_task.

Schematic of the overall framework. Given a task (e.g., an analogy to solve), inputs (denoted as {A, B, C, D}) are represented by the grid cell code, consisting of units (“grid cells”) representing different combinations of frequencies and phases.
Grid cell embeddings (x_A, x_B, x_C, x_D) are multiplied elementwise (represented as a Hadamard product ⊙) by a set of learned attention gates g, then passed to the inference module R. The attention gates g are optimized using 𝓛_DPP, which encourages attention to grid cell embeddings that maximize the volume of the representational space. The inference module outputs a score for each candidate analogy (consisting of A, B, C and a candidate answer choice D). The scores for all answer choices are passed through a softmax to generate an answer , which is compared against the target y to generate the task loss 𝓛_task.

Generation of test analogies from training analogies (region marked in blue) by: a) translating both dimension values of A, B, C, D by the same amount; and b) scaling both dimension values of A, B, C, D by the same amount.
Since both dimension values are transformed by the same amount, each input gets transformed along the diagonal.

Training with DPP-A

Results on analogy on each region for translation and scaling using LSTM in the inference module.

Results on analogy on each region for translation and scaling using the transformer in the inference module.

Results on arithmetic on each region using LSTM in the inference module.

Results on arithmetic on each region using the transformer in the inference module.

Results on analogy on each region using DPP-A, an LSTM in the inference module, and different embeddings (grid cell code, one-hots, and smoothed one-hots passed through a learned encoder) for translation (left) and scaling (right).
Each point is mean accuracy over three networks, and bars show standard error of the mean.

Results on analogy on each region using different embeddings (grid cell code, and one-hots or smoothed one-hots with and without an encoder) and an LSTM in the inference module, but without DPP-A, TCN, L1 Regularization, or Dropout for translation (left) and scaling (right).

Results on analogy on each region using LSTM in the inference module for choosing top K frequencies with in Algorithm 1.
Results show mean accuracy on each region averaged over 3 trained networks along with errorbar (standard error of the mean).

Results on analogy on each region using LSTM in the inference module for choosing top K frequencies with in Algorithm 1.
Results show mean accuracy on each region averaged over 3 trained networks along with errorbar (standard error of the mean).

Results on analogy on each region for translation and scaling using the transformer in the inference module.

Results on arithmetic with different embeddings (with DPP-A) using LSTM in the inference module.
Results show mean accuracy on each region averaged over 3 trained networks along with errorbar (standard error of the mean).

Results on arithmetic with different embeddings (without DPP-A, TCN, L1 Regularization, or Dropout) using LSTM in the inference module.
Results show mean accuracy on each region averaged over 3 trained networks along with errorbar (standard error of the mean).

Results on arithmetic for increasing number of grid cell frequencies N_f on each region using LSTM in the inference module.
Results show mean accuracy on each region averaged over 3 trained networks along with errorbar (standard error of the mean).

Results for regression on analogy using LSTM in the inference module.
Results show mean squared error on each region averaged over 3 trained networks along with errorbar (standard error of the mean).

Results for regression on arithmetic on each region using LSTM in the inference module.
Results show mean squared error on each region averaged over 3 trained networks along with errorbar (standard error of the mean).

Results on analogy for L1 regularization for various λs for translation and scaling using LSTM in the inference module.
Results show mean accuracy on each region averaged over 3 trained networks along with errorbar (standard error of the mean).

Results on arithmetic for L1 regularization for various λs using LSTM in the inference module.
Results show mean accuracy on each region averaged over 3 trained networks along with errorbar (standard error of the mean).

Results on analogy for one step DPP-A over the complete grid cell code for various λs for translation and scaling using LSTM in the inference module.
Results show mean accuracy on each region averaged over 3 trained networks along with errorbar (standard error of the mean).

Results on analogy for one step DPP-A within frequencies for various λs for translation and scaling using LSTM in the inference module.
Results show mean accuracy on each region averaged over 3 trained networks along with errorbar (standard error of the mean).

Approximate maximum log determinant of the covariance matrix over the grid cell embeddings (y-axis) for each frequency (x-axis), obtained after maximizing Equation 6.

Each panel shows the results after summation of the multiplication of the grid cell embeddings over the 2d space of 1000×1000 locations, with their corresponding gates for a particular frequency, obtained after maximizing Equation 6 for each grid cell frequency.
The left, middle, and right panels show results for the lowest, middle, and highest grid cell frequencies, respectively, of the 9 used in the model. Lighter color in each panel corresponds to greater responsiveness of grid cells at that particular location in the 2d space.

Sign up for email alerts