Abstract
Many species are able to recognize objects, but it has been proven difficult to pinpoint and compare how different species solve this task. Recent research suggested to combine computational and animal modelling in order to obtain a more systematic understanding of task complexity and compare strategies between species. In the present study, we created a large multidimensional stimulus set and designed a visual categorization task partially based upon modelling with a convolutional deep neural network (cDNN). Experiments included rats (N = 11; 1115 daily sessions in total for all rats together) and humans (N = 50). Each species was able to master the task and generalize to a variety of new images. Nevertheless, rats and humans showed very little convergence in terms of which object pairs were associated with high and low performance, suggesting the use of different strategies. There was an interaction between species and whether stimulus pairs favoured early or late processing in a cDNN. A direct comparison with cDNN representations revealed that rat performance was best captured by late convolutional layers while human performance related more to the higher-up fully connected layers. These findings highlight the additional value of using a computational approach for the design of object recognition tasks. Overall, this computationally informed investigation of object recognition behaviour reveals a strong discrepancy in strategies between rodent and human vision.
1 Introduction
Humans have (almost) no difficulty in invariant object recognition, the ability to recognize the same objects from different viewpoints or in different scenes (DiCarlo et al., 2012; Zoccolan, 2015). This ability is supported by the ventral visual stream, the so-called what stream (Logothetis & Sheinberg, 1996). A question that is repeatedly addressed in vision studies is whether and how we can model this stream by means of animal models or computational models to further examine and quantify the representations along the ventral visual stream. Computationally, researchers have recently modelled this stream by using convolutional deep neural networks (cDNNs), as for example done by Avberšek and colleagues (2021), Cadieu and colleagues (2014), Duyck and colleagues (2021), Güçlü & Gerven (2015), Kalfas and colleagues (2018), Kar and colleagues (2019), Kubilius and colleagues (2016), Pospisil and colleagues (2018) and Vinken & Op de Beeck (2021). Lately, the animal model of choice for vision studies has become the rodent model, motivated by the applicability of molecular and genetic tools rather than by the visual capabilities of rodents. Past studies have examined behavioural (Alemi-Neissi et al., 2013; De Keyser et al., 2015; Djurdjevic et al., 2018; Schnell et al., 2019; Tafazoli et al., 2012; Vermaercke & Op de Beeck, 2012; Vinken et al., 2014; Zoccolan, 2015) (for a review see (Zoccolan, 2015)) as well as neural (Matteucci et al., 2019; Tafazoli et al., 2017; Vermaercke et al., 2014; Vinken et al., 2016) data of rodents (rats and mice) performing in visual pattern recognition tasks. The behavioural findings showed that rats are capable of learning complex visual discrimination tasks. Here we plan to integrate computational and animal modelling approaches, by using data about information processing in artificial neural networks when designing the animal experiments.
One aspect that almost all rodent studies have in common is that the exact task and stimuli are chosen based on what we know from human and monkey studies. Earlier research showed that the intuition of researchers about the complexity of visual tasks can be misleading (Vinken & Op de Beeck, 2021). Through computational cDNN modelling of the tasks from previous studies, they showed that behavioural strategies that seem complex at first hand might be best modelled through relatively early levels of processing in cDNNs. They recommended that future studies could obtain more direct information about the complexity of visual tasks and behavioural strategies by incorporating neural network models in the design phase of the experiment. One way of implementing this is to train rodents in a challenging and multidimensional visual task and use cDNNs to select stimulus examples targeting strategies with different levels of complexity.
In the present study, we implemented this approach and created a large stimulus set that can be used for a variety of visual experiments. We decided to create the stimuli in a way that they are adaptable to different types of tasks, such as a “simple” categorization task or non-linear tasks (e.g. Bossens & Op de Beeck, 2016). We then took a subset of these stimuli and performed a visual categorization experiment in rats (see Figure 1 for the design). The task itself was defined in a stimulus space with two dimensions, here referred to as concavity and alignment Error! Reference source not found., and further complicated by transforming the stimuli along several dimensions that preserve the identity of the object. Once we trained the animals in a base stimulus pair, we used the identity-preserving transformations to test for generalization. After a number of transformation phases, we selected a final stimulus set by choosing a combination of transformations based on the outcomes of a trained cDNN. Using the neural network as a (basic) model for the different stages of ventral visual stream processing, we chose stimulus pairs that require either higher or lower levels of processing and thus allow us to maximally differentiate between the task strategies used by the animals. As a final part of the current study, we performed an online human experiment with the same stimuli and design as the experiment for the rats, providing us with a rich three-way comparison of rat behavioural data with human behavioural data and with cDNN data.
2 Results
In this study, we trained and tested 11 rats and 45 humans on a complex two-dimensional categorization task (see Figure 1 for the design of the rat study, and Supplemental Figure 6 for the design of the human study). Rats and humans were first trained in a base pair. Next we tested their ability to generalize across several image transformations. In the last two protocols of the design, we used a computational approach to select stimuli that require different visual strategies.
2.1 Animal study
Training
We first checked the variation in performance across phases and stimulus pairs during training. In the first Training Phase, animals were trained in the base stimulus pair (maximally different target and distractor in the 4×4 stimulus grid). This training was successful for all twelve animals and lasted on average for 8.62 sessions (SD = 1.61). Animals were trained until they reached 80% performance for two consecutive sessions.
Once the animals were successfully trained, we examined whether they use both dimensions (concavity and alignment) by presenting them with two additional stimuli pairs where the target and distractor differ in only one dimension (see Figure 1, Dimension learning). Performance on the old pair was similar to training performance (85.83%). The animals performed well with the stimuli that differ only along the concavity dimension (78.79%), although it was significantly lower than the performance on the base pair (paired t-test on rat performance, t(11) = 3.77, p = 0.003). Performance dropped to 67.83% for the alignment-only pair, yet also significantly higher than chance level (one-sample t-test, p < .0001). Overall, the Dimension learning protocol provides evidence that the animals have picked up each of the two dimensions. This finding already excludes trivial explanations in terms of simple visual dimensions. For example, while concavity is correlated with horizontal size (distractor wider) and with overall brightness (distractor brighter, thus the opposite relevance as in the shaping phase), these simple dimensions cannot explain above-chance performance on the alignment dimension.
The third training protocol consisted of a number of small transformations, as visualized in Figure 1 (Transformations). The performance of the animals was not affected by these small transformations, with an average performance of 83.05% (see Figure 2). The pairwise percentage matrix in Figure 2 shows that the distractor with the Size transformation (most right column in the matrix) affected the rat performance the most.
The variation across targets and distractors can be due to a variety of factors. This can include simple dimensions such as brightness. In the base pair, the distractor is brighter than the target. While this is the opposite from the shaping task of detecting a shape versus a black screen, visual inspection of Figure 2 suggests that the animals perform poorer on trials in which the distractor display is not so much brighter (e.g., when it is small). To quantify this effect of brightness, we calculated the correlation between the performances in the matrix and the difference in pixel values (and thus brightness) of the stimulus pairs. This resulted in a (Pearson) correlation of −0.59 (p < 0.01), suggesting that there is indeed an effect of brightness. Yet, brightness is at best a partial explanation because all percentages in the matrix are above chance, with the lowest percentage in the matrix being 68.83%, even though in some pairs the difference in pixel values is abolished or even opposite from the base pair.
Overall the findings from the training phase and the above-chance performance on a variety of dimensions and transformations suggest that the rats have mastered a pattern classification task with a level of complexity that might be competitive with other tasks in the rodent literature.
Testing across transformations
The six protocols that test generalization to various transformations with new, untrained images are associated with performances lower than 80% (binomial test, see Supplemental Table 5 (lower table) for detailed table with results), but significantly higher than chance level (see Supplemental Table 5 (lower table)). The pairwise percentage matrices of the animals in Figure 3 provide a more detailed view of what is happening in every test. The distractor has a higher impact on performance than the target in some tests. Supplemental Table 6 shows the marginal means and standard deviation for each target and distractor for these two test protocols. From these means it is clear that there is a higher variation in the performance between distractors in Rotation X (52%-65%) and Rotation Z (56%-73%) than between targets (55%-60% resp. 60%-66%). The same happens in the size test protocol.
After these first six test protocols, the animals were presented with a schedule where all three rotations are combined (see Figure 1). On the new stimuli, the animals performed 58.56%, which is rather low, but still significantly different from chance level (binomial test on pooled performance of all animals: p < 0.0001; 95% CI [0.57;:0.60]).
Testing computational levels of complexity
For the final two test protocols, we used a cDNN to find image pairs that would contrast strategies based upon a different stage in visual processing, with either early layers having lower performance than high layers (Zero vs. high), or early layers having better performance than high layers (High vs. zero). Rat performance was particularly low for Zero vs. high (56.47%), yet still significantly different from chance level (binomial test on pooled performance of all animals; p < 0.0001; 95% CI [0.55;0.58]). In contrast, rats were able to solve the High vs. zero pairs not only better than chance (average: 64.84%; binomial test on pooled performance of all animals; p < 0.0001; 95% CI [0.63;0.66]), but also significantly better than Zero vs. high (paired t-test on rat performance, t(10) = −4.49, p = 0.0012). This suggests that rats align with lower levels of processing when we purposely select image pairs that are optimized to contrast different levels of the visual processing hierarchy.
Next we checked how much individual cDNN layers can predict the variation in behavioural performance across image pairs when we take all test protocols together. We calculated the correlation of the generalization across image pairs between the cDNN classifier (summarized in Figure 7) and the rat performance of all nine test protocols. This correlation includes a total of 287 image pairs. We did this by concatenating all performances of the animals into one array and all Classification Scores of the network into another array, and calculating the correlation between these two arrays to retrieve a correlation for each network layer. The results are displayed in Figure 4. Overall, we see quite low correlations, but several convolutional layers nevertheless show a significant positive correlation (permutation test) with the behavioural pattern of performance at the image pair level.
Even though some of the correlations are significant, they are low. This could indicate that no cDNN layer is able to capture what rats do. Alternatively, it could be caused by a very low reliability of the behavioural data. To test the reliability of the variations in behavioural performance between stimulus pairs in all nine test protocols, we calculated the split-half reliability, as previously done in (Schnell et al., 2023), resulting in a correlation of 0.40. By applying the Spearman-Brown correction, we obtain a full-set reliability correlation of 0.58. This correlation is much higher than the correlations with individual cDNN layers.
It is possible that rat performance would be based upon multiple levels of processing, in which case we would need a combination of layers in order to explain the variation in performance across stimulus pairs. Given the low correlation between neighbouring layers (Supplemental Table 7), a multiple linear regression was calculated with the Classification Scores of the 13 layers as 13 regressors, and the rat performances as response vector. The results of this regression indicate a significant effect of the Classification Scores (F(287,273) = 2.22, p = 0.00907, R2 = 0.10). Further investigating the 13 predictors showed that the later convolutional layers 8, 9 and 10 of the network were significant predictors in the regression model (see Supplemental Table 8 for results of the regression model). The R2 = 10 of the full model would correspond to a correlation of around 0.32. This is better than the correlation of single layers, but still clearly smaller than the reliability of the rat data of 0.58. In conclusion, the cDNN model provides a partial explanation of how the performance of rats varies across image pairs.
2.2 Human study
A final part of this study was to include an online human study that follows the same design as the animal part. Figure 5 shows the average performance of humans (dark blue) versus rats (light blue) for all nine test protocols, as well as their performance on the old stimuli that were added in (or during) the testing protocols as quality control. Overall, humans performed better on all tests protocols than rats, with an average performance over all tests of 94.34% (humans) and 62.29% (rats). There was already a difference in terms of training performance (humans: 92.86% vs. rats: 77.84%), but the difference on the test protocols is larger. We subtracted the training performance of humans or rats from the testing performance of humans or rats, respectively, and even with this normalization for training performance there is still a significantly higher test performance in humans compared to rats (t(16) = −6.47, p < 0.0001). Thus, not surprisingly, the degree of invariance in this object classification task is higher for humans compared to rat.
The variation in performance across test protocols and across image pairs can give an indication of the strategies that each species follows. Overall, humans and rats show a mild correspondence in terms of which image pairs are more difficult, with a human-rat correlation of 0.18 across all image pairs of the nine test protocols (p < 0.001 with permutation test). Albeit significant, this correlation is clearly lower than the maximum value that could be obtained given the reliability of the data. The split-half reliability of the human data was 0.46, corresponding to a full-set reliability of 0.63. We reported above that full-set reliability is 0.58 for the rat data, resulting in a combined reliability of 0.60 (calculated as described in Op de Beeck et al., 2008). Thus, after taking data reliability into account there remains a pronounced discrepancy between rats and humans in terms of how performance varies across image pairs.
The main question of the present study is how this discrepancy relates to computational informed strategies. If we take a closer look specifically at the two cDNN-informed test protocols (Zero vs. high and High vs. zero), we see an opposite behaviour between animals and humans. Humans performed significantly better in the Zero vs. high protocol, i.e. where we used stimuli where the earlier layers of the network perform worse than the higher layers, than in the High vs. zero protocol (paired t-test: t(44) = 2.85, p = 0.0067). Rats, however, show the opposite (see above for statistics). There even is a significant interaction between species and test protocol (unpaired t-test: t(54) = 2.50, p = 0.016s). This suggests a different strategy between animals and humans: rats use strategies that are captured in the lower layers of the network, and thus correspond more to low level visual processing. Humans, however, tend to rely more on strategies captured by the higher layers of the network, and thus we are looking at more high-level visual processing.
As a next step, we calculated the correlation between the generalization across image pairs between the cDNN classifier and the human performance of all nine test protocols in an identical manner as for the rat performance (Figure 4). The results are displayed in Figure 6. Overall, we see quite high correlations, especially in the higher layers. This pattern across layers is very different from the pattern in rats where the highest layers showed no correlations, which again suggests that, despite successful generalization, rats rely on decisively lower-level strategies than humans in the same categorization task.
A multiple linear regression was calculated in an identical manner as we did with the rat performance. The results of this regression indicate a significant effect of the Classification Scores (F(287,273) = 6.8, p < 0.0001, R2 = 0.25). Further investigating the 13 predictors showed that in particular the fully connected layers 11, 12 and 13 of the network were strong predictors in the regression model (see Supplemental Table 9 for results of the regression model).
3 Discussion
In the current study, we trained and tested rats and humans in a categorization task using two-dimensional stimuli, with the two dimensions being concavity and alignment. We tested generalization across a range of viewing conditions. For the last two testing protocols, we used a computational approach to select the stimuli in terms of specifically dissociating low and high stages of processing. Rats were able to learn both dimensions (concavity and alignment) and showed a preference for concavity. Their performance on the testing protocols revealed a wide variety in percentage correct: for some test protocols they performed just above chance level, e.g. Zero vs. high, whereas for others they could easily reach about 70% correct (Position). Humans, on the other hand, performed better overall, with performances of 80% or higher on the testing protocols. Addressing the question of the complexity of the underlying strategies, rats performed best on the test protocol designed to specifically target lower levels of processing whereas humans performed best on the high-level processing protocol. Likewise, direct comparisons with artificial neural network layers showed that the variation of rat performance across images was best explained by late convolutional layers, whereas human performance was most associated with representations in fully connected layers.
All animals started by being trained in three training protocols. The first Training protocol only included one image pair, the base pair, containing the most different target and distractor without any further transformations. Learning of the individual dimensions of concavity and alignment was investigated through the Dimension learning protocol. The results from this Dimension learning protocol indicate that our rats have more difficulties learning the alignment dimension as opposed to the concavity dimension. One possible explanation for the superior performance on the concavity dimension could be that the animals were partially solving the task such that the brighter stimulus, i.e. the convex base shape, is the distractor and that their strategy is to pick the stimulus with the lowest brightness. This was confirmed by analyses on the third training protocol (Transformations) that included small transformations along various dimensions. Nevertheless, the rats still performed above chance level for trials in which the brightness differences were reversed, indicating that other dimensions are involved and overrule a contribution from brightness. Similar findings have been obtained in human behaviour and neuroscience. For example, despite the clear category selectivity in regions such as the fusiform face area, the selectivity in these regions is also modulated very strongly by various low-level dimensions (Yue et al., 2011). With regard to the size and position transformations it is important to keep in mind that the animals were freely moving in the touchscreen chambers, and so even for the original base pair was already undergoing changes in retinal size and retinal position. What we manipulate, is rather the size and position relative to the rest of the set-up (e.g., relative to screen position and size).
After these three training protocols, the animals were tested for generalization in a variety of testing protocols, each testing a separate transformation on the stimuli. The first six test protocols included rotation along all the three axes, size, position and light location, following by a test protocol in which we combined the rotation along the three axes. Overall, we found that the performance of the animals on these test protocols is affected by these transformations, but still significantly above chance in each protocol. Studies in the literature would often stop here, or proceed by systematically testing even larger transformations. Stimulus choices are based upon intuitions of what strategy animals might be using, and upon theories of how visual perception works. However, in some cases, a further computational modelling of the task and stimuli finds that what intuitively seems like a task of a particular complexity might not be so complex after all. The first tests of invariant object recognition seemed impressive, but were found to be easily solved with earlier layers of processing (Minini & Jeffery, 2006; Vinken & Op de Beeck, 2021). This was recently also highlighted by relatively simple pixel-based analyses (Kell et al., 2020). As another example, Vinken & Op de Beeck (2021) have used a computational approach to further investigate the levels of information processing in rodents by comparing three hallmark studies that provided evidence for higher order visual processing in rodents (Djurdjevic et al., 2018; Vinken et al., 2014; Zoccolan et al., 2009) with cDNNs. They found that for all three studies, the low and mid-level layers captured the rat performances best, providing thus evidence against the previously concluded high level visual processing in rodents.
For these reasons, we decided to directly test image pairs through computational modelling with cDNNs and select pairs that are particularly suited of dissociating different levels of processing. Stimuli were chosen by a cDNN from a very large set of possible stimuli and combinations, such that the higher layers and the lower layers of the network make distinct errors on classifying the stimuli (Zero vs. high and High vs. zero protocol), and thus are diagnostic of the level of underlying visual strategies. The stimuli of the Zero vs. high protocol included stimuli where the higher layers of the network performed better than the lower layers, and thus they address higher level visual processing. The opposite can be said for the High vs. zero protocol, which includes stimuli that specifically target lower level visual processing, given that the lower layers of the network perform best on these stimuli. After presenting these stimuli to the animals, we found that our rats performed best in the High vs. zero protocol, suggesting that they focus on low level visual cues to solve this categorization task. We found the opposite cDNN pattern for humans, indicating that they use high level visual processing. These findings provide more direct information about the level of processing that underlies the behavioural strategies compared to overall performance or to effects of image manipulations. This is a new promising way to design experiments in a way that is computationally informed rather than based on researcher intuitions or qualitative predictions.
Partially thanks to these computationally inspired tests, our total dataset finds a marked dissociation between how humans and rats solve this object recognition task. Our analyses show this most convincingly by correlating the variation in performance across image trials with the predictions of cDNN layers. There were significant correlations with multiple layers in both species. In humans, the most pronounced correlations were present for the highest, fully connected layers, while in rats correlations were limited to low and middle convolutional layers. This is the most direct evidence available in the literature that rats resolve object recognition tasks through a very different and computationally simpler strategy compared to humans. The cDNN approach does not inform us how we can verbalize this simpler strategy, but based upon earlier work (Schnell and colleagues, 2023); Vermaercke & Op de Beeck, 2012) we would hypothesize that rats rely upon visual contrast features (e.g., this area is darker/lighter than that other area). Such contrast features are also used by humans and monkeys, e.g. for face detection (Ohayon et al., 2012; Sinha, 2002), but in addition humans have access to more complex strategies that e.g. refer to complex shape features such as aspect ratio and symmetry (Bossens & Op de Beeck, 2016).
For future studies, it will be highly valuable to use this computational informed strategy on a wider battery of behavioural tasks, as well as a wider range of species such as tree shrews and marmosets (Callahan & Petry, 2000; Kell et al., 2020, 2021; Meyer et al., 2022; Petry et al., 2012; Petry & Bickford, 2019). One step further, we can use the information from computational modelling together with behaviour and how it differs among stimuli to further select stimuli for neurophysiological investigations of neuronal response properties along the visual information processing hierarchy, in this way following experimental designs that are optimized for highlighting the primary differences between processing stages and between species.
4 Methods
4.1 Animal study
4.1.1 Animals
A total of twelve male outbred Long Evans rats (Janvier Labs, Le Genest-Saint-Isle, France) started this behavioural study. Out of these twelve animals, two were tested extensively in a first pilot study, and were included in the remainder of the study as well. All animals were 11 weeks old at the start of shaping and were housed in groups of four per cage. Each cage was enriched with a plastic toy (Bio-Serv, Flemington, NJ), paper cage enrichment and wooden blocks. Near the end of the experiment, one animal had to be excluded because of health issues. During training and testing, the animals were food restricted to maintain a body weight between 85% and 90% of their underprived body weight. They received water ad libitum. All experiments and procedures involving living animals were approved by the Ethical Committee of the University of Leuven and were in accordance with the European Commission Directive of September 22, 2010 (2010/63/EU).
4.1.2 Setup
The setup is identical to the one used by Schnell and colleagues (2019) and Schnell and colleagues (2023). A short description will follow here. The animals were trained and tested in four automated touch-screen rat-testing chambers (Campden Instruments, Ltd., Leicester, UK) with ABET II controller software (v2.18, WhiskerServer v4.5.0). The animals performed one session per day and each session lasted for 100 trials or 60 minutes, whichever came first. A reward tray in which sugar pellets (45-mg sucrose pellets, TestDiet, St. Louis, MO) could be delivered was installed on one side of the chamber. On the other side of the chamber, an infrared touchscreen monitor was installed. This monitor was covered with a black Perspex mask containing two square response windows (10.0 x 10.0 cm). A shelf (5.4cm wide) was installed onto this black mask (16.5cm above the floor) to force the animals to attend to the stimuli and to view the stimuli within their central visual fields. Close proximity to the screen was enough to elicit a response because the screens are infrared.
4.1.3 Stimuli
Stimuli were created using the Python scripting implementation of the 3D modelling software Blender 3D (version 2.93.3). In general, the stimuli were objects that consisted of a body (base) with three spheres attached to it. A first step was to alter two dimensions of the object, namely the concavity of the base and the alignment of the three spheres. The base was made either concave or convex by increasing (convex) or decreasing (concave) the base parameter. The alignment of the spheres was altered by changing the placement of the left and the right spheres. These spheres could either be horizontally aligned or misaligned. In the misaligned case, the spheres were placed diagonally from upper left to lower right. Supplemental Figure 1a shows two example stimuli, the ones that later were selected as the so-called “base pair”. Next, additional exemplars were created by uniformly tiling the two-dimensional stimulus space between these two example stimuli. We decided to create eleven levels of the concavity dimension and four levels of alignment. This already yields 44 stimuli (see Supplemental Figure 2). We chose these levels of concavity and alignment based on the pixel dissimilarity of the stimuli (see Supplemental Figure 3). The final goal was to construct a 4×4 stimulus grid by selecting a subset of the 4×11 stimulus grid. We chose a large number of concavity levels, as this ensures flexibility in the calibration of the two dimensions relative to each other.
We added identity-preserving transformations to the stimuli, such as rotation among the x-axis, y-axis and z-axis in six different angles (0° to 180° in steps of 30°), as well as changing the light location (left, under, up, right, front) and finally the size and position. The latter two transformations were implemented using Python (3.7.3). Excluding the size and position transformation, these transformations resulted in a total set of 75460 stimuli (4 (alignment) * 11 (concavity) * 7 (x-axis rotation) * 7 (y-axis rotation) * 7 (z-axis rotation) * 5 (light location) = 75460 stimuli). Supplemental Figure 3 shows examples of these transformations.
4.1.4 Protocols
Once the pilot was finished (see supplementary for details), we set up the experiment and chose our stimuli. We started by reducing the 4×11 stimulus grid to a 4×4 stimulus grid (see Supplemental Figure 1b). All stimuli on the diagonal can be seen as ambiguous stimuli (four stimuli in total), as they can be identified as a target as well as a distractor. The six stimuli above this diagonal create the target part of the grid, and the six stimuli below this diagonal resemble the distractor sub-grid.
The different phases of the experiment are shown in Figure 1. In the main Training phase, we trained the animals in the maximally different stimuli that are placed at the very ends of the corners (Supplemental Figure 1a). We refer to this as the base pair. After this Training phase, the experiment consisted of two further training protocols. In the Dimension learning training phase, we pushed the animals to learn both dimensions (concavity and alignment) by presenting them two additional stimuli pairs from Supplemental Figure 1b in which the target and distractor differ in only one dimension. A third training protocol (Transformations) consisted of stimuli with some small transformations, such as 30° rotation along the x-axis, 30° rotation along the y-axis, 30° rotation along the z-axis, light location below, and size reduction of 80%, resulting in a total of 25 possible stimulus pairs (every combination of target-distractor with the 5 transformed stimuli). During these two training protocols, one third of the trials were so-called “old trials” with the base pair. Correction trials were given if an animal answered incorrectly, i.e. the same trial was repeated until the animal answered correctly. These correction trials were excluded from the analyses. In all trials, rats received a reward for touching the correct screen, i.e. the screen with the target.
After these three training protocols, the testing part of the experiment included nine test protocols. The crucial defining difference between these test protocols and the prior training protocols is that rats received a reward randomly in 80% of the trials with new stimulus pairs, and no correction trials were given for an incorrect response. This random reward is important to keep the animals motivated during the testing protocols and to measure real generalization, and not training behaviour. We have used a similar approach in the past, where we rewarded the animals in every testing trial (Schnell et al., 2019; Vinken et al., 2014). One third of the trials in all test protocols consisted of old trials with the base pair, and here, the animals received reward for touching the target and correction trials were shown if necessary. Regularly, we inserted a Dimension learning session in between two test sessions to maintain the performance high enough on training stimuli, especially for the animals in which we saw a drop in performance on the base pair. We excluded any test sessions where the performance on the base pair stimuli dropped to below 65%.
The first six test protocols included one protocol for each transformation, i.e. Rotation X, Rotation Y, Rotation Z, Light Location, Size and Position. The order in which these first six test protocols were given to the animals was counterbalanced between the animals. The stimuli that were used in these six test protocols can be seen in Figure 1 and every combination of target-distractor per test protocol was presented to the animals. For the rotation protocols, we used rotation degrees in steps of 30°, ranging from 30° to 180°. This resulted in 36 possible stimulus pairs for each of the three rotation protocols. In the Light Location protocol, we used stimuli where the light location was set at four different positions (below, left, right and up), resulting in 16 possible stimulus pairs for this protocol. In the Size protocol, we selected targets and distractors that were 80% and 60% reduced in size compared to the original, training pair. This protocol included 4 possible stimulus pairs. And finally, in the Position protocol, we changed the position of the 80% reduced in size stimuli and placed the objects in the lower left corner, lower right corner, centre, upper left corner and upper right corner. We have a total of 25 possible stimulus pairs for this protocol.
After these six test protocols, we presented the animals with six targets and six distractors where all three rotations were combined (Combination rotation), i.e. x-, y- and z-axis were rotated with the same degree (ranging from 30° to 180°, in steps of 30°). This resulted in a total of 36 new stimulus pairs. Again, no correction trials were included after the trials where rotated stimuli were shown and animals received random reward in 80% of the trials. One third of the trials consisted of the stimulus pair from the first Training phase (i.e. the base pair), and here, correction trials were given after an incorrect response and real reward was given to the animals.
In a final set of two test protocols, we created a cDNN-informed stimulus set. The details of the computational modelling are explained in the next section. The first protocol (Zero vs. high) included stimuli in which the lower layers of the network performed around chance level (i.e. target-distractor difference in Classification Scores (difference in signed distance to hyperplane) of about 0), whereas the higher layers scored high (see section 4.2). The second protocol (High vs. zero) included stimuli where the network did the opposite. That is, the earlier layers performed well whereas the higher layers performed around chance level. The order of the two test protocols was counterbalanced between the animals. Each of these test protocols included 7 targets and 7 distractors, giving a total of 49 new stimulus pairs.
Animals stayed in each session for 60 minutes or until they reached 100 training trials or 120 testing trials. We used an intertrial interval (ITI) of 20s and a time-out of 5s during training sessions. From another pilot study in the lab, we noticed we could decrease the ITI and time-out without affecting the rats’ performance. Therefore, we decided to use an ITI of 15s and time-out of 3s during testing, and to increase the number of trials during a testing session to 120 trials.
Each protocol was run for multiple sessions per animal. Given that we were interested in how performance would vary across stimulus pairs, we completed more sessions for the protocols that included more stimulus pairs.
Supplemental Table 1 indicates the average number of trials per test protocol for all rats together.
One animal was not placed in the Transformations phase as it was the slowest animal during training. However, its performance on the test protocols did not significantly differ from the other animals. We tested this by calculating the correlation of the variation of performance across stimulus pairs for each rat with the pooled responses of all other rats. The average correlation for each of the other animals with the pooled response was 0.24 (±0.09), and the correlation of this slowest animal with the others was very similar, 0.23.
4.2 Computational modelling
One important goal of this study was to create a cDNN-informed stimulus set to present to the animals. To do so, we followed the steps of Schnell and colleagues (2023) and Vinken & Op de Beeck (2021) to train a cDNN on the same stimuli on which our animals were trained. The steps of training the network are identical to Schnell and colleagues (2023) and a short description will follow here. We used the standard AlexNet cDNN architecture that was pre-trained on ImageNet to classify images into 1000 object categories (MATLAB 2021b Deep Learning Toolbox). Following Vinken & Op de Beeck (2021), we applied principal component analysis to calculate the activations in every layer, to standardize the values across inputs and to reduce the dimensionality. We then trained a linear support vector machine classifier by using the MATLAB function fitclinear, with limited-memory BFGS solver and default regularization. We performed this with the standardized DNN layer activations in the principal component space as inputs, before ReLU, to our 24 training stimuli (see Figure 1), i.e. all stimuli of the Training, Dimension learning and Transformations protocols. The layers of AlexNet were divided into 13 sublayers, similar as in Schnell and colleagues (2023) and Vinken & Op de Beeck (2021).
Figure 7 shows the performance of the network after training the network on our training stimuli for all test protocols. We added noise to the inputs of the network such that the average training performance, averaged over 100 iterations, lies around 75%. By adding noise in this way, the performance on the training pairs matches overall with rat performance on those pairs, otherwise the performance of the network would be at 100% on the training pairs and this would complicate comparisons with the animal data (see also Vinken & Op de Beeck, 2021). Note that the results for the Size test are unreliable given the low number of stimulus pairs in that test. The performance of the network on the tests (green line in Figure 7) differs among the tests and across layers, but typically the network had no problems to achieve a training performance of about 85% in all test protocols in at least some layers. The change in performance across layers is variable across test protocols.
To examine the performance of the model for specific image pairs during training and testing in more detail than possible with a binary categorization decision, we calculate the distance to the classifier’s hyperplane (decision boundary) of the targets and distractors. We do this by computing the difference in signed distance to the hyperplane between target and distractor (target – distractor). This is referred to as the Classification Score. For each stimulus pair in the test protocols we computed this Classification Score and we have such a score per layer.
We used this Classification Score to select image pairs for a cDNN-informed stimulus set. To do so, we randomly chose one target and one distractor from a subset of the pool of all 4×4 stimuli, including all possible transformations on these stimuli. This resulted in a stimulus pool of 10.290 stimuli (5145 targets, 5145 distractors) to randomly choose two from, and 5145*5145 (26 471 025) possible resulting pairs of two stimuli. Once one random target and one random distractor was chosen, the DNN was tested in a similar manner as we did for the six test protocols. We performed a total of 10000 iterations of randomly choosing a target and distractor pair. For each iteration, we calculated the average Classification Score of layers 1-3 and of layers 11-13 as we wanted to compare those two levels of processing (earlier layers vs higher layers). After these 10000 iterations, we finetuned and filtered the results according to the profile of performance across earlier and higher layers (see Supplemental Table 2). This finetuning started by calculating the distribution and standard deviation for two profiles of interest, i.e. (i) where early layers show an average Classification Score close to zero but higher layers show high Classification Scores (Zero vs High), and (ii) where early layers show high Classification Scores but higher layers show close to zero Classification Scores (High vs Zero). The performance was expressed relative to the distribution of values across all pairs, summarized by de standard deviation of the average target-distractor difference in Classification Scores of the early layers and the higher layers. We found a total of 48 stimulus pairs for these two criteria, and we ended up choosing 14 pairs, 7 of each criterion, that we used for the final part of the animal and human study (see lower two rows in Figure 1).
Afterwards we also calculated the binary target-distractor cDNN decision performance for the image pairs in the Zero vs High and High vs Zero tests, which is shown in Figure 7 (bottom row). The image pairs in the Zero vs High protocol are more difficult than the other protocols, in particular for the first half of the cDNN layers. In contrast, the High vs Zero protocol is the only protocol associated with chance performance in the last three layers. These analyses confirm that the cDNN-based image pair selection resulted in protocols that are very different from protocols that zoom in on intuitively chosen transformations and their combinations.
Comparing the rat performances to the Classification Scores of the network was done by calculating the correlation across image pairs between these model scores and the rat performances averaged across animals. We concatenated the performance of the animals on all nine test protocols, as well as the distance to hyperplane of the network on all nine test protocols. Correlating these two arrays resulted in the correlations as visualized in Figure 4. To test whether these correlations are significant, we performed a permutation test. We permutated these arrays 1000 times, resulting in a normal distribution of permutated data per layer. We then calculated, per layer, how many of the permutated values are higher than or equal to the correlation that is presented in Figure 4, and divided this by the number of permutations.
4.3 Human study
4.3.1 Participants
Data was collected from 50 participants (average age 33.24 ± 12.23; 34 females) who participated in return for a gift voucher of 10 euro. All participants had normal or corrected-to-normal vision. The experiment was approved by the ethical commission of KU Leuven (G-2020-1902-R3) and each participant digitally signed an informed consent before the start of the experiment.
4.3.2 Setup
For the human part of this study, we developed an online experiment using PsychoPy3 (v2020.1.3, Python version 3.8.10) and placed it on the online platform Pavlovia. All participants received the link and their individual participant number by e-mail with which they could participate in the experiment on their own computer. It took 30-45 minutes to complete the online study.
4.3.3 Stimuli and protocols
We used the same stimuli as in the animal study. The human experiment underwent the same phases as depicted in Figure 1, albeit with small changes. We dropped the 1/3rd old trials in the test protocols and included two additional Dimension Learning protocols in between the first counterbalanced tests as quality check (see Supplemental Figure 6). Supplemental Table 3 provides an overview of the number of trials during the human experiment for each phase.
Similar as in Bossens & Op de Beeck (2016), we presented the targets and distractors briefly to the left and right side of a white fixation cross on a grey background. Each stimulus was presented for three frames, followed by a mask (a noise image with 1/f frequency spectrum for three frames). We used this fast and eccentric stimulus presentation with a mask to resemble the stimulus perception more closely to that of rats. Vermaercke & Op de Beeck (2012) have found that human visual acuity in these fast and eccentric presentations is not significantly better than the reported visual acuity of rats. By using this approach we avoid that differences in strategies between humans and rats would be explained by such a difference in acuity. Participants could then answer using the f and ‘j’ keys to indicate which position they thought was the correct position. If they thought the target was on the left side of the fixation cross, they had to press ‘f’, and ‘j’ if they thought the target was on the right side. Participants received feedback during the shaping and the three training phases. This happened by colouring the fixation cross green if they answered correctly, and red if they answered incorrectly. Each trial was followed by an intertrial interval (ITI) of 0.5s. During the Shaping and Training phase, we kept a running average of the past 20 (Shaping) and 40 (Training) trials and participants continued to the next phase when they reached a performance of 80% or higher on the last 20 or 40 trials, similar as in Bossens & Op de Beeck (2016). There was no time limit for the participants for providing a response. The order of the first six test protocols (Rotation X, Rotation Y, Rotation Z, Size, Light Location and Position) was counterbalanced between the participants based on the participant number, as well as the order of the last two test protocols (Zero vs. high and High vs. zero), similar as the approach in the rat study. Supplemental Table 1 indicates the average number of trials per test protocol for all human participants together.
In terms of instructions, we explained to participants that they would see two figures appearing at the same time very quickly next to a fixation cross, and they would have to make a decision of which figure is the correct one. We mentioned that during training, the fixation cross would turn green if they answered correctly, and red if they answered incorrectly. Participants were informed that during testing, they would not get feedback (changing colour of the fixation cross) anymore and that they would have to use the knowledge they gained throughout training to make their decision in the testing.
Data availability
The data has been made publicly available via the Open Science Framework and can be accessed at https://osf.io/9eqyz/.
References
- Multifeatural Shape Processing in Rats Engaged in Invariant Visual Object RecognitionJournal of Neuroscience 33:5939–5956https://doi.org/10.1523/JNEUROSCI.3629-12.2013
- Training for object recognition with increasing spatial frequency: A comparison of deep learning with human visionJournal of Vision 21https://doi.org/10.1167/jov.21.10.14
- Linear and Non-Linear Visual Feature Learning in Rat and HumansFrontiers in Behavioral Neuroscience 10https://doi.org/10.3389/fnbeh.2016.00235
- Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object RecognitionPLOS Computational Biology 10https://doi.org/10.1371/journal.pcbi.1003963
- Psychophysical measurement of temporal modulation sensitivity in the tree shrew (Tupaia belangeri)Vision Research 40:455–458https://doi.org/10.1016/S0042-6989(99)00194-7
- Cue-invariant shape recognition in rats as tested with second-order contoursJournal of Vision 15https://doi.org/10.1167/15.15.14
- How Does the Brain Solve Visual Object Recognition?Neuron 73:415–434https://doi.org/10.1016/j.neuron.2012.01.010
- Accuracy of Rats in Discriminating Visual Objects Is Explained by the Complexity of Their Perceptual StrategyCurrent Biology 28:1005–1015https://doi.org/10.1016/j.cub.2018.02.037
- How Visual Expertise Changes Representational Geometry: A Behavioral and Neural PerspectiveJournal of Cognitive Neuroscience 33:2461–2476https://doi.org/10.1162/jocn_a_01778
- Deep Neural Networks Reveal a Gradient in the Complexity of Neural Representations across the Ventral StreamJournal of Neuroscience 35:10005–10014https://doi.org/10.1523/JNEUROSCI.5023-14.2015
- Representations of regular and irregular shapes by deep Convolutional Neural Networks, monkey inferotemporal neurons and human judgmentsPLOS Computational Biology 14https://doi.org/10.1371/journal.pcbi.1006557
- Evidence that recurrent circuits are critical to the ventral stream’s execution of core object recognition behaviorNature Neuroscience 22https://doi.org/10.1038/s41593-019-0392-5
- Conserved core visual object recognition across simian primates: Marmoset image-by-image behavior mirrors that of humans and macaques
- Brain organization, not size alone, as key to high-level vision: Evidence from marmoset monkeyshttps://doi.org/10.1101/2020.10.19.345561
- Deep Neural Networks as a Computational Model for Human Shape SensitivityPLOS Computational Biology 12https://doi.org/10.1371/journal.pcbi.1004896
- Visual Object Recognition
- Nonlinear Processing of Shape Information in Rat Lateral Extrastriate Cortex | Journal of Neurosciencehttps://doi.org/10.1523/JNEUROSCI.1938-18.2018
- Assessing tree shrew high-level visual behavior using conventional and natural paradigms
- Do rats use shape to solve “shape discriminations”?Learning & Memory 13:287–297https://doi.org/10.1101/lm.84406
- What Makes a Cell Face Selective? The Importance of ContrastNeuron 74:567–581https://doi.org/10.1016/j.neuron.2012.03.024
- Perceived Shape Similarity among Unfamiliar Objects and the Organization of the Human Object Vision PathwayJournal of Neuroscience 28:10111–10123https://doi.org/10.1523/JNEUROSCI.2511-08.2008
- The Second Visual System of The Tree ShrewJournal of Comparative Neurology 527:679–693https://doi.org/10.1002/cne.24413
- Behavioral measurement of RDK velocity discrimination thresholds in the tree shrewJournal of Vision 12https://doi.org/10.1167/12.9.1223
- “Artiphysiology” reveals V4-like shape tuning in a deep network trained for image classificationELife 7https://doi.org/10.7554/eLife.38242
- Face categorization and behavioral templates in ratsJournal of Vision 19:9–9https://doi.org/10.1167/19.14.9
- The importance of contrast features in rat visionScientific Reports 13https://doi.org/10.1038/s41598-023-27533-3
- Qualitative Representations for RecognitionBiologically Motivated Computer Vision :249–262https://doi.org/10.1007/3-540-36181-2_25
- Transformation-Tolerant Object Recognition in Rats Revealed by Visual PrimingJournal of Neuroscience 32:21–34https://doi.org/10.1523/JNEUROSCI.3932-11.2012
- Emergence of transformation-tolerant representations of visual objects in rat lateral extrastriate cortexELife 6https://doi.org/10.7554/eLife.22794
- Functional specialization in rat occipital and temporal visual cortexJournal of Neurophysiology 112:1963–1983https://doi.org/10.1152/jn.00737.2013
- A Multivariate Approach Reveals the Behavioral Templates Underlying Visual Discrimination in RatsCurrent Biology 22:50–55https://doi.org/10.1016/j.cub.2011.11.041
- Using deep neural networks to evaluate object vision tasks in ratsPLOS Computational Biology 17https://doi.org/10.1371/journal.pcbi.1008714
- Neural Representations of Natural and Scrambled Movies Progressively Change from Rat Striate to Temporal CortexCerebral Cortex 26:3310–3322https://doi.org/10.1093/cercor/bhw111
- Visual Categorization of Natural Movies by RatsJournal of Neuroscience 34:10645–10658https://doi.org/10.1523/JNEUROSCI.3663-13.2014
- Lower-Level Stimulus Features Strongly Influence Responses in the Fusiform Face AreaCerebral Cortex 21:35–47https://doi.org/10.1093/cercor/bhq050
- Invariant visual object recognition and shape processing in ratsBehavioural Brain Research 285:10–33https://doi.org/10.1016/j.bbr.2014.12.053
- A rodent model for the study of invariant visual object recognitionProceedings of the National Academy of Sciences 106:8748–8753https://doi.org/10.1073/pnas.0811583106
Article and author information
Author information
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
- Reviewed Preprint version 2:
- Version of Record published:
Copyright
This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
Metrics
- views
- 653
- downloads
- 48
- citation
- 1
Views, downloads and citations are aggregated across all versions of this paper published by eLife.