The Spatial Frequency Representation Predicts Category Coding in the Inferior Temporal Cortex

  1. School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran
  2. Cognitive Systems Laboratory, Control and Intelligent Processing Center of Excellence (CIPCE), School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran
  3. School of Cognitive Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
  4. Perception and Plasticity Group, German Primate Center, Leibniz Institute for Primate Research, 37077 Gottingen, Germany
  5. Neural Circuits and Cognition Lab, European Neuroscience Institute Gottingen - A Joint Initiative of the University Medical Center Gottingen and the Max Planck Society, 37077 Gottingen, Germany
  6. Department of Brain and Cognitive Sciences, Cell Science Research Center, Royan Institute for Stem Cell Biology and Technology, ACECR, Tehran, Iran
  7. Department of Physiology, Faculty of Medical Sciences, Tarbiat Modares University, Tehran, Iran
  8. Department of Psychology, Psychology and Educational Science Faculty, University of Tehran, Tehran, Iran

Peer review process

Revised: This Reviewed Preprint has been revised by the authors in response to the previous round of peer review; the eLife assessment and the public reviews have been updated where necessary by the editors and peer reviewers.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Kristine Krug
    Otto-von-Guericke University Magdeburg, Magdeburg, Germany
  • Senior Editor
    Tirin Moore
    Howard Hughes Medical Institute, Stanford University, Stanford, United States of America

Reviewer #1 (Public Review):

This study reports that spatial frequency representation can predict category coding in the inferior temporal cortex. The original conclusion was based on likely problematic stimulus timing (33 ms which was too brief). Now the authors claim that they also have a different set of data on the basis of longer stimulus duration (200 ms).

One big issue in the original report was that the experiments used a stimulus duration that was too brief and could have weakened the effects of high spatial frequencies and confounded the conclusions. Now the authors provided a new set of data on the basis of a longer stimulus duration and made the claim that the conclusions are unchanged. These new data and the data in the original report were collected at the same time as the authors report.

The authors may provide an explanation why they performed the same experiments using two stimulus durations and only reported one data set with the brief duration. They may also explain why they opted not to mention in the original report the existence of another data set with a different stimulus duration, which would otherwise have certainly strengthened their main conclusions.

I suggest the authors upload both data sets and analyzing codes, so that the claim could be easily examined by interested readers.

Reviewer #2 (Public Review):

Summary:

This paper aimed to examine the spatial frequency selectivity of macaque inferotemporal (IT) neurons and its relation to category selectivity. The authors suggest in the present study that some IT neurons show a sensitivity for the spatial frequency of scrambled images. Their report suggests a shift in preferred spatial frequency during the response, from low to high spatial frequencies. This agrees with a coarse-to-fine processing strategy, which is in line with multiple studies in the early visual cortex. In addition, they report that the selectivity for faces and objects, relative to scrambled stimuli, depends on the spatial frequency tuning of the neurons.

Strengths:

Previous studies using human fMRI and psychophysics studied the contribution of different spatial frequency bands to object recognition, but as pointed out by the authors little is known about the spatial frequency selectivity of single IT neurons. This study addresses this gap and shows spatial frequency selectivity in IT for scrambled stimuli that drive the neurons poorly. They related this weak spatial frequency selectivity to category selectivity, but these findings are premature given the low number of stimuli they employed to assess category selectivity.

The authors revised their manuscript and provided some clarifications regarding their experimental design and data analysis. They responded to most of my comments but I find that some issues were not fully or poorly addressed. The new data they provided confirmed my concern about low responses to their scrambled stimuli. Thus, this paper shows spatial frequency selectivity in IT for scrambled stimuli that drive the neurons poorly (see main comments below). They related this (weak) spatial frequency selectivity to category selectivity, but these findings are premature given the low number of stimuli to assess category selectivity.

Main points.

(1) They have provided now the responses of their neurons in spikes/s and present a distribution of the raw responses in a new Figure. These data suggest that their scrambled stimuli were driving the neurons rather poorly and thus it is unclear how well their findings will generalize to more effective stimuli. Indeed, the mean net firing rate to their scrambled stimuli was very low: about 3 spikes/s. How much can one conclude when the stimuli are driving the recorded neurons that poorly? Also, the new Figure 2- Appendix 1 shows that the mean modulation by spatial frequency is about 2 spikes/s, which is a rather small modulation. Thus, the spatial frequency selectivity the authors describe in this paper is rather small compared to the stimulus selectivity one typically observes in IT (stimulus-driven modulations can be at least 20 spikes/s).
(2) Their new Figure 2-Appendix 1 does not show net firing rates (baseline-subtracted; as I requested) and thus is not very informative. Please provide distributions of net responses so that the readers can evaluate the responses to the stimuli of the recorded neurons.
(3) The poor responses might be due to the short stimulus duration. The authors report now new data using a 200 ms duration which supported their classification and latency data obtained with their brief duration. It would be very informative if the authors could also provide the mean net responses for the 200 ms durations to their stimuli. Were these responses as low as those for the brief duration? If so, the concern of generalization to effective stimuli that drive IT neurons well remains.
(4) I still do not understand why the analyses of Figures 3 and 4 provide different outcomes on the relationship between spatial frequency and category selectivity. I believe they refer to this finding in the Discussion: "Our results show a direct relationship between the population's category coding capability and the SF coding capability of individual neurons. While we observed a relation between SF and category coding, we have found uncorrelated representations. Unlike category coding, SF relies more on sparse, individual neuron representations.". I believe more clarification is necessary regarding the analyses of Figures 3 and 4, and why they can show different outcomes.
(5) The authors found a higher separability for faces (versus scrambled patterns) for neurons preferring high spatial frequencies. This is consistent for the two monkeys but we are dealing here with a small amount of neurons. Only 6% of their neurons (16 neurons) belonged to this high spatial frequency group when pooling the two monkeys. Thus, although both monkeys show this effect I wonder how robust it is given the small number of neurons per monkey that belong to this spatial frequency profile. Furthermore, the higher separability for faces for the low-frequency profiles is not consistent across monkeys which should be pointed out.
(6) I agree that CNNs are useful models for ventral stream processing but that is not relevant to the point I was making before regarding the comparison of the classification scores between neurons and the model. Because the number of features and trial-to-trial variability differs between neural nets and neurons, the classification scores are difficult to compare. One can compare the trends but not the raw classification scores between CNN and neurons without equating these variables.

Author response:

The following is the authors’ response to the original reviews.

Public Reviews:

Reviewer #1 (Public Review):

Summary:

This study reports that IT neurons have biased representations toward low spatial frequency

(SF) and faster decoding of low SFs than high SFs. High SF-preferred neurons, and low SF-preferred neurons to a lesser degree, perform better category decoding than neurons with other profiles (U and inverted U shaped). SF coding also shows more sparseness than category coding in the earlier phase of the response and less sparseness in the later phase. The results are also contrasted with predictions of various DNN models.

Strengths:

The study addressed an important issue on the representations of SF information in a high-level visual area. Data are analyzed with LDA which can effectively reduce the dimensionality of neuronal responses and retain category information.

We would like to express our sincere gratitude for your insightful and constructive comments which greatly contributed to the refinement of the manuscript. We appreciate the time and effort you dedicated to reviewing our work and providing suggestions. We have carefully considered each of your comments and addressed the suggested revisions accordingly.

Weaknesses:

The results are likely compromised by improper stimulus timing and unmatched spatial frequency spectrums of stimuli in different categories.

The authors used a very brief stimulus duration (35ms), which would degrade the visual system's contrast sensitivity to medium and high SF information disproportionately (see Nachmias, JOSAA, 1967). Therefore, IT neurons in the study could have received more degraded medium and high SF inputs compared to low SF inputs, which may be at least partially responsible for higher firing rates to low SF R1 stimuli (Figure 1c) and poorer recall performance with median and high SF R3-R5 stimuli in LDA decoding. The issue may also to some degree explain the delayed onset of recall to higher SF stimuli (Figure 2a), preferred low SF with an earlier T1 onset (Figure 2b), lower firing rate to high SF during T1 (Figure 2c), somewhat increased firing rate to high SF during T2 (because weaker high SF inputs would lead to later onset, Figure 2d).

We appreciate your concern regarding the course-to-fine nature of SF processing in the vision hierarchy and the short exposure time of our paradigm. According to your comment, we repeated the analysis of SF representation with 200ms exposure time as illustrated in Appendix 1 - Figure 4. Our recorded data contains the 200ms version of exposure time for all neurons in the main phase. As can be seen, the results are similar to what we found with 33 ms experiments.

Next, we bring your attention to the following observations:

(1) According to Figure 2d, the average firing rate of IT neurons for HSF could be higher than LSF in the late response phase. Therefore, the amount of HSF input received by the IT neurons is as much as LSF, however, its impact on the IT response is observable in the later phase of the response. Thus, the LSF preference is because of the temporal advantage of the LSF processing rather than contrast sensitivity.

(2) According to Figure 3a, 6% of the neurons are HSF-preferred and their firing rate in HSF is comparable to the LSF firing rate in the LSF-preferred group. This analysis is carried out in the early phase of the response (70-170 ms). While most of the neurons prefer LSF, this observation shows that there is an HSF input that excites a small group of neurons. Furthermore, the highest separability index also belongs to the HSF-preferred profile in the early phase of the response which supports the impact of the HSF part of the input.

(3) Similar LSF-preferred responses are also reported by Chen et al. (2018) (50ms for SC) and Zhang et al. (2023) (3.5 - 4 secs for V2 and V4) for longer duration times.

Our results suggest that the LSF-preferred nature of the IT responses in terms of firing rate and recall, is not due to the weakness or lack of input source (or information) for HSF but rather to the processing nature of the SF in the vision hierarchy.

To address this issue in the manuscript:

Figure Appendix 1 - Figure 4 is added to the manuscript and shows the recall value and onset for R1-R5 with 200ms of exposure time.

We added the following description to the discussion:

“To rule out the degraded contrast sensitivity of the visual system to medium and high SF information because of the brief exposure time, we repeated the analysis with 200ms exposure time as illustrated in Appendix 1 - Figure 4 which indicates the same LSF-preferred results. Furthermore, according to Figure 2, the average firing rate of IT neurons for HSF could be higher than LSF in the late response phase. It indicates that the amount of HSF input received by the IT neurons in the later phase is as much as LSF, however, its impact on the IT response is observable in the later phase of the response. Thus, the LSF preference is because of the temporal advantage of the LSF processing rather than contrast sensitivity. Next, according to Figure 3(a), 6\% of the neurons are HSF-preferred and their firing rate in HSF is comparable to the LSF firing rate in the LSF-preferred group. This analysis is carried out in the early phase of the response (70-170ms). While most of the neurons prefer LSF, this observation shows that there is an HSF input that excites a small group of neurons. Additionally, the highest SI belongs to the HSF-preferred profile in the early phase of the response which supports the impact of the HSF part of the input. Similar LSF-preferred responses are also reported by Chen et. al. (2018) (50ms for SC) and Zhang et. al. (2023) (3.5 - 4 secs for V2 and V4). Therefore, our results show that the LSF-preferred nature of the IT responses in terms of firing rate and recall, is not due to the weakness or lack of input source (or information) for HSF but rather to the processing nature of the SF in the IT cortex.”

Figure 3b shows greater face coding than object coding by high SF and to a lesser degree by low SF neurons. Only the inverted-U-shaped neurons displayed slightly better object coding than face coding. Overall the results give an impression that IT neurons are significantly more capable of coding faces than coding objects, which is inconsistent with the general understanding of the functions of IT neurons. The problem may lie with the selection of stimulus images (Figure 1b). To study SF-related category coding, the images in two categories need to have similar SF spectrums in the Fourier domain. Such efforts are not mentioned in the manuscript, and a look at the images in Figure 1b suggests that such efforts are likely not properly made. The ResNet18 decoding results in Figure 6C, in that IT neurons of different profiles show similar face and object coding, might be closer to reality.

Because of the limited number of stimuli in our experiments, it is hard to discuss the category selectivity, which needs a higher number of stimuli. To overcome the limited number of stimuli in our experiment, we fixed 60% (nine out of 15 stimuli) while varying the remaining stimuli to reduce the selective bias. To check the coding capability of the IT neurons for face and non-face objects, we evaluated the recall of face vs. non-face classification in intact stimuli (similar to classifiers stated in the manuscript). Results show that at the population level, the recall value for objects is 90.45%, and for faces is 92.45%. However, the difference is not significant (p-value=0.44). On the other hand, we note that a large difference in the SI value does not translate directly to the classification accuracy, rather it illustrates the strength of representation.

Regarding the SF spectrums, after matching the luminance and contrast of the images we matched the power of the images concerning SF and category. Powers are calculated using the sum of the absolute value of the Fourier transform of the image. Considering all stimuli, the ANOVA analysis shows that various SF bands have similar power (one-way ANOVA, p-value=0.24). Furthermore, comparing the power of faces and images in all SF bands (including intact) and both unscrambled and scrambled images indicates no significant difference between face and object (p-vale > 0.1). Therefore, the result of Figure 3b suggests that IT employs various SF bands for the recognition of various objects.

Comparing the results of CNNs and IT shows that the CNNs do not capture the complexities of the IT cortex in terms of SF. One of the sources of this difference is because of the behavioral saliency of the face stimulus in the training of the primate visual system.

To address this issue in the manuscript:

The following description is added to the discussion:

“… the decoding performance of category classification (face vs. non-face) in intact stimuli is 94.2%. The recall value for objects vs. scrambled is 90.45%, and for faces vs. scrambled is 92.45% (p-value=0.44), which indicates the high level of generalizability and validity characterizing our results.”

The following description is added to the method section, SF filtering.

“Finally, we equalized the stimulus power in all SF bands (intact, R-R5). The SF power among all conditions (all SF bands, face vs. non-face and unscrambled vs. scrambled) does not vary significantly (p-value > 0.1). SF power is calculated as the sum of the square value of the image coefficients in the Fourier domain.”

Reviewer #2 (Public Review):

Summary:

This paper aimed to examine the spatial frequency selectivity of macaque inferotemporal (IT) neurons and its relation to category selectivity. The authors suggest in the present study that some IT neurons show a sensitivity for the spatial frequency of scrambled images. Their report suggests a shift in preferred spatial frequency during the response, from low to high spatial frequencies. This agrees with a coarse-to-fine processing strategy, which is in line with multiple studies in the early visual cortex. In addition, they report that the selectivity for faces and objects, relative to scrambled stimuli, depends on the spatial frequency tuning of the neurons.

Strengths:

Previous studies using human fMRI and psychophysics studied the contribution of different spatial frequency bands to object recognition, but as pointed out by the authors little is known about the spatial frequency selectivity of single IT neurons. This study addresses this gap and they show that at least some IT neurons show a sensitivity for spatial frequency and

interestingly show a tendency for coarse-to-fine processing.

We extend our sincere appreciation for your thoughtful and constructive feedback on our paper. We are grateful for the time and expertise you invested in reviewing our work. Your detailed suggestions have been instrumental in addressing several key aspects of the paper, contributing to its clarity and scholarly merit. We have carefully considered each of your comments and have made revisions accordingly.

Weaknesses and requested clarifications:

(1) It is unclear whether the effects described in this paper reflect a sensitivity to spatial frequency, i.e. in cycles/ deg (depends on the distance from the observer and changes when rescaling the image), or is a sensitivity to cycles /image, largely independent of image scale. How is it related to the well-documented size tolerance of IT neuron selectivity?

Our stimuli are filtered using cycles/images and knowing the distance of the subject to the monitor, we can calculate the cycles/degrees. To the best of our knowledge, this is also the case for all other SF-related studies. To find the relation of observations to the cycles/image and degree/image, one should keep one of them fixed while changing the other, for example changing the subject's distance to the monitor will change the SF content in terms of cycle/degree. With our current data, we cannot discriminate this effect. To address this issue, we added the following description to the discussion. To address this issue, we added the following description to the discussion:

“Finally, since our experiment maintains a fixed SF content in terms of both cycles per degree and cycles per image, further experiments are needed to discern whether our observations reflect sensitivity to cycles per degree or cycles per image.”

(2) The authors' band-pass filtered phase scrambled images of faces and objects. The original images likely differed in their spatial frequency amplitude spectrum and thus it is unclear whether the differing bands contained the same power for the different scrambled images. If not, this could have contributed to the frequency sensitivity of the neurons.

After equalizing the luminance and contrast of the images, we equilized their power concerning SF and category. The powers were calculated using the sum of the absolute values of the Fourier transform of the images. The results of the ANOVA analysis across all stimuli indicate that various SF bands exhibit similar power (one-way ANOVA, p-value = 0.24). Additionally, a comparison of power between faces and objects in all SF bands (including intact), for both unscrambled and scrambled images, reveals no significant differences (p-value > 0.1). To clarify this point, we have incorporated the following information into the Methods section.

“Finally, we equalized the stimulus power in all SF bands (intact, R-R5). The SF power among all conditions (all SF bands, face vs. non-face and unscrambled vs. scrambled) does not vary significantly (ANOVA, p-value > 0.1).”

(3) How strong were the responses to the phase-scrambled images? Phase-scrambled images are expected to be rather ineffective stimuli for IT neurons. How can one extrapolate the effect of the spatial frequency band observed for ineffective stimuli to that for more effective stimuli, like objects or (for some neurons) faces? A distribution should be provided, of the net responses (in spikes/s) to the scrambled stimuli, and this for the early and late windows.

The sample neuron in Figure 1c is chosen to be a good indicator of the recorded neurons. In the early response phase, the average firing rate to scrambled stimuli is 26.3 spikes/s which is significantly higher than the response in -50 to 50ms which is 23.4. In comparison, the mean response to intact face stimuli is 30.5 spikes/s, while object stimuli elicit an average response of 28.8 spikes/s. Moving to the late phase, T2, the responses to scrambled, face, and object stimuli are 19.5, 19.4, and 22.4 spikes/s, respectively. Moreover, when the classification accuracy for SF exceeds chance levels, it indicates a significant impact of SF bands on the IT response. This raises a direct question about the explicit coding for SF bands in the IT cortex observed for ineffective stimuli and how it relates to complex and effective stimuli, such as faces. To show the strength of neuron responses to the SF bands in scrambled images, We added Appendix 1 - Figure 2 and also added Appendix 1 - Figure 1, according to comment 4, which shows the average and std of the responses to all SF bands. The following description is added to the results section.

“Considering the strength of responses to scrambled stimuli, the average firing rate in response to scrambled stimuli is 26.3 Hz, which is significantly higher than the response observed between -50 and 50 ms, where it is 23.4 Hz (p-value=3x10-5). In comparison, the mean response to intact face stimuli is 30.5 Hz, while non-face stimuli elicit an average response of 28.8 Hz. The distribution of neuron responses for scrambled, face, and non-face in T1 is illustrated in Appendix 1 - Figure 2.

[…]

Moreover, the average firing rates of scrambled, face, and non-face stimuli are 19.5 Hz, 19.4 Hz, and 22.4 Hz, respectively. The distribution of neuron responses is illustrated in Appendix 1 Figure 2.”

(4) The strength of the spatial frequency selectivity is unclear from the presented data. The authors provide the result of a classification analysis, but this is in normalized units so that the reader does not know the classification score in percent correct. Unnormalized data should be provided. Also, it would be informative to provide a summary plot of the spatial frequency selectivity in spikes/s, e.g. by ranking the spatial frequency bands for each neuron based on half of the trials and then plotting the average responses for the obtained ranks for the other half of the trials. Thus, the reader can appreciate the strength of the spatial frequency selectivity, considering trial-to-trial variability. Also, a plot should be provided of the mean response to the stimuli for the two analysis windows of Figure 2c and 2d in spikes/s so one can appreciate the mean response strengths and effect size (see above).

The normalization of the classification result is just obtained by subtracting the chance level, which is 0.2, from the whole values. Therefore the values could still be interpreted in percent as we did in the results section. To make this clear, we removed the “a.u.” from the figure and we added the following description to the results section.

“The accuracy value is normalized by subtracting the chance level (0.2).”

Regarding the selectivity of the neuron, as suggested by your comment, we added a new figure in the appendix section, Appendix 1 - figure 2. This figure shows the strength of SF selectivity, considering trial-to-trial variability. The following description is added to the results section:

“The strength of SF selectivity, considering the trial-to-trial variability is provided in Appendix 1 Figure 2, by ranking the SF bands for each neuron based on half of the trials and then plotting the average responses for the obtained ranks for the other half of the trials.”

The firing rates of Figures 2c and 2d are normalized for better illustration since the variation in firing rates is high across neurons, as can be observed in Figure Appendix 1 - Figure 1. Since we seek trends in the response, the absolute values are not important (since the baseline firing rates of neurons are different), but the values relative to the baseline firing rate determine the trend. To address the mean response and the strength of the SF response, the following description is added to the results section.

“Considering the strength of responses to scrambled stimuli, the average firing rate in response to scrambled stimuli is 26.3 Hz, which is significantly higher than the response observed between -50 and 50 ms, where it is 23.4 Hz (p-value=3x10-5). In comparison, the mean response to intact face stimuli is 30.5 Hz, while non-face stimuli elicit an average response of 28.8 Hz. The distribution of neuron responses for scrambled, face, and non-face in T1 is illustrated in Appendix 1 - Figure 2.

[…]

Moreover, the average firing rates of scrambled, face, and non-face stimuli are 19.5 Hz, 19.4

Hz, and 22.4 Hz, respectively. The distribution of neuron responses is illustrated in Appendix 1 Figure 2.”

Furthermore, we added a figure, Appendix 1 - Figure 3, to illustrate the strength of SF selectivity in our profiles. The following is added to the results section:

“To check the robustness of the profiles, considering the trial-to-trial variability, the strength of SF selectivity in each profile is provided in Appendix 1 - Figure 3, by forming the profile of each neuron based on half of the trials and then plotting the average SF responses with the other

half of the trials.”

(5) It is unclear why such brief stimulus durations were employed. Will the results be similar, in particular the preference for low spatial frequencies, for longer stimulus durations that are more similar to those encountered during natural vision?

Please refer to the first comment of Reviewer 1.

(6) The authors report that the spatial frequency band classification accuracy for the population of neurons is not much higher than that of the best neuron (line 151). How does this relate to the SNC analysis, which appears to suggest that many neurons contribute to the spatial frequency selectivity of the population in a non-redundant fashion? Also, the outcome of the analyses should be provided (such as SNC and decoding (e.g. Figure 1D)) in the original units instead of undefined arbitrary units.

The population accuracy is approximately 5% higher than the best neuron. However, we have no reference to compare the effect size (the value is roughly similar for face vs object while the chance levels are different). However, as stated in Methods, SNC is calculated for two label modes (LSF and HSF) and it can not be directly compared to the best neuron accuracy. Regarding the unit of SNC, it can be interpreted directly to percent by multiplying by a factor of 100. We removed the “a.u.” to prevent misunderstanding and modified the results section for clearance.

“… SNC score for SF (two labels, LSF (R1 and R2) vs. HSF (R4 and R5)) and category … (average SNC for SF=0.51\%±0.02 and category=0.1\%±0.04 …”

(7) To me, the results of the analyses of Figure 3c,d, and Figure 4 appear to disagree. The latter figure shows no correlation between category and spatial frequency classification accuracies while Figure 3c,d shows the opposite.

In Figure 3c,d, following what we observed in Figure 3a,b about the category coding capabilities in the population of neurons based on the profile of the single neurons, we tested a similar idea if the coding capability of single neurons in SF/category could predict the coding capability of population neurons in terms of category/SF. Therefore, both analyses investigate a relation between a characteristic of single neurons and the coding capability of a population of similar neurons. On the other hand, in Figure 4, the idea is to check the characteristics of the coding mechanisms behind SF and category coding. In Figure 4a, we check if there exists any relation between category and SF coding capability within a single neuron activity without the impact of other neurons, to investigate the idea that SF coding may be a byproduct of an object recognition mechanism. In Figure 4b, we investigated the contribution of all neurons in population decision, again to check whether the mechanisms behind the SF and category coding are the same or not. This analysis shows how individual neurons contribute to SF or category coding at the population level. Therefore, the experiments in Figures 3 and 4 are different in the analysis method and what they were designed to investigate and we cannot directly compare the results.

(8) If I understand correctly, the "main" test included scrambled versions of each of the "responsive" images selected based on the preceding test. Each stimulus was presented 15 times (once in each of the 15 blocks). The LDA classifier was trained to predict the 5 spatial frequency band labels and they used 70% of the trials to train the classifier. Were the trained and tested trials stratified with respect to the different scrambled images? Also, LDA assumes a normal distribution. Was this the case, especially because of the mixture of repetitions of the same scrambled stimulus and different scrambled stimuli?

In response to your inquiry regarding the stratification of trials, both the training and testing data were representative of the entire spectrum of scrambled images used in our experiment. To address your concern about the assumption of a normal distribution, especially given the mixture of repetitions of the same scrambled stimulus and different stimuli, our analysis of firing rates reveals a slightly left-skewed normal distribution. While there is a deviation from a perfectly normal distribution, we are confident that this skewness does not compromise the robustness of the LDA classifier.

(9) The LDA classifiers for spatial frequency band (5 labels) and category (2 labels) have different chance and performance levels. Was this taken into account when comparing the SNC between these two classifiers? Details and SNC values should be provided in the original (percent difference) instead of arbitrary units in Figure 5a. Without such details, the results are impossible to evaluate.

For both SNC and CMI calculations in SF, we considered two labels of HSF (R4 and R5) and LSF (R1 and R2). This was mentioned in the Methods section, after equation (5). According to your comment, to make it clear in the results section, we also added this description to the results section.

“… illustrates the SNC score for SF (two labels, LSF (R1 and R2) vs. HSF (R4 and R5)) and category (face vs. non-face) … conditioned on the label, SF (LSF (R1 and R2) vs. HSF (R4 and R5)) or category, to assess the information.”

The value of SNC can also be directly converted to the percent by a factor of 100. To make it clear, we removed “a.u.” from the y-axis.

(10) Recording locations should be described in IT, since the latter is a large region. Did their recordings include the STS? A/P and M/L coordinate ranges of recorded neurons?

We appreciate your suggestion for the recording location. Nevertheless, given the complexities associated with neurophysiological recordings and the limitations imposed by our methodologies, we face challenges in precisely localizing every unit if they are located in STS or not. To address your comment, We added Appendix 1 - Figure 5 which shows the SF and category coding capability of neurons along their recorded locations.

(11) The authors should show in Supplementary Figures the main data for each of the two animals, to ensure the reader that both monkeys showed similar trends.

We added Appendix 2 which shows the consistency of the main results in the two monkeys.

(12) The authors found that the deep nets encoded better the spatial frequency bands than the IT units. However, IT units have trial-to-trial response variability and CNN units do not. Did they consider this when comparing IT and CNN classification performance? Also, the number of features differs between IT and CNN units. To me, comparing IT and CNN classification performances is like comparing apples and oranges.

Deep convolutional neural networks are currently considered the state-of-the-art models of the primate visual pathway. However, as you mentioned and based on our results, they do not yet capture various complexities of the visual ventral stream. Yet studying the similarities and differences between CNN and brain regions, such as the IT cortex, is an active area of research, such as:

a. Kubilius, Jonas, et al. "Brain-like object recognition with high-performing shallow recurrent ANNs." Advances in neural information processing systems 32 (2019).

b. Xu, Yaoda, and Maryam Vaziri-Pashkam. "Limits to visual representational correspondence between convolutional neural networks and the human brain." Nature Communications, 12.1 (2021).

c. Jacob, Georgin, et al. "Qualitative similarities and differences in visual object representations between brains and deep networks." Nature Communications, 12.1 (2021).

Therefore, we believe comparing IT and CNN, despite all of the differences in terms of their characteristics, can help both fields grow faster, especially in introducing brain-inspired networks.

(13) The authors should define the separability index in their paper. Since it is the main index to show a relationship between category and spatial frequency tuning, it should be described in detail. Also, results should be provided in the original units instead of undefined arbitrary units. The tuning profiles in Figure 3A should be in spikes/s. Also, it was unclear to me whether the classification of the neurons into the different tuning profiles was based on an ANOVA assessing per neuron whether the effect of the spatial frequency band was significant (as should be done).

Based on your comment, we added the description of the separability index to the methods section. However, since the separability index is defined as the division of two dispersion matrices, it has no units by nature. The tuning profiles in Figure 3a are normalized for better illustration since the variation in firing rates is high. Since we seek trends in the response, the absolute values are not important. Regarding the SF profile formation, to better present the SF profile assignment, we updated the method section. Furthermore, The strength of responses for scrambled stimuli can be observed in Appendix 1 - Figures 1 and 2.

(14) As mentioned above, the separability analysis is the main one suggesting an association between category and spatial frequency tuning. However, they compute the separability of each category with respect to the scrambled images. Since faces are a rather homogeneous category I expect that IT neurons have on average a higher separability index for faces than for the more heterogeneous category of objects, at least for neurons responsive to faces and/or objects. The higher separability for faces of the two low- and high-pass spatial frequency neurons could reflect stronger overall responses for these two classes of neurons. Was this the case? This is a critical analysis since it is essential to assess whether it is category versus responsiveness that is associated with the spatial frequency tuning. Also, I do not believe that one can make a strong claim about category selectivity when only 6 faces and 3 objects (and 6 other, variable stimuli; 15 stimuli in total) are employed to assess the responses for these categories (see next main comment). This and the above control analysis can affect the main conclusion and title of the paper.

We appreciate your concern regarding category selectivity or responsiveness of the SF profiles. First, we note that we used SI since it overcomes the limitations of the accuracy and recall metrics as they are discrete and can be saturated. Using SI, we cannot directly calculate face vs object with SI, since this index only reports one value for the whole discrimination task. Therefore, we have to calculate the SI for face/object vs scrambled to obtain a value per category. However, as you suggested, it raises the question of whether we assess how well the neural responses distinguish between actual images (faces or objects) and their scrambled versions or if we just assess the responsiveness. Based on Figure 3b, since we have face-selective (LSF and HSF preferred profiles), object-selective (inverse U), and the U profile, where SI is the same for both face and object, we believe the SF profile is associated with the category selectivity, otherwise we would have the same face/object recall in all profiles, as we have in the U shape profile.

To analyze this issue further, we calculated the number of face/object selective neurons in 70-170ms. We found 43 face-selective neurons and 36 object-selective neurons (FDR corrected p-value < 0.05). Therefore, the number of face-selective and object-selective neurons is similar. Next, we check the selectivity of the neurons within each profile. Number of face/object selective neurons is LP=13/3, HP=6/2, IU=3/9, U=14/13, and the remaining belong to the NP group. Results show higher face-selective neurons in LP and HP and a higher number of object-selective neurons in the IU class. The U class contains roughly the same number of face and object-selective neurons. This observation supports the relationship between category selectivity and profiles.

Next, we examined the average neuron response to the face and object in each profile. The difference between the firing rate of the face and object in none of the profiles was significant (Ranksum with a significance level of 0.05). However, the rates are as follows. The average firing rate (spikes/s) of face/object is LP=36.72/28.77, HP=28.55/25.52, IU=21.55/27.25, U=38.48/36.28. While the differences are not significant, they support the relationship between profiles and categories instead of responsiveness.

The following description is added to the results section to cover this point of view.

“To assess whether the SF profiles distinguish category selectivity or merely evaluate the neuron's responsiveness, we quantified the number of face/non-face selective neurons in the 70-170ms time window. Our analysis shows a total of 43 face-selective neurons and 36 non-face-selective neurons (FDR-corrected p-value < 0.05). The results indicate a higher proportion of face-selective neurons in LP and HP, while a greater number of non-face-selective neurons is observed in the IU category (number of face/non-face selective neurons: LP=13/3, HP=6/2, IU=3/9). The U category exhibits a roughly equal distribution of face and non-face-selective neurons (U=14/13). This finding reinforces the connection between category selectivity and the identified profiles. We then analyzed the average neuron response to faces and non-faces within each profile. The difference between the firing rates for faces and non-faces in none of the profiles is significant (face/non-face average firing rate (Hz): LP=36.72/28.77, HP=28.55/25.52, IU=21.55/27.25, U=38.48/36.28, Ranksum with significance level of 0.05). Although the observed differences are not statistically significant, they provide support for the association between profiles and categories rather than mere responsiveness.”

About the low number of stimuli, please check the next comment.

(15) For the category decoding, the authors employed intact, unscrambled stimuli. Were these from the main test? If yes, then I am concerned that this represents a too small number of stimuli to assess category selectivity. Only 9 fixed + 6 variable stimuli = 15 were in the main test. How many faces/ objects on average? Was the number of stimuli per category equated for the classification? When possible use the data of the preceding selectivity test which has many more stimuli to compute the category selectivity.

We used only the main phase recorded data, which contains 15 images in each session. Each image results in 12 stimuli (intact, R1-R5, and phase-scrambled version). Thus, there exists a total of 180 unique stimuli in each session. Increasing the number of images would have increased the recording time. We compensated for this limitation by increasing the diversity of images in each session by picking the most responsive ones from the selectivity phase. On average, 7.54 of the stimuli were face in each session. We added this information to the Methods section. Furthermore, as mentioned in the discussion, for each classification run, the number of samples per category is equalized. We note that we cannot use the selectivity data for analysis, since the SF-related stimuli are filtered in different bands.

Recommendations For The Authors:

Reviewer #1 (Recommendations For The Authors):

I suggest that the authors double-check their results by performing control experiments with longer stimulus duration and SF-spectrum-matched face and object stimuli.

Thanks for your suggestion, according to your comment, we added Appendix 1 - Figure 3.

In addition, I had a very difficult time understanding the differences between Figure 3c and Figure 4a. Please rewrite the descriptions to clarify.

Thanks for your suggestion, we tried to revise the description of these two figures. The following description is added to the results section for Figure 3c.

“Next, to examine the relation between the SF (category) coding capacity of the single neurons and the category (SF) coding capability of the population level, we calculated the correlation between coding performance at the population level and the coding performance of single neurons within that population (Figure 3 c and d). In other words, we investigated the relation between single and population levels of coding capabilities between SF and category. The SF (or category) coding performance of a sub-population of 20 neurons that have roughly the same single-level coding capability of the category (or SF) is examined.”

Lines 147-148: The text states that 'The maximum accuracy of a single neuron was 19.08% higher than the chance level'. However, in Figure 4, the decoding accuracies of individual neurons for category and SF range were between 49%-90% and 20%-40%, respectively.

Please explain the discrepancies.

The first number is reported according to chance level which is 20%, thus the unnormalized number is 39% which is consistent with the SF accuracy in Figure 4. We added the following description to prevent any misunderstanding.

“… was 19.08\% higher than the chance level (unnormalized accuracy is 49.08\%, neuron \#193, M2).”

Lines 264-265: Should 'the alternative for R3 and R4' be 'the alternative for R4 and R5'?

Thanks for your attention, it's “R4 and R5”. We corrected that mistake.

Lines 551-562: The labels for SF classification are R1-R5. Is it a binary or a multi-classification task?

It’s a multi-label classification. We made it clear in the text.

“… labels were SF bands (R1, R2, ..., R5, multi-label classifier).”

Figure 4b: Neurons in SF/category decoding exhibit both positive and negative weights. However, in the analysis of sparse neuron weights in Equation 6, only the magnitude of the weights is considered. Is the sign of weight considered too?

We used the absolute value of the neuron weight to calculate sparseness. We also corrected Equation 6.

Reviewer #2 (Recommendations For The Authors):

(1) Line 52: what do the authors mean by coordinate processing in object recognition?

To avoid any potential misunderstanding, we used the exact phrase in Saneyoshi and Michimata (2015). It is in fact, coordinate relations processing. Coordinate relations specify the metric information of the relative locations of objects.

(2) About half of the Introduction is a summary of the Results. This can be shortened.

Thanks for your suggestion.

(3) Line 134: Peristimulus time histogram instead of Prestimulus time histogram.

Thanks for your attention. We corrected that.

(4) Line 162: the authors state that R1 is decoded faster than R5, but the reported statistic is only for R1 versus R2.

It was a typo, the p-value is only reported for R1 and R5.

(5) Line 576: which test was used for the asses the statistical significance?

The test is Wilcoxon signed-rank. We added it to the text.

(6) How can one present a 35 ms long stimulus with a 60 Hz frame rate (the stimuli were presented on a 60Hz monitor (line 470))? Please correct.

Thanks for your attention. We corrected that. The time of stimulus presentation is 33ms and the monitor rate is 120Hz.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation