1. Neuroscience
Download icon

'Artiphysiology' reveals V4-like shape tuning in a deep network trained for image classification

  1. Dean A Pospisil  Is a corresponding author
  2. Anitha Pasupathy
  3. Wyeth Bair
  1. University of Washington, United States
  2. University of Washington Institute for Neuroengineering, United States
Research Article
  • Cited 1
  • Views 967
  • Annotations
Cite this article as: eLife 2018;7:e38242 doi: 10.7554/eLife.38242

Abstract

Deep networks provide a potentially rich interconnection between neuroscientific and artificial approaches to understanding visual intelligence, but the relationship between artificial and neural representations of complex visual form has not been elucidated at the level of single-unit selectivity. Taking the approach of an electrophysiologist to characterizing single CNN units, we found many units exhibit translation-invariant boundary curvature selectivity approaching that of exemplar neurons in the primate mid-level visual area V4. For some V4-like units, particularly in middle layers, the natural images that drove them best were qualitatively consistent with selectivity for object boundaries. Our results identify a novel image-computable model for V4 boundary curvature selectivity and suggest that such a representation may begin to emerge within an artificial network trained for image categorization, even though boundary information was not provided during training. This raises the possibility that single-unit selectivity in CNNs will become a guide for understanding sensory cortex.

https://doi.org/10.7554/eLife.38242.001

Introduction

Deep convolutional neural networks (CNNs) are currently the highest performing image recognition computer algorithms. While their overall design reflects the hierarchical structure of the ventral (‘form-processing’) visual stream (Hubel and Wiesel, 1962; LeCun et al., 2015), the visual selectivity (i.e., tuning) of single units within the network are not constrained to match neurobiology. Rather, single-unit properties are determined by a performance-based learning algorithm that operates iteratively across many pre-classified training images, tuning the parameters of the network to decrease the error between the network output and the target classification. Nevertheless, first-layer units in these CNNs, following training, often show selectivity for orientation and spatial frequency (Figure 1; see also Krizhevsky et al., 2012) like neurons in primary visual cortex (V1). Attempts to visualize features encoded by single units deeper in such networks (Zeiler and Fergus, 2013; Mahendran and Vedaldi, 2014) show that selectivity becomes increasingly complex and categorical, similar to the progression along the ventral stream. Solidifying this idea, Güçlü and van Gerven, 2015 found a corresponding hierarchy of visual features between BOLD signals in the human ventral stream and layers within a CNN. This raises the tentative but exciting possibility that units deeper in the network may approximate tuning observed at mid-level stages of the ventral stream, for example area V4. This is not unreasonable given that artificial networks that perform better at image classification also have population-level representations closer to those in area IT (Yamins et al., 2014; Khaligh-Razavi and Kriegeskorte, 2014; Kriegeskorte, 2015). V4 is a primary input to IT (Felleman and Van Essen, 1991), yet there has been no systematic examination of whether specific form-selective properties found in V4 emerge within a CNN.

The 96 kernels (11 × 11 pixels, by three color channels) of the 1 st layer, Conv1, of the AlexNet model tested here.

Like many V1 receptive fields, many of these kernels are band-limited in spatial frequency and orientation. Each kernel was independently scaled to maximize its RGB dynamic range to highlight spatial structure.

https://doi.org/10.7554/eLife.38242.002

To address this, we tested whether two properties of shape selectivity in V4, tuning for boundary curvature (Pasupathy and Connor, 1999; Pasupathy and Connor, 2001; Cadieu et al., 2007) and translation invariance (Gallant et al., 1996; Pasupathy and Connor, 2001; Rust and Dicarlo, 2010; Rust and DiCarlo, 2012; Nandy et al., 2013; Sharpee et al., 2013) arise within a CNN. In particular, many V4 neurons are selective for boundary curvature, ranging from concave to sharply convex, at particular angular positions around the center of an object. This angular position and curvature (APC) tuning may be important for supporting entire object representations deeper in the ventral stream (Pasupathy and Connor, 2001; Murphy and Finkel, 2007), but it remains uncertain how it arises or is used. Finding APC-like tuning in the middle of an artificial network could help to relate mid-level visual physiology to pressures on visual representation applied by image statistics at the front end and by categorization performance downstream. It could also relate to the recent observation that human perception of shape similarity correlates with response similarity in CNNs (Kubilius et al., 2016).

We take an approach to characterizing single units in an artificial deep network that we refer to as ‘artiphysiology’ because it closely mirrors how an electrophysiologist approaches the characterization of single neurons in the brain. In particular, we presented the original 362 shape stimuli used by Pasupathy and Connor, 2001 to AlexNet, a CNN that was the first of its class to make large gains on general object recognition (Krizhevsky et al., 2012) and that continues to be well-studied (Zeiler and Fergus, 2013; Yosinski et al., 2015; Lenc and Vedaldi, 2014; Donahue et al., 2014; Szegedy et al., 2013; Güçlü and van Gerven, 2015; Bau et al., 2017; Tang et al., 2017; Flachot and Gegenfurtner, 2018). Making direct comparisons between CNN units and V4 neurons using V4 data from two previous studies (Pasupathy and Connor, 2001; El-Shamayleh and Pasupathy, 2016), we found that many units in AlexNet would be indistinguishable from good examples of boundary-curvature-tuned V4 neurons. We applied a CNN visualization technique (Zeiler and Fergus, 2013) to examine whether natural image features that best drive such APC-like units are consistent with the notion of selectivity for curvature of object boundaries. We identify specific V4-like units so that other researchers may utilize them for future studies.

Results

AlexNet contains over 1.5 million units organized in eight major layers (Figure 2), but its convolutional architecture means that the vast majority of those units are spatially offset copies of each other. For example, in the first convolutional layer, Conv1, there are only 96 distinct kernels (Figure 1), but they are repeated everywhere on a 55 × 55 grid (Figure 2E). Thus, for the convolutional layers, Conv1 to Conv5, it suffices to study the selectivity of only those units at the spatial center of each layer. These units, plus all units in the subsequent fully-connected layers comprise the 22,096 unique units (Figure 2D) that we analyzed.

Architecture of the Caffe AlexNet CNN.

(A) A one-dimensional scale view of the fan-in and spatial resolution of units for all 21 sublayers, aligned to their names listed in column (C). The color-filled triangles in convolutional (Conv) layers indicate the fan-in to convolutional units, gray triangles indicate the fan-in to max pooling units, and circles (or ovals) indicate the spatial positions of units along the horizontal dimension. For the Conv layers and their sublayers, each circle in the diagram represents the number of unique units listed in column (D). For example, for each orange circle/oval in the four sublayers associated with Conv1, there are 96 different units in the model (the Conv1 kernels are depicted in Figure 1). The 227 pixel wide input image (top, yellow), is subsampled at the Conv1 sublayer (orange; ‘stride 4’ indicates that units occur only every four pixels) and again at each pooling sublayer (‘stride 2’), until the spatial resolution is reduced to a 6 × 6 grid at the transition from Pool5 to FC6. The pyramid of support converging to the central unit in Conv5 (dark blue triangle) is indicated by triangles and line segments starting from Conv1. Each unit in layers FC6, FC7 and FC8 (shades of green; not all units are shown) receives inputs from all units in the previous layer (there is no spatial dimension in the FC layers, units are depicted in a line only for convenience). Green triangles indicate the full fan-in to three example units in each FC layer. (B) The maximum width (in pixels) of the RFs for units in the five convolutional layers (colors match those in (A)) based on fan-in starting from the input image. For the FC layers, the entire image is available to each unit. (C) Names of the sublayers, aligned to the circuit in (A). Names in bold correspond to the eight major layers, each of which begins with a linear kernel (colorful triangles in (A)). (D) The number of unique units, that is feature dimensions, in each sublayer (double quotes repeat values from previous row). (E) The width and height of the spatial (convolutional) grid at each sublayer, or ‘1’ for the FC layers. The total number of units in each sublayer can be computed by multiplying the number of unique kernels (D) by the number of spatial positions (E). (F) The kernel size corresponds to the number of weights learned for each unique linear kernel. Pooling layers have 3 × 3 spatial kernels but have no weights—the maximum is taken over the raw inputs. The Conv2 kernels are only 48 deep because half of the Conv2 units take inputs from the first 48 feature dimensions in Conv1, whereas the other half take inputs from the last 48 Conv1 features; inputs are similarly grouped in Conv4 and Conv5 (see Krizhevsky et al.'s Figure 2). The bottom row provides totals. In addition to the weights associated with each kernel, there is also one bias value per kernel (not shown), which adds 10,568 free parameters to the 60.9 million unique weights.

https://doi.org/10.7554/eLife.38242.003

Responses of CNN units to simple shapes

We first establish that the simple visual stimuli used in V4 electrophysiology experiments (Figure 3A) do in fact drive units within the CNN, which was trained on a substantially different set of inputs: natural photographic images from the ImageNet database (Deng et al., 2009). Across the convolutional layers and their sublayers, we found that our shape stimuli typically evoked a range of responses that was on average similar to, or larger than, the range driven by ImageNet images (e.g., Figure 4, Conv1, compare red to dark blue). The ranges for shapes and images became more similar following normalization layers (e.g., Figure 4, Norm1). In contrast, in the subsequent fully-connected layers, the natural images drove a larger range of responses (Figure 4, FC6, dark blue) than did shapes (red line), and from FC6 onwards the range of responses to shapes was about 1/2 to 1/3 of that for images. The wider dynamic range for images in later layers may reflect the sensitivity of deeper units to category-relevant conjunctions of image statistics that are absent in our simple shape stimuli. These results were robust to changes in stimulus intensity and size (see Figure 4, legend); therefore, we settled on a standard size of 32 pixels so that stimuli fit within all RFs from Conv2 onwards (Figure 2B) with room to spare for translation invariance tests (see Materials and methods).

The angular position and curvature (APC) model and associated stimuli.

(A) The set of 51 simple closed shapes from Pasupathy and Connor, 2001. Shapes are shown to relative scale. Shape size, given in pixels in the text, refers to the diameter of the big circle (top row, 2nd shape from the left). Each shape was shown at up to eight rotations as dictated by rotational symmetry, e.g., the small and large circles (upper left) were only shown at one rotation. This yielded a set of 362 unique shape stimuli. Stimuli were presented as white-on-black to the network (not as shown here). (B) Example shape with points along the boundary (red circles) indicating where angular position and curvature values were included in the APC model. (C) Points from the example shape in (B) are plotted in the APC plane where x-axis is angular position and y-axis is normalized curvature. Note the red circle furthest to the left at 0° angular position and negative curvature corresponds to the concavity at 0° on the example shape in (B). A schematic APC model is shown (ellipse near center of diagram) that is a product of Gaussians along the two axes. This APC model would describe a neuron with a preference for mild concavities at 135°.

https://doi.org/10.7554/eLife.38242.004
Figure 4 with 1 supplement see all
Response distributions for shapes and natural images in representative CNN layers.

In each panel, the frequency distribution of the response values across all unique units in a designated CNN sublayer is plotted for four stimulus sets: our standard shape set (red; size 32 pixels, stimulus intensity 255, see Materials and methods), larger shapes (cyan; size 64 pixels, intensity 255), dimmer shapes (green; intensity 100, size 32 pixels) and natural images (dark blue). Natural images (n = 362, to match the number of shape stimuli) were pulled randomly from the ImageNet 2012 competition validation set. From top to bottom, panels show results for selected sublayers: Conv1, Relu1, Norm1, Conv2, FC6 and Prob8 (Figure 2C lists sublayer names). The number of points in each distribution is given by the number of stimuli (362) times the number of unique units in the layer (Figure 2D). The vertical axis is log scaled as most distributions have a very high peak at 0. For Conv1, standard shapes drove a wider overall dynamic range than did images because of the high intensity edges that aligned with parts of the linear kernels (Figure 1). This was not the case for larger shapes because they often over-filled the small Conv1 kernels. For Relu1, negative responses are removed by rectification after a bias is added. At Conv2, there is little difference between the four stimulus sets on the positive side of the distribution. This changes from FC6 forward, where natural images drive a wider range of responses. For Prob8, natural images (dark blue line) sometimes result in high probabilities among the 1000 categorical units, whereas shapes do not.

https://doi.org/10.7554/eLife.38242.005

Although our shapes drove responses in all CNN layers, many units responded sparsely to both the shapes and natural images. Across all layers, 13% of units had zero responses to all shape stimuli and 7% had non-zero response to only one stimulus, that is one shape at one rotation. Because we aim to identify CNN units with V4-like responses to shapes, we excluded from further analysis units with response sparseness outside the range observed in V4 (see Materials and methods and Figure 4—figure supplement 1).

Tuning for boundary curvature at RF center

To assess whether CNN units have V4-like boundary curvature selectivity, we measured responses of each unique CNN unit to our shape stimuli (up to eight rotations for each shape in Figure 3A), centered in the RF. We then fit responses with the angular position and curvature (APC) model (Pasupathy and Connor, 2001), which captures neuronal selectivity as the product of a Gaussian tuning curve for curvature and a Gaussian tuning curve for angular position with respect to the center of the shape (Figure 3B,C and Materials and methods). We found that the responses of many units in the CNN were fit well by the APC model. For example, the responses of Conv2 unit 113 (i.e., Conv2-113) were highly correlated (r=0.78, n=362) to those of its best-fit APC model (Figure 5A). The fit parameters indicate selectivity for a sharp convexity (μc=1.0, σc=0.39) pointing to the upper left (μa=135°, σa=23°), and indeed the eight most preferred shapes all include such a feature (Figure 5B, pink), whereas the least preferred shapes (cyan) do not. A second example unit, FC7-3591 (Figure 5C) with a high APC r-value (0.77) had fit parameters (see legend) reflecting selectivity for concavities roughly toward the top of the shape, consistent with most of its preferred shapes (Figure 5D). These results were similar to those for well-fit V4 neurons. For example, the V4 unit a1301 (Figure 5E,F) had an APC fit (r=0.76, p<0.001, n=362) reflecting a preference for a sharp convexity, like the first CNN example unit, except with a different preferred angular position (μa=180°).

Boundary curvature selectivity for CNN units compared to V4 neurons.

(A) APC model prediction vs. CNN unit response for an example CNN unit from an early layer (Conv2-113). (B) The top and bottom eight shapes sorted by response amplitude (most preferred shape is at upper left, least at lower right) reveal a preference for convexity to the upper left (such a feature is absent in the non-preferred shapes). This is consistent with the APC fit parameters, μc=1.0, σc=0.53, μa=135°, σa=23°. (C) Predicted vs. measured responses for another well-fit example CNN unit (FC7-3591) but in a later layer. (D) Top and bottom eight shapes for example unit in (C). The APC model fit was μc=-0.1, σc=0.15, μa=112°, σa=44°. (E) Model prediction vs. neuronal mean firing rate (normalized) for the V4 neuron (a1301) that had the highest APC fit r-value. (F) The top eight shapes (purple) all have a strong convexity to the left, whereas the bottom eight (cyan) do not. The APC model fit was μc=1.0, σc=0.39, μa=180°, σa=23°. (G) The cumulative distributions (across units) of APC r-values are plotted for the first sublayer of each major CNN layer (boldface names in Figure 2C) from Conv1 (black) to FC8 (lightest orange). The other sublayers (distributions not shown for clarity) tended to have lower APC r-values but the trend for increasing APC r-value with layer was similar. For comparison, red line shows cumulative distribution for 109 V4 neurons (Pasupathy and Connor, 2001), and pink line shows V4 distribution corrected for noise (see Materials and methods). (H) The cumulative distribution of r-values for the APC fits for all CNN units (black), CNN units with shuffled responses (green), units in an untrained CNN (blue) and V4 (red and pink). The far leftward shift of the green line shows that fit quality deteriorates substantially when the responses are shuffled across the 362 stimuli within each unit.

https://doi.org/10.7554/eLife.38242.007

For each layer of the CNN, we computed the distributions of the APC fit r-values across units (Figure 5G). There is a clear but modest trend for the cumulative distribution functions to shift rightward for higher layers (orange lines, Figure 5G), indicating that deeper layer units fit better on average to the APC model. The first CNN layer, Conv1 (black line) stands apart as having a far leftward-shifted r-value distribution, but this occurs simply because most of the stimuli overfill the small Conv1 RFs. Compared to V4 neurons studied with the same shape set (red line, Figure 5G), the median r-values (corresponding to 0.5 on the vertical axis) for layers Conv2 to FC8 were somewhat higher than that for V4, but the V4 and CNN curves matched closely at the upper range, with the best V4 unit having a higher APC r-value than any CNN unit.

One factor that could influence our CNN to V4 comparison is that CNN responses are noise-free, whereas V4 responses have substantial trial-to-trial variability. We extended the method of Haefner et al. (2009) to remove the bias that variability introduces into the correlation coefficient (see Materials and methods). The distribution of the corrected estimates of the r-values across the V4 population (pink line, Figure 5G) has a higher median than that for any of the CNN layers. This suggests that, had it been possible to record many more stimulus repeats to eliminate most of the noise in the V4 data, then one would find that the V4 population somewhat out-performs even the deep layers in AlexNet in fitting the APC model. Overall, regardless of whether we consider the raw or corrected V4 r-values, we would still conclude that the CNN contains units that cover the vast majority of the range of APC r-values found in V4 when tested with the same stimuli.

To determine whether the goodness of fit to the APC model was a result of the network architecture alone or if training on the object categorization task played a role, we fit the model to units in an untrained network in which weights were assigned random initial values (see Materials and methods) and found that only 14% had APC r-values above 0.5 (Figure 5H, blue trace) and none reached the upper range of r-values observed in the trained CNN (Figure 5H, black line, aggregate of all layers) or in V4 (red line). This suggests that training is important for achieving an APC r-value distribution consistent with V4.

To control for over fitting, we re-fit the APC model to all CNN units after shuffling the responses of each unit across the 362 shapes. After shuffling, 99% of units had r<0.07 (Figure 5H, green), whereas in the original data (Figure 5H, black) 99% of units had r>0.07. Thus, the APC model largely reflects specific patterns of responses of the units to the shapes, and not an ability of the model to fit any random or noisy set of responses (see also Pasupathy and Connor, 2001).

Translation Invariance

To have V4-like boundary curvature tuning, a CNN unit must not only fit the APC model well for stimuli centered in the RF, but must maintain that selectivity when stimuli are placed elsewhere in the RF, that is it must show translation invariance like that found in V4 for our stimulus set (Pasupathy and Connor, 2001; El-Shamayleh and Pasupathy, 2016). For example, responses of a V4 neuron to 56 shapes centered in the RF are highly correlated (r=0.97,p<0.0001, n = 56) with responses to the same shapes presented at a location offset by 1/6 of the RF diameter (Figure 6A), indicating that shapes that drive relatively high (or low) responses at one location also tend to do so at the other location. This can be visualized across the RF using the position-correlation function (Figure 6B, red), which plots response correlation as a function of distance from a reference position (e.g., RF center). For this V4 neuron, the RF profile, measured by the mean response across all stimuli at each position (Figure 6B, green; see Materials and methods), falls off faster than the position-correlation function, consistent with a high degree of translation invariance.

Translation invariance as a function of position across the RF.

(A) For an example neuron from the V4 study of El-Shamayleh and Pasupathy (2016), the responses to stimuli shifted away from the RF center by 1/6 of the estimated RF size are plotted against those placed in the RF center. The overall response magnitude decreases with shift, but a strong linear relationship is maintained between responses at the two positions. (B) In green, the RF profile of the same neuron from (A) is plotted (average response at each position). In red, the correlation of the responses at each position with the responses at RF center. (C) For unit Conv2-113, responses to stimuli shifted six pixels to the right are plotted against responses for centered stimuli. (D) For the same unit in (C), responses for stimuli shifted 14 pixels to the left vs. responses for centered stimuli. (E) For unit Conv2-113, the position-correlation function is plotted in red. The RF profile, that is the normalized response magnitude (square root of sum of squared responses) across all shapes is plotted in green. The region over which TI is measured, where all stimuli are wholly within the CRF (see Materials and methods), is within dotted lines bookending horizontal black bar. The unit is less translation invariant because it continues to have a large response even when correlation drops quickly from center. This is reflected in the lower TI score of 0.7. (F) The averages of the correlation and RF profiles across all units in the Conv2 layer show that correlation drops off much more rapidly than the RF profile. (G) Same as in (E) but for a unit in the 4th convolutional layer (Conv4-369). There is a broadened correlation profile compared to the Conv2 unit. (H) For Conv4, the average position-correlation function (red) has a wider peak than that for Conv2, more closely matching the shape of the average RF profile (green). It also has serrations that occur eight pixels apart, which corresponds to the pixel stride (discretization) of Conv2 (Figure 2A; see Materials and methods). (I) The shape-tuned example unit FC7 3591 (Figure 5C) in the final layer is highly translation invariant (TI = 0.89). (J) The response profile and correlation stay high across the center of the input field on average across units in FC7.

https://doi.org/10.7554/eLife.38242.008

A similar analysis for the example CNN unit, Conv2-113, reveals a steep drop-off in its position-correlation function (Figure 6E, red) compared to its RF profile (green). In particular, when stimuli were shown 13 pixels to the left of center (black arrow) the aggregate firing rate (see Materials and methods) was 87% of maximum, but the correlation was near zero. The largely uncorrelated selectivity at two points within the RF indicates low translation invariance. Thus, despite its high APC r-value (Figure 5A), its low translation invariance diminishes it as a good model for V4 boundary contour tuning. This behavior was typical in layer Conv2, as demonstrated by the position-correlation function averaged across all units in the layer (Figure 6F). Specifically, the correlation (red) falls off rapidly compared to the RF profile (green) even for small displacements of the stimulus set.

For deeper layers, RFs tend to widen and translation invariance increases. This is exemplified by unit 369 in the fourth convolutional layer (Figure 6G) and the Conv4 layer average (Figure 6H): on average the correlation (red) more closely follows the RF profile (green) and does not drop to zero near the middle of the RF. In the deepest layers, exemplified by the FC7 unit from Figure 5C, the RFs become very broad (Figure 6I, green) and there is very little fall-off in correlation (red) even for shifts larger than the stimulus size. This is true for the layer average as well (Figure 6J). These plots show that shape selectivity becomes more translation invariant relative to RF size, and not just in terms of absolute distance, as signals progress to deeper layers.

To quantify translation invariance for each unit with a single number, we defined a metric, TI, based on the normalized average covariance of the response matrix across positions (see Materials and methods). The values of this metric, which would be one for perfect (and zero for no) correlation across positions, are shown for the example CNN units in Figure 6E,G and I. The trend for increasing TI with layer depth seen in Figure 6 (panels F, H and J) is borne out in the cumulative distributions of TI broken down by CNN layer (Figure 7A). For comparison, the cumulative distribution of our TI metric for 39 V4 neurons from the study of El-Shamayleh and Pasupathy (2016) is plotted (red). Only the deepest four layers (Conv5 to FC8) had median TI values that approximated or exceeded that of our V4 population. Conv1 is excluded because its RFs are far too small to fully contain our stimuli at multiple positions (see Materials and methods). The substantial increase in TI for deeper layers is striking relative to the modest progression in APC r-values observed in Figure 5G.

Figure 7 with 1 supplement see all
Cumulative distributions of the TI metric for the CNN and V4.

(A) The cumulative distributions (across units) of TI are plotted for the first sublayer of each major CNN layer (boldface names in Figure 2C) from Conv2 (black) to FC8 (lightest orange). There is a clear increase in TI moving up the hierarchy. The TI distribution for V4 is plotted in red, and an upper bound for noise correction is plotted in pink (see Materials and methods). The other sublayers (distributions not shown for clarity) tended to have lower TI values but the trend for increasing TI with layer was similar. (B) The cumulative distribution of TI across layers in the untrained CNN. There is a large shift toward lower TI values in comparison to the trained CNN (faint grey and red and pink lines reproduce traces from panel A).

https://doi.org/10.7554/eLife.38242.009

An intuitive motivation for CNN architecture, chiefly convolution (repetition of linear filtering at translated positions) and max pooling, is the desire to achieve a translation invariant representation (Fukushima, 1980; Rumelhart et al., 1986; Riesenhuber and Poggio, 1999; Serre et al., 2005; Cadieu et al., 2007). This might lead to the idea that responses of units within these nets are translation invariant by design, but the observation that strong translation invariance only arises in later layers begins to deflate this notion. Furthermore, we computed TI for the same units and stimuli but in the untrained network. We found that the degradation of TI in an untrained network (Figure 7B) was even more dramatic than the degradation of APC tuning (Figure 5H). Specifically, it was very rare for any FC-layer unit in the untrained network to exceed the median TI values for those layers in the trained network.

To assess the influence of neuronal noise on our comparison of TI between V4 and AlexNet, we estimated an upper bound on how much TI could have been reduced by V4 response variability (see Materials and methods). TI tended to be less influenced by noise for neurons having higher TI, in particular the upward correction of the r-value was negatively correlated with the raw TI value (r=-0.6, p<0.001, n=39). Thus, for cells at the upper range of TI, we do not expect sampling variability to strongly influence our measurements. The distribution of V4 TI values corrected for noise is superimposed in Figure 7A and B (pink line). The modest rightward shift in the corrected distribution relative to the original raw distribution (red line) does not change our conclusion that only the deepest several layers in AlexNet have average TI values that match or exceed that of V4.

Our TI metric above was measured for horizontal stimulus shifts; however, we also measured TI for vertical shifts and verified that there was a high correlation between these two (r = 0.79) (Figure 7—figure supplement 1), particularly for high TI values.

Identifying and visualizing preferences of candidate APC-like units

We now plot the joint distribution of our metrics for boundary contour tuning and translation invariance described above to identify candidate APC-like CNN units. Figure 8 shows a unit square with APC r-value on the vertical axis and translation invariance, TI, on the horizontal axis. An ideal unit would be represented by the upper right corner, (1,1). The hypothetical best V4 neurons lie within this space at the red X (TI =0.97, r=0.80). This best V4 point is a hybrid of the observed highest APC r-value from the Pasupathy and Connor (2001) study, and the highest TI value from our re-analysis of the El-Shamayleh and Pasupathy (2016) data. In comparison, the most promising CNN unit lies at the orange star (TI =0.91, r=0.77), very close to the hypothetical best V4 point. To demonstrate how the CNN population falls on this map, we plotted 100 randomly chosen units from an early layer, Conv2 (dark brown), and a deep layer, FC7 (orange). Although only a few FC7 units approach the hypothetical best V4 point, many units are better than the average V4 neuron (red lines, Figure 8). In contrast, most units from Conv2 are much further from ideal V4 behavior, but they span a large range, indicating that even in the second convolutional layer, some units have ended up, after training, having high TI and high APC r-values.

Summary of the similarity of CNN units to V4 neurons in terms of translation invariance (TI) and fit to the APC model.

For 100 randomly selected CNN units from Conv2 (brown) and FC7 (orange), APC r-value is plotted against TI. The hypothetical highest scoring V4 unit (red ×) is the combination of the highest TI score and the highest APC fit from separate V4 data sets (0.97, 0.80). The highest scoring unit in the CNN (FC7-3591, from Figure 5C, Figure 6I and Figure 12C) is indicated by the orange star (0.91, 0.77) and is close to the hypothetical best V4 unit. The red lines indicate the mean V4 values along each axis, not including any correction for noise (see Figures 5 and 7 for estimated noise correction, pink lines).

https://doi.org/10.7554/eLife.38242.011

To determine whether units identified as being the most APC-like, that is those closest to (1,1) in Figure 8, respond to natural images in a manner qualitatively consistent with boundary curvature selectivity in an objected-centered coordinate frame, we identified image patches that were most facilitatory (drove the greatest positive responses) and most suppressive (drove the greatest negative responses) for the 50,000 image test-set from the 2012 ImageNet competition. We then used a visualization technique (Zeiler and Fergus, 2013) to project back ('deconvolve’) from the unit onto each input image through the connections that most strongly contributed to the response, thereby revealing the regions and features supporting the response. We examined the ten most APC-like units in each of seven layers from Conv2 to FC8. Below we describe major qualitative observations as a function of layer depth.

Visualizing the ten most APC-like units in Conv2 revealed selectivity for orientation, conjunctions thereof, or other textures. For example, unit Conv2-113 (from Figures 5A and 8E), was best driven by lines at a particular orientation (Figure 9A) and most suppressed by oriented texture running roughly orthogonal to the preferred. This explains why this unit responded well only to shapes that have long contours extending to a point at the upper left, and poorly to shapes having a broad convexity or concavity to the upper left (Figure 5B). Another Conv2 example (Figure 9B) was driven best by the conjunction of a vertical that bends to the upper left and a horizontal near the top of the RF that meet at a point in the upper left. Examining the input images reveals that textures and lines (e.g., the bedspread and rocking chair cushion) are as good at driving the unit as are boundaries of objects. A third unit (Figure 9C) preferred conjunctions of orientations and was suppressed by lines running orthogonal to the preferred vertical orientation. The preferred pattern was usually not an object boundary, but could surround negative space or be surface texture. These observations, taken together with the poor translation invariance of Conv2 relative to deeper layers, suggest that units at this early stage are not coding boundary conformation in an object-centered way, but that any pattern matching the preferred features of the unit, regardless of its position with respect to an object, will drive these units well.

Visualization of APC-like units in layer Conv2.

(A) For unit Conv2-113, the five most excitatory image patches are indicated by red squares superimposed in the raw images (top row, left side, from left to right). The size of the red square corresponds to the maximal extent of the image available to Conv2 units (see Figure 2B). In corresponding order, the five deconvolved features are shown at the upper right, with a 3x scale increase for clarity. The blank rectangular region at the right side of the second feature indicates that this part of the unit RF extended beyond the input image (such regions are padded with zero during response computation). For the same unit, the lower row shows the five most suppressive image patches and their corresponding deconvolved features. We examined the top 10 most excitatory and suppressive images, and for all examples in this and subsequent figures, they were consistent with the top 5. Below the natural images are the top 5 and bottom five shapes (white on black background) in order of response from highest (at left) to lowest (at right). Shapes are shown at 2x scale relative to images, for visibility. (B) Same format as (A), but for unit Conv2-108. (C) Same format as (A), but for unit Conv2-126. In all examples, the most suppressive features (bottom row in each panel) tend to run orthogonal to, and at the same RF position, as the preferred features (top row in each panel) For APC fit parameters, see Table 1 in Results text. Thumbnails featuring people were redacted for display in the published article, in line with journal policy. Input image thumbnails were accessed via the ImageNet database and the original image URLs can be found through this site: http://image-net.org/about-overview.

© 2018 Various. Image thumbnails were taken from http://image-net.org and may be subject to copyright. They are not available under CC-BY and are exempt from the CC-BY 4.0 license.

https://doi.org/10.7554/eLife.38242.012

From Conv3 to Conv5, the visualizations of the most APC-like units were more often consistent with an encoding of portions of object boundaries. Unit Conv3-156 was driven best by the broad downward border of light objects (Figure 10A), particularly dog paws. The most suppressive features for this ‘downward-dog-paw’ unit were dark regions, often negative space, with relatively straight edges. The deconvolved features tended to emphasize the lower portion of the object border. A similar example, Conv3-020, had a preference for the upper border of bright forms (e.g., flames; Figure 10B) and was suppressed by the upper border of dark forms (often dark hair on heads). This unit was representative of a tendency for selectivity for bright regions with broad convexities (e.g., Conv4-171, not shown). We assume that more dark-preferring units would have been found had our stimuli been presented as black-on-white. These trends continued with greater category specificity in Conv5. For example, Conv5-161 was driven best by the rounded, convex tops of white dog heads (Figure 10C), including some contribution from the eyes, and was most suppressed by human faces below the eyebrows. Unit Conv5-144 was best driven by the upward facing points of the tops of objects, particularly wolf ears and steeples (Figure 10D). This ‘wolf-ear-steeple’ unit was most suppressed by rounded forms, and may be important for distinguishing between the many dog categories with and without pointed ears. In addition to units like these, which appeared to be selective for portions of boundaries, there were several units that appeared to detect entire circles (Figure 11), and thus fit well to an APC model with specificity for curvature but broadly accepting of any angular position.

Visualization of APC-like units in layers Conv3 to Conv5.

(A) Visualization for unit Conv3-156, using the same format as Figure 9. Deconvolved features are scaled by 1.8 for visibility. (B) Same as (A), for unit Conv3-020. (C) Same for unit Conv5-161, but deconvolved features are scaled by 1.15. (D) Same as (C), but for unit Conv5-144. For APC fit parameters, see Table 1 in main text. Thumbnails featuring people were redacted for display in the published article, in line with journal policy. Input images were accessed via the ImageNet database and the original image URLs can be found through this site: http://image-net.org/about-overview.

© 2018 Various. Image thumbnails were taken from http://image-net.org and may be subject to copyright. They are not available under CC-BY and are exempt from the CC-BY 4.0 license.

https://doi.org/10.7554/eLife.38242.013
Visualization of APC-like units: circle detectors.

These examples are representative of many units that were selective for circular forms. (A) Unit Conv3-334 was selective for a wide variety of circular objects near its RF center and was suppressed by circular boundaries entering its RF from the surround. Deconvolved feature patches are scaled up by 1.8 relative to raw images. (B) Unit Conv4-203 was also selective for circular shapes near the RF center, but showed category specificity for vehicle wheels. Suppression was not category specific but was, like that in (A), related to circular forms offset from the RF center. The higher degree of specificity in (B) is consistent with this unit being deeper than the example in (A). Deconvolved features are scaled by 1.4 relative to raw images. APC fit parameters are given in Table 1. Thumbnails featuring people were redacted for display in the published article, in line with journal policy. Input images were accessed via the ImageNet database and the original image URLs can be found through this site: http://image-net.org/about-overview.

© 2018 Various. Image thumbnails were taken from http://image-net.org and may be subject to copyright. They are not available under CC-BY and are exempt from the CC-BY 4.0 license.

https://doi.org/10.7554/eLife.38242.014

In the FC layers, the most excitatory images were revealing about unit preferences, but the deconvolved features provided less insight because power in the back projection was typically widely distributed across the input image. For example, unit FC6-3030 (Figure 12A) responded best to hourglasses, but deconvolution did not highlight a particular critical feature. The shape stimuli driving the highest and lowest five responses (Figure 12A, bottom row) suggest that a cusp (convexity) facing upward is a critical feature, consistent with the APC model fit (Table 1). The most suppressive natural images (not shown) were more diverse than those for the Conv layers, and thus provided little direct insight. Broadly, many of the top ten APC-like units in the FC layers fell into two categories: those preferring images with rounded borders facing approximately upwards (we refer to these as the ‘balls’ group) and those associated with a concavity between sharp convexities, also facing approximately upwards (the ‘wolf-ears’ group). For example, FC7-3192 (Figure 12B) responded best to images of round objects (e.g., golf balls) and to shapes having rounded tops. FC7-3591 (Figure 12C), which was the most APC-like unit by our joint TI-APC index (orange star in Figure 8), responded best to starfish and rabbit-like ears pointing up. Shapes with a convexity at 112° drove the unit most strongly, whereas shapes with rounded tops and overall vertical orientation yielded the most negative responses. FC7-3639 (Figure 12D) is an example of a wolf-ears unit, and its preferred shapes include those with a convexity pointing upwards flanked by one or two sharp points. In FC8, where there is a one-to-one mapping from units onto the 1000 trained categories, the top ten APC units were evenly split between the wolf-ears group (categories: kit fox, gray fox, impala, red wolf and red fox) and the balls group (categories: ping-pong balls, golf balls, bathing caps, car mirrors and rugby balls). For example, unit FC8-271 (Figure 12E) corresponds to the red wolf category and units FC8-433 and FC8-722 correspond to the bathing cap and ping-pong ball categories, respectively.

Visualization of APC-like units in the FC layers.

(A) For unit FC6-3030, the top five images from the test set are shown above their deconvolved feature maps. The maximal RF for all FC units includes the entire image. At bottom, the top five shapes are shown in order from left to right, followed by the bottom five shapes such that the shape associated with the minimum response is the rightmost. For visibility, shapes are shown here at twice the scale relative to the images. (B) For unit FC7-3192, same format as (A). (C) For unit FC7-3591, same format as (A). (D) For unit FC7-3639, same format as (A). (E) For unit FC8-271, same format as (A), except the category of this output-layer unit is indicated as 'Red wolf.’ (F) For unit FC8-433, same format as (E). (G) For unit FC8-722, same format as (E). See Table 1 for APC fit values for all units. Thumbnails featuring people were redacted for display in the published article, in line with journal policy. Input images were accessed via the ImageNet database and the original image URLs can be found through this site: http://image-net.org/about-overview.

© 2018 Various. Image thumbnails were taken from http://image-net.org and may be subject to copyright. They are not available under CC-BY and are exempt from the CC-BY 4.0 license.

https://doi.org/10.7554/eLife.38242.015
Table 1
Fit parameters and TI metric for example CNN units.

Unit numbers are given starting at zero in each sublayer. The APC model parameters, μc, σc, μa and σa, correspond to those in Equation 2. The TI metric is given by Equation 3. For visualization of preferred stimuli for example units, see Figures 912.

https://doi.org/10.7554/eLife.38242.016
LayerUnitAPC rμcσcμaσaTI
Conv21080.670.70.72134340.76
Conv21130.760.90.39134220.70
Conv21260.670.10.72337510.81
Conv3200.680.50.012241710.90
Conv31560.670.50.013371710.79
Conv33340.730.20.121571710.74
Conv42030.710.20.162921710.77
Conv51440.650.90.2989300.89
Conv51610.720.20.16112870.85
FC630300.73−0.10.1689260.89
FC731920.750.20.161121710.91
FC735910.78−0.10.16112440.89
FC736390.76−0.10.161121140.92
FC82710.73−0.10.161121140.91
FC84330.700.30.211121300.91
FC87220.720.20.081121300.93

What is most striking about the deep-layer (FC) units is that, in spite of their tendency to be more categorical, that is to respond to a wolf in many poses or a ping-pong ball in many contexts, they still showed systematic selectivity to our simple shapes. We hypothesized that these FC units were driven by a range of image properties that correlated well with the target category, and that shape was simply one among others such as texture and color. We examined how much better the units were driven by the best natural images compared to our best standard shapes. Figure 13 shows for the top-10 APC-like units in each layer, that the best image drove responses on average about 2 times higher than did the best shape for Conv2-4, about 4–5 times higher for FC6-7 and more than 8 times higher for FC8. This is consistent with the hypothesis that shape tuned mechanisms contribute to the selectivity of these units, but are not sufficient in the absence of other image properties to drive the FC layers strongly. Nevertheless, the selectivity for simple shapes at the final layer appears to be qualitatively consistent with the category label. Notably, only two APC-like units responded better to a shape than to any natural image, but both were Conv4 units selective for bright circular regions (not shown), and the best stimulus was our large circle (Figure 3A, second from upper left).

Comparing the maximum responses driven by images to those driven by shapes for APC-like units.

For a given CNN unit, we computed the ratio of the maximum response across natural images (50,000 image test set) to the maximum response across our set of 362 shapes. The average of this ratio across the top ten APC-like units in each of seven layers (Conv2 to FC8) is plotted. Error bars show SD. In a few cases, the maximum response to shapes was a negative value and these cases were excluded: one unit for Conv3 and two for FC6 and FC7.

https://doi.org/10.7554/eLife.38242.017

CNN fit to V4 responses

Above, we examined the ability of CNN units to approximate the boundary curvature selectivity of V4 neurons as described by the APC model, but while an APC model provides a good description of the responses of many V4 neurons, there are also neurons for which it explains little response variance across our shape set. We therefore examined whether the CNN units might directly provide a better fit (than the APC model) to the responses of the V4 neurons. We used cross-validation (see Materials and methods) to put these very different models on equal footing. Figure 14 shows the cross-validated, best fit r-values for the APC model plotted against those for the CNN units. Neither model is clearly better on average: just over half (56/109) of neurons were better fit by the APC model, while just under half (53/109) were better fit by a CNN unit. Only 21 of 109 neurons had significant deviations from the line of equality (Figure 14, red) and these were evenly split: 11 better fit by the APC model and 10 by the CNN. The similar performance of the APC model and CNN could be a result of the CNN and APC model explaining the same component of variance in the data, or explaining largely separate components of the variance. To assess this, for each V4 neuron, we removed from its response the component of variance explained by its best-fit APC model. For this APC-orthogonalized V4 response, the CNN model had a median correlation to V4 of r=0.29 (SD = 0.11), much lower than the APC model's r=0.47 (SD = 0.12) median . For 94/109 neurons, the APC model explained more variance than the variance uniquely explained by the CNN. Overall, we conclude that the APC model and the CNN explain similar features of V4 responses for most neurons.

Comparing the ability of the APC model vs. single CNN units to fit V4 neuronal data.

Showing r-values for cross-validated fits from both classes of model, black points correspond to V4 neurons for which neither model performed significantly better at predicting responses to the shape set. The APC model provided a better fit for red points above the line of equality, whereas points below the line correspond to neurons for which at least one unit within the trained CNN provided a better fit than any APC model.

https://doi.org/10.7554/eLife.38242.018

Discussion

We examined whether the CNN known as AlexNet, designed to perform well on image classification, contains units that appear to have boundary curvature selectivity like that of V4 neurons in the macaque brain. Although our simple shape stimuli were never presented to the network during training, we found that many units in the CNN were V4-like in terms of quantitative criteria for translation invariance and goodness of fit to a boundary curvature model. While units throughout AlexNet had good fits to the APC model, relatively poor translation invariance in the early layers meant that only the middle to deeper layers had substantial numbers of units that came close to matching exemplary APC-tuned V4 neurons. Based on our quantitative criteria and on the qualitative visualization of preferred features identified in natural images, we believe that APC-like units within middle layers of trained CNNs currently provide the best image-computable models for V4 boundary curvature selectivity.

Finding such matches at the single unit level is striking because the deep net and our macaques differ dramatically in their inputs, training and architecture. The animals never saw ImageNet images and probably never saw even a single instance of the overwhelming majority of the 1000 output categories of AlexNet. They did not see the forest, ocean, sky nor other important contexts for AlexNet categories, nor had AlexNet been trained on the artificial shapes used to characterize V4. While the macaque visual system may be shaped by real-time physical contact with a 3D dynamic world, AlexNet cannot and was not even given information about the locations nor boundaries of the targets to be classified within its images during categorization training. AlexNet lacks a retina with a fovea, an LGN, feedback from higher areas, dedicated excitatory and inhibitory neurons, etc., and it does not have to compute with action potentials. Our results suggest that image statistics related to object boundaries may generalize across a wide variety of inputs and may support a broad variety of tasks, thereby explaining the emergence of similar selectivity in such disparate systems.

Visualization of V4-like CNN units

By applying a CNN visualization technique to APC-like units identified by our quantitative criteria, we found that some of these CNN units appeared, qualitatively, to respond to shape boundaries in natural images whereas many others did not. In early layers, particularly Conv2, the strongest responses were not driven specifically by object boundaries but instead by other image features including texture, accidental contours and negative space. In contrast, candidate APC units in intermediate layers often responded specifically to natural images patches containing object boundaries, suggesting that these units are APC-like. In the deeper (FC) layers, units were poorly driven by our shape stimuli relative to natural images, and the preferred natural images for a given unit appeared similar along many feature dimensions (e.g., texture, background context) beyond simply the curvature of object boundaries. We speculate that these units are jointly tuned to many features and that object boundaries alone account for only part of their tuning. More work is needed to understand the FC-layer units with high APC r-values; however, we believe units in the middle layers, Conv3-5, provide good working models for understanding how APC-tuning might arise from earlier representations, how it may depend on image statistics and how it could support downstream representation.

Training and translation invariance

Training dramatically increased the number of units with V4-like translation invariance, particularly in the FC layers (Figure 7A vs. Figure 7B). Since the trained and untrained nets have the same architecture, the increase in TI is not simply a result of architectural features meant to facilitate translation invariance, for example max-pooling over identical, shifted filters. Thus, while CNN architecture is often associated with translation invariance (Fukushima, 1980; Rumelhart et al., 1986; Riesenhuber and Poggio, 1999; Serre et al., 2005; Cadieu et al., 2007; Goodfellow et al., 2009; Lenc and Vedaldi, 2014), we find that high TI for actual single unit responses is only achieved in tandem with the correct weights. We are currently undertaking an in-depth study comparing the trained and untrained networks to elucidate statistical properties of weight patterns that support translation invariance. Our preliminary analyses show that spatial homogeneity of a unit's kernel weights across features correlates with its TI score, but this correlation is weaker in higher layers. Alternative models of translation-invariant tuning in V4 include the spectral receptive field (SRF) model (David et al., 2006) and HMax model (Cadieu et al., 2007). The former made use of the Fourier spectral power, which is invariant to translation of the input image, but this phase insensitivity prevents the SRF model from explaining APC-like shape tuning (Oleskiw et al., 2014). The HMax model of Cadieu et al. (2007) is a shallower network with the equivalent of two convolutional layers and does not achieve the strong translation invariance found in deeper layers here (Popovkina et al., 2017). Overall, translation invariance at the single-unit level is not a trivial result of gross CNN architecture, yet it is crucial for modeling V4 form selectivity.

Other studies of TI in CNNs

Although other studies have examined translation invariance and related properties (rotation and reflection invariance) in artificial networks (Ranzato et al., 2007; Goodfellow et al., 2009; Lenc and Vedaldi, 2014; Zeiler and Fergus, 2013; Fawzi and Frossard, 2015; Güçlü and van Gerven, 2015 , Shang et al., 2016; Shen et al., 2016; Tsai and Cox, 2015), we are unaware of any study that has quantitatively documented a steady layer-to-layer increase of translation invariant form selectivity, measured for single units, across layers throughout a network like AlexNet. For example, using the invariance metric of Goodfellow et al. (2009), Shang et al. (2016), their Figure 4c) averaged over multiple types of invariance (e.g., translation, rotation) and over all units within a layer and found a weak, non-monotonic increase in invariance across layers in a CNN similar to AlexNet. Using the same metric but different stimuli, Shen et al., 2016 found no increase and no systematic trend in invariance across layers of their implementation of AlexNet (their Figure 5). Although Güçlü and van Gerven (2015) plot an invariance metric against CNN layer, their metric is the half-width of a response profile, and thus it is unlike our TI selectivity metric. In spite of the importance of translation invariance in visual processing and deep learning (LeCun et al., 2015), there currently is no standard practice for quantifying it. An important direction for future work will be to establish standard and robust methods for assessing translation invariance and other transformation invariances to facilitate comparisons across artificial networks and the brain.

Comparison to previous work

One way our approach to comparing the representation in a CNN to that in the brain differs from previous work is that we examined the representation of specific visual features at the single-unit level, whereas previous studies took a population level approach. For example, Yamins et al. (2014) modeled IT and V4 recordings using weighted sums over populations of CNN units, and Khaligh-Razavi and Kriegeskorte, 2014 examined whether populations of CNN units represented categorical distinctions similar to those represented in IT (e.g., animate vs. inanimate). Also, Kubilius et al. (2016) examined whether forms perceived as similar by humans had similar CNN population representations. Our work is the first to quantitatively compare the single-unit representation in a CNN to that in a mid-level visual cortical area. We tested whether an artificial network matched the neural representation at a fundamental level—the output of single neurons, which are conveyed onward to hundreds or thousands of targets in multiple cortical areas. Unlike previous studies, we focused on specific physiological properties (boundary curvature tuning and translation invariance) with a goal of finding better models where a robust image-computable model is lacking. Furthermore, we use visualization of unit responses to natural images to qualitatively validate whether the representation that these response properties are intended to capture (an object-centered representation of boundary) does in fact hold across natural images. We believe this level of model validation, which includes quantitative and conceptual registration to documented neuronal selectivity, pushes the field beyond what has been done before. Our results allow modelers to focus on specific neural selectivities and work with concrete, identified circuits that have biologically plausible components.

Another major difference with prior work is that we fit the CNN to the APC model as opposed to directly to neural responses. This might seem like an unnecessary layer of abstraction, but the purpose of a model is not just predictive power but also interpretability, and the CNN's complexity runs counter to interpretability. The CNN is necessarily complex in order to encode complex features from raw pixel values, whereas the APC model has five interpretable parameters. The APC model describes responses to complex features while ignoring the details of how those features were computed from an image. By identifying APC-tuned units in the CNN, we gain an image-computable model of neural responses to interpretable features; these units can be studied to understand how and why such response patterns arise. When we separately tested whether the CNN units were able to directly fit the responses of V4 neurons, we found they were no better on average than the APC model, thus for a gain in interpretability, we did not suffer an overall loss of predictive power. Nevertheless, some V4 neurons were better fit directly to a CNN unit than to any APC model, suggesting there may be V4 representations beyond APC tuning that can be synergistically studied with CNNs.

Value of artiphysiology

Comparing artificial networks to the brain can serve both computer and biological vision science (Kriegeskorte, 2015). What can an electrophysiologist learn from this study? First, our results demonstrate that there may already exist image-computable models for complex selectivity that match single-neuron data better than hand-designed models from neuroscientists. Second, finding matches between neuronal selectivity in the brain and artificial networks trained on vast amounts of natural data provides one method for electrophysiologists to validate their findings. For example, our findings support the hypothesis that an encoding of boundary curvature in single units may be generally important for the representation of objects. Third, once a match is found based on limited sets of experimentally practical stimuli, units within deep nets can then be tested with vast and diverse stimuli to attempt to gain deeper understanding. For example, finding the downward-dog-paw and wolf-ear-steeple units raises the question of whether boundary curvature is encoded independent of other visual traits in V4 or in the CNN. Specifically, is it possible that V4 neurons that appear to encode curvature at a particular angular position are in fact also selective for texture or color features associated with a limited set of objects that have relevance to the monkey? Longer experimental sessions with richer stimulus sets will be required to test this in V4. Fourth, concrete, image-computable models can be used to address outstanding debates that may otherwise remain imprecise. For example, by visualizing the preferences of single units for natural stimuli after identifying and characterizing those units with artificial stimuli, our results speak to the debate on artificial vs. natural stimuli (Rust and Movshon, 2005) by showing that artificial stimuli are often able to reveal critical characteristics of the selectivity of units involved in complex mid-level (parts-based) to high-level (categorical) visual encoding, even when the visual dimensions of the artificial set explore only a minority of the feature space represented by the units. As another example, our results can help to address the debate of whether the visual system explicitly represents object boundaries (Adelson and Bergen, 1991; Movshon and Simoncelli, 2014; Ziemba and Freeman, 2015), which Movshon and Simoncelli describe as follows: ‘In brief, the concept is that the visual system is more concerned with the representation of the ‘stuff’ that lies between the edges, and less concerned with the edges themselves (Adelson and Bergen, 1991).' The models we have identified can now be used to pilot experimental tests of this rather complex, abstract idea.

Our approach also provides potentially valuable insight for machine learning. The connection between deep nets and actual neural circuits is often downplayed, but we found a close match at the level of specific single-unit selectivity. This opens the possibility that future studies could reveal more fine-scale similarities, that is matches of sub-types of single-unit selectivity, between artificial networks and the brain, and that such homology could become a basis for improving network performance. Second, translation invariance, seen as critical for robust visual representation, has not been systematically quantified for units within artificial networks. Determining why deeper layers in the network maintain a wide diversity of TI across units could be important for understanding how categorical representations are built. More generally, the art of characterizing units within complex systems using simple metrics and systematic stimulus sets, as practiced by electrophysiologists, can provide a useful way to interpret the representations learned in deep nets, thereby opening the black box to understand how learned representation contributes to performance.

Further work

Our findings are consistent with the hypothesis that some CNN units share a representation of shape in common with V4 that is captured by the APC model. Examining whether these CNN units demonstrate additional V4 properties, beyond those examined here, would further test this hypothesis. For example, curvature-tuned V4 cells have been shown to (1) have some degree of scale invariance (El-Shamayleh and Pasupathy, 2016), (2) suppress the representation of accidental contours, for example those resulting from occlusion that are unrelated to object shape (Bushnell et al., 2011), (3) be robust against partial occlusions of certain portions of shape (Kosai et al., 2014), and (4) maintain selectivity across a spectrum of color (Bushnell and Pasupathy, 2012). Further studies like these are needed to more deeply probe whether the intermediate representation of shape and objects in the brain is similar to that in artificial networks. In addition to further study of functional response properties, it is important to understand how the network achieves these representations. For example, translation invariance was a key response property that allowed the trained network to achieve a V4-like representation, yet we are just beginning to understand what aspects of kernel weights, receptive field overlap, and convergence are critical to matching the physiological data. For CNNs to be valuable models of the nervous system, it will be important to understand what network properties support their ability to match representations observed in vivo.

Materials and methods

The convolutional neural network

Request a detailed protocol

We used an implementation of the well-known CNN referred to as ‘AlexNet,’ which is available from the Caffe deep learning framework (http://caffe.berkeleyvision.org; Jia et al., 2014). Its architecture (Figure 2) is purely feed forward: the input to each layer consists solely of the output from the previous layer. The network can be broken down into eight major layers (Figure 2A, left column), the first five of which are called convolutional layers (Conv1 through Conv5) because they contain linear spatial filters with local support that are repeatedly applied across the image. The last three layers are called fully connected (FC6 through FC8) because they receive input from all units in the previous layer. We next describe in detail the computations of the first major layer, which serves as a model to understand the later layers.

The first major convolutional layer consists of four sublayers (Figure 2A, orange, and Figure 2C–F, top four rows). The first sublayer, Conv1, consists of 96 distinct linear filters (shown in Figure 1) that are spatially localized to 11 × 11 pixel regions and that have a depth of three, corresponding to the red, green and blue (RGB) components of the input color images. The input images used for training and testing are 227 × 227 (spatial) x 3 (RGB) pixels. The output of a Conv1 unit is its linear filter output minus a bias value (a constant, not shown). Conv1 has a stride of 4, meaning that neighboring units have filters that are offset in space by four pixels. The output of each Conv1 unit is processed by a rectified linear unit in the second sublayer, Relu1, the output of which is simply the half-wave rectified value of Conv1. These values are then pooled by units in the third sublayer, Pool1, which compute the maximum over a 3 × 3 pixel region (Figure 2A, gray triangles) with a stride of 2. The outputs of the Pool1 units are then normalized (see below) to become the outputs of units in the fourth sublayer, Norm1. These normalized outputs are the inputs to the Conv2 units in the second major layer, and so on. Figure 2A shows a scale diagram of the spatial convergence in the convolutional layers (major layers are color coded) along one spatial dimension. Starting at the top, the 11 × 11 pixel kernels (orange triangles) sample the image every four pixels, reducing the spatial representation to a 55 × 55 element grid (Figure 2A, column 4, lists spatial dimensions). The Pool1 layer reduces the representation to 27 × 27 because of its stride of 2. The Conv2 unit linear filters are 5 × 5 in space (red triangles) and are 48 deep (not depicted), where the depth refers to the number of unique kernels in the previous layer that provide inputs to the unit (see Krizhevsky et al., 2012), for details and their Figure 2 for a depiction of the 3D kernel structure).

These operations continue to process and subsample the representation until, after Pool5, there is a 6 × 6 spatial grid that is 256 kernels deep. Given the convergence between layers, the maximum possible receptive field (RF) size (i.e., extent along either the horizontal or vertical dimension) for units in each convolutional layer ranges from 11 to 163 pixels (Figure 2B) for Conv1 to Conv5, respectively. For example, the pyramid of support is shown for the central Conv5 unit (Figure 2A, dark blue triangle shows tip of upside-down pyramid), which has access to the region of width 163 pixels covered by Conv1 kernels (orange triangles). The receptive field sizes of units in the FC layers are unrestricted (not shown in Figure 2B). The last major layer, FC8, has a Prob8 sublayer that represents the final output in terms of the probability that the visual input contains each of 1000 different categories of object (e.g., Dalmation, Lampshade, etc.; see Krizhevsky et al., 2012, for details).

Units in the Norm1 and Norm2 sublayers carry out local response normalization by dividing their input value by a function (see Krizhevsky et al., 2012), their section 3.3) of the sum of squared responses to five consecutive kernels (indices from +2 to −2) along the axis of unique kernel indices (e.g., in Conv1, the indices go from 0 to 95 for the filters shown in Figure 1, from the upper left towards the right and down), thereby creating inhibition among kernels. Figure 2D (bottom row) lists the total number of units with unique kernels in each layer, and this defines the number of units that we examine here. In the Conv layers, we only test the units that lie at the central spatial position because they perform the same computation as their spatially offset counterparts. We analyzed a total of 22,096 units. To identify units for reproducibility in future studies, we refer to units by their layer name (e.g., Conv1) and a unit number, where the unit number is the index, starting at zero, within each sublayer and proceeding in the order defined in Caffe.

We tested the network in two states: untrained and fully trained. The untrained network has all weights (i.e., values within the convolutional kernels and input weights for FC layers) initialized to Gaussian random values with mean 0 and SD 0.01, except for FC6 and FC7 where SD = 0.005, and all bias values initialized to a constant of 0 (Conv1, Conv3, FC8) or 1 (Conv2, Conv4, Conv5, FC6, FC7). These initial bias values are relatively low to minimize the number of unresponsive units, which in turn guarantees a back propagation gradient for each unit during training. The fully trained network (available from Caffe) has been trained with stochastic gradient descent on large database of labeled images, ImageNet (Deng et al., 2009), with the target that the final sublayer, Prob8, has value 0 for all units except for a value of 1 for the unit corresponding to the category of the currently presented training image. To speed up training and mitigate overfitting, an elaborate training procedure was used that included a number of heuristics described in detail in Krizhevsky et al. (2012).

Visual stimuli

Request a detailed protocol

Our stimulus set (Figure 3A) is that used by Pasupathy and Connor (2001) to assess tuning for boundary curvature in V4 neurons. The set consists of 51 different simple closed shapes that are presented at up to eight rotations (fewer rotations for shapes with rotational symmetry), giving a total of 362 unique stimulus images. We rendered the shapes within a 227 by 227 pixel field with RGB values set to the same amplitude, thus creating an achromatic stimulus. The background value was 0, and the foreground amplitude was varied up to 255, the maximum luminance. This format matched the size and amplitude of the JPEG images on which the CNN was originally trained. The center of each shape was taken to be the centroid of all points on the finely sampled shape boundary. We fixed the foreground amplitude to 255 after varying it to lower values and finding that it made little difference to the response levels through the network because of the normalization layers (see Results).

We set the size of our stimuli to be 32 pixels, meaning that the largest shape, the large circle (Figure 3A second shape from upper left), had a diameter of 32 pixels and all stimuli maintained the relative scaling shown in Figure 3A. This ensured the stimuli fit within the calculated RF of all layers except Conv1 with additional room for translations (see Maximum RF size, Figure 2B) and allowed all layers to be compared with respect to the same stimuli. We excluded Conv1 from our analysis because fitting the stimuli within the 11 by 11 pixel RFs would corrupt their boundary shape, would not allow room for testing translation invariance, and Conv1 is of less interest because of its simple function. In the V4 electrophysiological experiments of Pasupathy and Connor, stimuli were sized proportionally to each neuronal RF, as it can be difficult to drive a cell with stimuli that are much smaller than the RF. We tested sizes larger than 32 pixels (see Results) and found it did not substantially change our results.

Electrophysiological data

Request a detailed protocol

For comparison to the deep network model, we re-analyzed data from two previous single-unit, extracellular studies of parafoveal V4 neurons in the awake, fixating rhesus monkey (Macaca mulattta). Data from the first study, Pasupathy and Connor (2001), consists of the responses of 109 V4 neurons to the set of 362 shapes described above. There were typically 3–5 repeats of each stimulus, and we used the mean firing rate averaged across repeats and during the 500 ms stimulus presentation to constrain a model of tuning for boundary curvature in V4 (Figure 3C). To constrain translation invariance, we used data from a second study, El-Shamayleh and Pasupathy (2016), because the first study used only two stimuli (one preferred and one antipreferred) to coarsely assess translation invariance. The data from the second study consists of responses of 39 neurons tested for translation invariance. The stimuli were the same types of shapes as the first study, but where the position of the stimuli within the RF was also varied. Each neuron was tested with up to 56 shapes (some of which are rotations of others) presented at 3–5 positions within the receptive field. Each unique combination of stimulus and RF position was presented for 5–16 repeats, and spike counts were averaged over the 300 ms stimulus presentation. Experimental protocols for both studies are described in detail in the original publications.

Response sparsity

Request a detailed protocol

While many units in the CNN responded well to our shape set, there were also many units, particularly in the rectified (Relu) sublayers, that responded to very few or none of our shape stimuli. It was important to identify the very sparse responding units because they could bias our comparison between the CNN units and V4 neurons. We quantified response sparsity using the fourth moment, kurtosis (Field, 1994),

(1) K=1nin(xi-x¯)4σ4,

where xi is the response to the ith stimulus, n is the number of stimuli, and x¯ and σ are the mean and SD of the response across stimuli. This metric works for both non-negative and signed random variables, thus covering the outputs of all layers of the CNN. We excluded CNN units where response sparsity was outside the range observed in V4: 2.9 to 42 (Figure 4—figure supplement 1; see Results). We also found that such units gave degenerate fits to the APC model.

Placing stimuli in the classical receptive field

Request a detailed protocol

In keeping with neurophysiology, we defined the classical receptive field (CRF) of a CNN unit as the region of the input from which our stimuli can elicit a response different from baseline, where baseline is defined as the response to the background input (all zeros). For example, to determine the horizontal extent of the CRF of a unit, we started with our stimulus set centered (in x and y) on the spatial location of the unit and determined whether there was a driven response (deviation from baseline) to any stimulus. We then moved the stimulus set left and right to cover a 100 pixel span in two pixel increments to find the longest set of contiguous points from which any response was elicited at each point. In other words, stimuli were centered on pixels ranging from 64 to 164 in the 227 pixel wide image. To account for the finite width of the stimuli, we subtracted the maximum stimulus width from the length of the contiguous response region and added one to arrive at the estimated extent of the CRF in pixels along the horizontal axis. Any unit with a CRF wide enough to contain three 2-pixel translations of our stimulus set was included in our analyses. Generally, this provided a conservative estimate of the receptive field, because most stimuli were narrower than the maximal-width stimulus, as observed in Figure 3A.

All analyses and plots of responses to translated shapes were made with respect to horizontal shifts of our vertically centered shape set. To verify that our conclusions did not depend on testing only horizontal shifts, we recalculated our metrics for vertical shifts and found them to be strongly correlated with those for horizontal shifts (Figure 7—figure supplement 1).

The APC model

Request a detailed protocol

Our study focuses on the ability of CNN units to display a particular physiological property of V4 neurons—tuning for boundary conformation—which has been modeled using the angular position and curvature (APC) model introduced by Pasupathy and Connor (2001). Conceptually, APC tuning refers to the ability of a neuron to respond selectively to simple shape stimuli that have a boundary curvature feature (a convexity or concavity) at a particular angular position with respect to the center of the shape. Unlike the CNN, the APC model does not operate on raw image pixel values, but instead on the carefully parameterized curvature and angular position of diagnostic elements of the boundaries of simple closed shapes (see example shape, Figure 3B). Each boundary element along the border of a shape can be mapped to a point in a plane heretofore referred to as the APC plane (Figure 3C). The responses, Ri, of a unit to the ith shape is given by:

(2) Ri=kmaxj[exp(-(ci,j-μc)22σc2)exp(-(ai,j-μa)22σa2)],

where the expression inside the square brackets is the product of two Gaussian tuning curves, one for curvature with mean μc and SD σc, and one for angular position with mean μa and SD σa. The curvature axis extends from −1 (sharp concavity) to +1 (sharp convexity), and the angular position is defined with respect to the center of the shape. The jth curvature value of the ith shape is encoded as ci,j and the angular position of that curvature element is ai,j. The factor k is a scaling constant. The max over these boundary elements is taken, thus the response depends only on the most preferred feature. In the original study (Pasupathy and Connor, 2001), these parameters were fit using a gradient descent method, the Gauss-Newton algorithm, from a grid of starting points across the APC plane. We instead discretely sampled the parameter space taking the Cartesian product of 16 values of μc, σc, μa and σa, where the means were linearly spaced, the SDs were logarithmically spaced, and the end-points were set to match the range of values observed for the V4 cells when fit by the original Gauss-Newton method (μc[-0.5,1], σc[0.01,0.98], μa[0,338] and σa[23,171]). We defined the best-fit model to be that which maximized Pearson's correlation coefficient between observed and predicted responses. We then found k using a least squares fit. We found this to be more rapid, and the median correlation of the original V4 neurons to be the same to two decimal places as the Gauss-Newton fits (0.48), and had the assurance that the same models were tested on all units. We used Pearson’s correlation coefficient two-tailed p-value to test for significance.

Measuring translation invariance

Request a detailed protocol

To visualize translation invariance we created position-correlation functions by plotting the r-value of responses between a reference and an offset location as a function of distance (e.g., Figure 6B and E–J, red). To compare the fall-off in correlation to the fall-off in the RF profile (e.g., Figure 6E–J, green) of the CNN unit, we computed an aggregate firing rate metric—the square root of the sum of the squared responses across the stimulus set at each spatial position. For CNN units, this was used rather than the mean firing rate because responses could be positive or negative.

To quantify translation invariance in neuronal and CNN unit responses, we defined a metric, TI, that can be thought of as a generalization of the correlation coefficient. The correlation coefficient,

(3) r=Cov(p1,p2)SD(p1) SD(p2),

which is bounded between −1 and 1, measures how similar the response pattern is across two locations, where p1 and p2 are vectors containing the responses to all stimuli at positions 1 and 2. Our TI metric is,

(4) TI=ijCov(pi,pj)ijSD(pi) SD(pj),

 where the sums are taken over all unique pairs of locations, and pi is the mean-subtracted column of responses at the ith RF position. The numerator is the sum of the non-diagonal entries in the covariance matrix of the responses, and the denominator is the sum of the products of each corresponding pair of SDs. Thus, this metric is also bounded to lie between −1 and 1, but it has an advantage over the average r-value across all unique pairs of locations because the latter would weight the r-value from RF locations with very weak responses just the same as those with very strong responses. For a simple model of neuronal translation invariance in which the variations of responses are described as the product of a receptive field profile and a shape selectivity function, our TI metric would take its maximum possible value, 1. If responses at all positions were uncorrelated, it would be 0.

We also evaluated an alternative metric, the separability index (Mazer et al., 2002; Hinkle and Connor, 2002) based on the singular value decomposition of the response matrix, but we found that it was biased to report higher translation invariance values for response matrices that reflected tuning that was more confined in space (i.e., smaller RF sizes) or more limited to a small range of shapes (i.e., higher shape selectivity). According to our simulations, our TI metric has the benefit of being unbiased with respect to receptive field size or selectivity of our response matrices, thereby facilitating comparisons across layers and diverse response distributions.

In testing the CNN, we finely sampled horizontal shifts of the stimulus set, as described above in ‘Placing stimuli in the CRF’. The TI metric for any neuron was computed only for the set of contiguous locations for which the entire shape set was within the RF of the unit.

Comparing CNN and APC model fits to V4 data

Request a detailed protocol

We examined whether the CNN units might directly provide a better fit to the V4 neural responses than does the APC model. This required us to compare, for each of the 109 V4 units, the best-fit unit in the pool of CNN units to the best fit provided by the APC model. In the case of the CNN, there are 22,096 units to consider (Figure 2D). In the case of the APC model, there are five parameters (see 'The APC model’ above). We employed cross-validation to ensure that any differences in fit quality were not the result of one fitting procedure being more flexible than the other. In particular, we performed 50 fits on a random subset of 4/5 of the neural data, then measured the correlation of the fit model on the remaining 1/5. We took the mean of these 50 fits for each unit to be the estimate of test correlation, and the 95th percentiles of the distribution of fits for identifying cells that deviate in their fit quality between two models (e.g., APC model and the CNN). To judge whether the variance explained by the CNN was largely distinct from that explained by the APC model we fit a V4 neurons best-fit CNN model to the residual of the fit of the APC model to a V4 neuron. If the correlation of the CNN unit to the V4 neuron remains high then the APC model and CNN explain different features of the response of the V4 neuron.

Estimating the effect of the stochastic nature of neuronal responses

Request a detailed protocol

AlexNet produces deterministic, noise-free responses, whereas the responses of V4 neurons are stochastic. This raises the possibility that our conclusions might have been different if more trials of V4 data had been collected to reduce the noise in the estimates of the mean neuronal responses. In particular, trial-to-trial variability will tend to lower the correlation coefficient (r-value) between model and data.

To address this, we used the methods of Haefner et al. (2009) to remove the downward bias that trial-to-trial variability imparts on the r-value for our fits of the APC model to neuronal data. The method of Haefner and Cumming assumes that the neural responses have been appropriately transformed to have equal variance across stimuli and that the averaged responses for each stimulus are normally distributed. For the case where the variance-to-mean relationship is, σ2(λ)=aλ, where λ is the mean response and a is a constant (i.e., Fano factor is constant across firing rates), an often used transformation is the square root of the responses. Empirically, we have found that this transformation works well even when neural responses have a quadratic variance-to-mean relationship. After taking the square root of the responses, we estimated sample variance for each stimulus across trials and then averaged across stimuli to get s¯2. We made a least-squares fit of the model to the centered mean responses (grand mean subtracted from the mean for each stimulus). We then calculated the corrected estimate of explained variance:

(5) Rc2^=β2^s¯2nα2^+β2^s¯2n(m1),

 where β2^ is the sum of squares of the model predictions (explained variance), α2^ is the sum of squares of the residuals from the model (unexplained variance), s¯2 is sample variance across trials, averaged for all stimuli, m is the number of stimuli, and n is the number of trials.

We used a different approach to estimate how much our TI metric for V4 neurons might be degraded by noise because TI is not a correlation coefficient and does not lend itself to the methods described above. In particular, for each V4 neuron tested with stimuli at multiple positions, we built an ideal model with perfect TI by taking the responses at the position that produced the greatest response and replicating them at the other positions, but scaling them to match the original mean at each RF position. We then used this set of sample means, which has TI = 1, to generate Poisson responses, simulating the original experiment 100 times and computing the TI value for each case. We took the average drop in TI (compared to 1) to be an estimate of the upper bound of how much the V4 neuron TI values could have been degraded by noise.

Visualization

Request a detailed protocol

To visualize the features that drove a particular unit in the CNN to its highest and lowest response levels, we first ranked all images (or image patches) based on the response of the unit to the standard test set of 50,000 images for AlexNet. For units in the convolutional layers, we considered the responses at all x-y locations for a particular unique kernel. Thus, we found not just the optimal image, but also the optimal patch within the image that drove the kernel being examined. We then performed a visualization technique on the 10 most excitatory images and on the 10 most suppressive images. We followed the methods of Zeiler and Fergus (2013), and used a deconvnet to project the response of the unit onto successive layers until we reached the input image. The deconvolved features can then be examined, as an RGB image, to provide a qualitative sense of what features within the image drove the unit to such a large positive or negative value.

References

  1. 1
    The plenoptic function and the elements of early vision
    1. EH Adelson
    2. JR Bergen
    (1991)
    In: M Landy, J. A Movshon, editors. Computational Models of Visual Processing. Cambridge: MIT Press (1991). pp. 3–20.
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
    DeCAF: a deep convolutional activation feature for generic visual recognition
    1. J Donahue
    2. Y Jia
    3. O Vinyals
    4. J Hoffman
    5. N Zhang
    6. E Tzeng
    7. T Darrell
    (2014)
    ICML.
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
    Measuring Invariances in Deep Networks
    1. I Goodfellow
    2. H Lee
    3. L Qv
    4. A Saxe
    5. N Ay
    (2009)
    In: Y Bengio, D Schuurmans, JD Lafferty, CKI Williams, A Culotta, editors. Advances in Neural Information Processing Systems 22. Curran Associates, Inc. pp. 646–654.
  17. 17
  18. 18
  19. 19
    Advances in Neural Information Processing Systems 21
    1. RM Haefner
    2. BG Cumming
    3. D Koller
    4. D Schuurmans
    5. Y Bengio
    6. L Bottou
    (2009)
    585–592, An improved estimator of Variance Explained in the presence of noise, Advances in Neural Information Processing Systems 21, Curran Associates, Inc.
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
    ImageNet Classification with Deep Convolutional Neural Networks
    1. A Krizhevsky
    2. I Sutskever
    3. GE Hinton
    (2012)
    In: F Pereira, CJC Burges, L Bottou, KQ Weinberger, editors. Advances in Neural Information Processing Systems 25. Curran Associates, Inc. pp. 1097–1105.
  27. 27
  28. 28
    Handwritten Digit Recognition with a Back-Propagation Network
    1. Y LeCun
    2. BE Boser
    3. JS Denker
    4. D Henderson
    5. RE Howard
    6. WE Hubbard
    7. LD Jackel
    (1990)
    In: D. S Touretzky, editors. Advances in Neural Information Processing Systems 2. Morgan-Kaufmann. pp. 396–404.
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
    Deep neural networks are easily fooled: high confidence predictions for unrecognizable images
    1. A Nguyen
    2. J Yosinski
    3. J Clune
    (2015)
    Computer Vision and Pattern Recognition (CVPR’15), IEEE, 10.1109/CVPR.2015.7298640.
  37. 37
  38. 38
  39. 39
  40. 40
    Representation of Outlines and Interiors in Primate Area V4
    1. DV Popovkina
    2. W Bair
    3. A Pasupathy
    (2017)
    University of Washington.
  41. 41
    Comparing the Brain’s Representation of Shape to That of a Deep Convolutional Neural Network 
    1. D Pospisil
    2. A Pasupathy
    3. W Bair
    (2016)
    Proceedings of the 9th EAI International Conference on Bio-Inspired Information and Communications Technologies (Formerly BIONETICS). pp. 516–523.
  42. 42
  43. 43
  44. 44
    Learning Internal Representations by Error Propagation
    1. DE Rumelhart
    2. GE Hinton
    3. RJ Williams
    (1986)
    In: D. E Rumelhart, J. L McClelland, editors. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Cambridge, MA: Bradford Books. pp. 318–362.
  45. 45
  46. 46
  47. 47
  48. 48
    Object recognition with features inspired by visual cortex
    1. T Serre
    2. L Wolf
    3. T Poggio
    (2005)
    2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
    https://doi.org/10.1109/CVPR.2005.254
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58

Decision letter

  1. Eilon Vaadia
    Reviewing Editor; The Hebrew University of Jerusalem, Israel
  2. Joshua I Gold
    Senior Editor; University of Pennsylvania, United States

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

[Editors’ note: the authors were asked to provide a plan for revisions before the editors issued a final decision. What follows is the editors’ letter requesting such plan.]

Thank you for sending your article entitled ""Artiphysiology" reveals V4-like shape tuning in a deep network trained for image classification" for peer review at eLife. Your article is being evaluated by two peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Joshua Gold as the Senior Editor. The reviewers have opted to remain anonymous.

Given the list of essential revisions, including new experiments, the editors and reviewers invite you to respond within the next two weeks with an action plan and timetable for the completion of the additional work. We plan to share your responses with the reviewers and then issue a binding recommendation.

Summary:

The reviewers who read this paper had quite different views. In my role as the reviewing editor, here I integrate and summarize the different impressions and comments and provide the authors with some recommendations to revise the paper and best respond to the comments to improve the manuscript, with my hope that it will be possible to reconsider publication of this study.

This paper examines the hidden units of deep networks trained on image classification in order to determine whether their tuning properties are similar to those that have been reported previously in primate visual area V4. The authors probe a pre-trained deep network (AlexNet, used in many other computational neuroscience studies) with simple closed-contour binary shape stimuli. These stimuli were used in previous impactful neurophysiology studies from this group to probe contour curvature and angle tuning in area V4.

The paper reports three main general results. First, it shows that some units within the deep network can be well described by a model contour curvature model of V4 selectivity (the APC model) that was developed previously by the authors. Second, it shows that the deep network units well described by APC model also exhibit position invariance, similar to that observed in real V4 neurons and reported in several previous studies. Third, the authors use a network visualization method to show that deep network units well described by the APC model are predicted to be selective for a range of complex natural images and that the best natural images evoke larger responses from the APC units than are obtained using the simpler stimuli.

This is a well written and clear paper. The computational analyses appear to be solid. Treating a deep network trained on a natural image classification task as an object of synthetic neurophysiological investigation is an interesting enterprise, and it is gratifying to see the correspondence between units in the deep network and V4 neurons that are well fit by the APC model.

The "major concerns" section below provides details of the major question that we have faced, which is: What is the added value of this well-written paper? We see it in two almost opposite ways;

1) The paper has a significant and general value in highlighting the potential of advantages of computational models in interpreting neurophysiological data and cortical computations, as well as a potential of learning from neurophysiological data how to improve computational models.

2) The paper does not provide new insights. Instead, it is a reasonably good paper, oriented to a local community, that confirms the notion that hierarchical networks (either biological or artificial) end up representing similar hierarchical structure in natural images.

Essential revisions:

Most researchers agree that the critical questions regarding modeling of sensory system at large are the practical ones: (1) What stimulus should be used in neurophysiology data collection? (2) How much data should be collected? (3) How should models be validated? (4) What modeling framework should we use? The authors are invited to clarify and explain how the paper appropriately relate to these questions.

At some level, it seems like this study has to work out the way that it did. As several recent neurophysiology modeling studies using deep networks have argued, both deep networks and the primate visual system were "trained" (or evolved) using analyze natural images, and it is not surprising that both hierarchical networks end up representing similar hierarchical structure in natural images. From that point of view the results of the manuscript are unsurprising. The two major conclusions of the paper are that units in deep networks trained on classification and well fit by the APC model show selectivity for contour curvature and that they are locally positioned invariant. Both of these observations are completely consistent with an enormous amount of prior data. A wealth of prior research suggests that object borders are important for vision. Theoretical arguments suggest that object borders are a critical component of any viable representation of the objects in natural scenes, and many previous neurophysiology studies using both synthetic and natural images have shown that units in V2 and V4 are sensitive to object borders including curved borders. The demonstration that deep networks are positionally invariant is also not surprising. After all, the deep networks tested here are convolutional and therefore are designed to be locally positioned invariant. In fact, convolutional networks were inspired originally by the finding of position invariance in primate vision.

The important question, therefore, is not whether there is any correspondence between the tuning of visual neurons and the tuning of units in deep networks trained for image classification. Instead, the important questions concern the nature of this tuning, the distributional relationships between primate neurons and units in the deep networks and, ultimately, what that can tell us about the biophysical mechanisms underpinning the observed functional selectivity.

We would have been more enthusiastic if the analysis could reveal some new principles that could provide impetus to further experimental studies, or which could be used to help improve current models. For example, if the current deep network framework could support some fundamentally new approach to interpreting functional responses that would provide a foundation for building a mathematical model that can explain observed functional properties in terms of known biophysical building blocks present in cortical neurons, or if the deep networks could be used as the building blocks for such a model.

At this point, this is a well-done study that supports the current view that the primate visual system and deep networks trained on natural image classification both encode intermediate object structure such as contour curvature. However, this paper as written doesn't resolve controversies and doesn't provide information that would be useful for designing future experiments or models.

https://doi.org/10.7554/eLife.38242.023

Author response

[Editors' note: the authors’ plan for revisions was approved and the authors made a formal revised submission.]

[…] The "major concerns" section below provides details of the major question that we have faced, which is: What is the added value of this well-written paper? We see it in two almost opposite ways;

1) The paper has a significant and general value in highlighting the potential of advantages of computational models in interpreting neurophysiological data and cortical computations, as well as a potential of learning from neurophysiological data how to improve computational models.

2) The paper does not provide new insights. Instead, it is a reasonably good paper, oriented to a local community, that confirms the notion that hierarchical networks (either biological or artificial) end up representing similar hierarchical structure in natural images.

Below we provide arguments, now highlighted in our manuscript, against three ideas articulated above: i) that our paper “does not provide new insights”, ii) that it is merely confirmatory, and iii) that it targets a local community.

i) Our manuscripts offers the following novel insights, highlighted in our revised Discussion:

1) We show that there are one-to-one correspondences between AlexNet model units and single V4 neurons in terms of tuning for translation invariant boundary curvature [​Discussion, first paragraph​]. This is despite the remarkable differences between the two systems in terms of architecture (e.g., lack of feedback in AlexNet), properties of individual units (e.g., no dedicated inhibitory neurons in AlexNet), training (limited visual environment of colony raised animal) and behavioral tasks (no motor output requirements for AlexNet) [Discussion, second paragraph​]. This correspondence at the level of single units was unexpected and was not previously pursued in past studies (e.g., Yamins et al., 2014). [​Discussion subsection “Comparison to previous work”, first paragraph​]

2) Our results suggest that comparing properties of single units in AlexNet with cortical neurons can be a powerful tool to clarify complex processing beyond the earliest stages in primate cortex. We now emphasize that single-neuronal selectivity described by electrophysiologists might be questioned, but can gain more traction once similar selectivity is seen within an artificial network trained on a large bank of naturalistic inputs [Discussion​ subsection “Value of artiphysiology”]. Our paper highlights a general way forward for systems Neuroscience research.

3) Despite the CNN’s apparent complexity, form selectivity of some units in AlexNet can be described by a simple 5-parameter model. This is a novel strategy that complements the traditional approach of unit visualization, helping to defeat the notion of CNNs as mysterious black boxes. [Discussion subsection “Comparison to previous work”​]

4) Our work shows that single units of AlexNet achieve V4-like translation invariance only after extensive training (our Figure 8B and Discussion ‘Training and translation invariance’). This will help to dispel the expectation that units in deep networks are translation invariant by design. Our paper can move the field beyond this misconception so that investigation can begin in earnest to understand how units within complex networks including the brain achieve translation invariance, we now emphasize this [​Discussion subsection “Training and translation invariance”​].

5) Our results speak to the artificial vs. natural stimulus debate. They reveal that artificial stimuli can be quite diagnostic even in systems trained only on natural stimuli and even in deep layers of the system where the artificial stimuli may drive only a small fraction of the response dynamic range. [Discussion subsection “Value of artiphysiology”​]

6) Our study identifies the first image-computable, biologically-plausible circuit model that matches quantitatively the boundary shape tuning and TI in V4 [Discussion, first paragraph​]. We thus provide a concrete, openly available tool to test complex theories of image statistic vs. object centered encoding in V4 (Ziemba and Freeman, 2014).

ii) Our results are not merely confirmatory​: there is little consensus about processing in area V4 and there are major differences between the primate brain and CNNs in terms of architecture and training. Thus, there is not a solid basis to predict, at the single-unit level, the outcome of a comparison of form selectivity between deep nets and area V4.

1) There is currently no accepted model among electrophysiologists of what V4 does; therefore, to say that one knows in advance that a particular single-neuron selectivity in V4 (here, selectivity for boundary shape of simple forms) will emerge from a given artificial network is not tenable. Neuroscientists do not yet agree on what V4 neurons do, and different studies use widely different experimental paradigms and propose different circuit models that make different and sometimes directly contradictory predictions. This fact is demonstrated nicely by the current review, which questions whether the APC model is really any good at describing important properties of V4 neurons, and whether it is even valid to use in our study. This reflects the reality that we face regularly, and it is based on disagreements in published papers, public talks and anonymous critique that demonstrate the overall feeling of doubt that anybody understands what is going on in an area like V4. This view conflicts with the idea that, “we already know what features are represented in single units in the brain and in units in deep nets (and we know these are the same)”. So which is it: do we already know all of this, or are we confused about basic V4 selectivity? Because of this lack of clarity about V4, we believe that our paper will be of great interest and spur important debate in the community. This highlights an important insight of our paper: that single-neuronal selectivity described by electrophysiologists might seem questionable, but can gain more traction once it is seen that such selectivity can arise within an artificial network trained on a large bank of naturalistic inputs, we now discuss this point in our revised manuscript ​[​Discussion subsection “Value of artiphysiology”​]​. Importantly, this is not local to V4, the circuits and roles of neurons in V2, V3, IT, and beyond the visual cortex, are not agreed upon either. Thus, our paper highlights a general way forward in neuroscience.

2) There is wide disagreement about the validity of deep convolutional networks as robust models of human visual function. Many see these models as using tricks of image statistics to achieve good results – for example, journal articles highlight adversarial images, where deep nets fall down, and cases where artificial networks can fail spectacularly in challenging situations that are trivial for the primate visual system. The argument has been made that the primate brain depends on feedback and structural knowledge of the world, and these elements are not yet properly incorporated into the design and training of artificial networks. Primates have different training inputs and different tasks compared to the type of deep net studied here: they manipulate objects and experience them in 3D and over time, whereas AlexNet does not; we now include this point in our manuscript ​[​Discussion, second paragraph​]​. We and others have considered that border ownership and boundary shape selectivity are critical for organisms that must form their hands to grasp objects with great certainty, or perish. It remains uncertain going forward to what extent we should expect specific representations at the single unit level to align across such diverse systems. This argues against the predictability of what types of specific neuronal selectivity might be shared between the brain and deep nets.

3) There is wide agreement that deep nets are difficult to understand and provide little insight in spite of their amazing performance on specific tasks. This contradicts the notion that one already knows what to expect when examining the internal representation of single units in the deep net. Here we show that a simple, insightful 5-parameter model can be related to parts of a deep net [Discussion​ subsection “Comparison to previous work”]​. Previous studies that have compared responses have not gone down to the single unit level, and thus have left the mystery of the black box intact.

4) If another study comes out next month where it is found that a major property of V4 neurons (or of neurons in V2 or IT) is not present in the deep net (e.g., blur tuning, accidental contour discounting, color-invariant shape tuning, etc.), can one not equally well then say, “Of course this is not surprising because deep nets and the brain are so fundamentally different: different architecture, different sensory inputs and different tasks?” The fact that the opposite view (to that in Summary #2 above) could also be backed up by a large set of viable arguments suggests that our results cannot be taken as known in advance.

5) A phrase is repeated in the summary and below, “that hierarchical networks (either biological or artificial) end up representing similar hierarchical structure in natural images.” To argue that this obviates our results would imply that one has accepted the idea that the visual system is best described as hierarchical and that we know what features are represented therein. However, this is not the case. While there is general consensus that boundaries (“edges”) and textures in images are critical, it is still unknown what features are encoded beyond V1, how visual receptive fields are built, how selectivity actually arises in a circuit, and how best to conceptualize visual processing. There is still deep debate as to the function of feedback, and it is actively investigated whether predictive coding is a better model for the cortex at many levels. Visual neuroscientists work in a world where currently very little is established and agreed upon about circuitry and single neurons in mid-level processing. The phrase quoted above about hierarchy presents a broad, classical notion that has been a useful point of reference but has remained under debate for fifty years. Currently, vast amounts of funding and scientific effort are aimed at trying to understand cortical circuitry and function. This argues against the ideas that (a) we already know the ventral stream is a hierarchy like a feedforward deep net, (b) we know which sensory features are represented throughout, and (c) we have already uncovered image-computable models to explain visual representation that aligns with that in neuronal circuitry.

iii) ​Our paper does not merely target a local community​. We believe that our paper takes a novel approach in connecting two substantial communities – electrophysiologists who depend upon simple and limited stimulus paradigms and the machine learning community that aims to improve intelligent performance in artificial systems. Only recently have electrophysiologists had access to artificial networks that rival their physiological model systems in terms of having daunting complexity and strikingly good “behavioral” performance. Our results suggest to electrophysiologists that they may be able to harness deep nets for single-unit level comparisons in terms of neural representation. To the machine learning community, it reinforces the notion that the brain remains an important system for comparison and guidance, and that the craft of interpreting the innards of black boxes, honed by electrophysiologists, might be usefully applied to make sense of internal representation in deep nets. Our paper should have broad interest beyond local communities because it goes toward the question of the degree to which man-made intelligent systems can end up looking like the brain at a fine scale. We believe our paper both connects and transcends local communities. We now emphasize this starting with the opening line of our Abstract and by making specific sections to explain what electrophysiologists can learn and what the machine learning community can learn [​Discussion subsection “Value of artiphysiology”​].

Essential revisions:

Most researchers agree that the critical questions regarding modeling of sensory system at large are the practical ones: (1) What stimulus should be used in neurophysiology data collection? (2) How much data should be collected? (3) How should models be validated? (4) What modeling framework should we use? The authors are invited to clarify and explain how the paper appropriately relate to these questions.

We highlight below how our paper relates to these four questions and will emphasize the main points within our revised manuscript. Due to the open-ended, general nature of this invitation, we attempt to keep our replies brief to focus more on specific criticisms further below.

1) What stimuli to use​? A well-known debate in neurophysiology is whether to use artificial or natural stimuli (e.g., Rust and Movshon, 2005). Our artificial shape stimuli were substantially different from the training images for AlexNet, thus one could expect there to be little relationship between responses of the units to shapes and responses to natural images. We found a close relationship in many cases where the shape tuning of the units was reflective of their shape tuning over natural stimuli (Figure 11-12). Even for units at the deepest level of the network, the artificial electrophysiological stimuli were able to correlate strongly with the output category. There were also cases where characterization with the artificial shapes did not appear to relate to that with natural images, and this provides concrete examples for further research to gain insight into the determining factors [​Discussion subsection “Visualization of V4-like CNN units”​].

More importantly, by identifying an image-computable model for V4 boundary-form selective units, we offer the community an opportunity to test and optimize their stimuli on a model that is better than any V4 form-selective model as far as we know [​Discussion subsection “Comparison to previous work”​]. By showing not only which stimuli best drive these units, but also which most suppress them, we provide completely novel insight into stimulus design to test opponency and inhibition in mid-level form processing. We are using these novel CNN-unit models to design stimulus selection algorithms in our electrophysiological lab, and have just collected data from our first V4 units using this method. Such methods are generally of interest to others (e.g., Cowley et al., 2017, NIPS).

These models are also useful for developing stimulus sets to quantitatively understand form vs. texture tuning at the level of single units and circuits, and can be used to operationalize the difference between the notion of selectivity for things vs. stuff (Adelson, 1982), e.g., for object boundaries vs. general image statistics [Discussion Subsection “Value of artiphysiology”​].

2) How much data to collect​ ? Our paper relates to this in several ways.

A) We show that TI can be estimated along one axis of translation (horizontal) and this metric has strong correlation along the other axis (vertical), and with 2D translation [​Results subsection “Translation Invariance”, last paragraph​]. This is important for studies that try to understand TI in the cortex (another debated area: e.g., Sharpee et al., 2013; Rust and DiCarlo, 2010), where stimulus sets may need to be large to have ample diversity but then cannot be repeated in their entirety across a dense 2D grid.

B) Our simulations and noise-correction procedures [Materials and methods​ subsection “Estimating the effect of the stochastic nature of neuronal responses”] show that with ~5 stimulus repeats, our R-values for APC model fits are under-estimated only by a small amount [now shown as pink line added into Figure 5G].

C) We provide novel V4 models as tools for the community to estimate when they have collected enough different stimulus dimensions to have fully understood a unit in a complex, non-linear network.

3) How should models be validated? We add substantively to a larger discussion on model validation. Prior studies on the effectiveness of deep neural networks for modeling the nervous system have not provided insight at the level of single neurons and electrophysiological selectivity (see for example Yamins et al., 2014; Khaligh-Razavi and Kriegeskorte, 2014). Their approaches emphasized explained variance for population-level fits. Here we take a different approach by choosing specific known response properties of single units (translation invariance and shape selectivity) and determine whether units in an artificial net achieve these actual neural response properties. Furthermore, we use visualization of unit responses to natural images to qualitatively validate whether the representation that these response properties are intended to capture (an object-centered representation of boundary) does in fact hold across natural images. We believe this level of model validation, which includes quantitative and conceptual level agreement to documented neuronal selectivity, pushes the field significantly beyond what has been done, we now state this in the revised manuscript [Discussion subsection “Comparison to previous work”​]. Our results allow modelers to focus on specific neural selectivities and work with concrete, identified circuits that have biologically plausible components which we emphasize now. Furthermore, the units in AlexNet that we point to are publicly available, image-computable models, thus allowing maximal transparency for others to validate or invalidate the models by any method they choose.

4) What modeling framework to use​? We have used two fundamentally different modeling frameworks in this study: the APC model, a simple descriptive model, and AlexNet, an image-computable statistical machine-learning model, and from this we have identified elements that can be extracted and analyzed to drive modeling at the circuit and biophysically plausible level. We now clarify in the Discussion [​second paragraph in subsection “Comparison to previous work”​] how using modeling frameworks at different levels of complexity (the 5-parameter APC model vs. the elaborate CNN) generates insight. We discuss how the combination of modeling frameworks is crucial, where an image computable model sheds light on potential circuit-level implementation, while the functional APC model captures the abstract representations (e.g., object-centeredness) theorized to be encoded and provides broad interpretation of neural responses. Because AlexNet is image-computable and feedforward, it allows one to extract and analyze the modular upstream circuitry driving identified units. These excerpts can then be compared to the best attempts at hand-made circuits (e.g., H-Max and many other cortical models of visual neuroscientists), refined to reach any desired level of biological plausibility, and analyzed to extract novel general principles. We believe this will be an important way forward in understanding form selectivity and translation invariance in the visual system.

At some level, it seems like this study has to work out the way that it did. As several recent neurophysiology modeling studies using deep networks have argued, both deep networks and the primate visual system were "trained" (or evolved) using analyze natural images, and it is not surprising that both hierarchical networks end up representing similar hierarchical structure in natural images. From that point of view the results of the manuscript are unsurprising.

For general arguments as to why our results did not have to work out the way they did, see our replies to the Summary above. To this more specific point, it is rational to expect that similar systems trained on similar inputs for similar tasks could achieve similarity in representation. But, before deciding whether CNNs and the macaques in this study can be considered sufficiently similar systems, several specific questions would need to be answered. Namely,

1) how similar is the macaque to AlexNet in terms of simple visual input? The macaques in the study were raised in an animal colony, and thus in a limited visual environment. They never saw ImageNet images and probably never saw even a single instance of the overwhelming majority of the 1000 image categories of AlexNet. They did not see the forest, the ocean, the sky nor other important contexts for AlexNet categories. These animals were never exposed to natural images as part of an experiment or training. We are aware of no scientific study that estimates how close the input statistics are for such macaques and AlexNet, and it would be an oversimplification to say there was one set of natural scene statistics that caused all visual systems to develop the same way (there is great diversity in biological visual systems).

2) How similar are the training signals? The macaque visual system could be shaped by the need to physically interact in real time in a 3D dynamic world. It is unknown how much influence this might have on representation. AlexNet does not interact with the world that generated the images that it senses, nor is it even given information about the location of object boundaries in its images for the categorization task for which it was trained.

3) How do the vastly different architectures of the macaque and AlexNet alter single-unit representation? AlexNet does not have a retina or LGN nor feedback from higher areas, nor dedicated excitatory and inhibitory neurons, nor does it have to compute with action potentials. It is unknown how these elements influence single-unit representation in an appropriate context.

4) Will a small set of artificial stimuli from the electrophysiology lab be sufficient to probe a network that was never trained on such impoverished and noise-free inputs? Overall, without knowing in advance a lot more about the answers to such questions, we do not see how one could scientifically predict our results about the similarity of mid-level single-unit representations across these two very different systems. We discuss these points in the revised manuscript [Discussion​ second paragraph].

For example, Yamins et al. (2014) in part motivate the regression of many units’ responses in the CNN onto multi-unit data in cortex by arguing it would be unlikely to see one-to-one correspondences between CNN units and neurons in the brain. This argument assumed a highly distributed representation where it would be unlikely for two axes (single units) of this representation to align but only by looking at a population could a shared representation be discovered. Our results add to this debate by opening the possibility that a match could be made at the level of single unit selectivity.

The two major conclusions of the paper are that units in deep networks trained on classification and well fit by the APC model show selectivity for contour curvature and that they are locally positioned invariant. Both of these observations are completely consistent with an enormous amount of prior data.

To clarify, we show that many units can fit the boundary curvature (APC) model for stimuli centered in the RF. But not all of these units are locally translation invariant. When the additional criterion of translation invariance (TI) is added, then fewer units are good fits to the APC/TI model (relative to V4 selectivity). When visualization and response dynamic range are also considered, only a minority of units appear to match the APC/TI model quantitatively and conceptually. We added additional clarification in the text that this match only occurs for some units in the middle layers [Discussion, first paragraph​].

No study has ever fit a quantitative model of boundary curvature tuning to units in AlexNet, and no study has ever systematically estimated TI for units as we have done (using a systematic set of stimuli like those in electrophysiology and with a metric that controls for RF size and other factors). Studies of translation invariance in these nets have had mixed and conflicting results, as we report in the Discussion [​subsection “Other studies of TI in CNNs”​]. Deep net studies disagree about TI across layers, and V4 studies disagree about TI in neurons. Thus, our observations are not ‘completely consistent with an enormous amount of prior data’ as far as the published literature (see below for further details). Nevertheless, our results are consistent with some conclusions of previous work, and because all correct studies should be consistent with each other, this is not a shortcoming of our work. Overall, there is no other study in the category of our study – comparing single unit selectivity for mid-level vision in the macaque and a CNN [Discussion​ subsection “Comparison to previous work”].

A wealth of prior research suggests that object borders are important for vision. Theoretical arguments suggest that object borders are a critical component of any viable representation of the objects in natural scenes, and many previous neurophysiology studies using both synthetic and natural images have shown that units in V2 and V4 are sensitive to object borders including curved borders.

We generally agree with these statements and they are not criticisms, but premises, of our work aiming to find image-computable models to advance our understanding of the primate visual system. In this context it is interesting to note ideas of Movshon and Simoncelli (2014) and Ziemba and Freeman (2015), who argue that the whole demonstration of border selectivity could just be a result of selectivity for higher order image statistics. Thus, there is still disagreement about how such basic features are processed in the cortex, and having deep nets as useful working models could add substantially to this debate and could ultimately resolve this debate. We have revised our manuscript to emphasize these points [​Discussion subsection “Value of artiphysiology”​].

The demonstration that deep networks are positionally invariant is also not surprising. After all, the deep networks tested here are convolutional and therefore are designed to be locally positioned invariant. In fact, convolutional networks were inspired originally by the finding of position invariance in primate vision.

While it is a commonly held belief that deep networks are translation invariant by design, we find that they only achieve actual translation invariance in their units after extensive training (see Figure 7B and Discussion ‘Training and translation invariance’). It is important that this commonly held belief is dispelled so that investigation can begin in earnest to understand how units within complex networks including the brain achieve translation invariance. We now emphasize this point in our revised manuscript [​Discussion​ subsection “Other studies of TI in CNNs”].

Our results demonstrate that TI of single-unit form selectivity tends to increase with depth in the network but is highly variable from unit-to-unit even within a layer. We discuss in “Other studies of TI in CNN’s” how prior work published in the computer vision literature reports conflicting results as to the progression of invariance in layers of CNNs. We believe differences in reported TI across layer are the result of a lack of consistent definitions and absence of careful controls. Again this shows that the results we present are not simply ‘completely consistent with an enormous amount of prior data’ as far as the published literature. We appreciate being made aware of the study by Güçlü and van Gerven (2015) that used an invariance metric in a deep network. Their measure is fundamentally different from our TI–they fit a Gaussian surface to responses of units to translations of a single preferred stimulus, then take the median extent of the Gaussian across units in a layer as a measure of invariance for the layer. This metric is more similar to an electrophysiological estimate of average receptive field width across units in a layer; it does not measure the consistency of form selectivity to diverse stimuli across space. We now discuss their study in our revised manuscript [​Discussion subsection “Other studies of TI in CNNs”​]. Our TI metric was designed to do the latter, in keeping with notions from other electrophysiological studies.

The important question, therefore, is not whether there is any correspondence between the tuning of visual neurons and the tuning of units in deep networks trained for image classification. Instead, the important questions concern the nature of this tuning, the distributional relationships between primate neurons and units in the deep networks and, ultimately, what that can tell us about the biophysical mechanisms underpinning the observed functional selectivity.

We generally agree. Our study is focused on the nature of form tuning in V4 and in AlexNet. We test whether the nature of the encoding for form is similar in terms of a specific model of object-centered boundary curvature derived from electrophysiology. We ask whether, in the eyes of a scientist using common experimental paradigms, units in the network could be indistinguishable from neurons in the macaque brain, and we characterized the distribution of those units within and across layers in the network. We highlight specific units for further study, thus providing the entire community of experimentalists and modelers with the first set of specific, novel circuit models for a type of neuronal form selectivity (Table 1). Other studies have reported more general observations and have not looked at the nature of specific single-neuron representation, or have simply focused on other questions. Figure 5G and 7A in our manuscript report the distribution of curvature tuned and translation invariant units, respectively, across layers of AlexNet.

As for biophysical mechanisms, the operations in a deep neural network are generally biologically plausible (addition, multiplication, division, max-pooling) and have been used by neuroscientists attempting to hand-tune cortical models (e.g., HMax model and various normalization models). Because of this, the units that we have identified, and that are publicly available in a well-documented system (Caffe), can be used to make predictions about cortical circuits in terms of the number and types (of selectivity) of units that could be combined to achieve selectivity and invariance in V4. This can guide predictions about how features are linked together from V1 to V2 to V4. We are currently using the models identified in our paper to work backwards and forwards in the deep net to understand what computations are critical to achieving V4-like boundary shape selectivity. With respect to TI, we are just beginning to understand what properties of kernel weights, receptive field overlap, and convergence are critical to matching the physiological data. We describe these efforts in our manuscript. [Discussion subsection “Further work”​]

We would have been more enthusiastic if the analysis could reveal some new principles that could provide impetus to further experimental studies, or which could be used to help improve current models. For example, if the current deep network framework could support some fundamentally new approach to interpreting functional responses that would provide a foundation for building a mathematical model that can explain observed functional properties in terms of known biophysical building blocks present in cortical neurons, or if the deep networks could be used as the building blocks for such a model.

Our paper does help to improve current models of V4. Specifically, we provide a novel set of models (i.e., clearly identified units and their supporting circuits in AlexNet) that are the best and only models for a mid-level, single-neuron, visual form selectivity that meets critical quantitative criterion used in electrophysiology. This opens up many lines of hypothesis testing, model refinement and stimulus selection. In terms of biophysical building blocks, synaptic weights, thresholds, pooling, inhibition and excitation (positive and negative weights), and normalization are plausible operations at the circuit and system level, and these can be fairly easily translated into more refined biologically plausible models (e.g., E-I spiking circuits). We now emphasize these points in the revised manuscript [​Discussion, subsection “Comparison to previous work”​].

At this point, this is a well-done study that supports the current view that the primate visual system and deep networks trained on natural image classification both encode intermediate object structure such as contour curvature. However, this paper as written doesn't resolve controversies and doesn't provide information that would be useful for designing future experiments or models.

Resolving controversies​: Our paper provides novel insight bearing on several important controversies listed below. For a description of the value of our study in identifying novel, useful models and designing experiments, please see our replies above.

Natural vs. artificial stimuli: We show that artificial stimuli can be quite diagnostic even in systems only trained on natural stimuli and even in deep layers of the system where the artificial stimuli may drive only a small fraction of response dynamic range. We directly show that artificial and natural stimuli can give highly consistent impressions of mid-level shape selective units in a complex non-linear network. We now discuss how this paper contributes to this debate in the revised manuscript [Discussion subsection “Value of artiphysiology”].

Object boundaries vs. ‘stuff’​: By identifying the first image-computable, biologically-plausible circuit model that matches quantitatively the boundary shape tuning and TI in V4, we provide a concrete, openly available tool to test complex theories of image statistic vs. representations of object boundaries. Movshon and Simoncelli (2014) describe this debate: “In brief, the concept is that the visual system is more concerned with the representation of the “stuff” that lies between the edges, and less concerned with the edges themselves (Adelson and Bergen, 1991).” We show how our paper contributes to this debate in the revised manuscript [​Discussion subsection “Value of artiphysiology”​.]

Is translation invariance built-in or learned?:​ As noted above, it is a widely held belief that translation invariance is built in to convolutional neural networks. Our work helps to move beyond this belief by demonstrating the importance of measuring TI for single units. We emphasize this important point in our revised manuscript [​Discussion subsection “Training and translation invariance​”].

Do deep nets develop representations like those in the brain?:​ Our paper is the first to show a single-unit level correspondence for mid-level visual form processing between deep nets and the brain [​Discussion​ subsection “Comparison to previous work”].

https://doi.org/10.7554/eLife.38242.024

Article and author information

Author details

  1. Dean A Pospisil

    Department of Biological Structure, Washington National Primate Research Center, University of Washington, Seattle, United States
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Funding acquisition, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    deanp3@uw.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-5793-2517
  2. Anitha Pasupathy

    1. Department of Biological Structure, Washington National Primate Research Center, University of Washington, Seattle, United States
    2. University of Washington Institute for Neuroengineering, Seattle, United States
    Contribution
    Conceptualization, Resources, Data curation, Supervision, Funding acquisition, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-3808-8063
  3. Wyeth Bair

    1. Department of Biological Structure, Washington National Primate Research Center, University of Washington, Seattle, United States
    2. University of Washington Institute for Neuroengineering, Seattle, United States
    3. Computational Neuroscience Center, University of Washington, Seattle, United States
    Contribution
    Conceptualization, Resources, Software, Formal analysis, Supervision, Funding acquisition, Investigation, Visualization, Methodology, Writing—original draft, Writing—review and editing
    Competing interests
    No competing interests declared

Funding

National Science Foundation (Graduate Research Fellowship)

  • Dean A Pospisil

National Science Foundation (CRCNS Grant IIS-1309725)

  • Anitha Pasupathy
  • Wyeth Bair

Google (Google Faculty Research Award)

  • Wyeth Bair

National Institutes of Health (Grant R01 EY-018839)

  • Anitha Pasupathy

NIH Office of Research Infrastructure Programs (Grant RR-00166 to the Washington National Primate Research Center)

  • Anitha Pasupathy

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This work was funded by a National Science Foundation (NSF) Graduate Research Fellowship (DAP), a Google Faculty Research Award (WB and AP), the National Science Foundation CRCNS Grant IIS-1309725 (WB and AP), National Institutes of Health (NIH) Grant R01 EY-018839 (AP), NIH Office of Research Infrastructure Programs Grant RR-00166 to the Washington National Primate Research Center (AP), and NIH Grant R01 EY-027023 (WB) We thank Yasmine El-Shamayleh for sharing V4 data. We thank Blaise Aguera y Arcas for helpful suggestions and advice.

Ethics

Animal experimentation: All animal procedures for this study, including implants, surgeries and behavioral training, conformed to NIH and USDA guidelines and were performed under an institutionally approved protocol at the Johns Hopkins University (Pasupathy and Connor, 2001) protocol #PR98A63 and the University of Washington (El-Shamayleh and Pasupathy, 2016) UW protocol #4133-01.

Senior Editor

  1. Joshua I Gold, University of Pennsylvania, United States

Reviewing Editor

  1. Eilon Vaadia, The Hebrew University of Jerusalem, Israel

Publication history

  1. Received: May 14, 2018
  2. Accepted: December 17, 2018
  3. Accepted Manuscript published: December 20, 2018 (version 1)
  4. Version of Record published: January 16, 2019 (version 2)

Copyright

© 2018, Pospisil et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 967
    Page views
  • 184
    Downloads
  • 1
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)