(A) Generative IT model based on real IT population responses. Top left box: schematic illustration of the neuronal representation space of IT population with a multi-dimensional Gaussian (MDG) model. Each point of the Gaussian cloud is one IT neural site. Middle box: an example of simulated IT neural site. The distribution of object preference for all 64 objects is created by a sample randomly drawn from the MDG (highlighted as a red dot; each color indicates a different object). Then, a size-tuning kernel is randomly drawn from a pool of size-tuning curves (upper right box; kernels fit to real IT data) and multiplied by the object response distribution (outer product), resulting in a fully size-tolerant (i.e., separable) neural response matrix (64 objects × 3 sizes). To simulate the final mean response to individual images with different backgrounds, we added a ‘clutter’ term to each element of the response matrix (σ2clutter; see Materials and methods). To simulate the trial-by-trial ‘noise’ in the response trials, we added a repetition variance (σ2repeats; see Materials and methods). Bottom box: another example of simulated IT site. (B) Response distance matrices for neuronal responses from real IT neuronal activity (n = 168 sites) and one simulated IT population (n = 168 model sites) generated from the model. Each matrix element is the distance of the population response between pairs of objects as measured by Pearson correlation (64 objects, 2016 pairs). (C) Similarity of the model IT response distance matrix to the actual IT response distance matrix. Each dot represents the unique values of the two matrices (n = 2016 object pairs), calculated for the real IT population sample and the model IT population sample (r = 0.93 ± 0.01). (D) Determination of the two hyperparameters of the IT-to-behavior-linking model. Each panel shows performance (d’) as a function of number of recording sites (training images fixed at m = 20) for model (red) and real IT responses (black) for two object discrimination tasks (task 1 is easy, human pre-exposure d’ is ~3.5; task 2 is hard, human pre-exposure d’ is ~0.8; indicated by dashed lines). In both tasks, the number of IT neural sites for the IT-to-behavior decoder to match human performance is very similar (n ~ 260 sites), and this was also true for all 24 tasks (see E), demonstrating that a single set of hyperparameters (m = 20, n = 260) could explain human pre-exposed performance over all 24 tasks (as previously reported by Majaj et al., 2015). (E) Consistency between human performance and model IT-based performance of 24 different tasks for a given pair of parameters (number of training samples m = 20 and number of recording sites n = 260). The consistency between model prediction and human performance is 0.83 ± 0.05 (Pearson correlation ± SEM). (F) Manifold of the two hyperparameters (number of recording sites and number of training images) where each such pairs (each dot on the plot) yields IT-based performance that matches initial (i.e., pre-exposure) human performance (i.e., each pair yields a high consistency match between IT model readout and human behavior, as in E). The dashed line is an exponential fit to those dots at any of the three sizes as the outer product of the object and size-tuning curves (A, bottom). However, since most measured size-tuning curves are not perfectly separable across objects (DiCarlo et al., 2012; Rust and Dicarlo, 2010) and because the tested conditions included arbitrary background for each condition, we introduced independent clutter variance caused by backgrounds on top of this for each size of an object (A) by randomly drawing from the distribution of variance across different image exemplars for each object. We then introduced trial-wise variance for each image based on the distribution of trial-wise variance of the recorded IT neural population (Figure 3—figure supplement 1E). In sum, this model can generate a new, statistically typical pattern of IT response over a population of any desired number of simulated IT neural sites to different image exemplars within the representation space of 64 base objects at a range of sizes (here targeting ‘small,’ ‘medium,’ and ‘big’ sizes to be consistent with human behavioral tasks; see Materials and methods for details). The simulated IT population responses were all constrained by recorded IT population statistics (Figure 3—figure supplement 1). These statistics define the initial simulated IT population response patterns, and thus they ultimately influence the predicted unsupervised neural plasticity effects and the predicted behavioral consequences of those neural effects.