Task, stimuli and the WPPM.

(A) 3AFC oddity task. On each trial, participants viewed a triplet of stimuli—two identical references and one different comparison—and identified the odd one out. (B) Stimuli are constrained to lie in the isoluminant plane that passes through the display’s gray point. Data are represented and fit in a transformation of this plane, which we refer to as model space. The grid of dots illustrates the transformation between the plane in the RGB and model space. (C) Example of a smoothly varying covariance matrix field produced by the WPPM. The field is generated by sampling from a finite-basis Wishart random process with a smooth prior (ϵ = 0.5 and γ = 0.0003; Prior over the weight matrix). Although the field is illustrated on a 7 × 7 grid, it specifies a covariance matrix for every stimulus in the plane, as shown in the heat maps. (D) Example of a less smoothly varying covariance matrix field. This field is obtained by drawing from a finite-basis Wishart random process with a less smooth prior (ϵ = 0.8 and γ = 0.0003). (E) Observer model. For each stimulus triplet , internal representations are drawn from multivariate Gaussian distributions centered at the reference stimulus with noise characterized by the corresponding covariance matrices. The model determines whether the observer correctly identifies the odd stimulus by comparing the squared Mahalanobis distances between the representation pairs. (F) Derivation of elliptical threshold contour. One-dimensional psychometric functions are approximated using Monte Carlo simulations (10,000 samples per stimulus pair shown for illustration; 2,000 used during model fitting). For each selected chromatic direction, we derive the threshold distance corresponding to 66.7% correct. An ellipse is then fit to the threshold distances to describe the discrimination threshold contour.

Threshold results and validation.

(A) Adaptively sampled trials. AEPsych-driven stimulus pairs are adaptively sampled for estimating thresholds across the psychometric field. Of the 6,000 trials, the first 900 are Sobol’-sampled; the remaining 5,100 (shown here) are adaptively selected using the EAVC acquisition function, based on a non-parametric GP model that is updated every 20 trials. (B) Discrimination threshold contours (66.7% correct) read out from the WPPM on a grid of reference stimuli for a representative participant, based on fits to the 6,000 AEPsych trials. (C) Group summary of WPPM readouts (N = 8), evaluated on the same grid of reference stimuli. (D) Validation trials for the same participant. The validation conditions (reference stimuli and chromatic directions along which the comparison stimulus varies) are randomly generated for each participant (see Appendix 4 for validation conditions used for the remaining participants). (E) Comparison of thresholds. Ellipses represent discrimination threshold contours read out from the WPPM fit (same fit as in panel B), evaluated at the 25 reference stimuli used in the validation trials. Gray lines: the validation directions; black bars: the 95% bootstrapped confidence intervals for the corresponding validation thresholds. (F) Comparison of psychometric functions. Only two validation conditions are shown for illustration (see Appendix 4.1 for all 25 conditions for each participant). (G) Linear regression of thresholds predicted by the WPPM against validation thresholds for the same participant. Horizontal and vertical error bars represent 95% confidence intervals for the validation thresholds and the WPPM predictions, respectively. (H) Summary of regression slopes and correlation coefficients for all participants. Error bars: 95% confidence intervals. As a benchmark, the same analysis is performed on a dataset simulated using a ground truth WPPM instance that approximates CIELab ΔE94 (Appendix 5).

Comparison of color discrimination thresholds with previous measurements.

Across all panels, black contours represent thresholds from prior studies, whereas colored contours represent the 66.7% discrimination thresholds estimated in our study. Colored shaded regions indicate 95% confidence intervals computed from 120 bootstrapped datasets. (A) MacAdam 1942.Top panel: MacAdam’s original threshold ellipses, magnified 10× for visualization. Bottom panel: Threshold contours measured from one participant in our study and transformed from the model space into the CIE 1931 chromaticity diagram. Reference stimuli are sampled from a 5 × 5 grid spanning [−0.7, 0.7] along each dimension of the model space. To reduce visual clutter, MacAdam ellipses falling within the gamut of our isoluminant plane (parallelogram) are shown only by arrows indicating their major axes. For visual comparability, our ellipses are magnified 2× to approximately match the scale of MacAdam’s data. The triangle indicates our monitor’s gamut. (B) Danilova and Mollon 2025. Left panel: Original threshold contours (79.4% correct) from their study, magnified by 4×. Right panel: Threshold contours from one participant in our study, transformed from the model space into the scaled MacLeod–Boynton space used in their study. Reference points are sampled on a 5 × 5 grid ranging from –0.7 to 0.7. As in panel A, to reduce visual clutter, their ellipses that fall within the gamut of our isoluminant plane (parallelogram) are shown as black arrows indicating only their major axes. For visual comparability, our ellipses are magnified 1.5×. (C) Krauskopf and Gegenfurtner 1992. Left panel: 79.4% threshold contours (Fig. 14 from their study, reproduced under Creative Commons CC BY-NC-ND 4.0). Right panel: 66.7% threshold contours from one participant in our study, transformed into DKL space with the axes scaled for each participant to equate thresholds along the L-M and S axes at the adapting chromaticity, as was done in their study. All contours are shown at their original sizes in this scaled representation. (D) CIELab ΔE76, ΔE94, and ΔE00. Threshold is defined as ΔE = 2.5, chosen to approximately match the scale of our measured thresholds, which are shown at their original sizes. See Appendix 6 - Appendix 9 for additional details.

The finite-basis Wishart Process Psychophysical Model (WPPM).

(A) Model overview. In our implementation, we use a set of 5 × 5 two-dimensional Chebyshev polynomial basis functions, denoted ϕi,j (x), where i, j ∈ {0, 1, …, 4}. These basis functions are summed using a learnable weight matrix W to produce an overcomplete representation Uk,l(x), where k ∈ 1, 2 and l ∈ 1, 2, 3. The resulting representation Uk,l is then combined with its own transpose to produce a field of symmetric, positive semi-definite covariance matrices. Each matrix specifies the internal noise in terms of the variance along the two model dimensions and their covariance (σdim1,dim2). For this example covariance matrix field, the weights used to generate the field correspond to the best-fitting parameters for participant CH (see Figure S3 for all participants). (B) Model readouts. Internal noise can be read out anywhere in the model space, illustrated here on a grid of 7 × 7 reference stimuli (solid lines), from which threshold contours (dashed lines) can be derived.

Corner vertices in the DKL, LMS, RGB, and model spaces.

Transformation matrices between DKL, RGB and model spaces.

AEPsych-driven trials (900 Sobol’-sampled and 5,100 adaptively sampled), fallback trials, and WPPM predictions for all participants.

Each row represents data from one participant.

AEPsych-driven trials (900 Sobol’-sampled and 5,100 adaptively sampled), fallback trials, and WPPM predictions for all participants (continued).

Each row represents data from one participant. Note that for participant CH, no pre-generated Sobol’ trials were used, as the fallback strategy was implemented later in the study to maintain experimental continuity and reduce delays between trials.

Percent correct as a function of the angular difference between reference and comparison stimuli for all participants.

The number of trials within each bin (bin width = 4°) is encoded by both marker size and color, with larger markers and brighter colors indicating a greater number of trials.

Covariance matrix fields obtained from the best-fitting WPPM for all participants.

Rows show (top to bottom): noise variance along the first model dimension, noise variance along the second model dimension, and the covariance between the two dimensions.

The best-fitting weights for all participants.

The symmetric solid curves represent the prior imposed on the model. The prior is implemented by specifying the variance of the weights, η, as a function of the polynomial order of the Chebyshev basis functions and two hyperparameters ϵ and γ (Equation 16). The solid curves indicate for our hyperparameter choice, corresponding to two standard deviations of the prior distribution.

Task timing and real-time trial scheduling.

(A) Trial sequence: a 0.5 s fixation cross was followed by a 0.2 s blank interval, then a 1 s presentation of three blobby stimuli. Participants responded at their own pace to identify the odd-one-out, after which a 0.2 s blank screen and a 0.5 s feedback were shown. The inter-trial interval (ITI) was 1.5 s. (B) A schematic representation of the trial timing and computational responsibilities of the two computers.

Validation for participant ME.

Same format as Figure 2D-G in the main text.

Validation for participant SG.

Validation for participant DK.

Validation for participant BH.

Validation for participant FM.

Validation for participant HG.

Validation for participant FW.

Validation for participant CH.

Threshold residuals.

Data are pooled across all validation conditions and all participants (N = 8). For all panels, color codes for the surface color of the reference stimulus, and the y-axis limits are set to ± the mean of the validation thresholds. (A) Residuals as a function of the absolute angular difference between the major axis of the elliptical threshold contours read out from the WPPM fits and the chromatic direction of the validation condition. (B) Residuals as a function of the aspect ratio (major/minor axis) of the WPPM threshold contours. (C) Residuals as a function of thresholds estimated from validation trials.

Linear regression results assessing the relationship between WPPM–validation threshold residuals and three predictors:(1) the absolute angular difference between the chromatic direction of the validation condition and the major axis of the contours read out from the WPPM fits, (2) the aspect ratio of the contours, and (3) the magnitude of the validation threshold.

This analysis was done on human data.

Catch trial performance summary across all sessions.

The proportion correct reflects the total number of correct responses divided by the total number of catch trials. Lower and upper bounds indicate the participant’s lowest and highest session-level performance, respectively.

Derivation of the ground-truth Wishart fits based on CIELab ΔE94.

(A–B) Comparison stimuli at the iso-distance contours in the isoluminant plane, shown in both RGB and model spaces. Note that the reference grid and fixed set of directions shown here are for illustration only; the actual sampling did not use a fixed grid or evenly spaced chromatic directions. (C) The Weibull psychometric function used to simulate binary (correct or incorrect) responses given ΔE values. (D) Sampled reference-comparison stimulus pairs. Reference colors and chromatic directions were sampled using Sobol’ sequences, and comparison stimuli were jittered around the iso-distance contour. A total of 18,000 trials were simulated; only the first 200 are shown here for clarity. (E) Comparison between readouts from the WPPM fit and from CIELab ΔE94. The WPPM fit was subsequently treated as the ground truth for simulating AEPsych and validation trials.

AEPsych-driven trials and WPPM readouts for a simulated observer.

Note that the ground-truth thresholds shown in (C) is the same WPPM readouts from Figure S15E.

Validation trials and WPPM readouts for a simulated observer.

Linear regression results for the simulated dataset.

Threshold residuals for a simulated dataset.

For all panels, color codes for the surface color of the reference stimulus, and the y-axis limits are set to ± the mean of the validation thresholds. (A) Residuals as a function of the absolute angular difference between the major axis of the elliptical threshold contours read out from the WPPM fits and the chromatic direction of the validation condition. (B) Residuals as a function of the aspect ratio (major/minor axis) of the WPPM threshold contours. (C) Residuals as a function of thresholds estimated from validation trials.

Deviation of WPPM estimates from the ground truth.

(A) BW distance between WPPM-estimated thresholds and the ground-truth ellipses. The upper limit of the color map (0.17) corresponds to the maximum BW distance between each ground-truth ellipse and a reference circle whose radius equals the largest major axis length among all ground-truth ellipses. The maximum BW distance between WPPM estimates and the ground truth (0.03) is substantially lower than this reference value. (B) Difference in major axis length between WPPM-readouts and ground-truth ellipses. The colormap limits (±0.17) reflect the ± maximum ground-truth major axis length. Again, the maximum deviation observed (0.03) is small relative to this range.

Comparison with MacAdam 1942.

First panel: MacAdam’s original threshold ellipses, magnified 10× for visualization. Rest panels: 66.7% threshold contours (colored lines) measured from all participants in our study and transformed from the model space into the CIE 1931 chromaticity diagram. Colored shaded regions indicate 95% confidence intervals computed from 120 bootstrapped datasets. Reference stimuli were sampled from a 5 × 5 grid spanning [−0.7, 0.7] along each dimension of the model space. To reduce visual clutter, MacAdam ellipses falling within the gamut of our isoluminant plane (parallelogram) are shown only by arrows indicating their major axes. For visual comparability, our ellipses are magnified 2× to approximately match the scale of MacAdam’s data. The triangle indicates the monitor gamut.

Comparison with Danilova and Mollon 2025 in the scaled MacLeod–Boynton space.

Top left: threshold contours from their study (black ellipses), enlarged by 4×. Remaining panels: threshold contours from all participants (colored ellipses). We sampled a grid of reference points evenly spaced from –0.7 to 0.7 (5 steps) in our model space, read out the corresponding threshold contours, and transformed them into the same scaled MacLeod–Boynton space. The parallelogram indicates the gamut of the isoluminant plane. To reduce visual clutter, ellipses from Danilova & Mollon that fall within our gamut are represented by red arrows indicating only their major axes. For visual comparability, our ellipses are enlarged by 1.5× to roughly match the size of those in their study.

Transformation from the model space to a stretched DKL space used in Krauskopf and Gegenfurtner 1992 for participant CH.

(A) Model space. Threshold contours were read out in this space based on each participant’s WPPM fit. Notably, our data were collected on a much larger region of the isoluminant plane than they characterized. (B) The intermediate, unstretched DKL space. Transformations between this space and both the model space and the stretched DKL space are affine. (C) Stretched DKL space, in which the cardinal axes of the original DKL space are rescaled such that the threshold at the achromatic reference point is normalized.

Comparison with Krauskopf and Gegenfurtner 1992 for the remaining seven participants.

Top left: original threshold contours reported by Krauskopf and Gegenfurtner 1992, reproduced under Creative Commons CC BY-NC-ND 4.0). Remaining panels: 66.7% threshold contours (colored lines) for the remaining participants, transformed into the stretched DKL space using participant-specific scaling of the cardinal axes. Colored shaded regions indicate 95% confidence intervals computed from 120 bootstrapped datasets. All contours are plotted at their original sizes.

Comparison with CIELab ΔE76 color-difference (Robertson et al., 1977).

ΔE values were converted to percent correct using a Weibull psychometric function, and threshold was defined as ΔE = 2.5, chosen to approximately match the scale of the measured thresholds in our data. Black contours represent the CIE predictions, whereas colored contours represent the measured thresholds transformed from the model space into CIELab space. Colored shaded regions indicate 95% confidence intervals computed from 120 bootstrapped datasets. Measured thresholds are shown at their original scales.

Transformation matrix from LMS to XYZ.

Comparison with predictions based on the CIELab ΔE94 color-difference metric (McDonald and Smith, 1995).

Comparison with CIELab ΔE00 color-difference metric (Sharma et al., 2005).

The effects of ϵ and γ on the variance of model weights.

(A) Variance of the Chebyshev basis weights as a function of polynomial order (i + j). The top panel illustrates the effect of varying ϵ while holding γ fixed, whereas the bottom panel shows the effect of varying γ while holding ϵ fixed. The yellow dashed curve indicates the hyperparameter values used in the main analyses (ϵ = 0.4, γ = 0.0003). (B) Two-dimensional Chebyshev basis functions arranged in order of increasing total polynomial degree (i + j).

Effects of ϵ on WPPM-predicted psychometric field for a representative participant.

Gray line and shaded region indicate the mean and full range of negative log likelihood (nLL) on the training set across five repetitions of five-fold cross-validation. Green line and shaded region indicate the mean and full range of nLL on the test set. Panels (a)–(g) show the model-predicted thresholds on a 7 × 7 reference grid for selected values of ϵ.

Effects of ϵ on model performance for the remaining seven participants.

Effects of γ on WPPM-predicted psychometric field for a representative participant.

Gray line and shaded region indicate the mean and full range of negative log likelihood (nLL) on the training set across five repetitions of five-fold cross-validation. Green line and shaded region indicate the mean and full range of nLL on the test set. Panels (a)–(g) show the model-predicted thresholds on a 7 × 7 reference grid for selected values of γ.

Effects of γ on model performance for the remaining seven participants.

The slope of linear regression assessing the relationship between WPPM–validation threshold residuals and three predictors reported in Table S3.

The hyperparameter γ was fixed at 0.0003, while ϵ was varied.

Effects of ϵ on the slope and correlation coefficient of the linear regression between WPPM-predicted thresholds and validation thresholds.

Stimuli and equipment used for calibration.

(A) The stimulus setup during calibration was identical to that used in the main experiment. The surface color of both the cubic room and the blobby stimulus (shown here as the top-position stimulus) was varied during the calibration procedure. The shaded gray circular region on the stimulus indicates the area measured by the spectroradiometer’s lens. (B) A SpectraScan PR-670 used for all calibration measurements.

Characterization of display output via Unity’s rendering pipeline.

(A) Gamma functions for red, green and blue primaries. Note that Unity’s internal correction places them above the identity line. (B) Spectral power distributions (SPDs) of the three primaries across a range of intensity levels. (C) The chromaticity of each primaries in the CIE chromaticity diagram at different intensity levels. (D) Normalized SPDs for each primary, showing that SPD shape is stable across intensity levels. (E) Linearity tests comparing predicted and measured chromaticity and luminance across two independent measurement runs. (F) Deviations from linearity. (G) Effect of the cubic room’s background color on the SPD of the blobby stimulus, showing no detectable influence.

Comparison of display output across stimulus locations.

(A) Spectral power distributions (SPDs) for each stimulus location: Ref Cal (bottom right), Cal 2 (bottom left), and Cal 3 (top). (B) Ambient light SPDs measured during calibration. (C) Gamma functions for each primary (red, green, blue) across all three stimulus locations. (D) Differences in normalized output for each pairwise comparison of stimulus locations, plotted separately for each primary. (E) Chromaticity coordinates of each primary in the CIE diagram, shown for all three stimulus locations.

Gamma correction.

(A) Measured gamma functions and their corresponding inverse functions for the red, green, and blue primaries, used to construct the gamma correction lookup table. (B) Gamma functions re-measured after applying the correction in Unity, showing close alignment with the identity line for all three primaries.

Comparison of display output with gamma correction over time.

(A) Spectral power distributions (SPDs) when measured at the bottom-right blobby stimulus location. Ref Cal (initial measurements prior to the experiment) and Cal 2 (follow-up measurements roughly one month after data collection began). (B) Ambient light SPDs measured during each calibration. (C) Gamma functions for the red, green, and blue primaries across both sessions, with gamma correction applied. (D) Chromaticity coordinates of each primary plotted in the CIE diagram for both calibration runs.

Evidence of spatial dithering by Unity’s rendering pipeline.

(A) The stimulus setup during measurements was identical to that used in the main experiment. The surface color of both the cubic room and the blobby stimulus (shown here as the top-position stimulus) was varied across trials during the measurements. The shaded gray circular region on the stimulus indicates the area measured by the colorimeter’s aperture. (B) Spatial dithering by Unity’s standard shader is suggested by comparing the luminance measurements from the Klein K-10A (averaged across a circular region on the blobby object) with the RGB values stored in the frame buffer. The measured luminance shows small incremental changes as the RGB settings increase in steps of 1/4095. These measurements are consistent with what we obtain by averaging over pixels in a saved image of the frame buffer (saved from Unity in .exr format). The averaged pixel values exhibit 12-bit quantization even though individual pixel values exhibit 8-bit quantization. (C) Top row: mean R channel values averaged vertically within a horizontal slice of the blobby object. Bottom row: differences in the R channel values between the minimum target R channel setting and each of the rest settings. Different shades of gray represent different target R settings. For illustration, only a portion of the horizontal slice is shown, and solid lines in the bottom row are scaled by a factor of 0.1. Dashed lines: the mean difference averaged across all pixels within each slice.